wiki:tier-one-meeting-20180313
Last modified 3 months ago Last modified on 03/13/18 15:33:35

dCache Tier I meeting March 13, 2018

[part of a series of meetings]

Present

dCache.org(Paul), IN2P3(), Sara(), Triumf(), BNL(), NDGF(Jens), PIC(Elena), KIT(Xavier), Fermi(), CERN()

Agenda

(see box on the other side)

Site reports

KIT

dCache doing just fine.

LHCb staging problem

Last week saw problem with LHCb production.

The PostgreSQL log file showed errors like:

2018-03-09 00:12:32 CET ERROR: could not serialize access due to concurrent update

Restarting pools and then dCacheDomain (which hosts pinmanager) helped, but didn't the problem didn't go away.

Restarted the database on Friday (2018-03-09T10:42). That had a positive effect: the postgres errors reduced in frequency but the errors persisted.

The staging activity stopped on Friday: 2018-03-09T20:53.

Now the problem has now gone away: last error was at 2018-03-10T23:44.

This is with dCache v2.16 using PostgreSQL v9.6.

LHCb saw this error message ("could not serialize access due to concurrent update").

This looks like dCache pinmanager simply isn't retrying the data transaction, as it is supposed to, but instead propagates that error back to the client.

Xavier to open a support ticket describing this problem.

Monitor direct memory

A method to monitor the committed memory

See RT 9364

Pool remove command

RT 9187

Glob supported now in dCache for the target command.

xrootd pool

RT 9213

Memory cache requirements

The underlying problem is that ATLAS is able to crash pools, seemingly by starting too many movers xrootd.

Responded by lowering the number of concurrent transfers.

This triggered another problem where ATLAS would complain if a mover is queued for more than 6 minutes.

(no current GGUS ticket, but ongoing discussion via email)

Looks like the fix involves changing how the door handles when the pool is queuing the mover.

200 movers per pool -- this is working just fine for production.

Production is a write-heavy work-load, so there is currently no queuing seen.

PIC

Nothing to report.

Jens

Nothing to report.

Support tickets for discussion

[Items are added here automagically]

DTNM

Same time, next week.