wiki:tier-one-meeting-20180109
Last modified 3 months ago Last modified on 01/09/18 16:46:51

dCache Tier I meeting MONTH DATE, 2013

[part of a series of meetings]

Present

dCache.org(Tigran, Paul), IN2P3(), Sara(), Triumf(), BNL(), NDGF(Ulf), PIC(), KIT(Xavier), Fermi(), CERN()

Agenda

(see box on the other side)

Site reports

NDGF

Working fine over the Christmas period.

Alice pools out of direct memory: asking for the same file many times from different WN.

Ulf to leave NDGF in February. Probably become a rotating role after he leaves, contact email is <support@…>.

KIT

Worked pretty well over Christmas and New Year.

xrootd DoS

Last weekend there was a problem with the ATLAS instance. One user was trying to download many files from dCache.

Pools had a limit of 200 active transfers. This was insufficient, with movers being queued.

After a while, the door refused to accept new movers, but the door was still redirecting the client to the pool.

There was errors describing the problem as "server shutdown".

level=ERROR ts=2018-01-05T00:00:00.158+0100 event=org.dcache.xrootd.request session=door:Xrootd-f01-080-105-e@xrootd-f01-080-105-eDomain:AAVh+0cHQqg request=open path=//pnfs/gridka.de/atlas /disk-only/atlasscratchdisk/rucio/panda/50/01/panda.0104111815.304447.lib._12918370.12134974529.lib.tgz options=0x450 response=error error.code=ServerError error.msg="Server shutdown"

The pool was initially showing low throughput: 100 MiB/s initially.

After Xavier increase the allowed concurrent movers to 800, the pool throughput improved and the queued movers decreased.

Xavier reintroduced the 200 mover limit, but the problem came back.

The limit was introduced because ATLAS transfers could spike, resulting in out-of-memory errors.

xrootd was unauthenticated and network connection was NAT-ed.

Tigran believes the job dead-locks itself by trying to read too many files.

Xavier to open a support ticket.

This is with dCache 2.16

Random xrootd authentication error

RT 9129

No progress to report.

Glob support

RT 9187

Delegated to HTW-Berlin

Zookeeper loosing connectivity

RT 9277

Need to add more information.

Empty core domains

RT 9307

Manually renamed interface name in ZK "fixed" the problem: the nodes reconnected to the core domain using the preferred interface.

Except that one pool "disappeared".

Upload using SRM/gsiFTP; however, the srmcp client failed:

level=ERROR ts=2018-01-09T11:19:20.669+0100 event=org.dcache.srm.request session=NQY:6838026:srm2:putDone socket.remote=192.108.45.48:38110 request.method=srmPutDone user.dn="/C=DE/O=GermanGrid/OU=KIT/CN=Robot - grid client - Xavier Mol (GridKa Monitoring)" user.mapped=17001:5900 request.token=fa3b35f0:-1626924167 status.code=SRM_FAILURE status.explanation="The operation failed for all SURLs" user-agent=dCache/3.2.7

Through the admin interface, Xavier was able to issue the 'ps' command on the System cell of that domain. This showed the pool cell as running.

However, direct communication with the pool cell was not possible.

The access log file shows the file was uploaded successfully:

level=INFO ts=2018-01-09T11:17:04.098+0100 event=org.dcache.ftp.response session=door:GFTP-f01-080-118-e-AAViVTdWkqA@gftp-f01-080-118-eDomain command="ENC{PUT pasv;path=//upload/4/5188f216-b401-496a-97ff-33df778fc71e/srmtest-dcachesrm-kit.gridka.de-1515493021.tmp;}" reply="ENC{226 Transfer complete.}"

Looks like the problem is with the final srmPutDone verification: the srm failed to contact the pool when verifying the upload succeeded.

Most likely, the domain hosting the ftp door *could* talk to the pool, while the domain hosting the srmmanager service *could not* talk to the pool.

The instance has three core domains.

Need to add more diagnostic information.

CMS plugin

RT 9280

Needs fixing and releasing.

Support tickets for discussion

[Items are added here automagically]

DTNM

Same time, next week.