wiki:tier-one-meeting-20180619
Last modified 4 months ago Last modified on 06/19/18 15:39:57

dCache Tier I meeting June 19, 2018

[part of a series of meetings]

Present

dCache.org(Paul), IN2P3(), Sara(), Triumf(), BNL(), NDGF(Jens), PIC(Elena), KIT(Xavier), Fermi(), CERN()

Agenda

(see box on the other side)

Site reports

KIT

Two major issues for ATLAS.

Main issue

Both triggered by network outage: the network router Line cards broke, bringing the interfaces down.

Restart domains -- to re-establish connection with the database. Hostname of the database could not be resolved.

All nodes of two racks -- for dCache, no pools were affected, all core services were affected.

Connections to zookeeper (seemed to) reestablish themselves.

Didn't check whether the tunnels were reestablished.

Database connections (via network) did not reestablish themselves.

Restarting dCache on those nodes resolved the problem; otherwise, dCache continued to report an UnknownHostException?.

Only the nodes on the affected machines needed to be restarted.

This is with dCache v2.16.

Please open a ticket.

Issue at happened Thursday midnight.

Network problems was resolved Friday 01:00 (one hour later).

cleaner

Cleaner service in a separate domain -- didn't restart this.

The cleaner domain was (mistakenly) not restarted, so was failing for 5 days, reporting UnknownHostException?.

Slow srmManager startup

Restarted core services for ATLAS

srmManager cell took a long time to appear (more than 10 minutes) -- not sure what was causing it.

Restarted domain, dCache logged srmManager startup was interrupted.

Restarted with debug enabled.

Saw that some 50,000 bringonline requests were being processed.

Restarted domain, disabling debug output.

(gplazma prints whole certificate chain in debug output)

Can this be done in the background?

xrootd memory problem update

With the support of Tigran, usual VisualVM monitor direct memory usage (with additional plugin).

ssh admin interface

SSH admin interface still problem with 'echo' pipe commands.

RT 9418

Paul to

REST api support in Web browser

Only in dCache v4.1

NDGF

Going very well and very badly.

Upgraded to dCache 3.2 -- for most part that went well.

Someone had set the tags for SRM so that it didn't published into SRM.

Later it turned out that the tape pools didn't work.

In 3.1, guava library was upgraded .. no longer compatible

dCache tries to load the module.

Downgrading the pool as an immediate attempt to recover from this problem.

The pool export didn't work -- RT ticket.

Problem is currently not believed to be critical, but this could change.

Postgres updates took less than an hour.

The goal is to for 4.2 at the end of August.

Preproduction system is progressing. Will try installing dCache 4.2 along with ANCIABLE playbook.

PIC

Everything is going fine.

Problem continue with the WebDAV door.

http://rt.dcache.org/Ticket/Display.html?id=9433

Support tickets for discussion

[Items are added here automagically]

DTNM

Same time, next week.