wiki:tier-one-meeting-20181113
Last modified 13 months ago Last modified on 11/13/18 15:23:52

dCache Tier I meeting November 13, 2018

[part of a series of meetings]

Present

dCache.org(Tigran, Paul), IN2P3(), Sara(), Triumf(), BNL(), NDGF(Jens), PIC(Elena), KIT(Xavier), Fermi(), CERN()

Agenda

(see box on the other side)

Site reports

NDGF

Things are going fairly well. Had a few incidence over the past few weeks. We have a double front-end setup. Trying to make this fully redundant, so shutdown either for maintenance.

They are supposed to be the same, and behave the same; but there are some differences in how they behave.

One machine has 12 GiB of active memory. The other uses 20 GiB (now updated to 30 GiB).

Trying to figure out ways to instrument the dCache instance.

Run ucarp to switch a virtual IP address.

Heap dump of both dCache domains; send them to us (support@…) and we can try to see where the extra memory is being used for.

Could also try unregistering the heavily loaded door from BDII and see if that has an effect.

One door couldn't be run through haproxy (which one?) that has a DNS alias to support round-robin.

PIC

dCache running fine this week.

We want to apply latest patch for 3.2; this year.

Plan is to upgrade to dCache v4.2 next year (2019 Q2).

KIT

Principally running fine.

Still have issues with the HTTP domain or it crashes with out-of-memory.

Xavier has uploaded the heap dump to KIT share service.

CMS pool

CMS pools failing every day for last week. There was a problem reading files from its partition. Restarted the pool "fixes" the problem.

GPFS people say there was no problem

This pool is connected to tape.

There is a problem where a cancelled flush request (due to the file being removed) caused the metadata repository to be interrupted -- triggering the pool to disable itself.

info

dCache v3.2 has different info-base info output -- the 'mover' queue is no longer reported. This used to be the sum of all other queues.

Webadmin still shows this, but it is missing from the info service output.

ATLAS

2.16 a couple of pools dropped from their well-known status and can only

Occurred some five days ago. The pool has been up since before this.

There are three core domains.

Please open a support ticket.

Support tickets for discussion

[Items are added here automagically]

DTNM

Same time, next week.