wiki:developers-meeting-20141120
Last modified 5 years ago Last modified on 11/26/14 16:11:32

dCache Tier I meeting November 20, 2014

[part of a series of meetings]

Present

dCache.org(Tigran, Paul), IN2P3(), Sara(), Triumf(), BNL(), NDGF(), PIC(), KIT(Xavier), Fermi(Natalia), CERN()

Agenda

(see box on the other side)

Site reports

KIT

Some issues with ATLAS. Looks like FZK to RAL they expected data on disk, but was only on tape. Triggered lots of staging.

Lots of services then started to fail.

ATLAS finally understood that they must stop what they were doing.

No protection in 2.6; upgrade to 2.10 to get this protection.

How to find out why a file was staged?

If you log bump the log level in pool-manager or pin-manager; you should see the log entry.

Need to know why a file was staged: what operation triggered it and client IP address.

Unreliable poolmanager

pool manager became very unreliable for normal operations.

Pool-manager had some 70k staging requests.

At first it didn't run out of memory.

There was a 11 GiB heap-dump; unfortunately the file became corrupted by attempting to compress it.

The only errors were:

"request to <pool-name> timed out." "No candidates left configured for staging" "PoolUp?: pool <change> from ENABLED to ENABLED"

On the pool:

"Message queue overflow"

dCache v2.6.33

Normally one is normally used for staging, with a second pool as a fall-back.

The second pool was being used.

Ticket is 8536

dispatch failed for BlockingHttpConnection?

Not normal, but unclear if directly linked to SRM problems.

2014-11-14 (Friday), 2014-11-15 (Friday), 2014-11-16 (Sunday), 2014-11-17 (Monday)

2014-11-14 ~9 am, noticed the problem
2014-11-14 ??     tried to contacted ATLAS to try to diagnose the problem, no success; disconnected alarm system.
2014-11-15 18:10  Adreas Pezlt was called by ATLAS (Rod?) that SRM was dead.
2014-11-15 18:18  the SRM was manually restarted and the problem went away.  SRM didn't shutdown cleanly and stack-trace of threads show many srmRm operations being processed.
2014-11-16 05:01  dCacheDomain JVM ran out of memory, domain didn't shut down.
2014-11-17 08:19  dCacheDomain was manually restarted.

Since Monday, SRM seems to be behaving correctly again.

rc set max restores

Another problem. You can see pool-manager "rc set max restores". Pool manager realised tried to set to 30,000. No new recalls were created, even though the concurrent number was less.

Xavier will create an RT ticket about this.

2.10 migration

Properties files; may set the max queue to a -ve number if you want to disable.

dcap.limit.clients may be set to a negative number, but the effect is not documented in the defaults file.

prestage with dcap

See RT ticket 8532

2.6 dcacheReadonly

dcache.authz.read-only = true

Need to update documentation in dcap defaults so it makes sense.

Changing anon-access to READONLY fixes the issue

WebDAV reporting error

RT # 8543

Looks like a race condition between registering the location in PnfsManager and the pool putting the file entry into the final state.

xrootd plugin

xrootd Plugin FAX

ATLAS and CMS name-to-name plugin. The pool monitoring plugin wasn't run at DESY.

Fermi

Many things.

Last time migration -- Eugint -- did it for the instance -- pool in READ-ONLY state.

IO errors likely due to RAID system.

new issue

"Out of memory alarms" -- stress tests when production was running.

On three pools have out-of-memory errors. ~30 pools had the same thing.

Dmitry was looking into the heap dump; thinks it could be too many movers

Three different alarms in Zabbix: clumping issue, OOM, something-else. OOM error triggered by an OOM being logged in the log file.

30 nodes that were restarted.

CMS uses old dcap library. This is fixed in updated dcap library (that CMS doesn't use) and in dCache versions.

Ganglia plots for pool: memory/RAM usage for pool decreased.

You can watch the number of files in the pool. Rough rule is that a single file using ~1kiB of memory.

Each pool node has 16 GiB; so 5 GiB memory limit per pool.

Each dcap mover has 256 kiB buffer for read/write.

pool unresponsive

Please create a thread dump: 'dcache dump threads' command and send the output to Dmitry.

You can run three pools inside the same JVM; this will save some memory.

Use 'mover ls' through jython to find the number of movers. Ask Dmitry for help in getting it working.

Patches

Patches were discussed; they want to back-port several

NFS

NFS v4.1 --> v4.0. See the problem come back to v4.0

Problem hardware

Hardware admins looking into how to replace the dead disk.

If attached by fibre-channel, then just attach cable to another pool-node and start serving the pool from there.

At DESY, usually have system-disk in a mirror configure (RAID-1), so they can be replaced without shutting down the pool node.

Support tickets for discussion

[Items are added here automagically]

DTNM

Same time, next week.