wiki:developers-meeting-20100928
Last modified 10 years ago Last modified on 10/04/10 12:04:36

dCache Tier I meeting MONTH DATE, 2010

[part of a series of meetings]

Present

dCache.org(Patrick, Owen, Antje, Christian, Tanja, Tigran, Paul), Triumf(Simon), PIC(Gerard), GridKa(Doris)

Apologies:

Agenda

(see box on the other side)

Site reports

KIT

Doris reported that everything is currently fine.

On Saturday, the head-node for the CMS instance crashes. The dCacheDomain was using some 6 GiB of memory and the service had stopped responding. Simply restarting the services fixed the problem and the service is now stable.

Tigran ask whether there were a large number of restores. Doris said the number was minimal: there were ~293 restores for the whole day; so not very many.

However, the problem appears when there was a massive number of stores (roughly 1,074,000).

One pool node was complete full of precious files. The problem was traced to a fault with the underlying filesystem (GPFS). The pool node ran out of memory ... couldn't create enough processes to store all the files.

Please ask the operators to update their procedure so, if the problem reoccurs, they take a heap-dump before restarting nodes that are using excessive memory. The /opt/d-cache/bin/dcache command many be used to achieve this.

There was no out-of-memory exception in PoolManager Java log files: the JVM did not run out of memory.

Do you using billing database or billing files? KIT uses both.

The billing database is running on a different host from the dCacheDomain

How many requests were coming in one hour: 700 is the peak.

The hour(s) before there 10 and 11 o'clock there were no errors.

Please create a ticket with this information, and also please sent the log file of pool node that failed due to out-of-memory.

PIC

Everything is OK, except for an xrootd issue. The users are trying to open several files, in ROOT, at the same time. They've tried this with dCache 1.9.10 and this works.

Is the ROOT code + data something we can run at DESY? Yes.

Triumf

Everything OK.

One case where p2p replication hangs. When this happens, lot of transfers fail.

Can you get a stack trace from both pools involved? Not easily; the problems are identified by the end-users and they tend to notice only when the job fails.

Is the problem correlated with any network problem? Could it be triggered by a network glitch for this p2p transfer?

Most of the observed problems are for files that show the p2p transfer completes successfully.

Are you saturating your network interfaces?

Simon mentioned that he has a separate partition for p2p transfers.

Patrick asked if the problematic file had been p2p-transferred earlier or about the same time as the problem? Simon wasn't sure.

Is the file listed in the restore queue? The restore queue holds two kind of transfers: those that trigger restoring from tape and those that trigger pool-to-pool.

Simon would try to collect more information.

Support tickets for discussion

[Items are added here automagically]

DTNM

Same time, next week.