Table of Contents
dCache Tier I meeting June 28, 2011
[part of a series of meetings]
Present
dCache.org(Tigran, Paul, Antje, Tanja), Triumf(Simon), PIC(Gerard), GridKa(Doris)
Agenda
(see box on the other side)
Site reports
PIC
Gerard reported that everything is OK.
PIC had a small issue as they hit their maximum number of concurrent open files. Using ulimit, they increase the limit to 1,000 files. Unfortunately, the ulimit command was moved from /opt/d-cache/job/dcache-local.sh to a separate file. This file was executed (rather than sourced), which meant the ulimit command did not affect the dCache environment.
They moved the ulimit back into the dcache-local.sh file and the problem was fixed.
Tigran noted that it is important to edit the correct "startup" file. There are two: dcache.local.sh and dcache.local.run.sh. The dcache.local.sh file is sourced, so will affect the dCache environment. The dcache.local.run.sh file is executed, so will not affect the dCache environment.
GridKa
Doris reported that they have had a few problems.
LHCb problem
LHCb had a problem with their dCache instance for ~2 weeks. The problem was due to a change in their behaviour that resulted in their throughput being limited to a single server. Because of this, there was high load on the pool-manager from the pin-manager. This was due to some disk-only pools being off-line.
Some years ago there were a support tickets about this problem ( RT #4405). The problem is that dCache submits a request to stage a file even though the file is not stored on tape.
Tigran mentioned that there is a potential fix for this problem. It's being tested at the moment; part of the problem is that some sites may be working as a result of this bug/feature, so we have to be careful before fixing the problem. However, it is anticipated that the fix will go into 1.9.5.
ATLAS problem
Doris mentioned another problem; this time, with their ATLAS instance. The problem is new, so she hasn't fully investigated it yet.
ATLAS tries to read a file with many clients. The file is replicated on many pools. These pools seem to be working some times, but not always. There are many errors mentioned in billing .. something like "Transfer fails EOF on input socket." What does this mean?
This is for the dcap protocol? Yes.
The "fails with EOF on input socket" message normally means that the client did not finish their session with dCache correctly. This, in turn, is symptomatic of the client not calling POSIX close() on their open files. This would also explain why you have so many movers.
Doris reported that she always sees a new pool chosen on the restore page: there are 144 active requests for this file. Whenever she does a reload on the transfers page, there's a different pool selected. Tigran suggested this was due to the cost calculation: the large number of movers means that future requests target a different pool.
Tigran added that this is a clear indication that the user application is broken and not calling close() on files that they have opened.
Triumf
Simon reported that everything looks good.
Triumf is currently upgrading their tape system; they are adding two new high-density frames to their HSM. The tape system will be online again in 2 days.
dCache is providing normal operations .. nothing to report.
Support tickets for discussion
[Items are added here automagically]
DTNM
Same time, next week.
