wiki:developers-meeting-20140422
Last modified 4 years ago Last modified on 04/22/14 14:58:25

dCache Tier I meeting April 22, 2014

[part of a series of meetings]

Present

dCache.org(Tigran, Paul, Gerd), NDGF(Ulf), PIC(Marc), KIT(Xavier)

Agenda

(see box on the other side)

Site reports

NDGF

Ulf reported that things are running fine with NDGF currently. There are no big problem; over the Easter holiday there was heavy traffic, but dCache is running fine.

PIC

Marc reported that everything is OK at PIC.

Their plans for upgrading are blocking on dCache providing a version of dCache > v2.2 that is compatible with Enstore. Other than that, everything is running OK.

PIC plans to upgrade to dCache v2.10 as soon as it is available and passes their testing. Paul described the release scheduled for v2.10, with an anticipated release of 1st July 2014.

There are two (known) problems with v2.2 that prevent PIC from using NFS:

  1. if a client attempts to append a file then that file is truncated,
  2. an export that is marked 'read-only' on the server may still be mounted read-write on the client.

Both issues are fixed with dCache v2.6, but the problems with enstore compatibility prevent PIC from upgrading.

Paul described how all the known problems with Enstore are fixed with dCache releases v2.7, v2.8 and the forthcoming v2.9 releases. However, previous experience has shown that real testing with Enstore can reveal additional problems; therefore, we need further testing before we can say definitively that all problems are fixed. Once this is done, we will back-port the NFS changes to v2.6, allowing sites running Enstore to upgrade to v2.6.

Marc offered to test dCache releases using their test instance to check Enstore compatibility. This offer was gratefully received.

KIT

Xavier reported that the future is looking brighter at KIT. The cause of the "connection reset for peer" problem has been identified and fixed. The problem was due to a bug in Alice software stack that resulted in their software ignoring their local SE and reading all data from CERN. This filled the firewall and triggered the problem. The problem was discovered before Easter, and is now fixed. The result was a huge improvement for KIT.

Xavier has further investigated the problem with statistics gathering for their ATLAS instance. He has collected the data with debug output enabled, waited for one pool to time-out and captured a heap-dump.

In capturing the heap-dump, he noticed a problem where, when the 'dcache' script failed to find the command needed for the heap-dump, the advice contained the wrong Java version number. Gerd mentioned he'd also seen this problem. Xavier will open a ticket about this http://rt.dcache.org/Ticket/Display.html?id=8303.

There was some speculation as to where the ATLAS problem may be origin from. The output from 'rep ls' includes information (such as file-count) that is not contained within the Berkley-DB. Displaying this information, therefore, triggers disk-IO activity, which could be the cause of the slow-down. Perhaps dCache could cache this information.

Xavier mentioned that the problems didn't seem to correlate with IO-load on the pools: pools showing negligible CPU time spent in WAIT state still demonstrated the problem. Although not definitive, this suggests the problem lies elsewhere.

Support tickets for discussion

[Items are added here automagically]

DTNM

The next Thursday meeting is:

Thursday 24th April at 16:00 CEST, 09:00 CDT, 10:00 EDT, 07:00 PDT

The next Tuesday meeting is:

Tuesday 29th April at 14:00 CEST, 16:00 MSK.

Thursday this week or same time, next week.