wiki:developers-meeting-20100831
Last modified 10 years ago Last modified on 09/01/10 16:22:40

dCache Tier I meeting August 31, 2010

[part of a series of meetings]

Present

dCache.org(Gerd,Antje,Paul,Patrick), Triumf(Simon), BNL(Pedro), NDGF(Gerd), PIC(Gerard), GridKa(Doris,Silke), CERN(Andrea)

Agenda

(see box on the other side)

Site reports

NDGF

Gerd reported that he doesn't believe NDGF have any current issues.

GridKa

Doris reported that everything is fine, except for a little problem with ATLAS and dcap.

She doesn't believe it is the same problem as Owen reported last week.

The issue is with hanging dcap. The client (on the worker-node) is issuing a dccp and, sometimes, this hangs. Doris is able to see the mover in the admin and web interfaces. It says "transferred -1 byte" and stays like that for days.

There is nothing unusual written in the dcap-door log files.

The problem doesn't affect very many of the transfers; she guessed that it was less than 10% of dcap transfers. The affected transfers seem to be randomly distributed: different PNFS-IDs, different pools, different doors.

CERN

Nothing special to report.

PIC

Gerard reported that things are OK at PIC.

He reported that he has seen some dcap transfers hanging on some worker nodes, but is unable to reproduce the problem so far.

There were some problems previously, but these were due to a faulty worker node. However, after fixing these nodes, not all the problems went away.

The symptoms are that the job just hangs. The job efficiency drops while it is waiting for dccp to complete.

Gerard hasn't found any error messages in the job output. Looking at the door, Gerard sees an error message like: transfer failed: 666 error code, "end of input".

The client is "dccp" and PIC are using the latest version.

The problem affects only a small fraction of the transfers: maybe 1% or maybe less. Currently, PIC are not receiving any complaints about these transfers, so more a curiosity at the moment. However, someone may complain about them.

Gerd asked if the transfers are active or passive? Gerard replied that they should be "active" always: the pool connecting to the client.

Gerd asked if this was the same problem that FZK replied? Doris give some additional information: she does see dcap doors on server side and the mover is still around. She doesn't see a connection between the pool and the client.

Gerd recalled that "-1" is a default value reported when the connection between the pool and the client is yet to be established.

Doris was able to kill the mover, but this didn't unblock the client.

Does the connection between the client and door still exist? Yes.

Tomorrow morning, we (Doris, Gerd) could jabber/chat tomorrow and diagnose a specific example of this problem.

BNL

Pedro reported that, last week, BNL had an issue with a pool 85 TB, which was failing very often. This lead to them restarting the pool, which took some 3 hours before simply refusing to proceed further along the start-up. There was nothing in the log files saying why the start-up was blocking. They tried running dCache within strace, but were unable to discover why the start-up was blocking.

We came to the conclusion that we had to delete all the pool's metadata.

[EDITOR'S NOTE: this is not recommended recovery procedure]

The pool took 1 week to recover and come back online, which is too slow for us. The transaction rate was roughly 100,000 requests per day.

Is this startup-recovery procedure something dCache is considering improving?

Gerd reported that it is and is being worked on. We hope to have this improved for the next major release (1.9.10). The improvement would allow end-users to read existing data from a pool before the complete pool startup. Writing would still be disabled until the start-up has completed.

Pedro also mentioned that, during the startup, they were using a webserver to make the files available. During this period they noticed that the web server was approx. 2x faster than dcap; a file transfer that takes ~40s with HTTP takes 80s with dcap. Pedro was unable to provide additional information about this testing.

Pedro reported that he has also been working on their test of the new Terracotta-based distributed SRM. They feel that the major overhead, now, is starting up the JVM on the client-side.

Pedro also went on to discuss their PNFS instance. They noticed that when the "block hits" in PostGreSQL ranged from 1,000,000 to 10,000,000 everything runs smoothly; however, when this number increases to 40,000,000 then there are too many and performance slows down.

Pedro mentioned that there were many messages in PnfsManager about messages timing out: could this be the cause of the problem. Gerd explained that this wasn't the cause: these messages are in a queue within PnfsManager; if they expire (because PNFS is too slow) then they expire without consuming any database resources.

Paul asked what Pedro meant by "block hits" but Pedro was unable to say.

Pedro noted that, when the latency in PNFS is around ~250 ms then their dCache instance is struggling. When PNFS slows down so the latency increases to 400 ms then dCache has real problems.

Gerd asked if the database dedicated to PNFS? Yes, there are no dCache database in there.

Restarting PNFS "fixes" the problem and the number of "block hits" drops back down to an acceptable number (10,000,000).

Triumf

Simon reported that production was OK last week. He also reported that he has increased the percentage threshold from 90% to 95% for "hot replication" threshold. So far, the "hot replication issue" reported last week has not come back, but he's still keeping a careful watch to see if the problem comes back.

Simon's testing of the Java memory cost on Solaris continues; he'll write a ticket once he's finished.

Bandwidth on Solaris

Gerard noted that PIC has observed dCache making less use of available bandwidth that might be expected: 2 MB per stream rather than 20 MB. He was wondering if anyone else has seen this, or knows of a solution?

Paul suggested starting a thread on the Tier-1 support mailing list, <srm-deployment@…>, as the topic is likely to be of interest to multiple Tier-1 centres.

Support tickets for discussion

[Items are added here automagically]

DTNM

Same time, next week.