wiki:developers-meeting-20090512
Last modified 12 years ago Last modified on 05/12/09 18:20:04

dCache Tier I meeting May 5, 2009

Present

dCache.org(Gerd,Owen,Paul,Tigran), IN2P3(), Sara(Ron), Triumf(Simon), BNL(), NDGF(Mattias), PIC(Gerard), GridKa(), Fermi(Jon), CERN()

Apologies

BNL

Agenda

  • Site reports,
  • Release notes,
  • DTNM.

Site reports

FZK

Everything seems OK.

NDGF

No issues

Triumf

No issues.

Simon reported that they have received a ticket about a batch of production jobs failing. This looks like a timeout issue with getTURL w/ SRM. The problem hasn't repeated itself, so unable to investigate why.

PIC

Gerard reported that they have successfully upgraded to v1.9.2-5 this morning. It appears to be working fine, although they have noticed a number of SRM requests in PENDING state, which they didn't observe in the previous version. Is this normal?

They also reported seeing less information in their SRM-Watch that is available from the Fermi CMS dCache instance. This may be due to not running the latest version of SRM-Watch; which, as of today, is v1.1-2.

SARA

Ron reported that they had a problem with their mass-storage system last week. This resulting in an imbalance of number of staging requests between different pools. There was some discussion about how to include this in the cost calculation, so the scheduler selects pools based on the number of staging requests.

The problem appears to be that each pool has a large number of allowed movers. This results in the presence of existing activity has little impact on pool-selection; instead, space costs dominate. A suggestion was to partition off the stage pools and adjust the cost factor to allow stronger dependency on the number of movers.

SARA had a problem with getTURLs for LHCb being very slow. This was found to be due to lack of heap inside the JVM. This triggers the JVM to garbage collect very aggressively. Increasing the memory available to the JVM fixed the problem.

Ron also reported that, over the course of 3 hours, a job requested 600,000 requests to the SRM for a file using rfio. The SRM was able to sustain this load; perhaps due to SARA running SRM configured not to attempt any retries.

There was a question about the getRecMaxWaitingReq parameter in SRM. What does it do? Gerd reported that this appears to be parsed but is not currently used within SRM; it is safe to ignore this parameter.

Fermi

Jon has a newly install version (v1.9.2-4). What is the progress with the dcap problem? Tigran reported that he's looking at removing duplications in how information is set in PnfsManager; however, the information needs to be set if AL/RP values are to be used. Jon mentioned that he doesn't use spaces or AL/RP, so he would be happy if writing this information were available as an option. Tigran agreed to provide a new version with this switch. This should allow testing whether the problem disappears.

Paul ask whether the problem had been narrowed to dcap write operations: Jon confirmed this. Also, recommended setting the slow PNFS logging option within PnfsManager (Tigran recommended an initial value of around 500 ms). Jon agreed to do this.

Simon asked Jon about how he is using dcap. Jon reported that he's using the limited dcap door that use a shared secret between the client and server. However, Tigran reported that, although these doors are supplied with dCache, they do not work with implicit space reservation: one cannot use them within a dCache instance that has a Space Manager. On a related note, as of 1.9.2, gsidcap doors now supports implicit space reservation.

BNL

"BNL is doing PNFS postgres database upgrade to 64-bit today. We will not attend meeting. There is no operation issues to report."

Release notes

Gerard mentioned that he has difficulty providing experiments with a succinct summary of the changes between two releases; the release notes may provide this information, but the format is poor for extracting this information.

Several ideas were discussed about how this can be improved. The developers will discuss this some more and try to come up with ways of improving this.

DTNM

Due to the forthcoming SRM developers workshop being held at DESY, it was agreed to skip the usual Tier-1 support meeting next week. Instead, the next meeting will be in a fortnight's time: Tuesday 26th May 2009 16:15 CEST.