wiki:developers-meeting-20090915
Last modified 11 years ago Last modified on 09/15/09 18:03:00

dCache Tier I meeting September 15th, 2009

[part of a series of meetings]

Present

dCache.org(), IN2P3(), Sara(Onno), Triumf(), BNL(), NDGF(), PIC(), GridKa(), Fermi(Jon,Gene), CERN(Andrea)

Agenda

(see box on the other side)

Site reports

SARA

Onno reported that SARA currently have two problems.

The first problem is to do with pools being disabled due to too many open files. This was reported an hour ago and Gerd has already replied with information and requesting more details. Onno reported that staging was underway during this problem and that they restarted both the SRM and the pin manager: they are on the same node and they restarted all components running on that node.

The second issue is with dCache reporting that many files are unavailable. However, if the user attempts to copy these files then the files are copied without any problem. This is true for SRM; direct gridftp and dccp. This issue is causing end-users (LHCb in particular) to have problems with submitted jobs.

The problem happens on many pools and the target files are present on the disk pools.

Tigran asked if the files were moved around? Yes, some files were moved as two pool nodes were drained. However, the problem is an issue with all files, not just those that were moved.

The lcg-cp it says "unavailable". Using the verbose output from the SRM ls command (srm ls -l) provides much output, including that the locality of the files is "unavailable".

Tigran also asked if the files are in a space token? (not sure?)

The SRM ls does not know which protocol the end-user will attempt to access a file with, so it must use a hard-coded default. This default protocol is currently dcap, so files that are not accessible via dcap will have locality "UNAVAILABLE" (perhaps we should change this to gsiftp in the future).

This might be the cause of the problem except that, in this case, the files are accessible using dccp.

To continue investigating this through the normal ticket system.

NDGF

AFAIK: no big issues

One ticket about timeout in pin manager. This has been investigated and believed to be traced to a problem with a 1-hour time-out in pin manager.

Triumf

Simon reported that Triumf are planning to upgrade to dCache v1.9.3 tomorrow.

No issues.

Fermi

Jon reported that things are running well here.

Fermi are in the process of deploying a modified dccp with longer time-out. This is to allow transfers to complete when checksumming is enabled and the site is very busy (a known problem).

BNL

From Pedro, via email:

I would like to report, though that there is nothing to report from our side.
dcache is working well.

PIC

From Gerard, via email:

today we're in scheduled downtime and won't be able to connect to to the meeting. Besides this we've nothing to report.

DTNM

Same time, next week.