wiki:developers-meeting-20090616
Last modified 12 years ago Last modified on 06/17/09 16:00:33

dCache Tier I meeting June 16, 2009

Present

dCache.org(Patrick, Gerd, Owen, Paul), IN2P3(), Sara(Ono), Triumf(Simon), BNL(), NDGF(), PIC(Gerard), GridKa(), Fermi(Jon), CERN()

Apologies

BNL, NDGF

Technical issues

FZK and Fermi dCache team were unable to connect due to ongoing issue with DESY MeetingPlace system.

Agenda

(see side-panel)

Site reports

Fermi

Nothing to report: dCache working fine.

NDGF

Gerd reported for NDGF that they have no outstanding issues.

NDGF have deployed a pool using an alpha-quality, pre-release version of v1.9.3; this is to help out with the QA of the release process. This pool suffered the lost of its metadata catalogue on the pool (this was not due to any problem in dCache, but rather from a mistake made when reinstalling the operating system). This gave NDGF a chance to test the automated "healer" recovery process, which can cope with this sort of problem. The recover succeeded: no data was lost and the pool is now operational.

SARA

Things are fine.

Triumf

Last week dCache worked fine.

PIC

No issues to report

The last three days CMS has been failing SAM tests due to pool being overloaded; they needed to reduce the threshold so pool-to-pool transfers could take place, so alleviating the load issue on the pool.

Gerard asked if there an environment variable for the SRM client so it defaults to using SRM v2, as explicitly setting the "-2" command-line option.

BNL

(via email)

The only issue we have to report is that we have some files which are on disk only areas but the SRM takes too long to answer and eventually the client times out. we've traced this to the fact that pinning (I repeat pinning not the retrieval of the pnfsid+storageinfo from pnfs) is taking too long.

xrootd

Gerd reported that Alice / xrootd recently upgraded their central catalogue. An effect of this upgrade is that file paths can now contain a double forward-slash.

The xrootd doors (in currently available versions of dCache) do not support filenames with a sequence of two forward-slash characters. Because of this, transfers were failing. Alice have implemented a work-around.

A patch for fixing this issue has been added. The next releases of 1.9.1 and 1.9.2 will include the fix.

Step09 debrief

NDGF: problems we had were don't dCache's fault: ATLAS had some problems with pre-staging, but these were resolved.

Restoring into a token

Gerard asked a question about how to restore into a space token.

This can't be done with dCache: but there may be a work-around. The problem is PIC is using the same poolgroup for reading/writing and for staging from tape. CMS allow this mixed mode, so they have no issue.

The immediate work-around to the problem is to allocate some 10 TB of space for staging; but this solution isn't sustainable as the storage was borrowed from other activity.

Patrick suggested one possible solution: could you create a linkgroup and assign a token that reserves less than that linkgroup's capacity. Re-staging could go into that linkgroup, so allowing load-balancing. This can potentially breaks the space reservation, as the space manager is unaware of restores, so one must manually protect the reservation by monitoring the free space within the poolgroup and manually intervening when this is less that the free space in the token. This is the solution NDGF have implemented and they monitor this by hand.

Gerard wasn't keen on this, concerned that a bug in ATLAS software might trigger many restores; such a inrush of activity would block their ability to write data.

Patrick: then you need to have physically separate staging area. There are some limitations; for example, one cannot do pool-to-pool into a space token.

Chasing up tickets

Patrick asked Jon if the ticket regarding an issue where dCache would re-stage a file multiple times, into different pools, is still affecting Fermi?

Jon replied it is, but to a lesser extent: occurrences are correlated with dCache being busy, if the maximum number of movers is exceeded. The time-out expires and the re-stage is retried.

Patrick and Jon to chase this issue further off-line.

DTNM

"Same time, next week"

Tuesday 23rd June 2009.