Last modified 11 years ago Last modified on 11/17/09 17:22:47

dCache Tier I meeting November 17, 2009

[part of a series of meetings]

Present,Paul,Owen,Tanja,Irina), IN2P3(), Sara(), Triumf(Simon), BNL(), NDGF(Gerd), PIC(Gerard), GridKa(Doris,Silke), Fermi(Jon), CERN()


(see box on the other side)

Site reports


Gerd reported that NDGF have updated to a pre-release version of 1.9.5-9. This release included a back-port of the forthcoming WebDAV door, which now includes support for writing. After the upgrade everything seems to be working OK.


Simon reported that everything is good with Triumf production service.

Triumf continue to test dCache 1.9.5 with Chimera. They recently updated their test instance to 1.9.5-6 and will likely upgrade to 1.9.5-8 soon.

The testing has identified a couple of issues. Those that are dCache in origin have been reported to support@….

The overall impression of the testing process is that things are looking good. However, more testing is needed before a decision can be made about when Triumf will move to 1.9.5.


Gerard reported that things are OK at PIC: no issues.

He is going to investigate the new info-provider either this afternoon or tomorrow. This is a pre-release copy of the 1.9.5-9 info-provider that includes support for reporting near-line (i.e., tape) capacity in GLUE.


Jon reported that, locally, things are going fine at Fermi.

There have been some problems with SRM transfers from CERN. There are two issues:

  1. Once the CERN SRM server has issued a TURL, the remote site has some three minutes to start to use it. After this period has elapsed a client attempting to connect will get a connection refused message. Jon is experimenting with putting things into different queues.
  1. The CERN server uses the wrong SRM return code under certain circumstances: when reporting that some requested work in partially completed, the CERN SRM server mistakenly returns SRM_PARTIAL_SUCCESS instead of SRM_IN_PROGRESS.

Jon had a good meeting with the CERN SRM developers and they hope to have a patch for their system tomorrow. No changes are needed in dCache SRM server or client software.

The good news is that sites will only see these effects when they are working at very high rights. Jon discovered them because he was deliberately stressing the system.


Doris reported that everything is fine at FZK.

The ongoing problem with dcap movers hanging continues, but the recent upgrade to 1.9.5-8 (?) fixed some (most ?) of the problem and the hanging movers happens less frequently.

Tigran and Doris discussed the problem some more: the problem is where the client is gone, the door is gone but the mover are still present.

Tigran asked whether the mover had transmitted any bytes, indicated by the "-1" value when doing an ls. Doris didn't know.

Tigran also asked whether the script that kills the dcap movers could be made to kill the mover only the second time around. The rational is that the door waits some 10 seconds before it sends the kill. So, it's possible that, through unlucky timing, that the script discovers and kills a mover that was about to die. Doris suggested that this wasn't happening, since there were movers that had been hanging for far longer than 10 seconds.

FZK are still noticing the black-hole issue, where a pool would accept all pool-to-pool transfers but all such transfers fail. However, this is only from the number of ATLAS pools still running the older version of dCache; as the pools are upgrade the problem is going away.


Release plans

Gerard asked when the next 1.9.5 version will be released?

Tigran: 1.9.5-9 will include a new version of the info-provider and will fix the issue Pedro reported where the SRM throws an ArrayIndexOutOfBound exception.

... and after that?

We don't know of any outstanding issue that would result in a 1.9.5-10 release.


Same time, next week.