wiki:developers-meeting-20091215
Last modified 11 years ago Last modified on 12/18/09 12:59:35

dCache Tier I meeting December 15, 2009

[part of a series of meetings]

Present

dCache.org(), IN2P3(), Sara(Ron and Onno), Triumf(Simon), BNL(), NDGF(Gerd), PIC(Gerard), GridKa(Doris/Silke?), Fermi(), CERN(Andrea)

Agenda

(see box on the other side)

Site reports

Sara

Onno reported that things are fine.

Sara are planning to upgrade all their nodes to the latest version on January 11th.

They are also planning to migrate from PNFS to Chimera on the week starting Monday 11th January. The migration process has been tested with a small test-database and with a copy of their production PNFS database. So far things seem to be going well.

Ron reported that there was one remaining issue to do with how Chimera and dCache handle locality information. This results in tests failing. Paul and Tigran will continue talking with Ron and Onno about this issue.

Short reads in gsi-dcap

A problem was reported with gsi-dcap that results in a COM.claymoresystems.ptls.SSLPrematureCloseException: Short read exception being thrown. The problem happens relatively infrequently: about 1 or 2 times per day.

IN2P3 have reported the same issue.

This issue has been reported as ticket RT #5313.

Tigran had seen the report but didn't have any specific further information. A "short read" normally means that connection is closed prematurely, or the client is attempting to connect using an unencrypted, or ...

The problem is unlikely to be from the batch system killing the transfer as the VO will know that this has happened and not complain to the site.

Speeding up migration checks

Ron reported that they had attempting to split the md5sum check to allow it to run on eight cores concurrently when testing against PNFS. This proved disappointing as they didn't see the performance increase they were expecting to see.

Paul explained that, whilst it is fine to run md5sum check against PNFS, the main motivation is to run the md5sum against Chimera. The other important thing to realise is that, in PNFS, the dbserver daemons are single-threaded. This means that concurrent queries against the same (PNFS) database will not any performance increase as PNFS will serialise these queries. If the md5sum is split by PNFS database, then running these concurrently should show a performance increase if the "fast" PNFS is used.

Another possibility is to install the PNFS databases on multiple machines. This would allow truly concurrent testing at a cost of increased complexity in the testing environment. This is the approach DESY are taking for their forthcoming migration on January 4th and Gerd confirmed that NDGF took this approach when they migrated their instance.

FZK

Doris report any issues over the past week.

Chimera migration

Doris reported that FZK are planning to migrate their ATLAS instance from PNFS to Chimera on the Monday 1st February 2010. This is predicated on successful test-migrations prior to that date.

Tigran and Paul will check their diaries to confirm whether they will be available for that week.

Problem with SRM in 1.9.5-9

Doris reported last week that FZK upgraded their dCache instance to 1.9.5-9 only to have to revert back to 1.9.5-4 after ATLAS transfers were failing. This was surprising as FZK's internal tests were succeeding.

At that time is was unclear whether the problem was with the dCache version or with the new (now unified) dCacheSetup file. Doris reported that they are now running the new (unified) configuration with 1.9.5-4 without any problems, so the dCache version is the culprit.

Gerd explained that this problem is believed to be understood. There is a race condition in a library that dCache uses. The race-condition appears when dCache reads the CRL files for a CA. This reading of CRLs happens on startup and periodically during normal operation. If a request is received from a user who's certificate was signed by a CA whilst dCache is (re)loading that CA's CRL file then the CRL is considered broken. Any CA with a broken CRL is disabled, which explains why an ATLAS user may find their transfers failing but the FZK internal tests (which likely use a different CA) succeed.

The problem has been fixed by patching the JGlobus library. The new JGlobus library will be included with 1.9.5-11.

Disappearing pools

Doris also reported a problem where the PoolManager (in their CMS instance) suddenly thought that all pools were off-line, so no transfers were succeeding.

Before restarting the PoolManager, Doris obtained a stack-trace. The stack-trace showed all threads were waiting. Tigran asked her to send the stack-trace to support@…, if it still exists. Doris said she can send it.

After obtaining the stack-trace, Doris restarted the PoolManager. This allowed the system to recover and the dCache instance started satisfying user requests again.

PIC

Gerard reported that everything is fine.

PIC are planning to upgrade their dCache instance to 1.9.5-10 next Friday. Given the issue where a CA may disappear, which will be fixed in 1.9.5-11, when will 1.9.5-11 be released?

Tigran reported that 1.9.5-11 is expected either the end of this week or early next week. Given the importance of fixing the SRM issue, this may be done earlier.

Triumf

Simon said he had nothing to reported.

CERN

Andrea said there was nothing special from CERN.

NDGF

Gerd reported that NDGF have been suffering quite badly from the CA "disappearing" problem that Doris reported. This problem is believed to be fixed.

There is only one problem still showing up is that ATLAS dashboard: that of FTS reported "no performance markers found" [or something similar]. This error message can happen if FTS receives performance markers but there is no progress made within a certain time-limit; so, the problem seems to be that there is no progress in the first 20 seconds.

Doris reported that they've seen the same problem at FZK: CMS have reported seeing the same error message with transfers between FZK and some German Tier-2 centres, including DESY. Tigran was interested whether this was with DESY as a server or as the client? If DESY is the server then the dCache developers would have easy access to configure any additional logging information. The originator in these failing transfers wasn't clear, but Doris will try to find out.

There was a short discussion about possible causes. Networking was raised as a potential problem, but Tigran thought that this was unlikely as there is a 10Gb dedicated link between FZK and DESY.

Fermi

Jon reported via email:

Nothing to report.

BNL

Pedro reported via email:

The only unanswered 'issue' we currently have is "queued p2p transfers spawn a thread" post I sent last week.

SRM-PUT space reservation issue

Simon brought up a topic that was the subject of recent discussion on the user-forum. The issue is that transfers are failing due to a failed previous transfer for the same file not being cleaned properly. The problem is with the client is cancelling a request after a very short time (some four seconds) and then retrying the transfer. Under these circumstances, it appears that the space manager is unaware of the transfer being cancelled and the record regarding the previous transfer is not removed. The presence of this previous transfer record prevents the fresh transfer from succeeding.

Timur reported that Dmitry has been able to reproduce the problem only by suspending a transfer, without killing it, and then attempting to start a new transfer for the same file. However, since it is claimed that FTS is not doing this, this reproduction of the problem is not satisfactory.

Gerd reported that he has received reports from the ARC community that they have observed the same class of problems. They, too, say they properly kill the transfer.

Effort in investigating this issue is ongoing.

1.9.6 released

Patrick announced that dCache has now released 1.9.6-1. The major new feature of 1.9.6 series is the availability of a WebDAV door.

This door will allow a user with a web-browser to:

  • see file listing
  • download data (as anonymous user)
  • download data (authenticated with grid certificate)

As a WebDAV service, the end-user can (in addition to the above list),

  • undertake namespace operations (rename, move, delete)
  • upload data.

All WebDAV client tools should speak to the dCache WebDAV door. There are clients for Windows, Mac, Linux: Gnome and KDE.

Gerd reported that the Swedish Tier-2 are planning to upgrade to 1.9.6-1 in the following days.

Tigran ask Ron whether he would be interested in trying the WebDAV. If it proves sufficient for SARA then support for the existing HTTP door will be dropped.

Ron asked how the new door is enabled. Gerd replied that one only needed to modify the door's node_config file to enable the WebDAV for insecure (http) communication. For secure (https) connections, a little more configuration is needed along with suitable certificates.

Support tickets for discussion

[Items are added here automagically]

DTNM

Since we are approaching the festive season, the meetings on Tuesday 22 December and 29th December are cancelled. The next meeting will be in the new year: Tuesday 5th January.

Merry Christmas and a Happy New Year everyone!