wiki:developers-meeting-20141216
Last modified 3 years ago Last modified on 12/16/14 15:53:42

dCache Tier I meeting December 16, 2014

[part of a series of meetings]

Present

dCache.org(Paul), IN2P3(), Sara(), Triumf(), BNL(), NDGF(), PIC(), KIT(), Fermi(), CERN()

Agenda

(see box on the other side)

Site reports

PIC

Marc reported that PIC upgraded to v2.10.13 yesterday. There were some problems to begin with, in particular with the SRM being slow initially. However, now everything is working fine.

They had some problems with the dcap protocol having anonymous access disabled by default, but it turned out that this affected their testing and not their users.

They see some errors in the srm server log file but it is unclear whether or not these errors are expected.

There are some LHCb tests that are failing. This is under investigation, but Marc reported that the tests pass when he ran the test script himself.

The tests use the gfal packages and, in particular, the lcg-cp command is failing.

A SAM test is failing, which is still under investigation.

For CMS and ATLAS, there is currently suffering only light load. Everything looks OK just now, but this could change when CMS or ATLAS start increasing their load.

Marc also reported that doing a find over a directory or an ls command on a directory with a large number of files sometimes it hangs. This is under investigation and Marc will open a ticket once the situation is understood better.

NDGF

Ulf reported that NDGF production system is working fine.

The development system is not working so well: ATLAS has noticed that they can't delete files using WebDAV. They claim they're getting a 401 error message; however, no corresponding error is logged by dCache.

Ulf is confident the problem is not with SSLv3 as such problems are now logged.

Ulf also commented that the logging for the webdav is "quite bad". He'll continue debugging tomorrow; however, this isn't too bad for NDGF as the problems are reported against only their test instance.

KIT

Xavier focused mostly on the KIT WLCG instances.

ATLAS

ATLAS have also reported a problem with KIT's ATLAS instance, similar to the problem they've reported against the NDGF test instance. In this case, ATLAS has opened a GGUS ticket.

The ticket currently contains a single example demonstrating the problem. Xavier has checked and the file in this example does not exist in KIT. Given this, the dCache appears to be responding correctly.

Xavier checked and there is no mention of this file in today's billing log file. During the meeting, he also checked the Nov--Dec archive and found no mention of the file there.

Some specifics about the problem: ATLAS is using exactly one door and claim they receive a 401 response.

This is in contrast with KIT internal monitoring, which regularly uploads, downloads and deletes files through WebDAV. These tests continue to pass.

KIT's upgrade experience shows a slow and failing database upgrade, due to inconsistency in the database dated before 2008. These were fixed manually.

KIT has also found a bug: if the dcap or ftp doors have an unlimited number of connections ("-1") then the door leaks memory until running out of memory. A work-around is to specify some large limit. A fix has been proposed and merged, and will be part of the next set of releases.

The worst part of the upgrade was the xrootd plugin. The "documentation is updated after you need it", but at least the WLCG repo now contains packages with the correct property files, and it's clear which properties to set for ATLAS.

CMS

As CMS does not use space reservations, for CMS there were no problems upgrading. CMS does have xrootd plugin -- still no regular RPM. The old plugin is working for 2.10.

LHCb

For LHCb, updated today. After the experience gained from upgrading the ATLAS instance, the LHCb upgrade was rather painless and was already finished at lunch-time. Everything seems to be working now. Xavier has notified CMS that SSLv3 is disabled with 2.10. CMS responded that this was not a problem for them.

Another (small, non-WLCG) dCache instance at KIT still needs to be upgraded to 2.11. This is happen in the new year.

Tickets

8547 -- NFS not notifying when file is deleted through NFS.

Tigran is working to fix this.

8548 -- Split-root archive: split files into several pieces. If you have only one of these then you get an error. Ticket may be closed.

8561 -- xrootd

The problem is now understood. It happens when the client is redirected to the door and the pool's transfer-finished message is somehow lost. Although the problem is understood, it isn't clear how to fix this issue.

8284 -- statistics

Not fixed yet.

Support tickets for discussion

[Items are added here automagically]

DTNM

The next Tier-1 meeting is Thursday 16 Dec. After this, the next meeting is Tuesday 6th January.