wiki:developers-meeting-20091124
Last modified 11 years ago Last modified on 11/25/09 15:22:37

dCache Tier I meeting November 24, 2009

[part of a series of meetings]

Present

dCache.org(Timur), IN2P3(), Sara(Onno), Triumf(), BNL(), NDGF(), PIC(Gerard), GridKa(), Fermi(Jon), CERN(Andrea)

Agenda

(see box on the other side)

Site reports

Sara

Onno reported that things are running "quite fine."

There are two issues:

  1. There was a problem with the SRM's stability. This is now fixed. The problem was due to usage of a non-thread-safe library, which was fixed in 1.9.5-9. Since upgrading the SRM node to this version of dCache, the problems have gone away.
  2. There was a problem where the gsi-dCap doors appeared not to accept end-user's VOMS certificates. Ron reported that, after a reboot, they worked.

Onno also reported that Sara are in the process of commissioning new hardware. They are introducing an additional 28 new pool nodes. These will have ~100 TiB per node, so the overall increase in capacity will be ~2.7 PiB. The new hardware is being introduced in batches; the current batch is 12 nodes, with 8 already running within dCache. The remaining nodes will be commissioned early next year.

Fermi

Jon reported that things are going fine at Fermilab.

They had some configuration changes late last week.

There were some delays discovered that were traced to the transfer manager. The Fermi-dCache team are working on trying to understand these delays further. Fermi are observing some ~1% of transfers failing, 600 MiB/s. At that level, Jon is happy to take time and fully investigate the cause of the problem.

The transfers that CERN are initiating are also sometimes failing. The observed rate is somewhat lower: ~0.5% CERN are investigating this.

Part of the problem is the SRM's retries. This happens despite the transfers are "doomed to fail". After a while, FTS will retry the transfer, which then succeeds.

Andrea asked if this is related to how SRM in dCache is implemented?

Jon explained that, early last week, CERN made a patch to their SRM implementation to fix a problem where Castor would return an incorrect result-code for SRM COPY commands. Once this patch was in place, Fermi/dCache then also had to apply a patch to update dCache's handling when the (now correct) return code is supplied.

However, the SRM COPY command isn't responsible for the remaining errors; the problem lies elsewhere.

Owen asked about on what timescale Jon would be moving to the 64-bit version of PNFS? Timur and Jon reported that Fermi has been using the 64-bit version for some time now (many months, IIRC) albeit not the official version. Jon anticipates moving the CMS instance to the official 64-bit PNFS during the Christmas period.

PIC

Gerard reported that PIC is having no issues: every is running smoothly.

He had three questions: GSI-dCap, tape-protection and 1.9.5 release plans

GSI-dCap

He ask for more information about gsi-dcap and VOMS authorisation:

Onno reported that the problem at Sara was with a user who is a member of two VOs (ATLAS and LHCb). The user was mapped to the wrong VO because the gsi-dcap was not respecting the VOMS FQAN, instead was looking at only the kpwd file. That file has the user mapped to both VOs; however, it was taking only the first matching line in the file. The quick-fix solution was to changing the ordering of the user's mappings in the kpwd file: the user is still mapped to a single VO, but now it is the "correct" VO.

Ron says that gsidcap should first look at the VOMS FQAN.

gPlazma can have several mappings.

Onno to send gPlazma if not found.

Onno reported that Sara is using dCache v1.9.5-8 on most nodes, except for the SRM. That is running 1.9.5-9

Tigran asked if the dcap transfers mixed up with different clients problem has been fixed? Onno said he would have to ask Ron.

Tape protection

What was people's experiences when enabling tape protection in 1.9.5-9?

Onno said Sara has experience with running dCache with tape protection enabled.

Gerard also asked if one can enable tape protection for a single VO? The rational is that only a single VO (ATLAS) has asked for tape protection; the other VOs have not (yet) asked for this.

Tigran reported that, no, one cannot enable tape protection for a subset of VO: it is either enabled or not. However, it should be possible to simulate enabling tape-protection for a single VO by including a VO-specific wildcard in the tape-protection file.

new 1.9.5 releases?

Gerard also asked if there were any pending bugs that are likely to be fixed in 1.9.5? The reason being that PIC are planning to upgrade in 21st December to 1.9.5-9, but would like to know if a newer version of dCache is likely to be available by then.

Tigran reported that we don't know of anything specific but, given the current trend, we'd anticipate having a version of 1.9.5 newer than 1.9.5-9 by 21st December. The 1.9.5 series of releases are patch-fixes-only release: a site may choose to upgrade to the latest 1.9.5 release to fix a bug they are experiencing; however, it's fine for them to stick with a less-recent version of 1.9.5 if they are not seeing any problems.

CERN

Andrea asked for access to dCache bugtracking system. Tigran said that RT (ticketing system) is more site-support, so may contain confidential information; however, the trac ticketing system is what actual code-changes are logged against. Is trac access what Andrea wants? Andrea said yes.

DTNM

Same time, next week.