wiki:developers-meeting-20081021
Last modified 12 years ago Last modified on 10/22/08 19:40:43

Tier I , SRM deployment phone conference Oct 21

Participants : Paul (chair), Irina Owen, Gerd (NDGF), Gerard (PIC), Doris (FZK), Flavia (CERN), Iris (BNL), Timor (FermiLab)

Apologies: Ron (SARA)

Agenda

  • Operational issues,
  • Experiences with 1.9.x-series (Gerd),
  • GFAL usage,
  • AOCB

Operational issues

Site report for SARA (via email)

Ron was unable to make the meeting, but sent this via email:

I will not be able to join the phone conf today. The issue that I reported about last week has been solved. There is one issue that we have had last friday. This resulted in a number of failed SE SAM test and a GGUS ticket 42486 submitted by ATLAS. This led to error messages like "transport end point not connected" for the gridftp door and "CGI-SOAP...connection closed" for the SRM door. The problem went away itself but we are still investigating the issue. We will get back to you on this one.

Site report for NDGF

business as usual: 1.9.1-pre-release on nodes and mixture on pool nodes (May-CCRC up to 1.9.1).

Site report for PIC

nothing to report

Site report for FZK

nothing to report; beginning of next year FZK will be splitting off ATLAS.

Issues from CERN

There is/was some (transient) problems with various timeout issues with doors.

Doris: for FZK there's a problem with hardware/memory issues on the PNFS node: this has rebooted itself a couple of times. New memory is being added after a pending down-time.

Flavia: we've also seen these problems with SARA and IN2P3.

Gerd: Do we have input from Experiments what triggers these problems?

Doris: We can replicate this. Our own testing does LCG-CR with a timeout of 60s. This sometimes fails.

Gerd: Could these error messages be some (strange) response to the system being slow?

Flavia: loads of GGUS tickets, an example one is #42486

Site report for BNL

Things are fine.

Site report for Fermi

Things are fine.

Fermi are having an issue with files stored on Entstore from checksums.

GridFTP v2 now supports checksum negotiation in dCache storage in MD5 format. Possible complication with this. ATLAS and CMS both use ADLER32. Flavia knows of no experiment requiring different checksums.

Flavia was able to confirm that, with SRM File Metadata request, the server can specify the checksum type and checksum value.

GridFTP v2 is needed to take advantage of checksum negotiation; GridFTP v1 does not support this.

Patched clients (for GridFTP v2) are in WLCG certification and the SL5 clients also support this out-of-the-box. Support for GridFTP v2 in dCache supports both passive-pools (i.e., without using the door as a proxy) and checksum negotiation since v1.8.0-*. What is happening here is the clients are catching up.

Fermi are working on a solution to this problem and will present this at the dCache team meeting tomorrow.

Experiences with 1.9.x-series

Gerd: no real issues.

1.9.0 isn't a big upgrade from 1.8.0-15p12.

Should be a fairly safe upgrade. Sites must make sure they upgrade the head nodes at the same time as, or before, upgrading the pools. This a strong suggestion.

Do you plan to upgrade everything in one go? Yes, because we must shutdown anyway. 1.8.0-15p8 ...

Owen: 15p12 was considerably more stable than 1.8.0-15p8.

The major improvement in 1.9.1 is the refactored pool component. This has been engineered so 1.9.1 pools can be run in an otherwise 1.9.0 deployment of dCache.

GFAL

Owen: GFAL can't work with dCache SRM v1 interface as it misinterprets negative request-IDs as an error code.

Flavia: this is known and will be fixed in future versions.

AOCB

Owen: any progress on the FTS time-out problems?

(doesn't seem to be)

Owen: announcing that 1.8.0-15p12 is now in PPS.