wiki:developers-meeting-20090304
Last modified 12 years ago Last modified on 03/05/09 10:43:29

dCache developers meeting March 4, 2009

Participants

Dmitry, Gene, Gerd, Owen, Paul, Tanja, Tigran, Timur, Vladimir

Topics

  • Checksums and GridFTP
    • Flavia reported a problem with Castor
  • SRM-ls asynch reports
  • Important outstanding bugs:
  • Preparations for GridKa ATLAS split
  • Making releases:
    • Releasing next 1.9.0 (1.9.0-11)
    • Releasing next 1.9.1 (1.9.1-7)
    • Releasing next 1.9.2 (1.9.2-4)
    • Releasing 1.9.3

GridFTP and Castor

SRM-cp transfer fails when a GridFTP transfer is performed. The srm-cp client tries to perform checksum negotiation; this fails (which should be a recoverable error). We do the FTP transfer, but this fails (a "file not found"-like error).

Several issues with Castor:

  1. they don't support TURL lifetime (the TURL is invalidated before SRM PUT-DONE)
  2. transient errors cause problems (the checksum error is supposed to be recoverable)

Server advertises what features the server supports; previously they didn't advertise checksums. Now they do but the implementation is broken.

Flavia also mentioned that transfers fail with BeSTMan and SToRM clients.

SRM ls

Dmitry reported that with the async. ls, dCache now passes Flavia's S2 tests

Gerd said that reviewing is problematic whilst at OGF but he'll do a review once back in Copenhagen.

Owen asked if this patch includes help for large directories? Yes

Problems with VOMS and SRM for ARC

This is ticket #4163.

Since Ted wasn't present, we didn't discuss this further.

Artem checksum

This is ticket #4352.

The problem is the server rejects the transfer because the uploaded checksum (ADLER32) doesn't match the MD5Sum calculated by the server. This comparison will always fail (if, for no other reason than the lengths of the checksums are different).

Currently, this looks like bug in Globus GridFTP server, but could be still be our client. Some more information is needed.

We need to verify that the client is sending correct. Checksums are sent as the ordered pair: (checksum-type, checksum-value). If the Globus GridFTP server is rejecting the transfer then either:

  1. we're sending the wrong checksum-type with the an ADLER32-generated value,
  2. the server is comparing values from differing checksum-algorithms.

The Globus GridFTP server currently only supports MD5Sum (i.e., not ADLER32) so it is possible it's confused by the uploaded an ADLER32 checksum. A work-around may be to refrain from sending a checksum if we don't have one the remote site understands.

If, in resolving the ticket, changes are made to the client then Timur would like to be informed.

Owen to chase this up.

gPlazma fails with ConcurrentModificationException

This is ticket #4354.

As Ted wasn't here, this topic was skipped.

FZK OutOfMemory in PinManager & PnfsManager

This is ticket #4353.

This problem was only just reported, so people haven't had a chance to look into it yet. Gerd has already asked Doris for more information.

Tigran has agreed to look into it.

ATLAS split

We believe FZK have all necessary tools and information to achieve this.

We're unsure which version of dCache they are planning to use. There was talk (some time ago) about using 1.9.2 branch, but apparently the decision about which version they will use is in a state of flux.

1.9.0

The most immediate change is for the info system. This will make a 1.9.0-branch release that would be acceptable to gLite.

Tigran mentioned some changes that Gerd suggested go in 1.9.0. If this is done, then they too would appear in 1.9.0-11

The plan is to release 1.9.0-11 as soon as the info system patches are in.

1.9.1

With 1.9.1, the most immediate change is for the info system. With 1.9.1-7, we have a version that is suitable for gLite and has been quite well tested.

Since 1.9.1-6 is the current recommended release, Owen may choose to base the gLite version around 1.9.1 rather than 1.9.0

The plan is to release 1.9.1-7 as soon as the info system patches are in.

1.9.2

Tigran commented that he's commit the new background pictures, so a release is imminent.

Currently the test system broken, so a new round of tests are needed.

A question was raised about whether to delay 1.9.2-4 due to the outstanding gPlazma issues, so waiting until they are fixed. Gerd recommended that we DON'T wait; the problems are transient problems that client should recover from (by retrying). So, the gPlazma bugs are not release-critical. There was general agreement with this position.

We anticipate making a 1.9.2-4 release tomorrow, provide all the patches make it in.

Security

Owen mentioned the gLite stress-testing of the SCAS software. There were several problems that this testing illustrated. Perhaps the most important was severe memory leaks, requiring a server restart every five minutes.

Gerd asked what SCAS was. Owen said it stands for "Site Certificate Authorisation Service". Timur added that it's a development based on the idea of the LCMAPS file: SCAS is a service that authorises (or denies) people at the site-level.

Owen agreed to forward links describing the testing to dTeam mailing list.

DTNM

Wednesday 11th March 2009 16:30 CET