wiki:developers-meeting-20091125
Last modified 11 years ago Last modified on 11/25/09 18:49:11

[part of a series of meetings]

Coordinates

  • H323: mcu2.es.net
  • E164: 0011349850044744
  • Meeting ID: 0044744
  • Jabber room: dcache at groupchat.nordu.net

Participants are requested to log into the chat room during the meeting.

Participants

Agenda

[see box to the side]

Postcards

Up to two minutes (uninterrupted) per person where they can answer two questions:

  • What I did last week (since the last meeting),
  • What I plan to do in the next week.

No questions until we get through everyone :)

Gerd

  • Last week: The usual support and bug fixing activities; Worked on polishing WebDAV door
  • Next week: Make sure 1.9.6 is releasable; possibly implement HTTP Basic Authentication

Paul

  • Last week: help others update ancient LDAP queries, improving pnfsDump performance, started implementing GLUE validation tests.
  • Next week: completion of GLUE tests; finishing off pnfsDump improvements (hopefully!)

Timur:

really busy debugging transports from CERN and Fermi. Now doing better than every. Worked on several patches. Making SRM aware of multiple hosts. Testing this under DNS loader balancer.

Will do work on RT tickets and need to write documentation

Vijay: able to get a hand-ful of patches

more than a few that need future testing. Plan to do this after a week or so. Enstore patches A few patch review board.

Two days off for holidays: thanks giving.

Tanja: PoolManagerAdapter? synchronous work: this is work on having common code for door. This is work to encapsulate the communication with the PoolManager. All logic with error handling and retries is held in common. Some doors: dcap, xroot, and .. tell PoolManager not to do ...

This will be done for space manager and namespace?

Creation of the file? Not yet. We tried

Owen: failed dCache for certification. Waiting on ~3--4 patches for fixing how info serv. Now have (in branch) full autoconf for dcap. Needs testing on Solaris.

Mostly got clients through cert. but SRM client doesn't work for DPM directories.

Update in info-server configurator. Done some OS management stuff.

Tigran work NFS, code review, fixing tickets, abstract door, ...

Jan: continued on the unit test project. Committed some changes: down to 7 failing. Currently changing OS to Ubuntu. This is taking some time. Next week: get to end of the unit testing and pretty printing code.

Irina: managing tickets and fixing some bugs in dcap. Some problems with the HSM cleaner. Make this command-line interface for get ACLs.

S.Traylon until auto-tools.

Deprecation of old releases (max 10 minutes)

  • What's the status?

Move away from 1.9.1? Pushed to CERN, but not accepted.

1.9.2: VDT?

Timur: don't know. Tigran: they have put 1.9.5-8 into VDT, but don't know if this is official or release release.

1.9.3: ?

Sara was running 1.9.3, but they aren't now.

Mark it as "no longer supported".

Move old releases to a page of "deprecated releases".

Status of work for 1.9.5

A (quick?) review of activity needed for the 1.9.5 release

  • At least one critical patch which still needs to be approved (1132); causes transfers to hang on some configurations.
  • New release needed for updated dCacheConfigure.
  • dcap remove fix : no permission check,
  • dcap tunnel gsi doesn't support multiple VOMS rules.

What doesn't support ACLs? SRM: it gets it as a side-effect of the final transfer.

PUT and GET doesn't, but ... REMOVEDIR does.

Status of work for 1.9.6

A (quick?) review of activity needed for the 1.9.6 release

  • Release date is approx. 1st of December. Means we need to stop adding stuff ASAP and fix whatever is broken. So what is broken?
    • some of the SRM operations are broken because they do read-permission checks when they shouldn't.
    • "get permission" only works if you have read permission for the file.

There is a FileAttribute? to request which requests you are allow to do; but

problem with the trunk is how it fetching the information: a second check is done.

in ACL there is a read attributes but not the file itself.

Fixed: "just needs to be done." Whoever has time should do it. Gerd tried to have a look at the code and come up with a proposal who does what.

Can't you do this through NFS4.1 clients? They're not installed.

  • Root doesn't work w/ the webdav

Root doesn't handle the HTTP redirect that the door sends to redirect the client to the pool. The work-around is to use a proxy code.

  • Degrdation of debugging support.

During the debug sessions transfer failures between CERN and Fermi. Quickly enable and disable.

Go through the GUI: set debug=3 on System cells of the pools, wait for a few seconds, then switch debug. Would you consider having the same commands available, so any cell in the domain can execute the same commands.

Would be nice to have a single command. both appender and all the logger connecting to it to a specific log level would be very convenient.

Yes, both are reasonable ideas.

ACL checks for SRM

Timur worked on this. Transfer manager is now ACL aware. Check is on the transfer manager side. Could be optimised to offload the check to PnfsManager by putting the subject in the create message.

It will behave correctly, but it may be sending too many commands.

Flag needed to enable proper full-path checking of permission. Perhaps on 1.9.7 this could be fixed in Chimera.

Timur to go through the SRM code again to check that code.

We have a hard date for cutting 1.9.6? December 1st was the agreement.

SRM-ls to not use ACLs and mounted file-system.

If not in 1.9.6 then

terracotta

Can Tanja do the OSG testing on Trunk? Timur.

Should get the SRM-ls related problems fixed first, though.

What's the status of Xrootd ACLs? It has no users.

Webdav support

  • Integration with billing and monitoring done (needs review)
  • Improved logging (needs review)
  • Improved error handling (needs review)
  • Fixed some problems with SRM integration for HTTP
  • Have been unable to test HTTPS via SRM as clients don't appear to like HTTPS

Fast list for SRM

Terracotta for SRM

Single port xrootd mover

  • Postponed til post 1.9.6; turns out I cannot implement this without fixing the clients first.

Easy domain composition

  • Further work is post 1.9.6.

HSM cleaner for Chimera

Don't know. "Yes"

Unittest cleanup (5mins)

Jan emails to: Ted, Timur and Team. Don't know if unit tests or code is broken. Please reply: nobody replied. When did you send it? At least a week ago. Thread manager test (to team). Testing setter methods that don't do anything.

Two more failing tests: Tigran already took them. Then all tests will

The unit tests becomes part of the build process: failing tests => failing build.

Code Format-Program (10mins)

If we could agree on a standard. You can't use the pretty-print format of Eclipse or Netbeans as this will change all the code.

Can we agree on using one standard for formatting.

Provide a standard file for eclipse (and Netbeans, ...)

It's a discussion we had multiple times. Back then we agreed on: no trailing white space, no tabs.

If we agree on a standard method.

The problem really split into two groups: whether it is safe

We cannot agree on a common format.

If we adopt something then someone would have to make a change on the whole repository is. Don't think this is a big problem.

Owen thinks that using a beautifier is the correct solution.

Massive white-space changes.

Diff with white-space changes.

A file with history: remove trailing white-spaces, replaced tabs.

We have "strip patches" a couple per week, so the problem is spread out over multiple patches.

Existing files already have trailing white-space.

Tigran Support this if we do it once. Add hook that rejects commits that include trailing white-space.

Everyone agreed to a hook.

Somebody will write a patch to remove trailing white-space, tabs.

Issues from yesterday's Tier-1 meeting

None really. (Paul) thinks it would be nice to hear how work is going on fixing the failing transfers Jon reported. Do Fermi people need any help?

Monday last week: all transfers from CERN stopped working.

CERN had a transparent update of their server. They started returning PARTIAL-SUCCESS return code. This was an illegal return code because PARTIAL-SUCCESS requires. dCache didn't allow any requests to have . Always returned PS since some were always .

CERN agreed to change the code.

Received transfer URL. The GridFTP server was not there anymore. After we explained this to CERN. Their TURLs are only valid for 180 seconds. Why it takes so long for the server to conn (190s on average.). They closed the port after 180s. On Friday, Cataline found network configuration of PNFS node was incorrect. Messages were taking a long time to get from SRM to other components.

The ping in cell webpage for SRM was 50 seconds (!!)

Seen this before with something blocking the cell, but this time this wasn't the cause.

Split the SRM transfer (runs the transfer manager for SRM COPY), PinManager?, SpaceManager. By splitting TransferManager? into a separate cell it helped a lot. Seems to work worse when communicating within the cell. The call of the message is a call-through. If they are in the same domain then there is a coupling between the two threads.

When you do a send, it's the same thread that calls the registered call-backs: the answerArrived() method. This is something we ought to fix.

Timur to send a bug report.

SRM v2 for copies. When you pull the transfer of the success, you pull the request for only the remaining requests. ArrayIndexOutOfBound?: this is now fixed. After they reconfigured the network, everything now works. Now it always takes us less than

About 25% transfer failures are CASTOR related: they return SRM failure. The message says "timeout". Internally, the cannot get something from a queue.

Another issues: .5% -- 1% of transfers failures. Timur will investigate this further.

Yesterday: RT #5285 Lionel reported a problem. Could this be related?

CASTOR is always having TURL 180s. Is that compliant with SRM spec. They say the clients aren't behaving well: they had problems with the client not coming back. This was crashing their data server.

This is bad.

FTS has the same. They have a three-minute time-out. That's valid: they are the client.

Transfer slot should be independent of the TURL.

Debugging levels.

Compilation on Centos and Debian (5 minutes)

dCache server doesn't compile on Debian stable and CentOS 5. The batik library.

Next face-to-face meeting (5 minutes)

Before Patrick went on

F2F meeting in Copenhagen.

When this could be: some time in April 2010.

Timur to start the wheels in motion for the Fermi meeting.

Current working directories (10 minutes)

People complain about is the top-link (JPL) SQL files. These are in CWD. There are two options to specify the SQL file name, but those cannot be absolute, they are relative to the CWD.

Submitted a couple of patches to address.

These are not temporary files because they allow you to put additional configuration in these files.

Suggested that we move these to /var.

Owen: no! We're going the wrong way. General practice that they should CWD to / to allow unmount.

CWD for all processes is to be /

PID files are already moved with 1.9.5.

Is it temporary? No. Then it should be /var (e.g., /var/opt/dcache)

Set CWD to $PREFIX (from RPM install; default is /opt/d-cache).

Owen: "In an ideal world we'll have a prefix of /, we should use the same prefix."

Documentation of SRM client (5 minutes)

There are lots of commands that are not documented.

There is a man page for srmcp, but it's out of date.

It could do with being update.

Any SRM command with "-h" isn't enough? Doesn't include all the arguments.

Remove options.

Please follow up with concrete issues.

Certification of srmclient DPM (5 minutes)

Owen has a special certificate for the testing.

Demtri sent an email with a suggestion.

At Sept. 2007, the clients were made to work against BestMaN.

GSI problem (5 minutes)

The issue is that all servers need to have new certificates. As soon as you do this, then people become aware of the issue.

There is not requirement on the client to verify the FQDN in the certificate is equivalent to the certificate it was connecting.

In globus-url-copy 4.2.1.

Entry in the bugzilla: from 2006, describing the problem.

Add this as an option so it can be enabled.

Outstanding RT Tickets

[This is an auto-generated item. Don't add items here directly]

Outstanding RT Tickets discussed during the previous meeting

No progress on the tickets: 5072, 5075, 4681, 5257, 5265 :

http://www.dcache.org/rt/index.html?q=5072

http://www.dcache.org/rt/index.html?q=5075

http://www.dcache.org/rt/index.html?q=4681

http://www.dcache.org/rt/index.html?q=5257

http://www.dcache.org/rt/index.html?q=5265

Tickets in progress: 5189, 5196, 5272.

Ticket reopen by Alex, assigned to Owen: 2285 http://trac.dcache.org/projects/dcache/ticket/2285

Timur, are you agree to take over these tickets: 4671, 4681, 5120 ?

1) http://www.dcache.org/rt/index.html?q=4671

with feature request: http://trac.dcache.org/projects/dcache/ticket/269

2) http://www.dcache.org/rt/index.html?q=4681

3) http://www.dcache.org/rt/index.html?q=5120

Review of RB requests

DTNM

Proposed: same time, next week.