wiki:developers-meeting-20091111
Last modified 11 years ago Last modified on 11/11/09 18:46:16

[part of a series of meetings]

Participants

Timur, Gene, Vijay, Gerd, Patrick, Jan, Irina, Tigran, Paul

Agenda

[see box to the side]

Status of work for 1.9.5

A (quick?) review of activity needed for the 1.9.5 release

We would like a release very soon.

There's an issue with the tape protection: 1.9.4 and 1.9.5. Prepare the releases this week.

Another issue, reported at SARA, under normal conditions things were fine. Under particular circumstances, the file sizes didn't match. From the logs files both the client and server were happy.

The problem turned out to be if two movers are started at exactly the same time (millisecond granularity) then, with passive mode, the challenge is insufficient to discriminate between the two movers. This allows the client to connect to the pool and receive data from the wrong mover.

Patch 1.9.4 and 1.9.5.

Q: patch for JGlobus as a work-around to the preference issue with the CRL handling. The stack-trace from catalina.log shows the same problem can affect 1.9.5, too. JGlobus is clever enough not to always reload the CRLs, but checks the files' mtime. Unfortunately, this checking of file mtime is also a bottleneck. The patch is to introduce a time window when the file system will not be checked.

This patch certainly doesn't fix Simon's problem ("SRM crashes"). It is purely a scalability issue. This is going in.

Status of work for 1.9.6

TODO: Paul, Update template to include these items.

A (quick?) review of activity needed for the 1.9.6 release

ACL checks for SRM

Still needs changes to PutCompanion.

It would be nice to add a "mkdir -p" like operation to PnfsManager.

Timur is happy about how to proceed.

Gerd sends link to Timur with URL where the operations that are checked, based on subject, in PnfsManager.

Webdav support

Initial version is committed.

Continuing work on authentication support.

Currently username / password authentication is handled in the doors themselves. Can we move support for this to gPlazma?

Timur: would be nice but it isn't clear who will do this.

Gerd: I'm implement a short-term solution for Webdav: kauth class path or htaccess file format.

Tigran said that we need to encrypt communication between door and gPlazma.

Also need mechanism (plugins) in gPlazma to hook into site-level user database (e.g., LDAP, pam module).

Fast list for SRM

Slow progress, nothing to report.

The SRM handler is accessing the mounted file-system.

Terracotta for SRM

Back-seat whilst working on the cell communication and ACL support.

Requested a DNS load-balanced (round-robin, most likely).

Slightly related ... make a new record: 295 Hz ping (140 Hz). With this high rate of pings, the SRM server still only on the 30% load due to another contention point in JGlobus.

Single port xrootd mover

No progress.

Alice is happy with it? Who knows?

There is a problem when clients are redirected to a pool without any initial handshake. Unable to switch on the debugging in ALICE jobs to see what's happening.

The number of movers that don't do anything seems to have increased after upgrading to the new 1.9.5 xrootd pools. This may be a coincidence: ALICE may have updated their batch system.

Easy domain composition

First version is in: the rest is for post 1.9.6.

Timur: we run 10 dcap doors in 10 different domains. Similarly, several GridFTP doors

Create multiple batch files: each being four lines. The differences (e.g., port range) are specified as parameters in the exec line.

TODO: Gerd to send an example of this.

HSM cleaner for Chimera

No progress.

Any other issues?

Crash on PnfsManager startup

LPCD dCache, managed separately. They've been testing upgrades from 1.7 to 1.9.5. On their test-node where they have everything running. The PnfsDomain? startup fails with Seg. fault.

Email send 4th November.

Where is the Seg.Fault. printed? Daemon line 39

All other domains start OK. Machine is: Scientific Linux 5

Usually when the JVM crashes it creates a file: hs_err_pidnnnn.log where nnnn is the PID of the process.

Can also try to start PnfsManager from the command-line. Switching on verbose-mode for the shell to discover the command-line used by the daemon script.

Pnfs release

Owen tested. When he's back tomorrow he's probably ready to release it.

Might need another small change to have a hard-coded port-number for mountd.

Publish 32-bit and 64-bit compiled PNFS RPMs, but we'll push only the 32-bit version to gLite.

Jon is running the 64-bit version of pnfsd and dbserver. He's happy to deploy the new version of PNFS but would need it soon.

EOL of java5

http://java.sun.com/products/archive/eol.policy.html

Starting 1st Nov., Java 5 is no longer supported by Sun.

Suggest switch to compiling Java-6 ByteCode?. This should be safe, since pool now uses Java 6 library calls.

Download page shows Java 7. We should look at this soon.

We can use Java 7 as a compile platform? String in switch statements and multiple catch sections.

Talk to Scientific Linux to get them to switch to Java 7 "soon".

PoolManager.conf

Can we deploy this as a template file?

Remove the PoolManager.conf file by pushing the defaults into the code. This means we can distribute dCache without any PoolManager.conf.

ticket 5257

dCacheSetup file between v1.9.2 and v1.9.5.

Info-provider tape accounting

Where to store user-created XML file?

/opt/d-cache/var/tape-info.xml

NO!

/var/opt/dcache/tape-info.xml

or similar.

Issues from yesterday's Tier-1 meeting

BNL

Four issues reported by Pedro:

Waiting for input from Pedro. If an exception happens very often then the stack-trace is optimised out by the JVM. There's an option that suppresses this behaviour

Two different problems:

Move a sticky-bit owned by the pin manager.

Two solutions:

  1. remove sticky bit those files,
  2. create fake pin from the lifetime of the sticky bit.

Gerd: go for option 1.

Tigran: does it send the pin lifetime? Gerd: No.

Tigran: perhaps we can use this opportunity to resync the database against the pool.

No, just delete the pin-manager sticky bit.

Likely cause was the pool was down when the pin-manager tried to remove the pin.

Pool is pinned (via sticky bit).

  • need to use vorole-mapping with gPlazma

SARA

Onno reported a problem with replication for ATLASHOTDATA

Outstanding RT Tickets

[This is an auto-generated item]

RT 4671: wildcard in vorolemap file

Timur to ping Ted

RT 5064: resilient dcache reduction times out

When replica is deleted, PnfsManager is notified.

Turn on events logging and trace the issue.

Pedro doesn't want to debug this further.

Broadcaster not configured to send it messages.

Close ticket

RT 5072: dcap movers hanging

Fix the case that the client goes away; but, if the door node is crashed then movers will hang.

TODO: Tigran to talk to Doris to ask if these are correlated with door crashing.

Maybe implement ping from mover to door.

RT 5075: Random write-pool selection choosing pools that cannot hold the file.

Site admin needs to change the batch file.

Gerd: believe there was a fix for this about a year ago: we make the cost infinite when the pool is full. Rnd selection should honour

TODO: Gerd to look into it.

TODO: re-assign the ticket to Gerd.

RT 5189: Fwd: Lost files

TODO: Tigran further investigation.

Remove DEV-discuss.

RT 5196: bug: migration module cannot handle pin manager failures

Remove DEV-discuss.

RT 5215: no SRMSPACEMANAGER - noise in out file

Dmitry to reply to ticket, closing it.

Review of RB requests

Move the time of the meeting?

Are there alternative slots for this meeting?

Gerd: Monday and Thursday Fermi: not clear since it depends on availability of video room.

Time-box meetings

Keep meetings down 1.5 hours.

DTNM

Proposed: same time, next week.