Table of Contents
- Participants
- Agenda
- Status of work for 1.9.5
- Status of work for 1.9.6
- Crash on PnfsManager startup
- Pnfs release
- EOL of java5
- PoolManager.conf
- ticket 5257
- Info-provider tape accounting
- Issues from yesterday's Tier-1 meeting
- Outstanding RT Tickets
- Review of RB requests
- Move the time of the meeting?
- Time-box meetings
- DTNM
[part of a series of meetings]
Participants
Timur, Gene, Vijay, Gerd, Patrick, Jan, Irina, Tigran, Paul
Agenda
[see box to the side]
Status of work for 1.9.5
A (quick?) review of activity needed for the 1.9.5 release
We would like a release very soon.
There's an issue with the tape protection: 1.9.4 and 1.9.5. Prepare the releases this week.
Another issue, reported at SARA, under normal conditions things were fine. Under particular circumstances, the file sizes didn't match. From the logs files both the client and server were happy.
The problem turned out to be if two movers are started at exactly the same time (millisecond granularity) then, with passive mode, the challenge is insufficient to discriminate between the two movers. This allows the client to connect to the pool and receive data from the wrong mover.
Patch 1.9.4 and 1.9.5.
Q: patch for JGlobus as a work-around to the preference issue with the CRL handling. The stack-trace from catalina.log shows the same problem can affect 1.9.5, too. JGlobus is clever enough not to always reload the CRLs, but checks the files' mtime. Unfortunately, this checking of file mtime is also a bottleneck. The patch is to introduce a time window when the file system will not be checked.
This patch certainly doesn't fix Simon's problem ("SRM crashes"). It is purely a scalability issue. This is going in.
Status of work for 1.9.6
TODO: Paul, Update template to include these items.
A (quick?) review of activity needed for the 1.9.6 release
ACL checks for SRM
Still needs changes to PutCompanion.
It would be nice to add a "mkdir -p" like operation to PnfsManager.
Timur is happy about how to proceed.
Gerd sends link to Timur with URL where the operations that are checked, based on subject, in PnfsManager.
Webdav support
Initial version is committed.
Continuing work on authentication support.
Currently username / password authentication is handled in the doors themselves. Can we move support for this to gPlazma?
Timur: would be nice but it isn't clear who will do this.
Gerd: I'm implement a short-term solution for Webdav: kauth class path or htaccess file format.
Tigran said that we need to encrypt communication between door and gPlazma.
Also need mechanism (plugins) in gPlazma to hook into site-level user database (e.g., LDAP, pam module).
Fast list for SRM
Slow progress, nothing to report.
The SRM handler is accessing the mounted file-system.
Terracotta for SRM
Back-seat whilst working on the cell communication and ACL support.
Requested a DNS load-balanced (round-robin, most likely).
Slightly related ... make a new record: 295 Hz ping (140 Hz). With this high rate of pings, the SRM server still only on the 30% load due to another contention point in JGlobus.
Single port xrootd mover
No progress.
Alice is happy with it? Who knows?
There is a problem when clients are redirected to a pool without any initial handshake. Unable to switch on the debugging in ALICE jobs to see what's happening.
The number of movers that don't do anything seems to have increased after upgrading to the new 1.9.5 xrootd pools. This may be a coincidence: ALICE may have updated their batch system.
Easy domain composition
First version is in: the rest is for post 1.9.6.
Timur: we run 10 dcap doors in 10 different domains. Similarly, several GridFTP doors
Create multiple batch files: each being four lines. The differences (e.g., port range) are specified as parameters in the exec line.
TODO: Gerd to send an example of this.
HSM cleaner for Chimera
No progress.
Any other issues?
Crash on PnfsManager startup
LPCD dCache, managed separately. They've been testing upgrades from 1.7 to 1.9.5. On their test-node where they have everything running. The PnfsDomain? startup fails with Seg. fault.
Email send 4th November.
Where is the Seg.Fault. printed? Daemon line 39
All other domains start OK. Machine is: Scientific Linux 5
Usually when the JVM crashes it creates a file: hs_err_pidnnnn.log where nnnn is the PID of the process.
Can also try to start PnfsManager from the command-line. Switching on verbose-mode for the shell to discover the command-line used by the daemon script.
Pnfs release
Owen tested. When he's back tomorrow he's probably ready to release it.
Might need another small change to have a hard-coded port-number for mountd.
Publish 32-bit and 64-bit compiled PNFS RPMs, but we'll push only the 32-bit version to gLite.
Jon is running the 64-bit version of pnfsd and dbserver. He's happy to deploy the new version of PNFS but would need it soon.
EOL of java5
Starting 1st Nov., Java 5 is no longer supported by Sun.
Suggest switch to compiling Java-6 ByteCode?. This should be safe, since pool now uses Java 6 library calls.
Download page shows Java 7. We should look at this soon.
We can use Java 7 as a compile platform? String in switch statements and multiple catch sections.
Talk to Scientific Linux to get them to switch to Java 7 "soon".
PoolManager.conf
Can we deploy this as a template file?
Remove the PoolManager.conf file by pushing the defaults into the code. This means we can distribute dCache without any PoolManager.conf.
ticket 5257
dCacheSetup file between v1.9.2 and v1.9.5.
Info-provider tape accounting
Where to store user-created XML file?
/opt/d-cache/var/tape-info.xml
NO!
/var/opt/dcache/tape-info.xml
or similar.
Issues from yesterday's Tier-1 meeting
BNL
Four issues reported by Pedro:
Waiting for input from Pedro. If an exception happens very often then the stack-trace is optimised out by the JVM. There's an option that suppresses this behaviour
Two different problems:
Move a sticky-bit owned by the pin manager.
Two solutions:
- remove sticky bit those files,
- create fake pin from the lifetime of the sticky bit.
Gerd: go for option 1.
Tigran: does it send the pin lifetime? Gerd: No.
Tigran: perhaps we can use this opportunity to resync the database against the pool.
No, just delete the pin-manager sticky bit.
Likely cause was the pool was down when the pin-manager tried to remove the pin.
Pool is pinned (via sticky bit).
- need to use vorole-mapping with gPlazma
SARA
Onno reported a problem with replication for ATLASHOTDATA
Outstanding RT Tickets
[This is an auto-generated item]
RT 4671: wildcard in vorolemap file
Timur to ping Ted
RT 5064: resilient dcache reduction times out
When replica is deleted, PnfsManager is notified.
Turn on events logging and trace the issue.
Pedro doesn't want to debug this further.
Broadcaster not configured to send it messages.
Close ticket
RT 5072: dcap movers hanging
Fix the case that the client goes away; but, if the door node is crashed then movers will hang.
TODO: Tigran to talk to Doris to ask if these are correlated with door crashing.
Maybe implement ping from mover to door.
RT 5075: Random write-pool selection choosing pools that cannot hold the file.
Site admin needs to change the batch file.
Gerd: believe there was a fix for this about a year ago: we make the cost infinite when the pool is full. Rnd selection should honour
TODO: Gerd to look into it.
TODO: re-assign the ticket to Gerd.
RT 5189: Fwd: Lost files
TODO: Tigran further investigation.
Remove DEV-discuss.
RT 5196: bug: migration module cannot handle pin manager failures
Remove DEV-discuss.
RT 5215: no SRMSPACEMANAGER - noise in out file
Dmitry to reply to ticket, closing it.
Review of RB requests
Move the time of the meeting?
Are there alternative slots for this meeting?
Gerd: Monday and Thursday Fermi: not clear since it depends on availability of video room.
Time-box meetings
Keep meetings down 1.5 hours.
DTNM
Proposed: same time, next week.