Table of Contents
dCache Tier I meeting June 22, 2009
Present
dCache.org(Patrick, Tigran, Gerd, Owen, Paul), IN2P3(), Sara(), Triumf(), BNL(Petro), NDGF(), PIC(Gerard), GridKa(Silke, Doris), Fermi(Jon), CERN( Andrea)
Apologies
Technical issues
None.
Site reports
NDGF
Upgraded to 1.9.3 last Thursday.
PIC
Everything is OK.
GridKa
No major issues.
Just fell into an old problem. Files asking
Replica Online. File available on pool, but staged anyway.
Pool has a very high load, then dCache thinks it can get the file from tape.
Entry in pool-manager
History: client doesn't even notice that file is missing. This was requested to support
Pedro, what does "high load" mean: cost factor is too high?
If the regular "ping" isn't received within 3-times, then a restage is triggered.
If the cost is too high, too many copies happening on the pool.
Only if the "stage on load" is set then PoolManager can cause a restage.
Pool had been up since February. Can't say when the file has been triggered.
Gerd asked: how is the test conversions go?
Some tables are not created that are needed.
Download latest version to create tables; see if we fall into the same problems.
Two problems: error cases in the directories.
Remove date ~2007.
Filename path file.
BNL
Update to 1.9.1-8 on the pools and 1.9.0-9 on the core machines.
Very well Step09. Need many more write pools, spread over more machines. Only issue.
No PNFS load problems.
21000 transfers in an Hour.
When doing this high transfers, cancelled requests cause SRM to misbehaving.
Q: from Patrick: did you
Storage manager on duty: during office hours someone looking
Out-of-office hours, we have probes that send emails if problems occur
Write pools needed attention, but quiet otherwise.
Patrick: What needs to be fixed?
Time being, we feel ready for end-of-the-year.
PNFS, avoid using /pnfs.
Scalability of SRM in terms of multiple front-ends.
Doing analysis.
Fermi
Things are running smoothly.
CERN
Nothing special
Maybe it would be useful to have somewhere documented how to configure dCache.
Patrick: Who is the intended audience of this information?
Jon: Fermi copes information across to a website; people are free to look at this. Gerard: we're already doing this, too.
Set up a page pointing to those URLs.
Jon and Gerard to send links
Questions from Gerard
Trash table support in PNFS
From Gerard:
I'd like to know more about the special PNFS release being used at FNAL, the one which has a removed files table which should avoid the creation of orphan files (we've a minor issue with them).
When is this to be released?
Fermi is running this?
Do we have any issues releasing it.
Jon says it works fine in Fermi.
We understand that it's important. We will be working on it.
Release in 1.9.2 branch?
PNFS is a separate release process, independent of dCache.
Delaying HSM writes
From Gerard:
I'd like to talk about the possibility to add a HSM migration thresholds in dCache pools in order to delay tape writing until a certain amount of data is present at the pool (ie 100MB). Of course, this would need a timeout (configurable?).
Some data is slowly landed at pools. We're wondering if we could add a flag so not copied to tape until there's a certain amount of data. Would this be in scope for this
Three thresholds:
Amount of data if for a storage group.
Threashold is time: minimum time
Number of files not been flushed: not been flushed, then it is flushed.
First threshold met will trigger the flush: configured for each
Per-pool, not global configuration.
Pedro: what's the motivation here?
HPSS, send it as soon as possible and use the HPSS disk buffer.
Gerard: reason with enstore, we've seen we're not been able to transfers for
retain disk=to-tape files and write 3..4 tapes
If you don't have a disk cache between dCache and the tape back-end. Its more efficient to write to tape in bunches. If you don't do this, then the tape system will constantly mount and unmount tapes.
Questions from Pedro
GridFTP copy wait errors
Pedro asks:
occasional gridftp_copy_wait errors which are 99% of the cases solved by restart of the doors (stalled connections?)
Restart doors one-or-twice per week.
pinid takes a long time
Pedro also asks:
under some stress conditions, the retrieval of pinid takes a long time (2 minutes which is more than the retrieval of the pnfsid <= 1 minute) and makes the retrieval of files on disk to time out (FTS timeout is 3 minutes)
Email to user-forum.
FTS time-out of three minutes.
Pin manager can take (approx.) two minutes just to .
Files are on disk.
Don't have a special thread for the location information: we don't have this.
Could you have a look how long the queues are in the PnfsManager?
May be a good idea to add the separate thread for the location information.
News on releases
1.9.3 is now available.
1.9.0 is no longer supported, please migrate asap.
DTNM
Tuesday 7th July.
