wiki:developers-meeting-20090630
Last modified 12 years ago Last modified on 06/30/09 17:19:06

dCache Tier I meeting June 22, 2009

Present

dCache.org(Patrick, Tigran, Gerd, Owen, Paul), IN2P3(), Sara(), Triumf(), BNL(Petro), NDGF(), PIC(Gerard), GridKa(Silke, Doris), Fermi(Jon), CERN( Andrea)

Apologies

Technical issues

None.

Site reports

NDGF

Upgraded to 1.9.3 last Thursday.

PIC

Everything is OK.

GridKa

No major issues.

Just fell into an old problem. Files asking

Replica Online. File available on pool, but staged anyway.

Pool has a very high load, then dCache thinks it can get the file from tape.

Entry in pool-manager

History: client doesn't even notice that file is missing. This was requested to support

Pedro, what does "high load" mean: cost factor is too high?

If the regular "ping" isn't received within 3-times, then a restage is triggered.

If the cost is too high, too many copies happening on the pool.

Only if the "stage on load" is set then PoolManager can cause a restage.

Pool had been up since February. Can't say when the file has been triggered.

Gerd asked: how is the test conversions go?

Some tables are not created that are needed.

Download latest version to create tables; see if we fall into the same problems.

Two problems: error cases in the directories.

Remove date ~2007.

Filename path file.

BNL

Update to 1.9.1-8 on the pools and 1.9.0-9 on the core machines.

Very well Step09. Need many more write pools, spread over more machines. Only issue.

No PNFS load problems.

21000 transfers in an Hour.

When doing this high transfers, cancelled requests cause SRM to misbehaving.

Q: from Patrick: did you

Storage manager on duty: during office hours someone looking

Out-of-office hours, we have probes that send emails if problems occur

Write pools needed attention, but quiet otherwise.

Patrick: What needs to be fixed?

Time being, we feel ready for end-of-the-year.

PNFS, avoid using /pnfs.

Scalability of SRM in terms of multiple front-ends.

Doing analysis.

Fermi

Things are running smoothly.

CERN

Nothing special

Maybe it would be useful to have somewhere documented how to configure dCache.

Patrick: Who is the intended audience of this information?

Jon: Fermi copes information across to a website; people are free to look at this. Gerard: we're already doing this, too.

Set up a page pointing to those URLs.

Jon and Gerard to send links

Questions from Gerard

Trash table support in PNFS

From Gerard:

I'd like to know more about the special PNFS release being used at FNAL, the
one which has a removed files table which should avoid the creation of
orphan files (we've a minor issue with them).

When is this to be released?

Fermi is running this?

Do we have any issues releasing it.

Jon says it works fine in Fermi.

We understand that it's important. We will be working on it.

Release in 1.9.2 branch?

PNFS is a separate release process, independent of dCache.

Delaying HSM writes

From Gerard:

I'd like to talk about the possibility to add a HSM migration thresholds in
dCache pools in order to delay tape writing until a certain amount of data
is present at the pool (ie 100MB). Of course, this would need a timeout
(configurable?).

Some data is slowly landed at pools. We're wondering if we could add a flag so not copied to tape until there's a certain amount of data. Would this be in scope for this

Three thresholds:

Amount of data if for a storage group.

Threashold is time: minimum time

Number of files not been flushed: not been flushed, then it is flushed.

First threshold met will trigger the flush: configured for each

Per-pool, not global configuration.

Pedro: what's the motivation here?

HPSS, send it as soon as possible and use the HPSS disk buffer.

Gerard: reason with enstore, we've seen we're not been able to transfers for

retain disk=to-tape files and write 3..4 tapes

If you don't have a disk cache between dCache and the tape back-end. Its more efficient to write to tape in bunches. If you don't do this, then the tape system will constantly mount and unmount tapes.

Questions from Pedro

GridFTP copy wait errors

Pedro asks:

occasional gridftp_copy_wait errors which are 99% of the cases
solved by restart of the doors (stalled connections?)

Restart doors one-or-twice per week.

pinid takes a long time

Pedro also asks:

under some stress conditions, the retrieval of pinid takes a long
time (2 minutes which is more than the retrieval of the pnfsid <= 1
minute) and makes the retrieval of files on disk to time out (FTS
timeout is 3 minutes)

Email to user-forum.

FTS time-out of three minutes.

Pin manager can take (approx.) two minutes just to .

Files are on disk.

Don't have a special thread for the location information: we don't have this.

Could you have a look how long the queues are in the PnfsManager?

May be a good idea to add the separate thread for the location information.

News on releases

1.9.3 is now available.

1.9.0 is no longer supported, please migrate asap.

DTNM

Tuesday 7th July.