wiki:developers-meeting-20141111
Last modified 3 years ago Last modified on 11/11/14 14:57:16

dCache Tier I meeting November 11, 2014

[part of a series of meetings]

Present

dCache.org(Tigran, Paul), IN2P3(), Sara(), Triumf(), BNL(), NDGF(), PIC(), KIT(Xavier), Fermi(), CERN()

Agenda

(see box on the other side)

Site reports

KIT

Filter by storage class when doing rep ls.

Xavier reported that things are going well, most of the time.

ATLAS and dCacheDomain

There was some issues with ATLAS that they were unable to identify. The result was dCacheDomain was using 90% CPU and eventually ran out of memory. Unfortunately due to incorrect permissions in the log directory, the java application was unable to write a heap dump when suffering an OOM. This has now been fixed.

So far, the problem has not reoccurred.

NFS

Xavier has now increased the maximum number of movers, which has solved the problem and the benchmarking tool the user-community were using now completes. Xavier didn't understand why this was necessary, though.

Tigran explained that ROOT in particular (and potentially other applications) will open all files it needs before starting an IO. If it tries to open more files on a pool that the number of allowed NFS movers (or dcap movers, when proxying) then the application and dCache will deadlock: the client is attempting to open a file, while dCache will only allow this once already opened files have been closed.

There is another issue in which dCache does not always clean up the movers. This is a bug where, when the IO is being proxied and the client doesn't wait long enough then the door does not remove the queued mover. This bug will be fixed with the next release of dCache, hopefully today.

Upgrading

Xavier has been testing the migration to 2.10 and has some questions after reading the upgrade guide.

With NFS, how should he find out if a pool was banned?

Currently this is only possible on the client. On the door, a client banning a pool will show up as a large number of proxy requests from one particular client to one particular pool.

What is the purpose of the pool health script: doesn't dCache already test the health?

Yes, it tests, but only that we can create and remove a zero-length file. The filesystem or RAID system may be able to provide additional information; for example, that one disk has gone bad and the RAID is currently rebuilding. Under these circumstances, it may be wise to switch the pool to read-only; however, dCache cannot discover that a RAID is rebuilding. This is why the script is necessary.

Xavier also reported that they will likely want to upgrade to 2.11 immediately after upgrading to 2.10. This is mostly because of the HSM bulk operations. Is there any reason why they shouldn't?

No. 2.11 will be supported as with 2.10; As NDGF already run it in production, there seems limited risk associated with upgrading.

LHCb and SRM

Last time, we talked about a problem where LHCb reported FTS transfers failing. This was triggered by GGUS ticket #109825.

Xavier asked the networking people; they do have a rule that closes idle TCP connections, but only if there's been no traffic for 3 hours.

Paul has been in touch with the FTS developers; they say that the error message indicates the problem happened during the initial handshake.

AP/ need to check whether dCache logs client or server closing a connection during the handshake.

---

AP/ Paul to contact Illya about support for the ATLAS xrootd n2n plugin.

Support tickets for discussion

[Items are added here automagically]

DTNM

Next meeting is this week on Thursday at 16:00 CET (via Google Hangout).

The meeting on the following Tuesday (2014-11-18) is cancelled as both Tigran and Paul will be away at CERN.