wiki:developers-meeting-20110104
Last modified 10 years ago Last modified on 01/04/11 17:39:08

dCache Tier I meeting January 4, 2011

[part of a series of meetings]

Present

dCache.org(Paul, Antje, Tanja), Triumf(Simon),PIC(Gerard)

Agenda

(see box on the other side)

Site reports

PIC

Gerard reported that everything was working fine over the festive period. There's concern about the Chimera bug that he reported with 1.9.10-4; is the problem with Chimera-specific code or does it affect PNFS-based installations. Paul replied that it is a Chimera-specific problem. Gerard also asked if the problem affects the 1.9.5-branch. Paul wasn't sure but would ask Tigran for more details.

Gerard also mentioned that they are planning on upgrading their production Tier-1 instance to newer version of dCache (e.g., 1.9.10). The likely time-scale for this would be in February. It would be with PNFS as they haven't thoroughly tested dCache and enstore with Chimera yet.

PIC currently have three dCache instances: a Tier-1, a Tier-2 and a Preproduction test-instance. The Tier-1 currently runs a 1.9.5 instance. The Tier-2 run latest "green version" (currently 1.9.10-4). For the Tier-2 being down for one or two days isn't such a big deal.

The motivation is support for http / WebDAV.

One of Gerard's concerns was that nobody is running PNFS with the latest dCache versions (e.g. 1.9.10), could some of the changes between 1.9.5 and 1.9.10 have broken something for PNFS? Paul thought it unlikely, but appreciated the concern.

Gerard also wondered about down-grading: should something go wrong when installing 1.9.10 would it be possible to downgrade to 1.9.5 ... in particular, are there any database schema changes that would prevent switching back to 1.9.5 if there's a problem?

Paul would find out.

Triumf

Simon reported that everything was OK over the festive period, but that they were having problems just before. At that time there was a large number of Tier1-to-Tier1 transfers. They noticed lots of problems where pools suffered "slow writing". Also, during this time, the pools apparently lost connection to the pool-manager and the pool-manager would declare the pool offline. Simon limited each pool max movers to 8--10; however, this didn't help.

This is described in RT ticket #6013.

Empirical evidence suggested that the problem wasn't due to network saturation as Simon was able to log into the pool and there was no appreciable delay when running commands.

The pools were observed spending lots of time in IO-WAIT: Simon thought it was between 20% and 50%.

Triumf now has various pools with different configuration: SL-4, SL-5, different network cards, etc. They find it hard to see which parts of the system need tuning, so help here would be appreciated.

Simon also asked for more details about the communication between the pool and the pool-manager.

Paul explained that all dCache internal communication goes through the same TCP connection. These TCP connections form a star-topology with the dCacheDomain at the centre, which runs the PoolManager. Pool-to-PoolManager communication is therefore over a single TCP connection, which all messages to and from that pool will travel (so all door messages and messages to PnfsManager, etc)

The pool IO is handled by threads; however, the ping messages from pools to the PoolManager are sent from a separate thread. This means that a pool that has all IO threads blocked (due to a slow disk) should still be able to send the ping messages. One explanation is CPU starvation: the IO threads are generating sufficient IO-WAIT activity that the ping thread is unable to run; however, Paul felt this explanation unlikely.

Simon confirmed that he has checksum-on-write enabled on the pools.

Another issue is that there is nothing logged in the pool-manager if a pool goes down; it's hard to know when this has happened.

Another issue is that doors are apparently killing transfers due to unresponsive pools; however, the mover is left on pool, which prevents other transfers from starting.

Pools are tested first in Triumf's preproduction system; however, this behaviour was not discovered there. The tests involve firing off lots of accesses from worker-nodes. The conditions are the same when the pool moves from PPS into Production: the only change is in the pool's configuration.

Support tickets for discussion

[Items are added here automagically]

DTNM

Same time, next week.