Table of Contents
dCache Tier I meeting May 3, 2011
[part of a series of meetings]
Present
dCache.org(Anjte, Paul), Triumf(Simon), GridKa(Doris)
Agenda
(see box on the other side)
Site reports
GridKa
Incident
Incident last night with ATLAS dCache
PoolManager 8 GiB
Didn't die, but no progress being made.
The on-call team restarted the dCacheDomain. This fixed the problem and the ATLAS dCache instance is working again.
The cause isn't known. Yesterday, they saw pools under heavy load. This may be partly due to on-going activity to migrate the data to new hardware; they have migration jobs running on these pools to move the data.
There are 8 pools that are being migrated.
There were quite a large number of staging requests.
System is running again now.
Load of the pool migrating is now lower than it was yesterday but there are still staging requests going on. Currently, there's roughly the same number of requests as yesterday.
Doris will open a support ticket with information about the problem.
Migration of PNFS to Chimera
Doris asked about support for migrating their old dCache instance to Chimera.
Paul will look into this.
Triumf
During the week-end, one of the pools dropped out of production for a bit. It recovered by itself; however, the pools disappearance resulted in a lot of jobs failing.
It looks like the problem was caused by high I/O---it happened during a period of high IO. The jobs that failed gave an error message describing how dccp had timed out.
The hosts have a maximum of 25 movers for dccp. Each pool has 5 movers for dccp.
Is the pool connected to a DDN system? Yes. Is it using XFS? yes.
For DDN systems, you could try using the noop IO-schedule. Note: this is just an idea and hasn't been tested at DESY yet.
In "recent enough" kernels, you can change the IO scheduler without rebooting. The command to set the IO-scheduler to noop is like:
echo noop > /sys/block/{DEVICE-NAME}/queue/scheduler
where {DEVICE-NAME} is the name of the DDN block-device.
To make this change permanent, adjust your kernel boot options by adding the following option:
elevator=noop
Note that this affects all disks, not just the DDN; so IO to the system disk may slow down as a result.
Simon mentioned that he's currently evaluating changing how checksums are calculated; changing it from onWrite to onTransfer. The results so far are promising.
Simon said there was no other problems to report.
Support tickets for discussion
[Items are added here automagically]
DTNM
Same time, next week.
