wiki:developers-meeting-20160322
Last modified 3 years ago Last modified on 03/22/16 15:31:00

dCache Tier I meeting MONTH DATE, 2013

[part of a series of meetings]

Present

dCache.org(Tigran, Paul), NDGF(Ulf), KIT(Xavier)

Agenda

(see box on the other side)

Site reports

KIT

Xavier reported that regular production is running fine now.

Draining with insufficient target capacity

Xavier encountered a problem with their ATLAS instance.

They were setting up new resources for them and wanted to reconfigure existing pools that are 15 TiB in size.

Xavier started a migration job to drain the pools, but miscalculated and filled all other ATLAS pools.

For some (small) number of pools, the pool attempted to write into the pool despite there being no free space on the underlying filesystem.

This resulted in an IOException (either from ATLAS upload or from the p2p transfers draining the pools), which triggered the pool going into DEAD state.

Once in DEAD state, the pool needed to be restarted to resume activity.

This behaviour is undesirable: the pool should go into READ-ONLY state if the pool runs out of capacity.

Xavier to send the stack-trace as a support ticket.

NDGF

Ulf reported that things are pretty nice.

NDGF have updated to v2.15.1. They experience some minor problems with the SRM database, but these are fixed now.

The database update took some time to complete: "slightly" over 4 hours predicted for NDGF.

Some pools have been updated to v2.15 already, but most are still running v2.14.

Ulf noted that, if you create a new migration job then you cannot cancel it during the initialisation stage. For large jobs, the initialisation stage can take some time (measured in tens-of-minutes), so this limitation is annoying.

The motivation was that they are doing massive migration of data off of ZFS RAID-Z2 as this does not work for ALICE. They are requesting small random reads, using xrootd. Unfortunately, ZFS calculates a checksum on read and always read whole blocks of data. The result is that the pools spend all the time reading data for (and calculating) checksums and not delivering data to the client. These are 80 TiB pools, so the draining will take some time.

Support tickets for discussion

[Items are added here automagically]

Other items

SRM work-around

Paul mentioned that the next bug-fix release will include a work-around for SRM clients making unreasonably small request lifetimes. It is disabled by default, but when enabled may allow the CMS requests at KIT to succeed. Xavier will ask CMS about upgrading.

Video conference

People were generally happy with the new video service and were happy to use it in next week's meeting. The URL should be the same; Paul will circulate the URL before the meeting.

DTNM

Same time, next week.