wiki:tier-one-meeting-20180213
Last modified 7 months ago Last modified on 02/13/18 17:33:48

dCache Tier I meeting February 13th, 2018

[part of a series of meetings]

Present

dCache.org(Paul, Tigran), IN2P3(), Sara(), Triumf(), BNL(Jane), NDGF(Dmytro), PIC(), KIT(Xavier), Fermi(), CERN(),

Agenda

(see box on the other side)

Site reports

KIT

Things are running fine again.

ATLAS transfers failing

Last week there were problems with ATLAS transfers. See RT ticket 9340.

After some investigation it was determined that the transfers were failing because they were queuing.

This was because the transfers were writing to tape and there is only two pools (one main pool and a smaller backup) that are tape-attached.

The pool was configured to accept 100 concurrent writes, which meant that CUSTODIAL writes should see a different limit to REPLICA writes.

FTS/ATLAS was trying to do too many transfers. This is now fixed by ATLAS separating their disk and tape writes and imposing different limits.

Where is the parallel transfers

RT 9344

Xavier would like to know the number of streams involved with a transfer. This is to investigate whether the number of streams affects the transfer performance.

We could add this as logging to protocol-info part in billing.

Accommodation for workshop

Will there be rooms reserved in the DESY hostel?

No, but you can reserve directly at DESY hostel.

Using HA at DESY

Yes. we have HA already deployed for some time for cloud instance.

We are upgrading XFEL instance next week and will switch on HA support there.

CA problems

Last year there was a problem with the UK CA changing their intermediate CA certificate from SHA-1 to SHA-256.

Is this now resolved? Yes upstream, but the fix is not yet in dCache.

Xavier is happy for ticket RT 9310 to be closed.

NDGF

Going fine!

Missing cell

dCache ssh admin interface claimed that a cell hosted on a pool under heavy load, did not exist.

Yes, this can happen as the admin cell sends a message to verify that the cell exists when the '\c' command is issued.

The cell still exists in \l output because that information is cached.

We can look at changing how \c command checks for the existence of a cell.

BNL

Jane reported things are going fine.

CLOSE_WAIT

Jane described the problem they saw at BNL: the door with many TCP connections (to pools) in CLOSE_WAIT state.

There seems to be two problems here:

  1. transfer didn't succeed.
  2. wasn't cleaned up correctly when the client disconnected.

Thanks to the information provided by Jane, we believe there is enough information to reproduce the problem.

Draining the GridFTP door

Jane reported that she has had difficulty draining the GridFTP door. This is with dCache v3.0 and seems to be a regression against their earlier version.

Paul asked Jane to open a ticket describing the problem.

Released 4.0 version

BNL is currently running 3.0. To which version should they upgrade? What will be "production ready" in ~6 months time.

All dCache releases are (to the best of our knowledge) production ready. When upgrading at DESY, we always take the latest version. For example, next week's upgrade is to 4.0. If we upgrade after that then it should be to dCache v4.1

Jane described how BNL has a test bed instance, exactly like their production, with which they can test dCache releases.

Paul: This should give you additional confidence that, whichever version you take will at least work. There is always the possibility of performance related regressions --- ones that are only visible when deployed in production; however, those are relatively unlikely.

Support tickets for discussion

[Items are added here automagically]

DTNM

Same time, next week.