wiki:developers-meeting-20111011

dCache Tier I meeting October 11, 2011

[part of a series of meetings]

Present

dCache.org(Tanja, Paul), IN2P3(Nicolas), Triumf(Simon), PIC(Gerard), GridKa(Doris)

Agenda

(see box on the other side)

Site reports

IN2P3

Nicolas reported that everything is OK for IN2P3.

IN2P3 have successfully migrated their non-LCG production instance to 1.9.12-10 on Tuesday. Everything went well.

There were two issues found: first, an issue with a missing uid/gid information when flushing to tape ( RT #6698) and a problem with memory consumption on pools.

The problem with pools results in the pool running out of memory. They had to increase the memory from the default value (0.5 GiB) to 2 GiB to allow it to start. The pools in question have a storage capacity of 1 TiB; they also had a large number of precious files (~100,000) despite the pool not being attached to any HSM.

Paul will bring this up during tomorrow's dev. meeting.

PIC

Gerard reported that things are OK in production.

He also asked if there had been any progress with ticket  RT #6561? The ticket describes a problem where the WebDAV door consumes increasing CPU load over time. Gerard supplied a snapshot of Ganglia graphs from the machine. This shows CPU activity that remains constant for most of the time but suffers a step-wise increase at (approx.) midnight (seemingly) every 10 days.

The problem has only been observed when encryption (SSL) is enabled and restarting the door provides a temporary work-around for the issue.

 RT #6561 is currently not a problem as WebDAV door isn't in production. However, Gerard would like to deploy the WebDAV door in production soon.

Paul mentioned that Gerd and himself had seem spikes of activity; however, these were relatively short-lived, lasting a few minutes. While undesirable, such spikes went away.

Paul will talk with Gerd about this, to see if he has any further ideas.

GridKa

Doris said that GridKa has nothing to report.

Triumf

Simon reported that they have upgraded their production dCache instance to 1.9.5-28 last week. The upgrade went OK, but there was an issue with a pool after the restart. Simon remembered a ticket (opened some time ago) that was on a related topic. He will investigate and either reopen that ticket or open a fresh one.

Simon also mentioned that he's continuing his testing of 1.9.12-10 in Triumf PPS.

He asked if anyone was using Scientific Linux v6?

Paul didn't know of anyone.

Gerard reported that they were running two pools using RedHat Enterprise v6

Simon mentioned a problem with movers on the pool. The problem was present with GridFTP v2 transfers, but not with GridFTP v1 transfers. The observation was that a client would connect to dCache's GridFTP door and initiate a transfer. Before the data channel is established (between pool and client) dCache disconnects the control channel.

The problem is not present with 1.9.5 on SL6: that works fine.

Simon felt that a potentially contributing factor was the server's network cards. It has four 10 GiB cards.

Simon has looked at the pool in "debug mode" but hasn't done this yet to the door.

There was some discussion on the Job Timeout Manager, about killing off movers. PIC mentioned that they have a last-access criterion of 30 minutes.

Support tickets for discussion

[Items are added here automagically]

DTNM

Same time, next week.