wiki:developers-meeting-20110719

dCache Tier I meeting July 19, 2011

[part of a series of meetings]

Present

dCache.org(Patrick, Gerd, Tanja, Paul), Sara(Onno), NDGF(Gerd), PIC(Gerard)

Agenda

(see box on the other side)

Site reports

Sara

Onno reported that, in general, things are OK. He reported having a small issue with stability with their SRM node.

Last Thursday evening Sara noticed that the SRM had became sluggish in that it sometimes responded but was slow to do so. For example, srmLs sometimes got a response but the response would take a long time (~2 minutes). Onno mentioned that they have Nagios tests for port 8443; this was failing most of the time. They found that increasing the timeout for the test would help: it would fail less often.

After suffering from this problem for a few hours they decided to reboot the SRM node. This fixed the problem; however, the problem came back on Sunday.

Onno expressed concern that this problem has appeared twice in short succession. It has been reported as  RT ticket #6474.

There was a discussion about some of the limitations of the SRM implementation in dCache. One limitation is what happens if a client times-out (internally) and disconnects. If this happens, due to a limitation of Axis, the SRM is unaware that the client has disconnected and will continue processing the user's request.

The problem is particularly a problem with blocking operations, such as srmLs. If this operation takes some time and the client disconnects (due to an internal timeout) then the client may reconnect. This can result in the 500 connection limit being exhausted.

This is consistent with restarting dCache "fixing" the problem: the restart will kill the outstanding connections (where the client had disconnected), so returning everything to normal.

There was a discussion on whether we can rule out this possibility. Sara monitor the number of SRM actions in Ganglia; however, in 1.9.5, unless the "async. ls" is enabled, any srmLs operations are not reported in the SRM's admin-interface "ls" command.

Onno has a thread-dump from one of the earlier indecents that Gerd looked at.

All the threads are blocked on authentication. In particular, collecting VOMS attributes.

There are lots of threads blocking on a gPlazma x509 cert util object but, in the thread-dump, this object doesn't have anything in the monitor. This is likely a thread-dumping artefact, indicating high activity.

The concurrent requests, when the thread-dump was made, seem to be of different types: srmPing, srmGetSpaceToken, srmPrepareToGet, .. This goes against the theory that a multitude of srmLs commands was causing the problem.

If the problem repeats itself, try to collect multiple thread-dumps: 10 thread-dumps in total, every 10 seconds, say. A single heap-dump would also be useful.

In the meantime, have a look at the code to see what could be taking this long.

Onno noted that they switched off the attribute caching the gPlazma. This was to support users using dCache with different VOMS attributes. This be making the problem worse

Upgrading to the next Golden release

Onno also reported that they are currently testing the next golden release. Once they're happy with it, and the migration procedure, they will upgrade.

This will be after the summer.

NDGF

Gerd had nothing to report.

Last week NDGF upgraded their head-nodes to 1.9.13. Tomorrow they will roll out a minor upgrade.

PIC

Gerard reported that everything is OK.

Support tickets for discussion

[Items are added here automagically]

DTNM

Same time, next week.