wiki:developers-meeting-20110927

dCache Tier I meeting September 27, 2011

[part of a series of meetings]

Present

dCache.org(Tigran, Tanja, Paul), IN2P3(Nicolas), Triumf(Simon), GridKa(Doris)

Agenda

(see box on the other side)

Site reports

Triumf

Simon reported that he has nothing to report from Triumf's production service. He also reported that he has deployed dCache v1.9.12-10 in the pre-production machines yesterday; so far, it is looking good.

Simon had a question: how do I configure a domain to have limited memory usage?

Paul and Tigran explained that, within the layouts file and immediately under the domain definition, additional configuration may be made.

Here's an example:

[myDomain]           # create a domain called "myDomain"
 dcache.java.memory.heap = 128m
 dcache.user = fred  # run this domain as Unix/Linux user "fred"
[myDomain/foo]       # host service 'foo' in domain "myDomain"
[myDomain/bar]       # host service 'bar' in domain "myDomain"

[myOtherDomain]      # create a domain called "myOtherDomain", all properties inherited from dcache.conf or defaults
[myOtherDomain/baz]  # host service 'baz' in domain "myOtherDomain"

Paul to send Simon the slides from a talk explaining this with worked examples.

IN2P3

Nicolas reported that currently everything is OK.

We plan to do the upgrade to version 1.9.12 next week on our "EGEE" instance of dCache (i.e., the instance supporting non-LHC VOs like biomed, ilc, etc.). The upgrade of the LCG dCache instance will be done in January or February 2012.

One ticket was on the dcap connection.

The second ticket was on the SRM and TCP backlog. The problem was identified as kernel-level limit on the TCP accept backlog. The value was set to a way too low, so a flood of connection attempts exceeded this value and clients received RST. The backlog has now been increased and they are watching to see if the problem returns.

Yvan also reports that ...

concerning the TCP backlog, we indeed increased the TCP parameters mentioned by Dmitry. However
we are not sure that these new values have been taken into account. We probably need to restart
the network service and this will therefore be done next week during the scheduled site downtime.

GridKa

Doris had a question about a ticket. The issue was with a staging problem with the ATLAS instance. They had to restart the pool. While the pool was restarting, the staging script fetched a number of files. On startup, the pool discovered the files and recovered by deleting the files. This seems wasteful of the effort taken to acquire the files from tape.

Tigran explained that, since the file is available on tape, the safest option is to assume that the files are broken and delete them. A fresh stage will fetch the file from tape.

Paul would raise this during the developers meeting tomorrow.

Support tickets for discussion

[Items are added here automagically]

DTNM

Same time, next week.