wiki:developers-meeting-20141127
Last modified 5 years ago Last modified on 11/27/14 18:03:31

dCache Tier I meeting November 27, 2014

[part of a series of meetings]

Present

dCache.org(Tigran, Paul), IN2P3(), Sara(), Triumf(), BNL(), NDGF(), PIC(), KIT(Xavier), Fermi(), CERN()

Agenda

(see box on the other side)

Site reports

KIT

Regular production is running fine.

NFS

Brenchmark CMS running

Stale movers on pool, door.

Fails for distinct files.

Reading files normally it works file.

Same set of files that have the problem.

This is dCache 2.6.

Have increase the number of movers to 1,000.

Running some CMS tests.

There are a lot of lines starting "decode" or "encode".

Only NFS movers.

During the last two weeks of December.

Questions

ATLAS federation FAX with 2.10?

Tried to install something at DESY, but it didn't work.

Will contact Shaun Mckey in University of Michegen (with you in CC) to kickstart

---

Deletion on tape is enable by default.

How is pool selected? (Think) chose one that is connected. If the pool has "set hsm <label>" then the URI will have osm://

---

info-provider.xml hasn't changed

---

Property files; httpd.conf

---

Setup with IPv6 tests. When trying to use FTS-3 with this.

FTS not able to initiate transfers to this instance.

srmcp works fine.

The domain .access files suggests the problem is networking related as dCache didn't see any activity.

---

2.10 dcap permission denied errors.

authz-readonly to "true".

---

dcap readonly and anonymous-

---

What are volatile pools?

Fermilab

OOM

Out of memory -- not that much happening now.

Tried to upload, but this failed.

NFS problems

Monday; NFS mount points. System was very slow.

Previously mount points were stuck.

Have stuck jobs.

The ls / df commands were working, but took ~ 40 seconds.

Saw "rm" processes on the node -- for a file that doesn't exist.

Fix to check if the

Comes from a prod plugins.

Incident on Monday where prod was stuck: saw 110 nodes hanging for 1--2 days.

NFS v4.0 didn't fix it; NFS v3 was quick.

Did that and had improvement.

It turned out that the SRM server hosting the NFS was swamped with log messages and had a disk full.

Started looking into

Gerard noticed high load on NFS v4 on the Chimera.

Monitoring nodes taking longer than 2 minutes to respond to 'ls' command.

There was ~50 nodes had the alarm; the nodes came and went from this set.

Found IllegalArgumentException?: comes from mixing NFS 4 and NFS 3.

--

Support tickets for discussion

[Items are added here automagically]

DTNM

Proposed: same time, next week.