wiki:developers-meeting-20140805
Last modified 5 years ago Last modified on 08/05/14 15:19:43

dCache Tier I meeting MONTH DATE, 2013

[part of a series of meetings]

Present

dCache.org(Paul, Gerd), IN2P3(), Sara(), Triumf(), BNL(), NDGF(Gerd), PIC(), KIT(), Fermi(), CERN()

Agenda

(see box on the other side)

Site reports

NDGF

NDGF is running mostly 2.8 on pool nodes with a single pool node running 2.9; head nodes are mostly running 2.9 except for a single FTP door running 2.10.

Last weekend, NDGF suffered several dCacheDomain auto-restarts due to out-of-memory problems.

The memory-dump showed that the problem was due to pool-manager processing many stage requests. After further investigation, the problem appears to be triggered when a stage request is retried. When a request is retried, the pool-manager sends another stage request to the pool and registers a call-back within pool-manager. This additional registration could have triggered the out-of-memory problem. A patch has been developed that should fix this problem.

Gerd also noticed significant numbers of certificate-related byte-arrays. It is unclear to what extent this contributed to the out-of-memory problem.

NDGF also suffered from high CPU usage from the dCacheDomain, resulting in nagios tests failing. The problem seems to originate from the nfs door. The node hosting the nfs door is dual-stacked, but dCache was mounted with the IPv4 address, so it is unclear to what extent this is related.

There were several problems with tape pools becoming unresponsive. In one case, this was due to a broken tape system, but for the other cases this was not because of the tape system. In some cases increasing the admin shell timeout allowed commands to succeed, but other times this didn't help.

The problem sees to be due to the tunnel being unable to send replies fast enough. This could be due to underlying networking issues; however, a patch has been developed that should drastically decrease the number of messages a pool needs to send to pool-manager.

Another potential issue with retrying staging requests is that additional memory is required on the pool for each retry. When staging a large number of files, this could lead to an out-of-memory problem on the pool.

NDGF plans to upgrade to 2.10 within the next two weeks.

Support tickets for discussion

[Items are added here automagically]

DTNM

Same time, next week.