Changes between Initial Version and Version 1 of developers-meeting-20140805


Ignore:
Timestamp:
08/05/14 15:19:43 (5 years ago)
Author:
paul
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • developers-meeting-20140805

    v1 v1  
     1[[TOC(depth=0)]] 
     2= dCache Tier I meeting MONTH DATE, 2013 = 
     3[part of a [wiki:developers-meetings series of meetings]] 
     4 
     5== Present == 
     6 
     7dCache.org(Paul, Gerd), IN2P3(), Sara(), Triumf(), BNL(), NDGF(Gerd), PIC(), KIT(), Fermi(), CERN() 
     8 
     9= Agenda = 
     10 
     11(see box on the other side) 
     12 
     13= Site reports = 
     14 
     15== NDGF == 
     16 
     17NDGF is running mostly 2.8 on pool nodes with a single pool node running 2.9; head nodes are mostly running 2.9 except for a single FTP door running 2.10. 
     18 
     19Last weekend, NDGF suffered several dCacheDomain auto-restarts due to out-of-memory problems. 
     20 
     21The memory-dump showed that the problem was due to pool-manager processing many stage requests.  After further investigation, the problem appears to be triggered when a stage request is retried.  When a request is retried, the pool-manager sends another stage request to the pool and registers a call-back within pool-manager.  This additional registration could have triggered the out-of-memory problem.  A patch has been developed that should fix this problem. 
     22 
     23Gerd also noticed significant numbers of certificate-related byte-arrays.  It is unclear to what extent this contributed to the out-of-memory problem. 
     24 
     25NDGF also suffered from high CPU usage from the dCacheDomain, resulting in nagios tests failing.  The problem seems to originate from the nfs door.  The node hosting the nfs door is dual-stacked, but dCache was mounted with the IPv4 address, so it is unclear to what extent this is related. 
     26 
     27There were several problems with tape pools becoming unresponsive.  In one case, this was due to a broken tape system, but for the other cases this was not because of the tape system.  In some cases increasing the admin shell timeout allowed commands to succeed, but other times this didn't help. 
     28 
     29The problem sees to be due to the tunnel being unable to send replies fast enough.  This could be due to underlying networking issues; however, a patch has been developed that should drastically decrease the number of messages a pool needs to send to pool-manager. 
     30 
     31Another potential issue with retrying staging requests is that additional memory is required on the pool for each retry.  When staging a large number of files, this could lead to an out-of-memory problem on the pool. 
     32 
     33NDGF plans to upgrade to 2.10 within the next two weeks. 
     34 
     35= Support tickets for discussion = 
     36 
     37[Items are added here automagically] 
     38 
     39= DTNM = 
     40 
     41Same time, next week.