Changes between Version 1 and Version 2 of developers-meeting-20140422

04/22/14 14:58:25 (6 years ago)



  • developers-meeting-20140422

    v1 v2  
    1515== NDGF == 
    17 Things are running fine; no big problem -- over Easter holiday, heavy traffic but running fine. 
     17Ulf reported that things are running fine with NDGF currently.  There are no big problem; over the Easter holiday there was heavy traffic, but dCache is running fine. 
    1919== PIC == 
    21 Everything is OK at PIC.  Waiting for upgrade to 2.6 on enstore compatibility to be fixed.  Other than that, everything is running OK. 
     21Marc reported that everything is OK at PIC. 
     23Their plans for upgrading are blocking on dCache providing a version of dCache > v2.2 that is compatible with Enstore.  Other than that, everything is running OK. 
    24 Plan to upgrade to dCache v2.10.  Plan to upgrade to ASAP. 
     25PIC plans to upgrade to dCache v2.10 as soon as it is available and passes their testing.  Paul described the release scheduled for v2.10, with an anticipated release of 1st July 2014. 
    26 Need to limit dCache v2.2 not working correctly,  
     27There are two (known) problems with v2.2 that prevent PIC from using NFS: 
     28  1. if a client attempts to append a file then that file is truncated, 
     29  1. an export that is marked 'read-only' on the server may still be mounted read-write on the client. 
    28 Those issues are already fixed in newer versions: 
     31Both issues are fixed with dCache v2.6, but the problems with enstore compatibility prevent PIC from upgrading. 
    30   1. you append a file; file is truncated; fix in 2.6. 
     33Paul described how all the known problems with Enstore are fixed with dCache releases v2.7, v2.8 and the forthcoming v2.9 releases.  However, previous experience has shown that real testing with Enstore can reveal additional problems; therefore, we need further testing before we can say definitively that all problems are fixed.  Once this is done, we will back-port the NFS changes to v2.6, allowing sites running Enstore to upgrade to v2.6. 
    32   2. mount exported as a read-only file-system as read-write. 
    34 Test instance to check Enstore. 
     35Marc offered to test dCache releases using their test instance to check Enstore compatibility.  This offer was gratefully received. 
    3637== KIT == 
    38 Is looking for a bringer future.  Connection reset for pier -- Alice had a bug in their -- ignored their local SE and read the data from CERN.  Filled the firewall ..  Discovered before Easter, now fixed.  Huge improvement for KIT. 
     39Xavier reported that the future is looking brighter at KIT.  The cause of the "connection reset for peer" problem has been identified and fixed.  The problem was due to a bug in Alice software stack that resulted in their software ignoring their local SE and reading all data from CERN.  This filled the firewall and triggered the problem.  The problem was discovered before Easter, and is now fixed.  The result was a huge improvement for KIT. 
    40 Statistics files for one instance --  collected the data in debug-mode, waited for first pool to time-out and .. heap-dump  
     41Xavier has further investigated the problem with statistics gathering for their ATLAS instance.  He has collected the data with debug output enabled, waited for one pool to time-out and captured a heap-dump. 
    42 BUG: the error message is misleading but describes a real problem. 
     43In capturing the heap-dump, he noticed a problem where, when the 'dcache' script failed to find the command needed for the heap-dump, the advice contained the wrong Java version number.  Gerd mentioned he'd also seen this problem.  Xavier will open a ticket about this [#8303]. 
    44 20 seconds per-pool deadline -- many pools barely manage this.  Pools are running the pool-monitoring plugin.  With the Berkley-DB database, file-count is certainly one of the information.  Maybe dCache should cache this information. 
     45There was some speculation as to where the ATLAS problem may be origin from.  The output from 'rep ls' includes information (such as file-count) that is not contained within the Berkley-DB.  Displaying this information, therefore, triggers disk-IO activity, which could be the cause of the slow-down.  Perhaps dCache could cache this information. 
     47Xavier mentioned that the problems didn't seem to correlate with IO-load on the pools: pools showing negligible CPU time spent in WAIT state still demonstrated the problem.  Although not definitive, this suggests the problem lies elsewhere. 
    4649= Support tickets for discussion = 
    5053= DTNM = 
     55The next Thursday meeting is: 
     57    Thursday 24th April at 16:00 CEST, 09:00 CDT, 10:00 EDT, 07:00 PDT 
     59The next Tuesday meeting is: 
     61    Tuesday 29th April at 14:00 CEST, 16:00 MSK. 
    5264Thursday this week or same time, next week.