Changes between Version 3 and Version 4 of developers-meeting-20141216


Ignore:
Timestamp:
12/16/14 15:53:42 (5 years ago)
Author:
paul
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • developers-meeting-20141216

    v3 v4  
    1515== PIC == 
    1616 
    17 Upgraded to 2.10 -- yesterday.  Some problems to begin with; SRM in particular -- slow response. 
     17Marc reported that PIC upgraded to v2.10.13 yesterday.  There were some problems to begin with, in particular with the SRM being slow initially.  However, now everything is working fine. 
    1818 
    19 Now it working. 
     19They had some problems with the dcap protocol having anonymous access disabled by default, but it turned out that this affected their testing and not their users. 
    2020 
    21 dcap protocol -- disabled -- ANONYMOUS user. 
     21They see some errors in the srm server log file but it is unclear whether or not these errors are expected. 
    2222 
    23 Errors in the log file.  SRM server -- message  
     23There are some LHCb tests that are failing.  This is under investigation, but Marc reported that the tests pass when he ran the test script himself. 
    2424 
    25 Failing LHCb tests. 
     25The tests use the gfal packages and, in particular, the lcg-cp command is failing. 
    2626 
    27 Marc checked test script manually and it works for him. 
     27A SAM test is failing, which is still under investigation. 
    2828 
    29 Latest gfal packages. 
     29For CMS and ATLAS, there is currently suffering only light load.  Everything looks OK just now, but this could change when CMS or ATLAS start increasing their load. 
    3030 
    31 The command that's failing is 'lcg-cp'. 
    32  
    33 Running several jobs in the farm.  A SAM test is failing. 
    34  
    35 Still under investigation. 
    36  
    37 CMS and ATLAS have only light load currently.  Everything looks OK just now. 
    38  
    39 Installed 2.10.13. 
    40  
    41 If you do a `find` over a directory, sometimes it hangs. 
     31Marc also reported that doing a `find` over a directory or an `ls` command on a directory with a large number of files sometimes it hangs.  This is under investigation and Marc will open a ticket once the situation is understood better. 
    4232 
    4333== NDGF == 
    4434 
    45 Prod system is working fine. 
     35Ulf reported that NDGF production system is working fine. 
    4636 
    47 Development system is not so good. 
     37The development system is not working so well: ATLAS has noticed that they can't delete files using WebDAV.  They claim they're getting a 401 error message; however, no corresponding error is logged by dCache. 
    4838 
    49 ATLAS has noticed that they can't delete files using WebDAV.  They claim they're getting a 401 error message.  Nothing is being logged in dCache log files. 
     39Ulf is confident the problem is not with SSLv3 as such problems are now logged. 
    5040 
    51 It's not an SSLv3 problem: this is logged. 
    52  
    53 Logging for the webdav is "quite bad". 
    54  
    55 Continue debugging tomorrow.  Not too bad as it's the test instance. 
     41Ulf also commented that the logging for the webdav is "quite bad".  He'll continue debugging tomorrow; however, this isn't too bad for NDGF as the problems are reported against only their test instance. 
    5642 
    5743== KIT == 
    5844 
     45Xavier focused mostly on the KIT WLCG instances. 
     46 
    5947=== ATLAS === 
    6048 
    61 ATLAS problems: got a GGUS ticket.  The example shows ATLAS attempting to delete a file that does not exist.  The dCache responds correctly to this situation. 
     49ATLAS have also reported a problem with KIT's ATLAS instance, similar to the problem they've reported against the NDGF test instance.  In this case, ATLAS has opened a GGUS ticket. 
    6250 
    63 There is no mention of this file in today's billing log file: Nov--Dec found no mention of this file. 
     51The ticket currently contains a single example demonstrating the problem.  Xavier has checked and the file in this example does not exist in KIT.  Given this, the dCache appears to be responding correctly. 
    6452 
    65 ATLAS is using exactly one door.  There are 401 response  
     53Xavier checked and there is no mention of this file in today's billing log file.  During the meeting, he also checked the Nov--Dec archive and found no mention of the file there. 
    6654 
    67 Regular monitoring tests for uploading, downloading and deleting files through WebDAV continue to pass. 
     55Some specifics about the problem: ATLAS is using exactly one door and claim they receive a 401 response. 
    6856 
    69 Slow/failing upgrade for the inconsistency since before 2008.  These need to be handled manually. 
     57This is in contrast with KIT internal monitoring, which regularly uploads, downloads and deletes files through WebDAV.  These tests continue to pass. 
    7058 
    71 Found first bug: dcap and ftp doors leak memory when the "-1" limit is specified. 
     59KIT's upgrade experience shows a slow and failing database upgrade, due to inconsistency in the database dated before 2008.  These were fixed manually. 
    7260 
    73 xrootd plugin -- "documentation is updated after you need it".  WLCG repo now has property files.  Now know which properties to set -- for ATLAS. 
     61KIT has also found a bug: if the dcap or ftp doors have an unlimited number of connections ("-1") then the door leaks memory until running out of memory.  A work-around is to specify 
     62some large limit.  A fix has been proposed and merged, and will be part of the next set of releases. 
     63 
     64The worst part of the upgrade was the xrootd plugin.  The "documentation is updated after you need it", but at least the WLCG repo now contains packages with the correct property files, and it's clear which properties to set for ATLAS. 
    7465 
    7566=== CMS === 
    7667 
    77 For CMS there were no problems upgrading.  CMS does have xrootd plugin -- still no regular RPM.  The old plugin is working for 2.10. 
     68As CMS does not use space reservations, for CMS there were no problems upgrading.  CMS does have xrootd plugin -- still no regular RPM.  The old plugin is working for 2.10. 
    7869 
    7970=== LHCb === 
    8071 
    81 For LHCb, updated today.  This was rather painless.  Already finished at lunch-time.  Everything seems to be working now.  Notified that SSLv3 is disabled with 2.10. 
     72For LHCb, updated today.  After the experience gained from upgrading the ATLAS instance, the LHCb upgrade was rather painless and was already finished at lunch-time.  Everything seems to be working now.  Xavier has notified CMS that SSLv3 is disabled with 2.10.  CMS responded that this was not a problem for them. 
    8273 
    83 Another dCache instance at KIT still needs to be upgraded to 2.11.  This will happen in the new year. 
     74Another (small, non-WLCG) dCache instance at KIT still needs to be upgraded to 2.11.  This is happen in the new year. 
    8475 
    8576 
    8677=== Tickets === 
    8778 
    88 8547 -- Tigran is working to fix this. 
     798547 -- NFS not notifying when file is deleted through NFS. 
     80 
     81    Tigran is working to fix this. 
    8982 
    90838548 -- Split-root archive: split files into several pieces.  If you have only one of these then you get an error.  Ticket may be closed. 
     
    92858561 -- xrootd  
    9386 
    94 The problem is that when redirection happen.  Door is missing that pool sends the transfer-done message. 
     87The problem is now understood.  It happens when the client is redirected to the door and the pool's transfer-finished message is somehow lost.  Although the problem is understood, it isn't clear how to fix this issue. 
    9588 
    96898284 -- statistics 
     
    10497= DTNM = 
    10598 
    106 Same time, next week. 
     99The next Tier-1 meeting is Thursday 16 Dec.  After this, the next meeting is Tuesday 6th January.