wiki:developers-meeting-20070801
Last modified 14 years ago Last modified on 08/01/07 18:59:13

Developers Meeting Aug 01 2007

Present

Present : Dmitry, Gerd, Martin, Patrick, Timur, Andrew, Ted, Vladimir


Synchronizing on CDF Problems (as preparation on Monday Meeting)

  • Restore Handling
    • If a pool is misbehaving, it affects restores for the whole pool group
    • If a pool goes down, (in some cases even the node becomes unpinnable), the panding restore requests are not rescheduled on new pools, even after 30 hours, as we recently observed in case of CDF
    • If a configuration in PoolManager is reloaded, all pending restore requests are ignored, and new requests are scheduled, leading to files being restored twice
    • Restore Requests are sometimes getting into the waiting state, without other requests for the same files in progress, the only way to make them restart is to manually request their retry or reload PoolManager configuration
  • Pool instabilities
    • Memory on the pool nodes keeps on growing even for the pools that do not do anything (pools in the TestPools group in CDF)
  • Doors
    • Sometimes access is denied because door claims that max login is reached, when the real number of doors and files open is much less that maximum, only way to address is to increase the max login to thousands or keep on restarting doors.

Issues of the deployed 1.8 sites

  • SOLVED : FTS failes to transfer LHCB data from CERN to gridKa and IN2P3.Obviously the correct space tokens are not present.

Workshop preparation

Tutorials and discussions

  • SRM design talk by Timur
  • What's new in the dCache core (Pool/Repository/PoolManager?)
  • PoolManager in general and Cost Functions.
  • Resilient Manager session.
    • Report on changes in the Pool/Repository/File? States
    • Report by Alex on design of Resilient Manager and implementation Problems
  • Jay Packard and the TelePath project
  • REMARK : Gerds suggests to go deep into the code and not only discuss designes. So please be prepared.

Missing code for the final dCache 1.8

  • Largest part is certainly to make the resilient part compliant to the new pool/repository file mode schema (cached, cached/pinned, precious)
  • Is 1.8 compliant to the NDGF requirements with multi HSM support.
  • We need to make sure that 'enstore' can work with multi HSM code. As a first shot, it is ok to only support a single HSM if enstore is involved. This is because we don't know how to multiply the level 4 (enstore specific) content.
  • Is the FTP 2 code (door and mover) fully ported to 1.8. Door can be set to FTP II by an option (or does the door already notice protocol version itself). We should run the FTP II mover. Should do the protocol I as well.
  • Does srmcp support the FTP II to allow dCache to dCache SRM copy.
  • Does the SRM (and later the resilient manager) send the pin owner ?
  • P2P checksum : Currently we detect if the source sends wrong data to the destination : PoolManager retries.
  • StorageInfo file on pool disk
    • Storage info will be (customized) XML encoded (serialized).
    • Only GenericStorageInfo will be supported. System is backward complatible. OSM/Enstore StorageInfo will be understood but no longer generated.
    • Migration schema :
      • A restart of 1.8 will read old StorageInfo and will write XML SE.
      • Offline tool will be available to convert old to new. This can be done in a parallel directory while the old system is still running. On restart only the directories have to be moved/renamed.
      • Talk to SUN (C) to get more info on how to speedup the startup on ZFS. (Patrick)
  • BUG in SpaceManager reported by Gerd.
  • Authentication Information (resp setup) is moved from PoolManager to SpaceManager
  • Important improvements of 1.7-NDGF Branch to be moved into 1.8
    • Improvements in PoolManager memory consumption
    • Pool set to read'only which system is starting up. (!!! What happens with resilient manager query)
    • New Cleaner which is a) multiHSM aware and b) can talk to Chimera
  • (Before end of the Year) Change Space For File to be implemented.

Making dCache 1.8 compliant to Flavias stress tests (Performance/Scalability/Ramp? up parallel client threads)

  • FIX : maxLogin doesn't behave correctly, as described by Timur 18 Jul 2007 (RT1912).
  • FIX : Doors are still dying (at gridKa). This needs to be fixed. Martin will talk to Doris to get more information on that.
  • In order for us to be able to work on SRM performance we (DESY, NDGF and Art) need to understand more on the SRM details. I would suggest Timur gives a Tutorial on SRM design as some kind of introduction on monday.
  • Are we able to get sufficient information out of the postgres log files to understand the SRM database behaviour and timing ? Maybe Art can help us here.
  • There is still the issue that SRM and gPlazma cell usage don't work together.
  • I asked BNL (Iris) to provide a client machine to run the stress tests ourselves. Maybe they can give us access to this machine in advance of the meeting ao that we can prepare the stress tests in order not to waste time at BNL. Iris will let us know if this is possible.
  • We (Martin and myself) have to talk to Flavia to get the stress test code. (She mentioned that this code is not yet checked in).
  • Code missing to fully integrate chimera
    • Cleaner needs to be adapted.
    • ls of gridFtp and SRM needs to use the Chimera 'ls' API to get rid of the mounted filesystem.
    • Migration tools :
      • Regular files and directories are OK. Java tool scans the source (pnfs) system and talks directly to the chimera database.
      • Links can't be handled that way. Java doesn't know links.
      • How to handle tags is not yet clear.

Help for the BNL crew

  • This is the BNL PPS dCache setup. Information was provided by Iris.
    1. Admin node (dcache02.usatlas.bnl.gov)
     
    Memory: 7973 Megabytes
    Available disk: 2TB  
    
    It hosts dCache admin components like PnfsManager, PoolManager and
    administrative interface etc. In the test bed, we did not separate PNFS
    from dcache admin node. So this machine host PNFS. For convince, we also
    installed dcap door in this node. 
    
    
    2. SRM door (dcsrmv2.usatlas.bnl.gov)
    
    Memory: 7973 Megabytes
    Available disk: 2TB  
    
    Services: SRM2.2, Utility and SRM database  
    
    
    3. GridFtp Door (dcdoor99.usatlas.bnl.gov)
    
    Memory: 4952 Megabytes
    Available Disk:  ~30G
    The machine is out side of BNL firewall and have two NICs  
    
    Services: Grid ftp door with two interfaces. 
    
    
    4. Read/Write pools (dc002.usatlas.bnl.gov)
    
    Thumper with SunOS 5.10
    Memory: 16288 Megabytes
    Disks: 16 TB 
    
    Services: Four dCache pools (read/write).
    

last modified Fri Feb 26 13:17:27 2021, by Patrick