Last modified 14 years ago Last modified on 08/06/07 17:58:15

Meeting on CDF dCache 1.7 issues

First Results :

  • A slow PnfsManager seems to be at least one of the reasons for the Waiting Requests in the PoolManager. It is recommended to
    • Reduce the number of external script hammering on pnfs.
    • Run the !PnfsManager3 with companion enabled.
    • Use the EnstoreExtractor instead of the GenericExtractor.
  • We might have found the problem with the maxLogin. Timur found Exception in secure protocol in the door logfiles. Code is fixed and will be deployed with the next release.
  • We need to findout for what reason the pool misbehave. If it happens again, please check
    • psu ls -l POOLNAME in the PoolManager
    • cm ls -a im the PoolManager
    • ps -f POOLNAME in the pool domain system cell.
    • info in the pool cell itself
    • Hoping for more information from Timur on pool behaviour with 1.6 and JConsole connected.

Synchronizing on CDF Problems (as preparation on Monday Meeting)

  • Restore Handling
    • If a pool is misbehaving, it affects restores for the whole pool group
    • If a pool goes down, (in some cases even the node becomes unpinnable), the panding restore requests are not rescheduled on new pools, even after 30 hours, as we recently observed in case of CDF
    • If a configuration in PoolManager is reloaded, all pending restore requests are ignored, and new requests are scheduled, leading to files being restored twice
    • Restore Requests are sometimes getting into the waiting state, without other requests for the same files in progress, the only way to make them restart is to manually request their retry or reload PoolManager configuration
  • Pool instabilities
    • Memory on the pool nodes keeps on growing even for the pools that do not do anything (pools in the TestPools group in CDF). Investitation :
      • Run one/some pools with 1.6 and check if the show the same behaviour.
      • Run one/some pools with the JConsole attached to visualize the increasing memory consumption.
  • Doors
    • Sometimes access is denied because door claims that max login is reached, when the real number of doors and files open is much less that maximum, only way to address is to increase the max login to thousands or keep on restarting doors.
  • PnfsMaganer : it seems that on high load the PnfsManager or pnfs significantly degrades performance. This leads to accumulated requests in the PoolManager waiting for pools to be assigned.

Last modified by Patrick Wed Mar 3 08:44:49 2021