wiki:MonitoringAtDesy
Last modified 8 years ago Last modified on 07/23/10 17:57:23

Monitoring at DESY

This page contains the current (2010-07-22) list of monitoring activity at DESY towards their dCache instances. It is likely to be incomplete.

The list groups monitoring activity by person.

Lusine

Lusine has three scripts

  1. Check PnfsManager 'threads'
  2. Check poolmanager status
  3. File usage: a check written for ATLAS. Identify files that are never used, used once or used more than once.

Yves

Yves has written some test for poolgroups and space tokens. Data from poolgroup (taken from info service) is written out to an RRD database and plots generated. If some threshold value is exceeded then an email is sent.

Andreas

Regular testing of overall functionality. A file is copied into dCache using each door (gsiftp, dcap, ..) and with lcg-cp and then copied back out again. This is repeated every 30 minutes.

Dima

Keep an external list of pools. This is a list of all pools (automatically updated every night) and a list of write-to-HSM pools (maintained manually).

Use list of pools to check connectivity (responds to ping) and liveliness (responds to "info" command).

Check all files in namespace. Check that at least one pool registered in Chimera that has this file has it "precious". Recovery procedure is manual.

Check all files in pools: does Chimera know about this replica. Recovery procedure is manual.

For each pool, count the number of files in error state.

Check for any new pools that are HSM connected.

Maintain a list of directories where files have been flushed to tape. Check all files in these directories; do they have a storage-info? Check t_accesslatency and t_retentionpolicy are defined appropriately for these files.

The address of the file in OSM has a particular form. Validate that stored addresses have this form.

Check replicated instances:

  1. every file has at least two copies
  2. no two replicas on the same home. This appears to be due to a bug when replication is triggered by a pool going down.

Checksum check on all files --> every two days. This is currently disabled due to apparent interference with pool's selection of which replica files to delete.

Count number of deactived files. This can happen if the HSM script returns an error code.

Check how long does it take to flush a file?

For how long is each mover alive? If longer than the lifetime of a job on the WNs (e.g. several days) then may indicate a stuck mover. Opening a file and not copying data is also considered an indicator that the mover is stuck.

Feature requests

List of all pools not in a poolgroup.

Read/Write? bandwidth per pool-queue, per-queue, per poolgroup, overall.

Would like to see test results going back in time, looking for correlations (multiple items failing at the same time).