wiki:dCacheCloud
Last modified 5 years ago Last modified on 03/28/14 14:34:55

Official Website Registration Website

Production System

Machine name IP OS CPU RAM Disk Space Function
dcache-cloud01.dcache.org 131.169.98.105 SL 6.5 24 Intel® Xeon® CPU E5645 @ 2.40GHz (6 cores) 24 GB 37 TB Head Node + Pool Node
dcache-cloud02.dcache.org 131.169.98.106 SL 6.5 24 Intel® Xeon® CPU E5645 @ 2.40GHz (6 cores) 24 GB 37 TB Pool Node

Software

Current dCache version installed: cloud-2.8.1
Current postgres version installed: postgresql92-server-9.2.5-1PGDG.rhel6.x86_64

Configuration

dCache: dcache-cloud is configured to replicate all the files that land on dcache-cloud01 pools to dcache-cloud02.

Postgres: The postgres database is backed up regularly and it was agreed that the interval shall be adjusted as needed. Please see the cron of dcache-cloud01 for details. Postgres is configured as a hot standby.

SmallFiles?

The pools are configured to call the hsm-interal.sh script that uses mongo for files tagged with hsmInstance=dcache. It uses the database ceph-mds1/cloud.

The packing is configured to take files from the Tape subdirectory and pack them into 1G archives.

Backup

There are multiple backups of the database.

  1. A daily dump of all databases together and every single database.
  2. The database s running in hot standby mode
  3. A third backup is in preparation to be able to do point in time recovery with Barman

Daily backups

The sql dumps of the postgresql database on dcache-cloud01 can be found on dcache-cloud02:/var/lib/pgsql/9.2/backups

Hot Standby

This means that all operations that are done on the postgresql master server (dcache-cloud01.desy.de) are mirrored on the slave (dcache-cloud02.desy.de). The hotstandy status is monitored via nagios: http://dcache-cloud-auxiliary.dcache.org/nagios/cgi-bin//extinfo.cgi?type=2&host=dcache-cloud01.desy.de&service=Postgresql+Hot+Standby

To start the hot standby postgres instance on the slave execute: su postgres -c "/usr/pgsql-9.2/bin/pg_ctl -D /var/lib/pgsql/9.2/hotstandby -l hotstandby.log -U postgres restart"

There is a simple way to check, whether the replication really works:

  1. "ps aux |grep wal" on master
        [root@dcache-cloud01-test ~]# ps aux |grep wal
        postgres  6520  0.0  0.0 220728  2820 ?        Ss   18:01   0:00 postgres: wal sender process postgres 131.169.5.194(41293) streaming 0/410003B0
    
  1. "ps aux |grep wal" on client
        [root@dcache-cloud02-test 9.2]# ps aux |grep wal
        postgres 16102  0.0  0.0 226552  2748 ?        Ss   18:01   0:00 postgres: wal receiver process   streaming 0/410003B0
    
  2. Check the difference between the two outputs, if the streaming IDs are the same

Point in time recovery

We are using Barman to provide point in time recovery. Barman is quite easy to setup. Just follow this document: Barman HowTo.

We are backing up the Barman files to the local dCache storage, which is mounted like this:

    mount -t nfs4 -o rw,minorversion=1 dcache-dir-desy02:/pnfs/desy.de/desy/dcache-cloud /home/backup/backup_dcache-dir-desy02

Barman backup and recovery process:

  1. User barman is running a cron job that executes a "barman backup main" and rsync to the /home/backup/backup_dcache-dir-desy02/barman/main directory
  2. In case of needed recovery: Check whether the local barman directory on the slave is intact
  3. Execute a barman recover main <time stamp> <path to recover to>
  4. Copy the recovered directory from the slave to the master
  5. Assuming dCache was stopped, stop any running postgres and start a new postgres from the recovered directory, e.g.:
        su postgres -c "/usr/pgsql-9.2/bin/pg_ctl -D /var/lib/pgsql/9.2/data_recover start"
    

Monitoring

Machine name IP OS CPU RAM Disk Space Function
dcache-cloud-auxiliary.dcache.org 131.169.5.195

Nagios is running on this machine. The following pages can be used to monitor dCache:

http://dcache-cloud-auxiliary.dcache.org/ganglia/
http://dcache-cloud-auxiliary.dcache.org/nagios/
http://dcache-cloud-auxiliary.dcache.org/pnp4nagios/

General tests against the dCache cloud are gathered under:
http://dcache-cloud-auxiliary.dcache.org/nagios/cgi-bin//status.cgi?host=cloud.dcache.org


Test System

The test system is installed on a xen cluster it-xen20, which is a sub machine of it-xen27. It is not possible to create snapshots of the machines.

Machine name IP OS CPU RAM Disk Space Function
dcache-cloud01-test.dcache.org 131.169.5.193
dcache-cloud01-test.dcache.org 131.169.5.194


Mailing Lists

cloud-announcements(ad)dcache.org



Support

cloudsupport(ad)dcache.org