wiki:tier-one-meeting-20180814
Last modified 4 months ago Last modified on 08/14/18 15:50:55

dCache Tier I meeting August 14, 2018

[part of a series of meetings]

Present

dCache.org(Paul), IN2P3(), Sara(), Triumf(), BNL(), NDGF(Jens, Dmytro), PIC(), KIT(Xavier), Fermi(), CERN()

Agenda

(see box on the other side)

Site reports

NDGF

Update on ZooKeeper? problem

Dmytro wanted to know what is the situation of ticket RT 9451

In particular, which files should he send.

These are the .zookeeper files in /var/log/dcache directory; e.g., /var/log/dcache/dCacheDomain.zookeeper.

pool changing to read-write

NDGF had a mysterious switch where pools that were previously read-only became read-write.

The only change was adding an admin's SSH key to the admin.

Could this have triggered the change?

No, not unless there several bugs: the admin service is an independent service.

Most likely explanations (in no particular order)

  1. pool(s) were restarted
  2. someone logging in and changed the pools
  3. some automated agent (outside of dCache) discovered the pools were read-only and fixed the problem.

The poolmanager logs changes to pool state. This should allow you to verify the pool(s) actually became read-only, and also at what time they reverted back to being read-write.

The pool log file should allow you to discover if the pool was restarted.

Finally, the access log file (/var/log/dcache/<domain>.access) logs when an SSH client connects, authenticates and disconnects from the admin interface. This should allow you to discover if anyone was logged in when the pool changed state.

KIT

dCache is running fine.

Database switch

The CMS instance had an issue last week: the Chimera database ran out of disk space. They decided to switch to the backup database instance, only to discover that the backup database was ~1 week behind the master instance. Therefore, after giving the database more capacity, they switched back to the master.

Although the data of files written while dCache was using the backup database is still present on the pools, they do not exist in the namespace any more (since the master does not have the files).

Given that inumber is a simple counter, the inumber values for files written in the slave server have been reused.

Therefore, KIT decided that the simplest recovery procedure is to declare that data written while dCache was using the slave database is lost.

The question is how to get the pools to "forget" about this data.

Suggest running the pnfs register command on the pools to ensure old data is deleted.

Testing HA SRM door

KIT are still investigating HA deployment and are struggling to find the correct configuration / deployment.

They have DNS aliases that round-robin over two HA-proxy nodes, which aggregate over multiple srm doors.

This was not working correctly with srmcp, as the host certificate presented by the srm doors did not include the ha-proxy node's hostname.

This should be fixable by including the DNS alias and ha-proxy hostname as Subject Alternative Name entries in the SRM host certificate.

KIT are also investigating ctdb as an alternative. Would involve DNS round-robin over multiple virtual IP addresses, with ctdb taking care that virtual IP address(es) assigned to a "dead" node are reassigned to live nodes. A reverse lookup on any virtual IP address should return the DNS round-robin name, and the host certificates should include the DNS round-robin name as a Subject Alt Name entry.

Support tickets for discussion

[Items are added here automagically]

DTNM

Same time, next week.