wiki:developers-meeting-20090415
Last modified 12 years ago Last modified on 04/21/09 12:48:13

Developers meeting April 15, 2009

Agenda

  • Operational issues at Fermi: more input from Catalin or Jon?
  • Show stoppers for 1.9.3 release
    • The default of srm-ls needs to be synchronous.
    • A certain slowness is observer with 1.9.3
    • ACLs : (reported by Anton Mitterer)
    • Others ?
  • Tanja: trouble-shooting a problem
  • Replica manager
  • ACL performance
  • ACL correctness
  • Point-in-Time backup script
  • How to handle the counters
  • DTNM

Operational issues at Fermi

dcap performance

dcap performance is not an issue any more. Fermi are currently running the normal dcap door with gPlazma configured to run as a module.

The performance problems were probably due to the backport of restrictedDcapDoor. Timur took the 1.9.1-branch HEAD version and copied it into the 1.9.2 tree but under a new name. The door's batch file was changed to alter the interpretor. A change was needed to support an additional constructor, but this was simple to implement.

The new code was added to the existing jar file using the jar command-line utility. This new jar file was used in place of the existing jar file.

In production they reverted to the original jar file (as delivered as part of the release) and configured gPlazma to recreate the restricted dcap door with the regular door. gPlazma is being used as a module rather than as a cell.

Billing NPE

Fermi have noticed a large number of errors in billing. These are manifest as NullPointerException. Timur agreed to send more details to Tigran,

Show stoppers for 1.9.3 release

Currently known show-stoppers are:

  1. SRM server must use asynchronous srm-ls by default.
  2. Further investigation regarding Tigran's suspection about a slow-down observer with 1.9.3
  3. Some ACL issues reported by Anton Mitterer,
  4. FTP list and sym-links (see patch 178)
  5. Broken static initialisation of new NFS code.
  6. FTP LIST permission handler interface.

Current status of each item:

  1. is a simple fix and should be implemented "Real Soon Now",
  2. This needs further investigation by Tigran,
  3. This may require some changes in documentation,
  4. ?
  5. Tigran is aware of the problem.
  6. FTP LIST permission handler interface seems to be a limitation. Should be fixed, but maybe only as a short-term solution.

Tanja: trouble-shooting

Uni Chicago SRM transfers failing

Around 1% of SRM transfers fail with No Write Pools Configured. The symptoms include all transfers fail for a short period of time; after this, transfers start to succeed again.

Owen ask how much memory they had. Tanja didn't know.

Since the discussion started to centre on information that was missing, it was agreed that Tanja should ask the Uni Chicago people to submit a ticket to support@… and the investigation should continue there: Tanja would be CC-ed on the ticket.

What is the recommended version?

Is 1.9.2-4 production ready? Yes: the dCache team will make it "green" so it becomes the recommended version.

Replica manager

There is an issue with the ReplicaManager reduces the number of replicas of a particular file. When it requests that a pool deletes a (non-sticky) file the ReplicaManager expects to receive a FileRemovedMessage from the pool. However, since it is using the admin interface to request files are deleted, the message it receives is the ASCII string "removed".

Because the ReplicaManager never receives it's expected FileRemovedMessage, the handle associated with the file removal request is kept. The default time-out is 12 hours and a work-around is to set this to something much smaller (5 seconds).

The problem exists in all 1.9.2- and 1.9.1- releases (and possibly earlier).

This is not a show-stopper for 1.9.3 release since it isn't a regression.

ACL performance

This involves the interaction between PnfsManager, the doors, and SRM.

Tigran explained that the ACLs are stored in a separate database. This is code from DESY-Zeiten. He didn't recall the details, but remember it as a fairly simple format.

All doors will need non-local database access. Gerd pointed out that this is the first time non-local database access is needed: up to now, messages are sent, requesting information.

Introducing a separate PnfsManager message would allow non-local ACL support without requiring direct DB access.

It was suggested this be discussed further at the forthcoming developers meeting at Chicago.

ACL permission correctness

Current support is "broken" since it doesn't check the complete directory path: an lack of permission to enter the directory somewhere within a path will be ignored.

This behaviour is present currently (with PnfsManager without ACLs).

This requires passing the user record when asking for ACL information, allowing context. Gerd said this should be done with all messages.

Gerd mentioned that the expense of doing the ACL check properly can be reduced by caching. This is what operating systems do. Naturally, one should be careful to invalidate any caches when changes are made to the ACLs.

Point-in-Time backup script

Timur requested further input on Catalin's point-in-time backup script.

Open questions included:

What to do about the review process?

Those who had reviewed the first iteration agreed to review the updated version.

Who was going to support the script?

Timur agreed to provide support for the script.

How was the script to be distribute?

Suggestions included including the script with the dCache RPM and distributing it as part of a contrib. RPM (c.f., Ron's info provider, some of Lionel's work). No consensus appeared on this issue.

How to handle the counters

The plan is for those who reviewed the code to check the new version ASAP. Once it is submitted, Tigran can merge it into modules/dCache source tree as he sees fit.

DTNM

Wednesday 22nd April 2009


Last modified by Patrick @ Sun Mar 7 00:02:42 2021