wiki:developers-meeting-20090422
Last modified 12 years ago Last modified on 04/27/09 08:48:23

Developers meeting April 22, 2009

Agenda

  • Status of show-stoppers for 1.9.3 release:
    • SRM server must use asynchronous srm-ls by default,
    • Further investigation regarding Tigran's suspection about a slow-down observer with 1.9.3,
    • Some ACL issues reported by Anton Mitterer,
    • FTP list and sym-links (see patch 178),
    • Broken static initialisation of new NFS code,
    • FTP LIST permission handler interface,
    • Performance of LIST command ACLs.
  • Progress on Replica Manager issue,
  • ACL performance,
  • Planning for forthcoming dCache developers meeting.
  • RT #4476: the NLST command,
  • Proposal: fixing the LIST verbose format in 1.9.2 branch.
  • Berkley DB in pools and how to deploy
  • Releases

Show-stopper

SRM server in trunk is now asynch. This is item is closed.

Tigran has not yet started investigating the suspected slow-down. DESY has a JProbe license, so this may help.

With the ACLs there is only one issue left: how to represent inherit-only ACLs. There is a ZFS paper that defines inherit-only ACL using D?. In dCache the 'O' flag is used. The advantage of fixing this is people who mount dCache and use NFS interface to alter ACLs will see the interface as within the admin interface. We agreed that this isn't a show-stopper.

FTP list committed no longer considered a show-stopper RC bug.

Broken static initialization of NFS; Gerd has committed the PortRange class. Tigran to use this to fix the static initialization.

FTP LIST command and the Permission handler interface. The problem with the FTP LIST command performance *may* be due to it's increased usage of the permission handler. The problem may be endemic within Trunk. The ACL code was committed today. This allows us to fix the permission handler interface, so allowing more performant queries.

Performance of LIST cmd with ACL: depends on fixing permission handler interface (see previous item).

Remaining issues are:

  • Tigran investigating suspected slow-down,
  • Broken static initialiser for NFS code,
  • LIST commands and ACL performance generally.

Replica Manager

Not opened a ticket yet in RT, but there is a bug in Trac (see ticket #230).

Tigran to open an RT ticket describing the problem.

The recommendation to reduce the time-out from 12 hours to 5 seconds is bad advise. RM may attempt to remove too many replicas, resulting in potential data-loss.

Tigran reported the problem is with 1.9.1 (maybe 1.9.1-7) with Chimera and the new pool. Gerd reported the propagation of the delete message is done by PnfsManager rather than the individual NameSpaceProvider, so Chimera shouldn't be the problem here.

Alex asked DESY to send Replica Manager log files for further investigation; also, they should check the broadcaster configuration.

Gerd noted that ReplicaManager has a handler for PoolsRemovedFileMessage unrelated to cache location.

We see the files are removed; the RM receives the "removed" message (an ASCII string). Could it be that RM is waiting for ClearCacheLocation message that it will never receive?

The RM task-ls shows tasks that last forever. Alex suggested checking the broadcast configuration: either message is corrupt or not being delivered.

ACL Performance

Already discussed in show-stoppers (see above).

Planning for forthcoming dCache developers meeting

Timur has sent a link to the wiki page with dCache agenda. People are encouraged to look at this page so agenda can be fixed.

the NLST command

Currently considered in dCache FTP door as an alias to the non-verbose LIST. This is wrong: it is supposed to be for listing directories only. If a file is given as an argument it should give an empty reply. This confuses the Arc client.

Gerd to look into fixing this.

Fixing the LIST verbose format in 1.9.2 branch

We had a patch to fix the LIST verbose format before ACLs were committed to trunk. It was not committed to 1.9.2 or 1.9.1 branches as 1.9.3 was expected soon. Given the apparent delay in releasing 1.9.3, would it be worth committing the fix to stable branches?

The consensus was that this is a bug-fix so OK, go ahead.

New Berkley DB version and how to deploy

Gerd mentioned that there has been a new, major release of BerkelyDB about 6 months ago. With this release, the on-disk format of the database has changed. The new version of the database will convert to the new format automatically; however, once the migration has taken place it is impossible to migrate to the old Berkley DB format. Because of this, when a site upgrades to a new version of dCache that is using the new Berkley DB format they will be unable to roll-back to an earlier version.

Consensus opinion was to do this with a major dCache release that focuses *only* on the Berkley DB version change. For example: when 1.9.3 is released (and stable) a 1.9.4-1 release is made, which is the stable 1.9.3 release with the new version of Berkley DB.

Tigran asked if there was something that forces us to migrate to the new version of the BerkleyDB? No.

He also mentioned the problem with people jumping releases.

Releases

Whether we should have a 1.9.1 and 1.9.2 release soon? Yes. Various fixes, including hopping manager, a fix from Irina to partition manager

Could Fermi people review this fix to check whether Jon's instance will still work with the patch deployed?

What are our expectations for when a v1.9.3-1 release will appear? Don't think this will come out before a couple of weeks.

DTNM

Wednesday, next week.