wiki:developers-meeting-20090909
Last modified 11 years ago Last modified on 09/10/09 11:52:56

[part of a series of meetings]

Participants

Tigran, Jan, Paul; Gerd; Timur, Dmitry, Gene, Vijay, Vladimir

Agenda

[see box to the side]

Status of work for 1.9.5

A (quick?) review of activity needed for the 1.9.5 release

New info service/info provider

Work is progressing. Most of the patches are now in. There are two largish ones outstanding. The first Gerd and Paul are working on; the second Paul will submit to RB (it's largely independent of the other big one).

With NDGF deploying Trunk, it's clear that there are some issues with the info service. A specific example is that pools no longer have pool size metrics.

Paul is able to reproduce the problem locally and is investigating.

Active/passive fixes for SRM client

Gerd thinks that the fixes are now all committed. Dmitry, when connecting, confirmed this.

We should do another SRM client release; but this is dependent on confirming that the code-changes have actually fixed the problem. We need someone to test this: Owen seems a natural choice.

Tigran mentioned it would be nice to have tests for this in the functional test suite. These tests should check active and passive mode of transfers.

PnfsManager based listing in SRM

Gerd reported that code has gone in to the dCache SRM interface to allow PnfsManager-based changes.

Unfortunately, this is not sufficient to eliminate the need to mount the namespace since the generic SRM component does a file-system 'ls' to collect information for the non-verbose ls.

The other unfortunate aspect is that the behaviour of SRM is such that it queries the metadata for each file as a separate activity. One of the speed benefits is to query all (required) metadata of a file as part of the PnfsManager ls operation. This is currently not possible with the SRM, so directory listings will involve an extra round-trip.

Terracotta and SRM

Timur reported that he still has a few patches that need to be reviewed and he has just committed a couple of patches yesterday.

Timur mentioned that he doesn't feel confident that he'll be able to get in all the patches needed to use terracotta in SRM before the 1.9.5 cut-off.

There was a discussion about what it would mean if this happens: could these patches be applied to the 1.9.5 branch after the cut?

The opinion was "no". We have stable branches that specifically exclude non-bug-fixes because this increases the stability of dCache. Although 1.9.5 will be supported for 1 year, other feature releases (e.g., 1.9.6 etc) are anticipated. Sites that require new features can jump to 1.9.6, but support 1.9.5 for one year / first run.

What happens with PnfsManager without ACLs

TODO: remove from template (done)

new http door with https support

Nothing has changed: Tanya is currently on holiday.

xrootd mover reimplementation

This is a less ambitious patch than was originally anticipated. The patch retains the pool/mover model of starting a new cell as a mover.

As part of the NDGF deployment of Trunk, the patched pool has been deployed as an Alice pool yesterday. It seems to be working well so far.

The patch has been updated is now just waiting for review.

HSM cleaner for chimera

TODO: add to template (done)

Tigran is currently merging the work Irina has completed into current Trunk. This will be a series of patches. The first patch will be submitted today.

Fix ReplicaManager to use new pool msgs

TODO: add to template (done).

The replica manager now listens for the wrong message. This needs to be fixed.

ACL rename checking

TODO: add to template (done).

The ACL permission handler in PnfsManager in

No specific rename perm. in Chimera.

Permission handler for the doors (in services.acl) AclPermissionHandler?

Namespace Permission handler has PosixPermissionHandler? ...

Get rid of three when dropping the ACLs check on the doors.

Tigran and Gerd to iterate on this.

It doesn't work

Three things Tigran found didn't work:

o space manager stopped working. o regression gsidcap doesn't work. o start-up session.

1st Tigran to investigate this further 2nd and 3rd the problem is understood and patches for this is in RB.

SRM w/o Space Manager

If the SrmSpaceManager is not enabled then the SRM log file contains many entries warning that the space manager isn't available. This confuses operators.

Timur agreed to suppress these messages. Tigran and Timur to discuss this off-line.

Is Jon going to upgrade?

Gerd asked whether Jon was planning on upgrading to 1.9.5? This release has many performance improvements that reduce the load on PNFS.

Timur said that he is intending to stay with 1.9.2 and have custom patches. He is concerned that it took a long time to stablise his system after the upgrade to 1.9.2.

Gerd reiterated that 1.9.5 contains many fixes to reduce the load on PNFS. Timur will relay this information to Jon.

Resilient manager and space manager

Fermi talked about how to support resilient manager

RT#5050

Timur suggested that pools be made precious (lsf=precious). This idea was that incoming files would be marked precious, so be candidates for the replica manager. Gerd said that lsf=precious merely means that an hsm isn't attached; whether a file is precious is determined by the Retention Policy of the file.

There was some discussion about how differences could have arisen between the BNL system and Jon's. It is believed that this is due to the default AL/RP at BNL is not CUSTODIAL/NEARLINE but rather REPLICA/ONLINE. This could have happened by specifying Default-AL/RP tags or from altering the system default.

There was a discussion no how the current replica manager achieves replication and how the migration module does this. Gerd elaborated on how the migration module preserves stickiness by using a different interface to the p2p module. There is a new command that includes the target state.

Another topic was briefly discussed: what about the space calculation? If files are replicated then the available space will be wrong. Timur described how this is a side-effect of achieving the replication. Users should take this into account.

An idea that was floated was using the Retention Policy OUTPUT as a marker that a file should be replicated. Two problems with this approach: any change would need to be discussed with WLCG (we can't "just do it") and RP=OUTPUT can only be specified when copying in the file.

Another approach is to support plugin policies for the resilient manager. These plugins would allow the replica manager to choose different sets of files for replication. Timur asked if there were any example policies: replication on space-token, replication of all files in a pool.

Replica manager

Decides it needs to replicate when it receives a AddCacheLocation? message from the pool.

Timur: AddCacheLocation? needs to be amended about whether the file is cache+sticky.

This won't work, however, since the new pool no longer emits an AddCacheLocation? message. The ReplicaManager must now listen for the SetFileAttributes? message.

Another, related issue is that it would be nice if the ReplicaManager would registered itself with the broadcaster. This would eliminate the need to configure and we don't need to configure it. The registration is soft-state and the soft-state registration might be lost.

This lead to a brief discussion on the merits of JMS and how this would solve this particular problem.

Issues from yesterday's Tier-1 meeting

Nothing serious; just to mention it: Triumf was asking about this.

Outstanding RT Tickets

[This is an auto-generated item]

RT 4699: GGUS-Ticket-ID: #50779 ASSIGNED to dCache Developers srm-{get,set}-permissions does not work on DPM

The is a problem with the SRM client. We can make a new release, but it should be tested first, otherwise there would be little point.

The testing should be done against DPM.

Paul: to ask Owen to test new SRM client against a DPM endpoint supplied by WLCG. (Done).

TODO: Drop ticket from further discussion. (Done).

RT 4712: retry setting for suspended files

The problem here is the files ending up in the suspended state. This is probably due to some message timing out under heavily load.

Patrick has agreed to look into this.

TODO: Drop ticket from further discussion. (Done).

RT 4716: list 10k on chimera as root takes O(2-4) more than as a regular user!

Tigran to investigate.

TODO: Drop ticket from further discussion (done).

RT 4721: Problem in GridftpClient when copying files to dCache with srmcp

The problem is with the cryptix library. Jan is investigating and gave a description of the problem.

Jan is now trying to create a minimal test-case to demonstrate the problem. Once this is done, we can go to the IBM people to look into why their JVM is broken.

TODO: Drop ticket from further discussion (done)

RT 4733: gsidcap failures with new CAs

The decision was that the ticket should be resolve via RT in the normal fashion.

TODO: Drop ticket from further discussion (done).

RT 5055: thousands of mounted pnfs .(access) entries

TODO: Paul to ping them to get more info.

We should keep this for now.

RT 5089: gPlazma mapping not based the attribute that the voms proxy has been initialized with

Gerd: would like Ted to have a look at this. Timur: ask Ted to have a look. Gerd to reassign to Ted.

TODO: Drop ticket from further discussion (done).

Review of RB requests

DTNM

Same time, next week.