wiki:developers-meeting-20090819
Last modified 12 years ago Last modified on 08/21/09 17:07:23

[part of a series of meetings]

Participants

Timur, Vijay, Alex, Vladimir; Gerd, Marco; Patrick, Owen, Jan, Paul

Agenda

[see box to the side]

Status of work for 1.9.5

[standing item]

A (quick?) review of activity needed for the 1.9.5 release

New info service/info provider

Active/passive fixes for SRM client

On hold 'til Dmitry is back from vacation.

Moving tape protection inside pool manager

Irina handed this item over to Gerd. He is going to look into it, maybe make some changes then re-submit the patch to RB.

One change Gerd would like to do is to move where the tape-protection check is done. Currently the patch does this for all incoming requests, even if they would not trigger a restage.

Another potential change is in which thread the check is done. Currently the check is done in the message thread. It would be better if the check was handled in another thread; for example, in the request handler thread.

Patrick: is the pattern matching done correctly? Gerd: don't know. Paul: should be, there are unit tests.

Gerd: mentioned that, if performance is an issue, it is trivial to speed it up as we can simply cache the results for a given DN.

Patrick asked how this works with the door optionally doing the check.

If checked is done at the door then it sends the root DN (?) and a bit describing whether to allow staging (if staging is triggered).

If door is configured not to do the check then the proper Subject is sent and the bit is not set.

PnfsManager based listing in SRM

Timur: this item is blocking on other activity completing.

Gerd said has been looking at this and asked if he can submit patches on this? Timur: OK

Refactoring of Pin manager

Gerd reviewed first part: working on improving the patch.

Timur to get in touch with Gerd after the meeting to discuss the logic of the patch.

Patch only adds some simple new behaviour.

Gerd and Paul described how the logic for the pin manager wasn't clear from the code; but that we are not expecting a complete restructure in the next two weeks.

Timur and Gerd (and hopefully Paul) to meet up to discuss this at some point next week.

Terracotta and SRM

Gerd submitted request to refactor again into smaller patches. Timur to talk to Gerd about this.

Gerd: volume of patch (3,000 lines) makes it overwhelming. Appreciate that majority is mechanical, but can we split this into multiple patches?

Timur: can we do this code review over the phone?

Gerd: let me have another look.

PnfsManager based listing in dirDomain

Gerd: worked on this. Two patches, one is a simple clean-up. The other is pending.

There is some stuff with the root path that is confusing.

There is a parameter pnfsRoot that doesn't seem to work with Chimera. This is to construct the path to return to the client.

Patrick: can't you do an ls with PNFS-IDs?

Owen: was experimenting with dcap listing of directories only works for PNFS and not for Chimera.

What path does the client provide when asking for a directory; is it a subpath or the full path?

Gerd: to write email with details

PnfsManager based permission handling in all doors

Same status as last week: patch waiting for review.

Paul to look at this.

New http door with https support

Not complete yet; Tanja is having problems with getting secure connections to work. She's in contact in the Grizzle developers about how best to achieve this.

Gerd mentioned that Macro suffered from, what sounds like, the same problem. He tried to get Grizzly to work with secure connections and got stuck for ~1 week. He then switched to Jetty and got it to work in half a day.

Even if it goes in now it would be too late for Marco.

However, the could would be nice to have in 1.9.5, so we should keep this for now.

xrootd mover reimplementation

No progress, waiting on someone to review the patch.

the p2p trigger-on-load

New patch submitted and marked ship it. Just need to commit it.

HSM, Chimera and its cleaner

Gerd: I suspected for quite a while that the HSM cleaner doesn't send a message to the pool; so doesn't support cleaning of the HSM files.

HSM cleaner for PNFS does this.

There were patches earlier in the year from Irina; Gerd had comments, but thinks that the patch didn't get committed and somehow this slipped off the radar.

Tigran needs to look into this when he's back.

NDGF are still deleting files and have noticed that files are no longer being deleted from tape.

The removed files should remain in the HSM trash table. It would be a problem if this was not so.

SRM-rm and ACLs

During last weeks meeting, Gerd mentioned that there is no ACL support on SRM-rm.

Patrick reminded us that this is important for customers.

Timur: we should discuss how to implement this.

There is a new field in all Message classes to specify the Subject: the person who made the request.

Gerd has been looking into this. There is currently a remove companion. If the decision for whether a remove operation should go ahead is delegated to the PnfsManager then the remove companion can be made much simpler.

There is a pending patch to restrict delete operations to certain types (e.g., only allow deleting of a directory). This is because some protocols have commands that should only work for certain name-space entries.

With recursive deletes, we would still need to do directory listings, but the complete list of entries could be sent to PnfsManager: those that success would be deleted and SRM would trust that PnfsManager takes care that only authorised users can delete entries.

Timur: recursive deletion only allowed for empty directories. This was a decision from WLCG MoU.

Gerd: even this shouldn't be a problem. The crucial question is whether a partially successful recursive delete should be allowed?

We don't know.

Timur (+ others?) to look into the answer to this question.

Issues from yesterday's Tier-1 meeting

[standing item]

PIC and pinning states

Gerard reported a problem (there's a ticket).

Owen: This is part of a series: we now have four tickets about this bug in pin manager.

Timur: what I'm fixing just now isn't directly related to the bug.

The bug report is that, if file in pinned on a pool that is currently down then further attempts to pin this pool will fail. This results in SRM transfers failing, since a transfer will attempt to pin a file before returning the TURL.

Timur's current work is fixing a problem when, under circumstances, a pin stays forever in PINNING state.

Action: Owen to update the tickets -> Done.

dCache recovering from db interruption

NDGF reported that they noticed that some dCache components do not recover if the database becomes unavailable. Instead, they must be manually restarted before they start to work. The components that are badly behaved in this respect are: Chimera cleaner and SRM.

Gerd: we should try to fix these problems. They are probably not too difficult to fix and allowing database restarts makes dCache more robust.

Owen to put a ticket into trac to keep track of this long-term goal. -> Done

Release of PNFS

Not released yet.

If not done for Wednesday then it will be done for the following Wednesday.

There was some discussion about whether the supply a statically linked version; the conclusion was to supply both: statically and dynamically linked.

Preferred platform: SL4 or SL5? Owen to investigate this further.

Owen to look into compiling PNFS. If there's no progress by mid-day Thursday he'll contact Vladimir for help.

Update: Owen is currently working on this with Vladimir

Outstanding RT Tickets

[auto-generated standing item]

RT 4571: PnfsManager set log slow threshold

Although ostensibly this is a simple problem, it's actually part of a deeper issue: how dCache interacts with its configuration files. Some configuration options are settable within dCache (acting more like saved state) whereas others may be altered but set only in the configuration file.

Gerd has no idea of the proper way of doing this. No one else expressed an optinion.

The problem is likely to be fixed at one of the dCache workshops ... or assign a task to someone to look into the market.

Patrick asked if there is something in spring we could use? Yes, there's something like JavaConfig? (sp?) that use java property files to call setters on objects, but Gerd doesn't have experience of using them.

Paul to add the short-term fix.

Trac entry to be added to find a general solution.

RT 4699: GGUS-Ticket-ID: #50779 ASSIGNED to dCache Developers srm-{get,set}-permissions does not work on DPM

Timur commented that a patch from Dmitry has been committed but it looks like it was only a fix for the SRM-GET part. Waiting on Dmitry to get back for further investigation.

RT 4701: feature request: catch java ConnectionException when SRM-server is not running

Can redirect stdout/stderr from this stop command since the script checks whether the tomcat stops.

Disadvantage is that any real error messages would also disappear.

Alternative: we have the process-id file. Before telling tomcat to stop, check whether tomcat is still running.

Timur to have a look.

RT 4702: feature request: pools smaller than 4 GB

Solved.

Gerd explained what the gap was.

RT 4706: Default owner for the chimera database has administrative privileges

No discussion.

RT 4708: XML parser vulnerability

Gerd: looked at Xerces project. There hasn't been a new release since 2007. There has been a commit in June 2009 SVN, but no announcement about a new release.

Our options seem to be:

  • compile our own version,
  • move to using the Java built in support,
  • ignore the issue,
  • take an already-patched version from someone.

Some distributions are shipping a patched version of the JVM.

Do we know to what extent we can depend on the OS providing the Java dependencies here? Owen reported that gLite doesn't do this well.

Owen: to put into trac explicit dependency on Xerces as a long-term goal. Update : Done

Paul & Gerd to do something short-term.

RT 4712: retry setting for suspended files

Site is claiming pools not going offline but files going into suspended state.

Gerd: may happen if the pool doesn't respond to three successive poolmanager requests.

Can switch this suspend off, but then file transfers will fail instead. However, with a sufficiently long time-out, this isn't a problem.

The user reported that switching to a higher timeout value was not respected by PoolManager.

If the problem is something they can reproducing then switching on debug logging in the pool manager would catch useful information.

Owen to follow this up.

RT 4716: list 10k on chimera as root takes O(2-4) more than as a regular user!

Gerd: most likely explanation is caching.

Paul to copy Gerd's reply "how to test Chimera" to wiki page.

RT 4721: Problem in GridftpClient when copying files to dCache with srmcp

Gerd suggesting to use a different JVM and checking whether it works.

RT 4731: PinManager: files in pinning state while their pinning have already expired

Already discussed.

RT 4733: gsidcap failures with new CAs

This is potentially a very serious ticket.

Owen: did a full fresh install, with the new RPM and was unable to reproduce the problem.

Waiting on a reply from Lionel.

RT 5039: SpaceManager in 1.9.3-p3 not picking up policy directives

Original problem was due to dCacheConfig being altered on the wrong host.

However, this illustrated that the Space manager still has options for default AL and RP. These are displayed in the info admin command output.

Gerd speculated that these maybe used as a defaults when doing a reserve space and not specifying the AL and RP values. Timur agreed that this was the default value usage usage.

Gerd recommended that the option be removed: it's too confusing. Instead the Space Manager should get default values from the PnfsManager.

Timur: we did to revisit the issue.

RT tickets and actions

Does everyone know what they are? Do people feel they are useful? Are we missing any obvious actions?

In general yes.

Policies

Should items be automatically unchecked once the meeting is over?

This should be a manual process on a ticket-by-ticket basis. Keep all tickets for next week.

How long before a meeting should the list be generated?

There was a discussion extolling the benefits of auto-generating the list.

Half an hour.

What criteria should be used to decide if a ticket is to be brought up at a meeting?

Gerd: ones that people feel should be discussed.

Patrick: any urgent Tier-1 ticket that has not been handled in a few days should be marked for discussion.

Trac wiki templates

When you create a new wiki page, you will see a new dialogue box asking which template to use. Don't panic, just select the "empty page" one.

Review of RB requests

[standing item]

Issue with GGUS

Owen reported that there has been an issue with the pairing of GGUS tickets and our own RT tickets. This has resulted in the two no longer being paired.

Owen has asked Sven (DESY RT sysadmin) to investigate this.

Sven Sent reply to team was lost, Owen will summarize this.

Update: Owen has now done this.

SVN access

Vijay now enjoys SVN access.

DTNM

Proposed: same time, next week.

People to email to figure out what system to use; e.g., whether esnet is an alternative.