wiki:developers-meeting-20100120
Last modified 11 years ago Last modified on 01/20/10 18:39:02

[part of a series of meetings]

Participants

Timur, Vijay; Gerd; Patrick, Jan, Tigran, Tanja, Irina, Owen, Paul

Agenda

[see box to the side]

Postcards

Up to two minutes (uninterrupted) per person where they can answer two questions:

  • What I did last week (since the last meeting),
  • What I plan to do in the next week.

No questions until we get through everyone :)

Gerd: busy fixing PUT with webdav multihome machines; address relayed to was determined in a stupid way; fixing scripts; locking issues with the SRM No plans for next week.

Tigran: releasing 1.9.5-12 and 1.9.6-2. Finding out what root doesn't working with dcap. Fixing a few problems found at DESY. Trying to install kernel for SL5 that will use NFS v4. We're building up a "grid lab" at DESY: dCache w/ webdav and NFS v4.1. Next week More of the same.

Jan: two things: tried to set up a multi-node dCache to reproduce Gerd's serialistion; investigating the http-door problem problem that Tanya found; working with ETICS. next week: ETICS and web-admin work.

Paul:

Owen: ETICS, deployment and certification test-bed. Next week: ETICS and tag a release of dcap.

Patrick: last week: at CERN participating in the GDB, Outcome is quite good. 5--8 hours from data taking till first results. So, data-management is working fine. People are saying that there's still problems with "dCache file access" without much details. Next week: try to convince ppl at ATLAS@Muchen to write up what they did to check file access open times .8--1 seconds have reduced to 0.1 seconds. Ask them to write a small paper about this. ATLAS writing files in a stupid layout: fine for writing but a pain for reading. ATLAS want to fix this, but may take up-to a year for new code to be deployed. ATLAS have also been try changing dcap library to improvement file access throughput. This can be deployed quickly. Time-out of 3 weeks for writing the paper, if not then we'll write it.

We're building up a test system where we can show that dcap and NFS4 have same throughput.

Irina: trying to set up dCache on multiple nodes for future tests. Porting HSM cleaner code from 1.9.6 to trunk. Next week: much the same.

Tanja: last week: looking into the RT tickets. Next week: hope to complete the CSS support for WebDAV and also looking at the PoolManagerAdapter?: common code in all doors.

Timur: last week: preparation for pending upgrade for the public dCache; test with enstore, kerberised dcap, mostly the things that are not "standard" part of dCache. Next week: more testing and, if things go well, the actual upgrade on Thursday. Deployment is a standard dCache RPM, but an additional Fermi-specific RPM is installed on top of the dcache.org RPM that configures the dCache system.

Vijay: last week: able to clean up files from the pool: cleaner was running in the wrong mode. Also, trying to release encp work, but delayed due to illness. Found two problems: one now fixed, but the other needs discussion (see below). Next week: get the encp released.

Dmitry: last week: extended w/e; discussing with patch with marshalling SRM-ls requests, if there are multiple SURLs then we create a container requests with sub-requests. The existing state-machine is not sufficient and needs an additional state. Since Timur is busy with the upgrade, this work is delayed a bit. Looking at what 5168 (see below) fail requests. Next week: modify real-encp, in addition, have one RB request that Gerd has comments (regexp on Linkgroup auth.file). Seven RT tickets that need some attention.

Questions

Small patch for dcap. When testing this, discovered that ROOT isn't working with dcap. ROOT doesn't the URL syntax. If it starts with "dcap:" then it uses the local mounted filesystem. There was another issue with ROOT using xrootd as a fall-back protocol; the result is the file is opened but no data is read. It is unclear what effect this has.

Still have this long-standing issue with ALICE & NDGF that some connections hang, but this seems unrelated. Tigran will try with old ROOT version, but will contact Gerd if there's problems.

This guy from BNL managed to get his fix in all ROOT branches. Savannah ticket mentioned ROOT committed this to all branches.

Status of work for 1.9.5

A (quick?) review of activity needed for the 1.9.5 release

Gerd: submitted several patches, some for 1.9.5.

File checksum even for failed restores.

Tigran was a small patch about dcap door.

Jon also submitted something today that he cannot kill the dead mover. This is because we don't use NIO for all mover implementations. Either switch to using NIO or use an alternative (non-interrupt) means to stop the mover.

Gerd to look at how we could fix this and then make a decision on whether it goes into 1.9.5.

Status of work for 1.9.6

A (quick?) review of activity needed for the 1.9.6 release

(as above for 1.9.5 patches)

Nothing specific for 1.9.6

Status of work for Trunk (a.k.a future 1.9.7)

A (quick?) review of activity needed for the 1.9.7 release

Long standing issue with the CA certificates in SRM. This hasn't happened yet with. Speculated that this might be related to timing issues, perhaps due to JGlobusFX not used in Jetty.

At about the same time changed the CRL handling.

Could be a non-atomic update on the CA CRL.

Which script are you using? The grid-update-crls script from the ARC distribution.

Issues from yesterday's Tier-1 meeting

No major issues to report; FZK are planning to migrate to Chimera the week: Feb 1st--5th. Triumf are also looking at migrating; they plan to make a final decision next week.

Chimera functionality for enstore

Vijay:

Two problems with

First: the way dCache namespace:

  1. create namespace entry in Chimera,
  2. fset to set the size.
  3. use encp command to

Whenever I use the '.(fset)' command, gives "file I/O error" when accessing the '.(access)' command.

If the file-size is zero then the NFS client (Linux kernel)

The "cat" command is the wrong command to exercise the problem: cat command is not equivalent to the fstat that enstore does.

Vijay will check what enstore does and forward it to Tigran.

problem 2.

If I touch a file and I try to cat level-2 on PNFS; with Chimera, it

In Chimera levels doesn't exist until you write into it.

What to do with a stage request that points to a dead pool

Timur:

Long outstanding issue. Obversed in CDF pool that had a number of requests went down. This blocked a number of requests for some time.

What state were these requests in: WAITING, STAGING, SUSPEND state?

Waiting for get getpool is a state of a door. There should be a corresponding entry in the "restore page". The state of these requests is STAGING.

Advice was to do "rc store retry" in the PoolManager, but this didn't help.

What was the status of this pool in PoolManager? If the pool was "offline" then the PoolManger? should retry the staging on a different pool.

CDF are still running dCache v1.7.

If a pool comes back up again, with files staged, then the files can be served from the pool.

What to do with the exiting stage request?

We can redirect the request to the pool that came up.

Don't want to do an "rc retry". We want to check the

Before it starts a new stage request it ...

If the file is not registered in companion then we don't know.

More book keeping: we have drop the stage only if the mover starts.

Just need to implement this, details to be discussed using mailing list and review board.

Problems with 64-bit PNFS

Charles Waldman has reported a problem with pnfsDump on their 64-bit PNFS instance. Initial investigation suggests that the problem isn't limited to pnfsDump but affects other PNFS tools (shmcom, for example). So, the suspicion is that something is slightly broken with PNFS.

Timur to ask

It's a bit urgent because PIC are intending to move to the 64-bit.

Create a new ticket for with this information.

dcap deployment thing

Owen: we have good news and bad news. The good news is that the October release of dcap (1.xxx) is finally reached production. The bad news

The 64-bit releases are multi-arch: they contain both the 64- and 32- bit files.

Two sets of RPMs: one for 32-bit and one for 64-bit. For the tar-balls they only picked the 32-bit version.

Owen is investigating.

Refactoring patches on stable branches?

Paul: I believe we have a policy of not introducing new features in a stable branch, but I'm unclear what our policy is on refactoring code in a stable branch. Refactoring does not introduce any additional functionality, but it does touch the code. A (good) refactoring patch will make code more sustainable, but touching the code introduces the risk that something breaks (if that code has insufficient unit tests).

Since we plan to support 1.9.5 for a long period, should we allow refactoring patches in 1.9.5?

Timur: don't think it's a good idea. Patrick: with Timur. Owen: it depends. Gerd: least likely that these patches go to the stable branches; more disruptive. However, there may be exceptions.

Tigran: maybe we need to back-port of features?

Rule is to reduce the risk for stable branches.

If Jon and others are suffering and we can justify it then changes can go into 1.9.5.

Should a domain be allowed to die?

Paul: we use a wrapper script to automatically restart a domain any time it dies. Is this always the best behaviour? Some issues are not fixed by restarted but require manual intervention; e.g., out-of-memory exception, poor configuration. Are there more problems caused by a crash-restart cycle than cause by a domain crashing where a simple restart helps?

Gerd: short answer "no", always restart. Not restarting doesn't help with configuration.

Problem of log files filling up is "solved" by throttling

Owen: legacy hack left from Patrick; any large site should have suitable monitoring to do the restart themselves.

Timur: add the auto restart flag to the script, so sites have the option to disable this auto-restart.

Perhaps we can check how long the service was up. If too short then don't restart. Try exponential back-off for sleep-time.

Gerd: concrete suggestions which auto-restart service: monit. Even understands protocols. Runs locally.

Having a flag, to disable.

If we detect that the cells successfully started then restart, but if the cells fail to start then don't restart.

Another issue: (timur) we still expect people to have their own monitoring solution.

Agreed:

Flag to prevent restart if people don't want it. Minimum time for a restart.

It would be nice to have How-To s on how to integrate dCache with existing systems Upstart, Solaris services, monit, etc..

Workshop

We should start work on agenda for workshop.

Timur will create an agenda page for the workshop in Fermi's Indigo and send around the URI and the modification key.

We should try to focus the workshop on a few themes.

Outstanding RT Tickets

[This is an auto-generated item. Don't add items here directly]

RT 5168: Design flaw in how AL and RP is propagated

Dmitry to email Jon.

Drop ticket from discussion.

RT 5325: BerkleyDB metadata repository import does not make files cached in volatile pool

Existing logic with LSF mode, not setting the sticky bit correctly. Having a per-file volatile setting.

Gerd to create a ticket so this isn't forgotten.

Drop this ticket from discussion.

Review of RB requests

Move to new disk

The synchronisation between SVN and mercurial is currently slow. This affects check-in times.

Tigran is investigating, but so far hasn't found anything.

Slow check-out

Timur reported a problem last week with checking out. It was very slow.

DTNM

Proposed: same time, next week.