wiki:developers-meeting-20110302
Last modified 10 years ago Last modified on 03/02/11 18:36:06

[part of a series of meetings]

Participants

Agenda

[see box on the right-hand side]

Postcards

Up to two minutes (uninterrupted) per person where they can answer two questions:

  • What I did last week (since the last meeting),
  • What I plan to do in the next week.

No questions until we get through everyone :)

Tanja: tickets and working on reading over multiple-movers.

Tigran: NFS testing in Chicago, recovering. Only real bug is supporting NFS v4.0: Apple, .. Still fine with the current Linux client .. better understanding of the NFS spec. Better understanding of the parallel stuff, so we can implement it in dCache.

Gerd: deployed 1.9.12 on production system yesterday. Mostly only (a few bugs turned up again). Otherwise, everything is OK. Fixing liquibase.

Dmitry: ported SRM scalability to 1.9.5 and installed on CMS prod. Introduced a second machine as a load-balancing and found that lcg-cp started to fail in this mode. Because it does a reverse-IP mapping and complaining. Work-around to return the same number. Server code seems to be stable. Fix a couple of bugs: creating of index in space-manager (bug reported by IN2P3).

Antje: Improving our documentation.

Paul: Fixing migration and talking lots.

Christian: several EMI training sessions. Testbed training session was one. Prepared our testbed machines for this session along with some slides. Still trying to get ETICS to build. Preparing presentations for Goettingen. Testing and validation plan for EMI is needed this week.

Karsten: follow-up patches to VO-Rolemap patches. Preparation for Goettingen.

Plans for patch-releases

Should we make a new patch release?

Built 1.9.10-5 yesterday. Gerd will provide the release notes and we can put it on the web-page.

Next one is 1.9.5

1.9.12

We have branched.

Now on our test system. There was a problem with the plain upgrade, now fixed. Currently have six tests failing, but this might be a timing issue with space-manager not updating information from pools quickly enough. The tests are continuously, so next cycle should pass.

What shall we do next?

See the error message in build.

ETICS problem may be that it has / as the home directory and passes HOME with the actual directory. This may confuse some tools that query the actual home directory.

Tigran is away for a couple of weeks; Paul is away, too. What is our release plan?

15th March +/- a day or so.

1st April for official dCache.org release (1.9.12-1).

Certification is our state saying we've tested our software on our testbed.

There's then an additional testing phase in the EMI testbed, which is poorly defined at the moment.

When is this supposed to happen?

Does this happen before or after 1st April?

We are in RC1 phase. As soon as 91% build in ETICS repository then these are deployed on testbed. This should be handled by someone outside the dCache team, but in reality we will do the deployment. Once testbed is deployed with EMI-1 RC1, there is a testing phase.

EMI-1 release timescale isn't well understood, at least by us.

No merges to 1.9.12 for the 14 days Tigran is away. So, bug-fixes will not go into 1.9.12 until 21st March (unless they're done in the next few days).

There's about 10 items.

Two issue with info-provider:

UNDEFINEDVALUE interfaces IP for the doors is published incorrectly (an internal IP published)

Issue with the pool and the Diagnostic Context. Tigran can't find a nice solution, but can probably hack around the problem.

Looking into the SRM, found a number of issues with 3rd party copies. But these are not believed to be new.

From history at NDGF, there have always been bugs discovered when upgrading to a major release.

Releasing 1st April should be fine; the only concern is the EMI release.

2nd March code-freeze.

4th March RC1 will be available.

RC-0 had roughly 90% success, but this dropped back down to roughly 70% due to changing the ETICS client. This is because the client doesn't packaging.

22nd April release date for EMI-1 29th April is announcement date.

S2 tests are currently. putDone is currently failing (availability test)

One problem could be the synchronous reply? This was off by default but, with 1.9.12, it is enabled by default.

Two tests with the same name in different directories; the problem is present with both the automated S2 tests and when ran manually. There's a different test with the same name but in a different directory. This test passes (when ran manually).

Dmitry will try the S2 tests too and have a look at the problem.

Trunk activity

Progress with new features...

Currently we focus on 1.9.12, so don't anticipate any further work here.

Unclear how much of Karsten's future new plugins will go into 1.9.12. vorole plugin configured: testing existing plugins.

Will any sites upgrade 1.9.12 and

Only two sites said they would upgrade to 1.9.12. Whether they switch to gPlazma2 is unknown.

If CMS Tier-1 were to upgrade to 1.9.12 then many sites would consider upgrading.

Dmitry has talk to Catalin; he felt that they have a stable system and don't see a good reason to risk this for 1.9.12.

BNL are more receptive to upgrading to 1.9.12. From their talk at CHEP, they tried to solve some problems themselves: parallel databases, etc.

Tanja: multiple movers per-transfers. As soon as we prove it works, it looks like the new xrootd client can stripe transfers across multiple pools.

Will it help CMS who are trying to parallelise their IO, fork()ing processes to read different parts of the data? No. It wouldn't help them.

Their approach is to try to cheat dCache's cost model.

Gerd started to get rid of movers completely: switching to using a protocol engine instead. Remove everything about movers from the pool. A lot of complicity comes from sharing things between protocols. A pool could be simply a container of protocol engines. A pool is exposing queues to the protocol engine(s). Cost module still gets the same information. Legacy implementation can implement the same behaviour as we have currently. Several layers of encapsulation to make things generic enough to satisfy all protocols.

Plan to review Brian's patches. One patch ~OK, one patch is ugly. Tigran to look into this when back from holiday. Paul to have a look too.

Scalable SRM: Dmitry is working on it.

Issues from Add link to yesterday's Tier-1 meeting

Need a gPlazma plugin to do DIGEST auth. You also need to extract the usename+password or token in the WebDAV door.

Issues from EMI

We have to get the testing reports in by Friday 2011-03-04 and building in ETICS as soon as possible.

One issue that came up is with opening up firewalls. This is more a DESY issue since we currently host the testbed services.

EMI 'guidelines' are now 'policies'. Unclear what effect this will have.

Fedora guidelines for packaging. Our directory structure doesn't comply with these guidelines. Perhaps this is an issue for EMI-2, not for EMI-1. Problem isn't for us: we can be compliant; the problem comes for clients who upgrade their system.

Goettingen things

There is now a trac page for Goettingen workshop presentations.

Gerd: have started the major release notes .. hope to have this finished for Goettingen workshop.

JMS

Not much to gain for using JMS right now, but it's a stepping stone towards a better, brighter dCache world.

Gerd to send Christian a picture; Christian to tidy this up.

New version of ActiveMQ and resolution of well-known names.

No negative reply: only reply is a timeout. annoying.

Due to this broadcasting, with low rate, sometimes well-known cells don't resolve. We retry inside dCache, so the failed

Upgrade to ActiveMQ 5.4.x releases. When restart the broker, some cells don't resolve. Have to restart to recover.

Fixing this: two approaches. Introduce a central service that does well-known cell resolution. The other approach: create a queue for each well-known cell. The issue is that doors are created for each transfer, ActiveMQ didn't garbage-collect queues. Since 5.4.x ActiveMQ now GC's queues, so this may be an option.

Maybe first approach could be done and maintain compatibility, which would be needed for 1.9.12.

(If you pin file and then repin with a different pin-time; debug tomorrow).

pin manager

From a user perspective, this shouldn't be very noticeable. This should have the same functionality as the old one.

Outstanding RT Tickets

[This is an auto-generated item. Don't add items here directly]

GGUS tickets

For more than 1/2 year, Gerd is on a team of people who deal with GGUS tickets before they hit dCache.org.

If Jens, as 1st line support, can't decide if a problem is configuration or code problem then he asked Gerd as "dCache resolver": the expert that decides if the problem is code-related or not. DMSU "Deployed Middleware Support" Unit.

People tend to bypass this process and assign tickets directly to dCache.org.

Tickets should go through EMI support process as user errors and configuration problems shouldn't hit dCache.org.

Jens is supposed to do the day-to-day stuff. Gerd only gets involved if its unclear what to do.

Review of RB requests

DTNM

Same time, next week.