wiki:developers-meeting-20100119
Last modified 11 years ago Last modified on 01/20/10 14:42:17

dCache Tier I meeting January 19, 2010

[part of a series of meetings]

Present

dCache.org(Patrick, Owen, Tigran, Paul, Gerd), Triumf(Simon), NDGF(Gerd), GridKa(Silke/Doris?), CERN(Andrea)

Agenda

(see box to the right)

Site reports

NDGF

Gerd reported that NDGF have not yet suffered from a reoccurance of the CA certificate problem. The problem is that, after some time running, the SRM no longer accepts certificates issued by some specific CA. It's believed that the problem occurs as the certificate and/or CRLs are re-read whilst a request from a user with a certificate from that CA is processed. A similar problem was discovered and fixed, which remove a similar problem with GSI-dcap, but the problem appears to persist with the SRM. It is hoped that, running the SRM inside Jetty (as NDGF do) will allow additional logging that tomcat appears to hid, so allowing the root cause to be discovered.

Gerd also reported that a dead-lock was found in the SRM (present only with 1.9.6-series of the code). It is now fixed.

NDGF are still seeing some issues transferring data between FZK and themselves. Doris and Gerd are already in discussion over this issue. With some changes at FZK's networking, the problem was believed to be fixed, but it seems to persist.

The symptoms of the problem is that FTS logs an error coming from the GridFTP door at FZK: "connection failed connection timed-out". Gerd believes the problem is that an FZK pool is trying to connect to NDGF pool, but the connection fails because the pool doesn't receive the reply from NDGF, so times out.

Doris reported that the FZK networking people have changed some routing, which should have fixed this issue.

Gerd believed the problem persists: in the past 24 hours: 2,500 transfers failing from FZK. For which VO are these transfers failing? ATLAS.

Doris and Gerd will continue to investigate this issue over Jabber.

FZK

Doris reported that FZK don't have any problems.

They are currently investigating the Chimera migration process, planning out the procedure, ensuring it runs smoothly.

Triumf

Simon reported that Triumf upgraded their dCache instance to 1.9.5-11 last Wednesday. The upgrade went smoothly.

They noticed a problem affecting one user (or group); restarting the SRM fixed the problem. Two weeks ago, got a similar problem: something involving a space-token.

Truimf also experienced a few hours of SAM tests failing, but production users were unaffected and continued to work.

Paul suggested that the problem may be the same issue as Gerd and others have reported: that sometimes a CA certificate is no-longer considered valid by SRM. Simon offered to send a thread dump when the problem reoccurs. Paul said this was fine and would allow us to check what is the cause, but if it is the SRM-CA-certificate issue then the thread dump would be of little use in identifying the root cause since, by the time the dump was taken, the CA certificate has already been marked invalid.

Simon also mentioned that Triumf continue to prepare for a possible Chimera migration. They intend to make a decision whether to move to Chimera next week.

PIC

Gerard reports via email:

At PIC we've no issues and we're running 1.9.5-11.

Support tickets for discussion

[Items are added here automagically]

DTNM

Same time, next week.