wiki:developers-meeting-20090526
Last modified 12 years ago Last modified on 05/26/09 16:48:58

dCache Tier I meeting May 5, 2009

Present

dCache.org(), IN2P3(), Sara(), Triumf(Simon), BNL(Pedro), NDGF(), PIC(Marc), GridKa(Silke, Doris), Fermi(), CERN()

Apologies

SARA(Ron)

Agenda

  • Site reports,
  • DTNM.

Site reports

BNL

Everything is OK.

We are still having problems with PNFS performance that is impacting SRM. This is understood and there isn't much that can be done to fix the problem.

We're having a look at Chimera and evaluating its performance. We're also looking at the migration process and will be having a meeting with the dCache team tomorrow (Wednesday 27th May) about this.

BNL are also in the process of upgrading the pools to 1.9.1. This is to fix bug with the sweeper select (LRU).

FZK

FZK reported that they don't have any operational issues.

Just had a configuration problem: opened a ticket. FZK were investigating how does the system behaves when a staging pool is unavailable. The behavior seems wrong and FZK have opened a ticket. Irina is investigating this; feedback will be via the ticket.

Triumf

Don't have operational issue.

Triumf have moved pools to using gridftp2. This is working OK so far.

Simon mentioned finding some dcap transfers hang; the door is in state "waiting for door transfer OK". Some doors have been in that state for 1--2 days. The speed of transfer is zero: no bytes are flowing.

Tigran asked: is there still a client connected? If the door is to satisfy someone executing dccp then the above symptoms would definitely indicate a problem. If the door is for clients running some libdcap-linked application then the door may hang around until the application terminates.

SARA

Ron reports via email:

Sorry, I am not able to attend the meeting. But currently we have three issues:

  1. The gsidcap doors run into memory problems (GC overhead limit exceeded) when they are moderately used (upto 44 logins per door).

Ticket 4527. Gerd is looking into this.

I have increased the heap from 2GB to 8GB but that did not help. In the log there are a lot of messages like:

26 May 2009 12:25:56 (DCap-gsi-bee22) [] Initializing CA certificate store from directory: /etc/grid-security/certificates
  1. We see numerous SQLExceptions in our srm log. Messages like:
2009-05-22 04:02:18.958 (SrmSpaceManager) [v2:srmPutDone:83418722 SRM-srm CancelUse]
dmg.cells.nucleus.CellNucleus.log(CellNucleus.java:847) ERROR  - cancelUseSpace for path /pnfs/grid.sara.nl/data/atlas/atlasmcdisk/mc08/log/mc08.106023.PythiaWhadtaunu.simul.log.e347_s462_tid044924/log.044924._30000.job.log.tgz.2 failed with java.sql.SQLException: Multiple records found, disallowed

This is ticket 4534. Dmitry will look at it.

  1. Users have difficulties getting files from our SRM sometimes. In the

SRM log I see messages like:

 2009-05-23 04:35:11.010 (SRM-srm) []
    dmg.cells.nucleus.CellNucleus.log(CellNucleus.java:847) ERROR  -
    srm: creating a failed request status with a message:
    getRequestStatus(): request #-2092754132 does not belong to user
    AR:-7102685611343022596 atlas /O=dutchgrid/O=users/O=nikhef/CN=Kors
    Bos 1900 read-write 0 / / < 1 groupLists :
    GL:/atlas/Role=production 1 groups : [1213,]; >

This is ticket 4535.

PIC

(report via email due to connection problems) Nothing to report

DTNM

Some time next week.