wiki:developers-meeting-20100105
Last modified 11 years ago Last modified on 01/12/10 14:19:15

dCache Tier I meeting January 5, 2010

[part of a series of meetings]

Present

dCache.org(Tigran, Patrick, Owen, Tanja, Paul, Gerd, Timur), IN2P3(), Sara(Onno), Triumf(), BNL(), NDGF(Gerd), PIC(Gerard), GridKa(), Fermi(Timur), CERN()

Agenda

(see box on the other side)

Site reports

PIC

Gerard reported that they have seen one or two issues over the Christmas period.

The first issue was with the PinManager. They noticed that some files were staying in PINNING state for a long while. This seemed to correlate with a networking issue the second time. The problem hasn't come back.

NDGF

Gerd reported that NDGF was looking pretty good over the holiday period.

There were a couple of issues they found:

First, SRM suffered a partial failure. This issue is that the SRM would refuse all requests from users who's certificates came from one or more specific CAs: the SRM no longer considered that CA's certificate to be valid. This is believed to be similar to an existing problem with the GridFTP door.

NDGF are planning to upgrade to 1.9.6 this coming Thursday. With this upgrade, they hope that the problem will get better.

Next week, NDGF are looking to deploy a Jetty-based version of SRM. Jetty can run web servlets; so, within dCache architecture, it is a replacement for Tomcat.

The hope is that Jetty will provide more diagnostic information, allowing a better understanding of these CA problems.

Sara

Quite good over Christmas.

Onno reported that they have now upgraded the door nodes to 1.9.5-11.

We have one issue: we see some transfers connection is tried on TCP port that is not in the defined range. Did we miss a configuration statements somewhere? Ron has posted a message on the user-forum about this.

Are you sure this is from a FTP-mover? This wasn't clear from Ron's email.

You see the problem on the door or pool? It's seen on the door.

Client port range attribute. Sara have configured this to 20000--25000.

The suggestion was to try enabling debug-level logging within the door. This should report the port-range the door is using.

Fermi

Timur reported that US-CMS have upgrade shortly before Christmas to 1.9.5-10. There was one major issue: pools would not start up if there are any corrupted SI files. The issue is understood and fixed in the code; the next release won't suffer from this problem.

US-CMS will likely have one more possibility to upgrade slot before end of the month.

Timur reported that they ran into some other issues. For them, the most serious was that staging (via the PinManager?'s admin interface) was broken. This was an issue because USCMS were relying on this to achieve pinning.

Support tickets for discussion

[Items are added here automagically]

RT 5388: dCache 1.9.5.-11 pool OutOfMemoryError (Direct buffer) with dcap transfers

Onno asked how many movers per machine do people have configured?

Tigran reported that, at DESY, we have about 300--500 dcap movers for random access clients analysing data. For FTP people are generally less since each connection can have multiple streams (10, 20, etc), so generally there much fewer of these: 10--20 would be a reasonable number.

Each mover is a thread and takes some memory: 128 kB buffer per mover.

There is an issue with NIO direct byte buffer and the JVM garbage collector. The direct byte buffer is outside the JVM's heap memory with a small proxy object inside the JVM's heap. The garbage collector doesn't notice the space taken up by the direct byte buffer, so doesn't garbage-collect the buffer.

Yet another issue is that each uses some of the available address space.

Tigran relayed Sun's general advice: if you use less memory than 4 GiB then better to switch off the 64-bit. JVM itself is 64-bit, but the heap available depends on which mode is selected.

Guidelines on how to set the JVM memory sizes would be useful.

JVisualVM (comes with JVM) shows amount of memory used by Java and by the DirectByteBuffers.

DESY we have pools with a maximum 256 GiB. 40 TiB

Gerard reported that, at PIC, they have 1 server (a 1 pool with 120 TiB). Want this to support ~1,000 movers. It has 16 cores.

This should be fine. May need to tune some of the GC parameters; there are a tonnes of parameters for the GC.

What kind of hardware: Sun HP blades; Intel CPUs (8 cores in two physical packages) FC connection to shared 2 PiB storage. Could you send Tigran "uname -a" output.

Issue with a newer broadcom NICs where the OpenSolaris? drivers had issues with hung transfers. This has been fixed now, and the tests look good. 3 Gib/s without Jumbo frames, ~9 Gib/s with Jumbo frames.

GSI dcap

Are you still observing the problem with the gsi-dcap? We think we may have seen the problem at DESY, which will help in solving the issue.

Yes.

The problem could be related to the CA issue. This is a race-condition between JGlobus updating certificates and a client attempting to authenticate itself using that CA. If a client attempts to authenticate themselves when the CA certificate is being update then that CA is considered corrupt and is disabled. This problem has been fixed, as of 1.9.5-11.

Reports that user managed to use the same proxy-cert successfully may be because they used a different door.

Are you sure that the CA certificates are the same on all (gsi-dcap) doors? Onno: I'll check.

With the dCache release 1.9.5-10, the developers rolled back the version of jglobus to v1.4; however, we continued to see the problem.

At NDGF it is the FTP door where we saw the problem, we used to see this, but now NDGF only see occasional problems with the SRM.

Is this error message: "Short read" present on all doors? This was found on many doors if not all.

We'll investigate further and get back in touch.

DTNM

Proposed: same time, next week.