wiki:developers-meeting-20091201
Last modified 11 years ago Last modified on 12/03/09 17:58:05

dCache Tier I meeting December 1, 2009

[part of a series of meetings]

Present

dCache.org(Tigran, Irina, Paul), IN2P3(), Sara(Ron, Onno), Triumf(Simon), BNL(), NDGF(), PIC(Gerard), GridKa(Silke, Doris), Fermi(Timur), CERN(Andrea)

Agenda

(see box on the other side)

Site reports

BNL

Pedro expanded these issues and discussed a few additional ones:

BNL have noticed that, on their doors, they have some 6 times the number of connections on their internal interface than are present on the external interface. They are using only a single stream, so are expecting only a single connection to the pools. Whilst this isn't causing any noticeable problem, the large number of connections is undesirable.

Pedro also asked about the time-scale for having a fix for the PinManager (see description above). Paul didn't know the time-scale, but would ask at the next developers meeting.

Pedro also reported that BNL now have a switch that supports SSL-hardware acceleration. It was unclear precisely what this switch would do, but since GSI is incompatible with SSL it is unlikely to help. Should secure protocols (e.g., SRM) switch to using SSL then the switch would require certainly require additional code in dCache to support it.

Fermi

Jon reported via email:

Nothing to report.

PIC

Gerard reported that everything is running OK and that they have no issues.

FZK

Doris reported that they have upgraded their ATLAS instance to 1.9.5-9 and most of the pools are now running 1.9.5-4. There was an issue with dcap (movers on pools?) becoming stuck due to insufficient available ports, but this is now fixed.

There was an issue noticed after the upgrade where ATLAS transfers were failing. This was despite FZK's in-house testing showing the newly deployed dCache instance being OK. One of the symptoms was that SRM ping didn't work.

The dCacheSetup file for that node was updated at the same time as the upgrade. This was part of a harmonisation effort to have a common dCacheSetup for all dCache nodes.

The solution was to downgraded the SRM node to the previous version: 1.9.5-4. This downgrade also reverted the dCacheConfig file to the previous values.

The ATLAS instance is now working fine.

Sara

Ron reported that everything is fine at Sara; there are no significant issues.

Migration to Chimera

Ron also reported that they are currently preparing to migrate their namespace to Chimera. The current work is in migrating their test instance. After a few tries, everything went through OK.

Ron asked whether the example migration process (migrating each VO in turn) is recommended or whether a single pnfsDump run would work. Tigran and Paul replied that either is fine. A site can migrate a subset of the tree or the whole tree. The example (VO-by-VO migration) comes from a DESY-specific migration use-case where it was desired to migrate only a subset of all VOs.

The planned migration will take place during the second week of January: Monday 4th January to Friday 8th January.

Ron reported that they are planning to migrate the PNFS instance to some "very fast hardware" (with 8 cores) and to do the migration from there. The plan is also to split the md5sum check over multiple processes, one for each core, to improve throughput of the tests.

Tigran asked if Ron knew roughly how many files in the system? Ron thought around 14--15 million.

Supporting multiple roles over GSI-dcap

Tigran reported that he has spent some time investigating the issue. The problem is two-fold: the GSI transport doesn't send the multiple records (group+role) to the dcap door and the door doesn't do anything with the (currently absence) multiple records.

The first problem has been fixed but the second part will be harder to fix.

Tigran will investigate this issue further, but the solution will likely take longer to implement than was thought initially. He'll get be in touch when there more information.

Triumf

Simon reported that dCache has worked well during the last week.

There was a couple of issues that he wishes to report.

PNFS performance

First, they have noticed performance problems that is due to poor PNFS performance.

Initially, jobs were processing 250-events but this has now increased, resulting in greater IO requests on dCache.

They are looking at improving throughput of PNFS by obtaining additional memory for the PNFS / PnfsManager node.

kpwd file broken

Triumf are using a kpwd file for configuring their authentication. They noticed that this file was broken yesterday. dCache continued to run OK, but SAM test DNs were mapped to the wrong group. This took some time to fix.

It's not clear what cause the kpwd file corruption: investigation is on-going.

Anything new on the SRM COPY problem

Andrea wanted to know what progress has been made in solving the IN2P3 issue with SRM-COPY. It seems that they are suffering from a problem where all SRM COPY transfers from other sites into IN2P3 are failing. (This is ticket 5285).

Timur reported that the problem is currently not completely understood.

The initial investigation discovered a bug where URIs that do not include the port number (8443) in the SRM URI are considered remote. Because of this, SRM COPY commands that don't include the port number will fail. A fix for this has already been developed and should be part of the next release. However, until then, it is recommended to always include the port number in the SRM URI.

However, when the port number is included in the URL, a different problem is discovered. The problem is reported as SRM doesn't understand the httpg protocol. Work is underway in discovering why this could be.

A similar problem was reported by a site that was fixed by reinstalling dCache on the SRM node. The suspicion is that there was a broken configuration option. This option may help IN2P3.

Update

As of 2009-12-03 14:43, the problem is fixed.

Pool file-system integrity

Simon asked about how the pool tests the file-system and how it responds to problems.

Tigran reported that the pool will try to create and remove a file every 30 seconds. This is the OK file in the repository directory.

As the pool attempts to delete the file after creating it, one should never see this file under normal circumstances.

If there is an IO error when creating or removing this file then the pool will disable itself. This is done, rather than making the pool "read-only", because we are unsure what is the cause of the problem: it may be something that will affect existing files, so marking the pool read-only may be insufficient for data integrity.

Tigran also reported that people are also deploying pools with some network-attached filesystem. With NFS-based systems, this introduces an issue that creating an empty file and immediately deleting it might test nothing: the network-filesystem client may cache the file creation until the first write; if the file is deleted before this happens, the server may be unaware of the file being created.

Future versions of dCache may also write a small amount of data to counter this client-side caching problem.

Support tickets for discussion

[Items are added here automagically]

RT 5181: Problem with Xrootd door at BNL

BNL are currently running dCache v1.9.4-3 on their xrootd doors, but ran v1.9.0-10 when the ticket was reported.

This ticket appears to have been submitted by BNL local users without consulting their local site support. Pedro indicated that (??) was the correct contact person, and will take charge of the ticket.

DTNM

Same time, next week.