wiki:developers-meeting-20101102
Last modified 10 years ago Last modified on 11/03/10 14:24:29

dCache Tier I meeting November 2, 2010

[part of a series of meetings]

Present

dCache.org(Tigran, Paul, Antje, Tanja), IN2P3(Yvan), PIC(Gerard), GridKa(Silke)

Agenda

(see box on the other side)

Site reports

PIC

Gerard reported that everything is, more or less, OK.

They've upgrade to the latest patch-release of dCache 1.9.5 (version 1.9.5-23) and will out of site down-time in a few minutes.

Gerard mentioned that they are still suffering from the high pool-to-pool load when starting up. When the machine is under normal load, this isn't a problem.

Tigran asked if Gerard can confirm the xrootd problem is now fixed with 1.9.5-23? Gerard said he couldn't just yet; the users that reported the problem will check it in a few days.

GridKa?

Silke reported that things at GridKa are fine.

They are having a minor issue with LHCb files: files have been written into dCache that should be in a space but are not.

Tigran thought dCache may need an administrative command to fix this kind of problem: add files to space?

However, we should understand the underlying cause and fix that.

Paul asked if the file written originally written into a space? Yes.

Silke also mentioned that this is similar to a problem previously reported with ATLAS files. She'll try to find the ticket number.

Paul asked if Silke had a recovery procedure. Silke was going to obtain a list of files from LHCb and those that aren't in the token, delete them and ask LHCb to copy the files back into dCache.

Tigran and Paul felt that it should be possible to fix the problem without asking LHCb to (re-)copy files. They asked Silke to open a ticket, describing the situation, so we can propose a suitable recovery procedure.

IN2P3

Yvan reported that the previously reported problem with Solaris continues to be a problem. The downgrade of Solaris to the previous version was initially thought to fix the issue; however, it now seems that this is not the case.

Yvan reported that IN2P3 are using dCache v1.9.5-22 on their head-nodes and pools. They are using Solaris 10 (see http://rt.dcache.org/Ticket/Display.html?id=5906 RT ticket 5906 for more details.

He noted that some pools are heavily loaded while others are not: there are 12 ATLAS pools but only 7 pools are having problems.

Yvan mentioned that he performed Iperf tests on the network; the bandwidth seems to be fine.

Yvan asked some questions:

  1. could the problem be due to the PoolManager PSU ?

It's possible, but if the dCache configuration hasn't changed then this is unlikely.

  1. could the problem be due to sharing bandwidth between import- and export- pools?

Paul felt that was unlikely. These pools are often separated, but not because of network bandwidth issues.

  1. There's more traffic on some pools since they're also running doors. Could this be a problem?

Fermilab use a similar deployment, although they limit the number of concurrent logins to three. If the doors are limited then this shouldn't be a problem.

Tigran suggested some further avenues for investigation:

  • Do you observe high IO-Wait on the machines: could the problem be due to slow disks? Yvan wasn't sure.
  • are the problems more with GridFTP "PUT" or "GET" operations? Both are present, but there seemed to be more GETs.
  • can you run IOZone on the machine: to check the disks are working correctly and providing sufficient throughput?
  • Have the end-users changed their usage-patterns recently; e.g., different files or being more active? Yvan wasn't sure.

It was agreed that Tigran would send Yvan his SSH public key and the IP address of his machine. Yvan would create an account to allow Tigran to investigate further.

Support tickets for discussion

[Items are added here automagically]

DTNM

Same time, next week.