wiki:developers-meeting-20090609
Last modified 12 years ago Last modified on 06/15/09 15:00:05

dCache Tier I meeting June 5, 2009

Present

dCache.org(Owen, Paul), IN2P3(), Sara(Ono), Triumf(), BNL(), NDGF(Mattias), PIC(Gerard), GridKa(Doris), Fermi(Jon), CERN()

Apologies

BNL (too busy), Fermi-dCache-team (problem with conference system), Triumf (problem with conference system)

Agenda

  • Site reports,
  • Step09,
  • Break-even experiences,
  • Sites enabling GridFTPv2,
  • DTNM.

Site reports

NDGF

Things are going OK.

We have been spending effort readjusting poolgroups to maximise throughput.

We've also been using the migration module to create cached copy of files for pools prior to scheduled downtime. This allows users to continue using the files and not see the affect of the pools being unavailable.

Currently observer maximum throughput of 2--300 MBytes of throughput for GridFTP doors. Eagerly waiting on the new FTS being deployed (particularly at CERN) so GridFTP v2 can be used instead.

FZK

Everything fine. Seem to have reached the network limits on some pools. With the ATLAS instance, we had to adjust some of the parameters on the SRM node. The system is serviving.

Doris also asked a question about how the pool-HSM interface should be when a pool goes down. In particular, they have experienced an issue with a pool crashing (due of running out of memory within the JVM). When this happens, if there are files that are requested from the HSM but not yet delivered, the HSM delivers them when the pool is down. When the pool is restarted, the requested files are present but the pool has no knowledge of them. It required some effort to recover from this inconsistency.

It seems that the restore scripts are not killed when the JVM dies due to a JVM out-of-memory. Is it possible to fix this?

Paul would forward this issue to dCache team.

SARA

No special issues. "dCache working great" :-)

Availability of files is now bandwidth-limited.

Have send a link to a page with SARA's configuration; who's putting together a page will all the links? Owen volunteered to host this in dCache wiki.

PIC

Everything OK. dCache sustaining a high network bandwidth: 2GiB/s to outside dCache and 1GiB/s internally.

Gerard reported seeing an occasional issue with the SRM reporting that it already had a record when a user requests uploading a file. This is an issue with the space-manager and the manual fix is to remove the space-manager reservation for the transfer. Is this problem being addressed?

Paul to relay question to dCache team.

Fermi

Jon reported that they've switched from being a WLCG to a OSG compliant site. The implication from this is they are now critically dependent on the info service working.

They have noticed an issue with the info cell reporting many NullPointerException. Although the cell doesn't die, it is reporting "crazy stuff".

Paul to investigate further.

Also, Jon has been collecting some historic data on PNFS usage and on the Cost information for ~ 1 month. This logging data is taking up space O(100 GiB) that he would like to recover. What should he do with the data.

Paul to ask team and get back to Jon.

The topic of PNFS and Berkeley-DB came up. Jon hasn't migrated to the Berkeley-DB yet. There was an issue with taking backing-ups on ext3. Fermi were experimenting with ext4 in the latest kernel.org kernel (~2.5.29) and found that, with that kernel version, the issues with ext3 were solved. They plan to migrate when they next have a 8-hour downtime window.

Triumf

(via email)

No problems.

Step09

NDGF have noticed that pre-staging requests have started to come in.

One issue the sites raised was that VOs (ATLAS is one) have asked to have staging pools cleared. This is so they can test worse-case scenario for staging in files. It varies from site to site how often these requests are received. NDGF have (within Step09) received the first request since the last CCRC whilst at FZK they receive these requests weekly as a VO tests the tape recall process.

Sites that have received these requests (FZK, NDGF, PIC) have scripts to support these requests. The scripts are written at the different sites and the sites are happy with maintaining their independent scripts.

At FZK, end users have requested a mechanism to clear the pools on demand. Currently none of the sites provide a mechanism for end-users to trigger removal of files from staging pools and none anticipate providing this service.

Inconsistencies between billing files and billing db

Gerard reported that at PIC they were routinely deleting the billing files. They were doing this because they believed that all information contained within the billing files are recorded in the billing db.

When investigating file lifecycles, it turns out that file removal messages are recorded in billing files and not in billing db. Is this deliberate or a bug?

Paul: It's a bug: there is work underway towards fixing it.

Break-even experiences

Gerard reported that at PIC they were suffering from a Solaris/JVM bug were, if a pool was given too much activity, it would crash. When a new pool was introduced to the dCache system, there was sufficient activity to trigger this bug.

As a work-around for this bug, they attempted to limit the activity on the pool. They tried adjusting the max movers on the pool but, due to the low cost of space, transfers were scheduled for that pool predominately.

They were looking at adjusting the break-even parameter, so was wondering what people's experiences were with adjusting this parameter.

  • FZK Doris reported she believed that they adjusted the parameter in the past at one stage. They use the space cost factor to achieve pool rebalancing.
  • NDGF Mattias reported they might have adjusted this parameter in the past, but he can't remember. They use the migration module to achieve pool rebalancing. They have some scripts / procedures to do this and will sent information to Owen.

Owen reported a poster-presentation by Christopher Jung at CHEP about dCache cost calculation.

Sites enabling GridFTP v2

Owen asked which sites have enabled GridFTP v2?

  • Fermi has this enabled but haven't seen clients using it; Jon believes their local FTS service is configured to use it, but would need to check.
  • NDGF have this enabled,
  • FZK don't have this in production yet but are testing it in PPS; local FTS has this disabled,
  • PIC not enabled,
  • SARA don't know if it's enabled.

DTNM

"Same time, next week..."

Tuesday 2009-06-16 14:15 UTC (16:15 CEST, ...)