wiki:developers-meeting-20090428
Last modified 12 years ago Last modified on 04/28/09 17:50:45

dCache Tier I meeting Apr 28, 2009

Present

dCache.org(Patrick, Timur, Tigran, Owen, Paul), IN2P3(), Sara(), Triumf(), BNL(), NDGF(), PIC(Gerard, Marc), GridKa(Silke, Doris), Fermi(Jon), CERN()

Apologies

Agenda

  • Site reports,
  • New info provider,
  • DTNM.

Site reports

PIC

Everything OK.

Currently running v1.9.0 but with one pool running the v1.9.1 pool.

We are planning to upgrade to v1.9.2 on the 12 March. That doesn't allow much time for testing. Is that OK?

The v1.9.2 dCache has been heavily tested at Fermi and NDGF, so basic OK. It would still be a very good idea to test v1.9.2 on a pre-production setup using local dCache setup for a short while (e.g., a week).

Timur asked about the SRM performance: PIC have reported a very good throughput rate. Was this a peak or sustained?

It was a peak value, but one that hasn't been seen again. The experiment that generated the throughput is not running just now, but planning on tweaking the settings when the experiment starts generating load again.

Gerard reported that the SRM database is running on the same host as the SRM front-end. PIC had a long-term plan on migrating the database to another node. This has been accelerated to a medium- or possible short-term plan.

Karlsruhe / FZK / KIT / …

Everything is OK with ATLAS instance.

FZK reported a problem with the old / non-ATLAS dCache instance. There is no problem writing data; however, but there is an issue with reading (via SRM) where reading blocks. Reading via gsiftp works OK. After restarting the SRM front-end, it is OK for a while but then performance decreases until SRM read requests start to timeout.

Timur asked if this was the result of PNFS overload? No. Have you looked at the log files? Yes, but didn't see anything alarming.

Timur also asked if Doris could check there are a large number of requests in the READY state? This can be checked by looking at the SRM cell in the admin interface (the 'ls' and 'info' commands) or, for historic plots, using the SRM Watch.

Doris reported seeing occasional peaks in the number of TCP sockets in TIMEWAIT state on SRM front-end. It was unclear where these may have come from or even if they are related to the SRM GET problems. (it may be a result of bulk of requests all timing out at the same time)

The problem could be an effect of limited number of concurrent sockets that Tomcat allows "acceptCount" parameter in the Tomcat configuration file (server.xml). This can be increased and Fermi observed an improved throughput when they did this.

Doris reported that the problem heals itself after a few hours; however, after the problem went away, the number of GET requests starts to increase and the problem reappears.

The length of the request in the QUEUED state are plotted in SRM-Watch.

Someone (Gerard?) reported several issues with SRM-Watch (missing graphs, exceptions, etc). Timur reported that there may be a new version available soon. This may fix these issues.

Paul mentioned that, if the SRM front-end is rejecting clients (due to Tomcat and kernel backlog exhaustion) it should be possible to see the TCP RESET packets being sent from the SRM front-end. This could be monitored using, for example, tcpdump. Timur mentioned that this should be reported in the Tomcat log files, too.

There was some discussion on how whether to increase the acceptCount parameter in Tomcat and to what value. Timur recommended increasing this to 1,000 as this value is used at Fermi. Others were more cautious about increasing this value, due to the potential increased load on PNFS.

Doris agreed to open a ticket to track this. Further work, involving log files and tweaking settings, will take place there. She also agreed to check what parameters are set for ATLAS.

Fermi

Jon reported that things are going find for CMS.

There is an issue with Replica Manager's interactive SSH- / admim- interface. There is a ticket related to this issue.

New info provider

Gerard brought up the issue of publishing more than one SRM endpoint. PIC would like to publish multiple SRM endpoints: these are currently aliases for the same physical device but may allow migration to separate front-end machines in the future. The new info provider does not make publishing this information easy.

Paul agreed to look into this: both a short-term solution and a longer-term solution that would allow easier management.

DTNM

Proposed next meeting is the same time next week (Tuesday 5 May 2009).

Several dCache developers are taking part in the dCache developers workshop at Fermi next week. They don't anticipate any problem conducting the Tier-1 support meeting but, should such a problem arise, the meeting may need to be rescheduled.