wiki:developers-meeting-20091020
Last modified 11 years ago Last modified on 10/21/09 18:47:55

dCache Tier I meeting October 20, 2009

[part of a series of meetings]

Present

dCache.org(Owen, Paul, Tigran, Tanja, Irina), IN2P3(), Sara(Onno), Triumf(Simon), BNL(Pedro), NDGF(), PIC(Gerard), GridKa(), Fermi(Jon, Gene, Timur), CERN()

Agenda

(see box on the other side)

Site reports

Fermi

Things are working fine.

We have just deployed the last of a batch of new worker-nodes. This brings the number of WNs up to ...

75,000

We are now able to provide 8GiB/s of data to worker nodes.

Triumf

No problems currently.

Simon reported that Triumf has been experiencing SAM test failures with SRM time-out problems.

A similar problem has happened in the past. Fixing the problem was correlated with adjusting an SRM settings although, off the top of his head, Simon couldn't remember precisely which setting (with Put/Get? SrmScheduler there are some settings like MaximumUser).

PIC

Issues from 1.9.5-3 upgrade

Upgrade to 1.9.5-3 was OK; we used puppet to upgrade.

There were only a few issues

  • LinkGroupAuthFile? can not be commented, because by default it's empty!
  • With srmAsynchronousLs=true we're getting this errors

[FTS] FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [USER_ERROR] Cannot process SrmLs? Request [-2046270387]. SrmLs? Request should not be asynchronous when asking for SURLs stats.] Source Host [srm-atlas.cern.ch]

  • Needed to update some Nagios scripts which monitor java daemons using ps, since syntax has changed (now all java procs look the same)
  • Needed to update ganglia wrappers for PNFS info command (3rd column is gone with 1.9.5-3).
  • pnfs root owned directories issue (automatic script used to work fine). This causes errors like this:

[FTS] FTS State [Failed] FTS Retries [1] Reason [DESTINATION error

during TRANSFER_PREPARATION phase: [SECURITY_ERROR] at Tue Oct 20 15:56:17 CEST 2009 state Failed : user has no permission to write into path /pnfs/pic.es/data/atlas/atlasdatadisk/step09/ESD/closed/step09.20200943000675L.physics_A.recon.ESD.closed] Source Host [ccsrm.in2p3.fr]

This is ticket #5184 PNFS root owned directories issue at PIC

Furthermore, this is the most relevant puppet config we use to deploy dCache on Linux (if you want it for Solaris dCache pools I've it too):

                               if ( $downgrade == "true" ) {
                                        exec {"yum -y remove dcache-server":
                                                before => Package["dcache-server"],
                                                onlyif => "if [ "`rpm -q dcache-server | grep $dCacheVersion | cut -d- -f1`" = "dcache" ]; then
exit 1; else exit 0;fi",
                                                logoutput       => true,
                                        }
                                }
                                exec { "beforeRPM":
                                        command         => "cd /opt;
/opt/d-cache/bin/dcache stop; ps -ef | grep /opt/d-cache/jobs | grep -v
puppet | grep -v grep | awk '{print $2}' | xargs kill; ps -ef | grep java |
grep -v grep | grep -v puppet | awk '{print $2}' | xargs kill; tar -cz
--exclude billing -f dcache.`date +%y%m%d`.tgz d-cache",
                                        onlyif => "if [ "`rpm -q
dcache-server | grep $dCacheVersion | cut -d- -f1`" = "dcache" ]; then
exit 1; else exit 0;fi",
                                        logoutput       => true,
                                }
                                package {"jdk":
                                        ensure   => latest,
                                }
                                package {"java-dummy":
                                        ensure   => installed,
                                        require  => Package["jdk"],
                                }
                                package {"dcache-server":
                                        ensure   => $dCacheVersion,
                                        alias    => dCache,
                                        require =>
[Yumrepo["dCache-$instance.repo"], Exec["beforeRPM"],
Package["java-dummy"]],
                                        notify  =>
[Class["dc_dCache_Config"], Exec["$dCacheConfigureCMD"]]
                                }

PIC are currently in down-time from their upgrade to dCache 1.9.5.

Paul asked what these scripts were doing? (the motivation being to identify problems with dCache that site-admins have to script around) Gerard mentioned that he's seen dCache thinks its running OK, but that two pool daemons are running, so the script looks to see if you have the right number of java processes and they're running the right command.

They officially come out of down-time in an hour's time. PIC is already accepting new jobs and transfers have started.

Sara

No issues.

Question: who else, apart from PIC and NDGF is using 1.9.5? None of the sites participating in the meeting were using dCache v1.9.5.

BNL

  • cannot use gPlazma with SAML or XACML in 1.9.4-3, this leads to very frequent SRM_FAILURE errors due to permission denied. if we use a short lifetime for the cache we also notice a very high load on the SRM machine. currently using grid VO role mapping but we still see some SRM_FAILURE errors every once and a while. is the problem understood? is there a time line for a fix?

FZK

Report from email

One info from us: we plan to update our ATLAS dCache instance next week
on Wednesday (28.10) to 1.9.5. The following week we will update the
other two dCache instances at GridKa.

DTNM

Proposed: same time, next week.