Table of Contents
dCache Tier I meeting May 17, 2011
[part of a series of meetings]
Present
dCache.org(Tigran,Tanja,Antje,Paul,Lusine), IN2P3(Nicolas), Triumf(Simon), NDGF(Gerd), PIC(Gerard), KIT(Doris)
Agenda
(see box on the other side)
Site reports
KIT
Doris reported that everything is currently fine.
There was an issue two weeks ago. This turned out to be a denial-of-service attach from CMS. Gerd helped diagnose the problem and suggested switching off the p2p-on-cost option. The problem was CMS issuing lots of file prestaging requests. These prestaging requests triggered a bug in dCache, which triggered lots of pool-to-pool activity, resulting in about 50 replicas of files that CMS were reading, spread over the pools.
SIDE NOTE: the problem is that, when the dummy-stager (when staging with dcap) or pin-manager (when using SRM) will ask pool-manager to stage a file, pool-manager misinterprets this as an imminent read of this file and is somewhat reluctant to send subsequent read requests to the same pool. Massive staging can then trigger pool-to-pool transfers (due to pool-to-pool on-load) should a client attempt to read a file located on an HSM pool. This problem is fixed with dCache v1.9.5-26.
Another problem was with the Auger VO. They were writing a large number of very small files; roughly 1,500 files were going to a single pool. These files were small: O(10 MB) in size. Although the files were small, they were able to use up the available number of concurrent GridFTP transfers that were allowed (the MaxLogin? property), which was set to 400.
The problem was that the "excessive" use of dCache GridFTP transfers prevented other people, in particular LHCb, from using dCache. From LHCb's point-of-view, the GridFTP door were unstable.
Doris implemented some short-term solutions and increased the gsiftpMaxLogin value from 400 to 600 next morning. This may have help alleviate the problem, but the Auger may have stopped trying to write so many small files.
NDGF
Gerd reported that things are "quite fine."
NDGF have upgraded to the latest internal build on their central nodes. This involved switching to the new FHS compliant packaging.
SIDE NOTE: dCache.org provides FHS-compliant packaging as part of our EMI releases. These EMI-released RPM files deploy files within the /etc, /usr, /var directories according to the FHS rules. We continue to provide RPMs that deploy files into the traditional /opt/d-cache path, which are available from the downloads page of our website.
Gerd said that they have also enabled some of the new features that will be coming with dCache v1.9.13. These include checking permissions on the complete path of a file. In all current releases, dCache will check the permissions of the containing directory and (in the case of reading) that of the file itself; dCache assumes that all directories that are ancestors of the containing directory are accessible.
Enabling correct directory and file permissions checking resulted in one of NDGF's Nagios tests to fail. This failure is actually correct: the probe was testing something that should fail, according to the permissions of the complete path.
Gerd also mentioned that they are currently in the process of changing all the ownership and permissions of ATLAS files. This is in preparation for a switch to take advantage of features from the new gPlazma. This preparation includes having a unique uids for each user (as opposed to having group accounts for all ATLAS users).
NDGF have also enabled OpenMQ for some of their domains. This is an alternative (and potential successor) to the current JMS implementation: ActiveMQ.
So far, everything is running fine.
IN2P3
Nicolas reported that everything is going OK at IN2P3.
There are not so many problems now; the problems with dCache becoming overloaded have gone away.
dcap version
Nicolas described how they have a "really old" version of dcap library deployed on their worker nodes. When queried, dccp returns "v1.2".
IN2P3 have observed high load on their billing server. Could the old version of the library be triggering this problem?
Tigran said that, if the clients are doing dccp then there's no difference with more recent versions of the dcap library.
Migration
Nicolas also mentioned that they are planning to migrate two dCache instances to the latest 1.9.5 version: the LCG and EGEE instances. The LCG instance is currently running 1.9.5-24; the EGEE version is older: 1.9.5-3.
They are also planning to migrate from PNFS to Chimera during this upgrade. Paul said to write an email to <support@…> if they have any questions or problems.
PIC
Gerard reported that everything is running fine at PIC.
They have upgraded their Tier-3 instance to dCache v1.9.12.
Gerard is planning on upgrading their Tier-1 instance from dCache v1.9.5-25 to v1.9.5-26 on June 8th.
He was considering upgrading the Tier-1 facility to v1.9.12 on June 8th but, on reflection, he considers it too soon.
Gerard mentioned the problem they had with the latest dcap client. This forced them to downgrade their version to a pre-dcap++ version.
SIDE NOTE: the problem is that, when the user community explicitly enables the local-caching buffer (LCB a.k.a "dcap++") by setting an environment variable, the LCB is applied to all dcap transfers, including dccp. Unfortunately, the LCB algorithm provides very poor performance for applications that stream the file's contents, such as dccp.
Gerard mentioned that they, too, had a high level of activity from CMS due to a software bug. CMS were generating a lot of file-opens; he mentioned 2,000 jobs where each job was opening 5,000 files. This high number of file-opens resulted in dCache filling up the billing disk but, apart from that, dCache continued to work.
Gerard remarked that he was considering consolidating the billing and pool-manager; to run these two services on the same machine. However, he has decided to keep them on separate machines now, having noticed high load on both machines during CMS's massive amount of activity.
Triumf
Simon reported that Triumf are running OK and have upgraded their dCache instance to 1.9.5-25 on 11th of May.
dcap
Doris asked whether there has been any activity about the active transfer retry bug? No, we don't have a fix for this yet.
Paul asked if we have a ticket about the problem; Doris and Tigran said yes.
Support tickets for discussion
[Items are added here automagically]
DTNM
Same time, next week.
