wiki:developers-meeting-20100223
Last modified 11 years ago Last modified on 02/24/10 15:46:46

dCache Tier I meeting February 23 2010

[part of a series of meetings]

Present

dCache.org(Owen, Patrick, Paul), Triumf(Simon),NDGF(Gerd), GridKa(Silke, Doris), CERN(Andrea)

Agenda

(see box on the other side)

Site reports

FZK

Doris reported that everything is fine.

They have one little issue with SRM space-tokens: the space-token database appears to be missing some files. ATLAS are reporting that quite a lot of files that should be inside of a space token are not. Doris has noticed that the catalina.out file contains error messages like "No file with PnfsID existing".

Patrick asked if these error messages appear when reading or writing? They appear when writing.

These files may be read without any problem.

Does this only happen occasionally or all the time? A grep of yesterday's log-file show some 338 instances of the error message.

Yesterday, there were 338 reports in the catalina.out file.

Overall there are some 36,000 files that are not in a space token when we checked last week.

FZK running the latest 1.9.5-release on their head-nodes: 1.9.5-15. The pools are running an earlier version.

Could you check your PostGreSQL log file? Perhaps it is reported some problem that correlates with ATLAS writing files that end up not in a space token.

Is the space-reservation database hosted in the same PostGreSQL instance as other databases? At FZK, the same PostGreSQL instance provides the SRM-, space-reservation, and pin- databases.

Please open a ticket (email support@…) and we'll investigate more.

NDGF

Gerd reported that things are fine: no incidents.

NDGF plans to upgrade to the latest 1.9.6 release (1.9.6-3) with their back-port of various items from Trunk.

Triumf

Simon reported that things are Triumf are currently not good.

Last week Triumf migrated to Chimera. The migration was successful; however, about 20 hours after the migration, the Chimera database suffered a database-level corruption. Two rows inside the t_dirs table then had a VARCHAR field has an impossibly large length. The index was not corrupted. Work is on-going in establishing the root-cause; current likely culprits are fabric hardware or the database software. The hardware has been checked so suspcion is now falling on the PostGreSQL software, which is v8.3.4.

There was mention in the PostGreSQL release notes that describes a possible index corruption bug; however, Simon has checked this and the problem isn't due to this issue.

Simon asked about possibility of using newer version of PostGreSQL. Doris reported that FZK are using v8.3; Gerd reported that NDGF are using v8.4.

Another issue (not related to the migration) was about a pool not coming up. This was initially thought to be an HSM problem, but it turned to be a pool problem. The pool was restarted and it came up successfully.

CERN

Andrea had nothing specific.

root and xroot ?

Doris asked what is happening about "root" vs "xroot". Paul reported the background to this issue: CERN allow access via the old root protocol and the newer xroot protocol. LCHb need to distinguish between these two.

One solution to this is to rename all xrootd implementations from using "root" to use "xroot". dCache currently advertises the xrootd protocol as root in GLUE, it accept root as a protocol (in SRM calls, like srmPrepareToGet) and return TURLs like root://a-root-door.example.org.... These would change to publishing xroot in GLUE, accepting xroot as the protocol (in SRM) and returning TURLs like xroot://a-root-door.example.org....

We're waiting on GDB to give a definitive statement that this is the correct approach before releasing a new version of dCache. There will likely be an announcement from GDB about this soon.

Support tickets for discussion

[Items are added here automagically]

DTNM

Same time, next week.