wiki:developers-meeting-20141015
Last modified 5 years ago Last modified on 10/15/14 17:02:49

[part of a series of meetings]

Participants

Karsten, Christian, Paul, Gerd, Al, Dmitry

Agenda

[see box on the right-hand side]

Postcards

Up to two minutes (uninterrupted) per person where they can answer two questions:

  • What I did last week (since the last meeting),
  • What I plan to do in the next week.

No questions until we get through everyone :)

Karsten:

  • Worked with Christian on srm-probe testbed and updated Jenkins job to build package.
  • Tickets, Reviews

Christian:

  • srm-probe testbed
  • fixed pcells
  • started again to mavenize pcells as a multi-module project

Gerd:

  • Mostly bugfixes: migration module, billing, ...
  • trained some people in dCache

Paul:

  • Meeting preparation
  • LSDMA AllHands? meeting
    • Potential for colaboration with UniHH in analyzing logfiles
  • Setup Prometheus system
  • Presented SmallFiles? to Atlas people -> seem interested
  • SRM client release

Al:

  • Tied up plot issue with bug, but lib
  • RM: chatting with Gerd:

Dmitry:

  • Build 2.6 with patches, will deploy tomorrow
  • Problems with DB
  • Testing, etc
  • globus-url copy problem: parsing issue, developer points to open ticket from Gerd
  • will continue investigation, suggest to send in Ascii mode

Special topics

2.11 + Java8

RHEL6 has Java8, but it is a bit too late to put in Java8 code into dCache. Suggest to set Java8 as requirement, makes rollback easy.

SRM-Client should stay with Java7 compatibility, we will have to release it from dCache 2.10 branch. We can also set the dependencies in dCache of SRM-Client to be Java7

How long should we support SRM-Client from 2.10, then? Previously we always released from the latest release... we still can if we set the dependencies to Java7

-> So will keep SRM-Client as Java7 as long as possible!

What about 2.11? Using Java8 could make it harder for Sites that want to use feature releases.

Dmitry: No problem for us. We should consider Tigran's opinion.

Christian: Java8 seems to help in performance

2.11 will wait for Tigran's input. 2.12 will definitly Java8

Prometheus

Paul: The initial idea is to have daily refreshed running dCache based on master.

Then the CMS xoot6 issue gave the idea to let customers and clients to test their software against master, to catch problems early.

-> The system is set up and running

Gerd: CMS and Atlas probably won't use it, but it is still good that we do our part to avoid

problems.

Paul: We are missing functional tests from xroot and dcap for more complicated scenarios

Dmitry: What would I need to do to trigger the bug?

-> Starting to give access to us developers and later maybe to others for testing

Trunk activity

Progress with new features...

Replica Manager

Al: The situation at the beginning of summer was to have a prototype. Then Gerd worked on it and now we finally got back to this. Started to pull in some things from Gerd and discovered a couple of issues:

  • Looks like it is a good idea to break of the replication module and divide it on the pools and the namespace
  • A lot of complicated things could be removed

-> Redisigning Replica Manager again (version 3). We need to be able to have a aflag to not have a replica on the same host.

Gerd:

  • Suggest to use the server side of the migration module, that would allow to get rid of some stuff...
  • Suggest to split off some code to other modules

Paul:

Does RM react if there are too many replicas? Which messages will RM respond to?

Gerd: The desing allows to synchroniously ignore, but asynchroniously trigger adding caches

Gerd: There still might be a little race condition, that could be solved using annotated message handlers.

Paul: Could we solve this having an additional field? The approach seems to try to solve the race condition

by reordering of messages which seems like a fragile solution.

-> Will add additional field, Al to implement this.

Atlas -> Metadata for Datasets

Paul: It would be good if files from a dataset would be stored together.

They take hashes of file names and files are stored in up to 16 different directories. If we had the concept of groups of files, we could take care that they end up together.

Gerd: It seems not sensible to code specific grouping mechanisms into our code base, better

would be to have a more generalized concept. That would allow us to have PoolManager to selection on some piece of metadata.

-> Rucio probably stores datasets in directories with subtrees

Ask Friederike about datasets

Issues from [FIXME: Add link to yesterday's Tier-1 meeting]

KIT: Problem with overloaded SRM caused by new policy.

Gerd: Did that actually kill the SRM? Paul: It was overloaded, probably having an impact on other transfers.

Talking to Tigran about this and we have this concept

Gerd: The authentication is taking up the time.

Paul: They could keep the connection open or use bulk request, will contact Oliver about this.

Tigran gave some advice how to improve performance to Xavier Upgrading to 2.10 will also introduce a more robust handling of overload situations.

Plans for patch-releases

Should we make a new patch release?

No patch releases this week. We had one on Friday.

  • We had a SRM-Client release

Next round of releases will be next Tuesday including SRM Client

Outstanding RT Tickets

Upgrade checklist for 2.2 -> 2.6 http://rt.dcache.org/Ticket/Display.html?id=8479

They should try upgrading, many already did and if they have problems they should contact us. Related: Should we have such a guide for upgrading to 2.10? Yes. Gerd to write this.

xrootd Domain doesn't reconnect to dCacheDomain http://rt.dcache.org/Ticket/Display.html?id=8485

PinManager? fails to start when the nfs service is started first http://rt.dcache.org/Ticket/Display.html?id=8274

Solaris packages http://rt.dcache.org/Ticket/Display.html?id=8443

Uploading to dCache box does not work. Karsten to tackle this.

hsm cleaner concurrent requests http://rt.dcache.org/Ticket/Display.html?id=8460

Could be the pool takes to long, so the cleaner times out and recommits it. Maybe set up the timeout Check code, if we log the problem

srmls logger messages should be STDERR http://rt.dcache.org/Ticket/Display.html?id=8434

Likely fixed with latest release.

Pool manager errors in billing files http://rt.dcache.org/Ticket/Display.html?id=8442

dcache uses DNS to determine its hostname http://rt.dcache.org/Ticket/Display.html?id=8366

We need to replace the references and use the context of the cells that should give the FQDN and can be parameterized. Karsten to check.

High load on NFS server: http://rt.dcache.org/Ticket/Display.html?id=8407

Looks like requests from certain requests are causes problems.

Review of RB requests

New, noteworthy and other business

SRM-Client release

DTNM

Proposed: same time, next week.