wiki:developers-meeting-20141126
Last modified 5 years ago Last modified on 11/26/14 17:50:37

[part of a series of meetings]

Participants

Paul, Tigran, Karsten, Gerd, Dmitry

Agenda

[see box on the right-hand side]

Postcards

Up to two minutes (uninterrupted) per person where they can answer two questions:

  • What I did last week (since the last meeting),
  • What I plan to do in the next week.

No questions until we get through everyone :)

Karsten:

Scrum, Jira Agile Presentation Small Files:

Preparing another test run -> container on tape corrupt Helped planning with a second Small Files instance for dphep

Reviews

Paul:

Alternating between meetings and releases optimizing release process various meetings Abstracts

Tigran:

  • Mostly working on cloud issues
  • fixing ownCloud -> fixing NFS
  • some patches in RB
  • tickets
  • populating elasic search from billing using JSON instead of logstash

Dmitry:

  • Addressing OOM on poolnodes tested with own version which seems to fix it
  • Server performance became better after sweeper

Gerd:

  • Been in contact with Ivan and Xavier
    • IN2P3 uses import and export pools, overwriting db schema once a minute
  • worked on migration module
  • bugfixes
  • SSL problems
  • Atlas is not able to delete files on 2.11 instance, probably because of SSL3 from SL5

Special topics

SSL

If there is an SSL hatching error, Jetty doesn't log it except at DEBUG level and just closes the connection (as IO Exception). This poses a problem to discover issues with disabled algorithms.

Paul: Xavier recently had such a case with CRLs

Gerd: We could subclass a class in the chain and do proper logging or do it with an Aspect. Or submit a patch to Jetty.

Paul: Could be bump up the level of the specific logger?

Gerd: That would also cause other debug messages to be logged

Tigran: We should ask Jetty people to "fix" this.

Gerd: If we continue disabling SSL3 in older branches this may cause problems for WLCG

Paul: We need to talks to them.

Tigran: Maybe we don't need to support SL5 too much longer (e.g., DESY only has SL6 worker nodes)

Gerd: Even SL5 should be able to use TLS

-> Paul to drop email to WLCG -> Need to log SSL handshake errors

Gerd: moving to 2.11 on production will be delayed for some weeks because of this.

SRM Advice

Paul: Got a question about how many bring online requests is a reasonable number. Currently we say 2k.

Gerd: Although this is site dependent, 2k should usually be a good number.

-> 2k is fine, but we should be able to handle tens of thousands.

For rm and ls the number should probably be lower.

Access Times

Dmitry: pool2pool modifies access times of files, but shouldn't.

Paul: On a hot-spot replication this could cause the wrong files to be GCed

Dmitry: Could the interval of the sticky bit be exposed?

Tigran: We don't rebalance cache

Paul: Would it make sense to preserve the access times

Gerd: We could

Tigran: For migration files it could make sense to keep it as it.

Paul: Why do you want to adjust the 2min lock?

Dmitry: We would like for some use-cases to guarantee a longer period of online time

Tigran: For such cases you could have a partition that only removes oldest files

Paul: It would be a nice feature, others do have it

-> They should

---

Dmitry: Hot replication

Trunk activity

Progress with new features...

DDN

Tigran: They build a storage box with SSD access layer. They want us to provide data about access.

Gerd: Suggest to use information from billing data

Issues from [FIXME: Add link to yesterday's Tier-1 meeting]

KIT had some issues from Atlas with staging -> instance collapsed (PoolManager crashed)

Dmitry: Did it crash with OOM? We had a case where adjusting memory on the dCacheDomain fixed the issue. (~30k requests)

Paul: Xavier had 70k, sadly the dump got corrupted.

Dmitry: Could he use rc set all restores?

Paul: This seems to be broken. If you set the limit no further staging happens at all.

-> we can try to reproduce that

Gerd: Could be that the rm requests blocked. The question is "why".

Paul: Maybe because of a backed up message queue. The pool reports message queue overflow.

What can we do?

Gerd: We could introduce a throttle in PoolManager (or other places) to avoid flooding pools.

We could also drop messages more aggessively. That's what we are doing in 2.10

-> Advice should be to keep testing 2.10 and upgrade. -> Investigate "rc set all restores"

---

Paul: WebDAV testing in 2.10. It looks like if you upload a file and immediately try to read it there is a race condition

where the pnfs is already final registered in PnfsManager, but since it is a synchronous call from the pool, the entry is not yet completely final on the pool.

-> The Pool should report a "new file"-error that would cause the door to retry.

Gerd: A proper way to avoid this would be to have a control channel

---

xrootd-plugin: will it work with 2.10?

-> it should do

---

Natalya: Gave some general advice.... they should upgrade.

---

Tuesday:

Xavier: generally happy

  • logback.xml
  • file deletions, all doors except nfs log file deletions... would be nice if it did.

NDGF:

  • SSLv3 issue

Plans for patch-releases

Should we make a new patch release?

Improving automatisation, looking good.

G2 tests are failing. Should we say that SL5 is no longer supported and move to newer machines?

-> we should backport disabling SSLv3 to all supported branches.

Outstanding RT Tickets

[This is an auto-generated item. Don't add items here directly]

Review of RB requests

New, noteworthy and other business

DTNM

Proposed: same time, next week.