wiki:developers-meeting-20141022
Last modified 3 years ago Last modified on 10/22/14 16:48:29

[part of a series of meetings]

Participants

Patrick, Paul, Christian, Karsten, Dmitry, Gerd, Al

Agenda

[see box on the right-hand side]

Postcards

Up to two minutes (uninterrupted) per person where they can answer two questions:

  • What I did last week (since the last meeting),
  • What I plan to do in the next week.

No questions until we get through everyone :)

Karsten:

  • CHEP Abstracts

Patrick:

  • pushing people to write CHEP abstracts and wrote some
  • gave a presentation in Paris for a new Neutrino Detector Community for EGI (Antares)

Christian:

  • pcells: having fun with maven and class loaders, dependencies

Paul:

  • Upsala computing meeting: Talk about dCache. They were interested, learned about their ideas
  • Poodle: Understood the thread and posted an announcement
  • CHEP: Abtracts

Al:

Dmitry:

  • Investigating NFS and making patches

Gerd:

  • Talking to Al
  • Followed some error logs
  • Pool health check, pool size issue with file system not supporting the free space -> workaround
  • Race condition problems
  • Database deadlock resolution in SpaceManager
  • Upgrade guide to 2.10

Special topics

NFS performance

  • Since June Fermi experiences slowdowns,
  • Slowdowns are caused by recursive ls (python implementation)
  • Currently we have a workaround that restarts NFS door if stat takes longer than 10s
  • Patch from 2.2 helped in 2.6 to increase rate in transactions
  • delay seems to be in the interaction with the server
  • CPU consumption in 20%
  • memory is not an issue
  • iowait is not critical
  • NDGF: Upgrade to SL7 has improved the situation, but Atlas FAX still causes problems -> iowait is steadily increasing
  • Reboots also "fixed" the problems
  • Kernel version in SL7: 3.10.
  • performance issues are not consistent, between '12 and '14 performance was fine
  • slowdown is significant, but does not happen for all clients
  • cannot unmount server, only -f- l works
  • Server has virtual network interface, remounting with on the same interface does not help, but mounting on with a different name helps.

-> looks like server remembers client and slots for the client are full

! the problem also happens to the probing client doing only the stats.

Didn't see any correlation between SL5/SL6 version and slowdowns

Paul:

  • NFS server in Grizzly has a limited number of threads (around CPU count)
  • Tigran split
  • ls -R causes almost unlimited amount of work (single threaded, probably) -> There should be a way to limit the amount of work a client is able to

pose on the server.

1 Would be interesting to know how often and how regularely the ls -R happens! 1 Clients didn't say anything about clients dying on reboot, they only complain if the request takes too long. 1 Need correlation about ls -R and slowdown. 1 ls -R is run on 3 different machines. 1 could be the directories are changing during ls -> probably the creation of new entries

causes the slowdowns.

  • ls -f is much faster than ls -l

Does the caching-issue with ls have something to do with this? This should be fixed in more recent versions of linux.

What kind of loads does a typical T1 site see? Fermi sometimes sees about 20kHz.

ls on 4.1 gets stuck while writing, but works on 3.

Globus Online

Situation?

  • Globus library mangles reply from GO and seg faults on various occasions
  • They suggest to use mls with ascii

Is the recursive copy the only problem that keeps us from being compatible?

There are other issues, transfers fail at the very end. Between DESY and KIT transferes failed with the globus native server.

If Globus is not willing to fix it we will have to 1) fix the globus library 2) escalate the issue to ensure the patch gets into GO

Paul to contact Tiki to ensure the code would get promoted to GO

Trunk activity

Progress with new features...

Gerd: Found a bug in Pool Manager.

Issues from [FIXME: Add link to yesterday's Tier-1 meeting]

Tier-1 s home alone

Plans for patch-releases

Should we make a new patch release?

did releases on Tuesday and there will be another one next one. and disabling SSL3 in jGlobus is in the pipeline. Globus advised to not disable SSL3 to avoid problems with other components.

2.11: started setting up the infrastructure. Please don't break master until Monday.

Outstanding Documentation

Outstanding RT Tickets

[This is an auto-generated item. Don't add items here directly]

Review of RB requests

Gerd, please commit srmclient.

New, noteworthy and other business

DTNM

Proposed: same time, next week.