wiki:developers-meeting-20100929
Last modified 10 years ago Last modified on 09/29/10 18:41:34

[part of a series of meetings]

Participants

..

Agenda

[see box on the right-hand side]

Postcards

Up to two minutes (uninterrupted) per person where they can answer two questions:

  • What I did last week (since the last meeting),
  • What I plan to do in the next week.

No questions until we get through everyone :)

Christian: worked on Apache SSHD framework to build a new admin shell. Have basic 'hello world' working system. Now trying to get dCache to compile. Got a 64-bit virtual machine; trying to install 1.9.9.

Tanja: holidays, preparing for NFS meeting ("bake-a-thon") in Boston.

Owen: EMI-ing, virtual machine updating. Managing testbeds. Got dCacheConfigure to (finally) work. It is ready to merge with dCache.

Paul: dcap++. almost ready for review

Jan: EMI-ing, helped with Antje, 1.9.10 web-admin features, talk on web-admin / Wicket.

Antje: EMI-ing, installed 1.9.10 on a fresh virtual machine. Finally found out all the configuration .. tests are all running and passing. Need to repeat this on the next virutal machine, and document the process.

Dmitry: OSG forum .. reports on SRM scalability tests. General impression: people who have invested time in getting ; People installing dCache for the first time are frustrated and try something else.

Noted dcap++ questions.

Move PNFS from one place to somewhere .. public dCache service degregation with PNFS timeouts. File-locality patch for 1.9.5 now in RB.

Thomas: was on vacation .. today fixed a bug.

Tigran: mostly merging stuff into 1.9.10 prior to release.

Plans for patch-releases

Should we make a new patch release?

Today or tomorrow Tigran plans to merge Dmitry's fix for 1.9.5 and release. He expect to have 1.9.5-23 released tomorrow or Friday. Friday we'll have 1.9.10-1.

SRM-client

Owen releases dcap client but who releases srm-client?

What is the procedure for releasing srm-client?

Add target to build the client, like we do currently for the server.

Web releases test srm-client only against all our support dCache releases. Testing against dpm, castor, etc. is done by Owen and Jan when releasing into gLite / EMI.

Owen + Tigran between them will add ability to make srm-client RPMs with build system.

Trunk activity

Progress with new features...

Tigran hopes to be ready with "new" pool by the end of the week; this is the "threadless movers" patch.

Pool names

Motivation is:

  1. configuration system can discover if a pool is already configured .. and has been run with a name.
  1. solve the issue where inconsistency between cacheinfo (Chimera or companion) and pool name

We can (as a short-term work-around) add a documentation in the book about being careful not to rename pools ("we" -> Thomas).

Issues from yesterday's Tier-1 meeting

Issues from KIT

There's currently one issue from KIT

CMS headnode

"Our CMS head-node crashed on Saturday" Ticket ...

"Crash" here means that dCachedDomain was using lots of memory (~6 GiB) and the machine needed to be restarted by operators (nothing actually crashed).

PoolManager eating memory .. pool cannot flush (due to filesystem issue). We think these are related, but we don't have log files or any strong evidence of this.

Heap-dump is useful information to gather.

Paul to remind Doris for the information.

Tigran: we should add a JVM option to generate heap-dump on out-of-memory. Tigran to do this.

Issues from PIC

There is currently one issue from Gerard.

xrootd issue

Gerard's Tier-3 ATLAS users notice problems reading from dCache using ROOT + xrootd-client.

RT 5859: xrootd not working when multiple streams

Thomas would have to look into the source-code.

Can you run 1.9.5 Thomas? ... yeah, kinda. We can run the job against old and new xrootd implementations and see what happens.

Versioning is difficult: ROOT version (experiment), xrootd client version.

The problem is triggered when the client open two files in parallel (not sequentially).

Tigran to try after the meeting ... he'll keeping Thomas up-to-date with what he finds.

Issues from Triumf

There's currently one issue from Triumf

hanging pool-to-pool

Simon has reported problems where pool-to-pool transfer(s) hang, resulting in transfers hanging.

RT 5824: One hot file access case

Tigran: ... maybe I remember. if you have multiple clients (or single), trigger p2p and p2p fails then p2p goes into restore queue. Depending on configuration this request may go into suspend state. Retry only if a pool says "I'm here" or if a timer goes off (rc set retry timeout kinda thing).

Tanja: to check ticket 5824 and ask Simon for PoolManager.conf file if it's not already there.

Tags

Reminder: Tigran, problem with enstore and Chimera.

Tigran to send Dmitry a chimera.jar file after meeting.

If you have permission to write into directory then new tags may be created by users and take ownership from that user (with permissions 0644).

To update an already existing tag, the user must have write-permission on the tag.

Dmitry to test Tigran's chimera.jar. If it is what they want then Tigran will merge the patch into stable dCache branches.

Pool tags

You may use them during selection. The "host" pool tag is documented. We don't to pool-to-pool to hosts that have the same "host" pool tag.

Pool tags are labels one can add to a pool. It's arbitrary key-value pairs .. we use "host" as the host-name of the pool, but others tags can be added .. for example, adding the rack within which the host is placed.

This should be documented.

Maintenance of book

New people document changes, but we're changing too fast for this.

Most of the features are not documented; e.g., we have WebDAV that isn't documented (WebDAV isn't alone in this).

One thing that might help is moving the book into dCache code-base. This would allow us to prevent patches that touch functionality without

Could add a policy line saying people should updated the dCache book when they touch functionality.

When you start from scratch .. and it doesn't work. Someone knowledgeable looks through the logs to see what is the problem. The person with the dCache doesn't know enough to update the Book or Troubleshooting guide.

Certificates is also poorly documentated at the moment: what needs to be where.

People can put changes into ReviewBoard?. It doesn't matter if the patch isn't perfect .. people will review the patch to ensure it makes sense.

Tigran: you have to be hard: go through procedure until it breaks and don't proceed until the book is updated; then, when the book is updated, reset the machine and start from scratch.

Paul to circulate slides from the dCache Book talk.

Antje: haven't really documented yet, but as soon as its working (it's repeatable) Antje will push changes into the book.

Make a video about how to install dCache ... Patrick is keen on videos. The problem is that we'd have to keep these videos update, too.

Leave the videos for now, and focus on getting the Book up-to-date.

Who doesn't have this problem? Big companies ..

Comments by site-admins/end-users? We have a prototype system that allows comments. Paul tries to get this working.

Could use (some of) the comments as starting point for future troubleshooting items.

Troubleshooting guide

Paul to circulate a dummy version of document for people to look at.

We can also search in the user-forum for common problems.

We need a person who feels responsible for this, to keep the document up-to-date: Antje, as QA person.

Perhaps we could work on documentation during the next dev. workshop.

Logging of IP addresses in SRM

Thomas promised, two weeks ago. Unauthorised user more-or-less attaching them. Reported by Doris. The proper way of logging this would be to set the Origin principle in the Subject.

No direct connection between the connector and the door component. This is a common problem: also seen with web-admin; Jan's solution is to store a reference to the cell in JNDI and obtain the reference using JNDI. Tigran, in Grizzly, the WorkingThread? as subclass of Thread, which has attributes. By explicitly casting Thread, you can get all the extra information.

Put the IP information into the logging MDC. That way, the IP address will be logged.

Thomas: I can look into what Jan did.

dcap++

Customer Uni. of Chicago would like the patch released. Would like to be able to modify the size of the buffer.

Did they try the patch, or just heard the rumours that it fixes everything?

We need to make sure it doesn't break other users.

Dmitry: Charles gave a talk in OGF on dcap++.

There's the dccp man page and the API web-pages that need to be updated.

Outstanding RT Tickets

[This is an auto-generated item. Don't add items here directly]

RT 5572: Re: SRM dCache needs restart after lcg-CA update?!?

Can't fix this with the old SRM.

Since 1.9.9, jetty is the default. Since 1.9.10, we have the async. connector, so reloading is supported.

RT 5756: Problem of locality

Fixed, will be merged in 1.9.5, we can then close the ticket.

TODO: remove Discuss-DEV label.

RT 5824: One hot file access case

Already discussed: remove Discuss-DEV label.

RT 5846: Clean-up in table srmspacefile

Remove Discuss-DEV label.

RT 5851: PNFS problem on dcache pool nodes

Q: who supports PNFS?

He's going to install 2.6.31 on a pool. If that works, he'll upgrade the remaining.

Owen suggests also removing the "sync" option.

CDF have ScienticLinux? 5.1 on the new hardware, it's not working. Two systems: "public" and "CDF" dCache run 2.6.18-1.nnn and don't see the problem. CMS runs 2.6.31.

Remove Discuss-DEV label.

Review of RB requests

Dmitry promises to look at Gerd's four SRM/SURL patches.

Dmitry happy with Tigran's comments about his patch for 1.9.5. The file written to the write-pool isn't available. The location is likely fresh ... doesn't matter.

If you disable pool, you will also get SRM "UNAVAILABLE".

DTNM

Proposed: same time, next week.