Table of Contents
[part of a series of meetings]
Participants
Jan, Paul Tigran, Karsten, Antje, Tanja, Christian; Dmitry, Al, X
Tigran: last week at NetApp?. Testing NFS v4 impl with vendors in particular with RH 6.2. RH 6.2 has some problems that are now fixed.
Karsten: we have/had a Debian build machine; now rebuilt. Java-7 bug-fix needed for Debian.
Tanja: tickets, and fixing things (patches)
Al: complete Jaida implementation; new billing cell; write new version of the HTTP servlet that will use the Jaida. This will work inside our Jetty / http cell .. the old one. The old one is using
Dmitry: progress on request storage via DataNucleus?. Have also implemented a memory-only Job DAO that doesn't depend on database. This isn't noticeable .. copy isn't really working yet. due to invokation of client, managing state isnt' trivial. Completed tests on enstore with Chimera .. CHEP conference "enstore working fine with Chimera."#
Sasha: trying to get involve in the process .. replacing Gene as participant from management.
Gerd: finished the xrootd refactoring. Now have stand-alone version (does file listing). Still need to a little bit of work loading the dCache plugins, code a bit cleaner now. Maven-ising dCache. Build something that was installable and could be started. Started writing a script to do this maven conversion process on-demand. Updating the unit-tests to remove hard-coded path that cause 108--~20 failing. Create a path in svn to demo the results. Looking good: lots of old stuff kicked out.
Christian: trying to build our stuff in ETICS. Work on srm probes. Committed ssh-v2 code.
Q: Examples for jython that interact with the admin interface. Do you check these? The jython uses SSH connector from python, which should work out-of-the-box (assuming it supports SSH v2).
Antje: documentation.
Jan: deeper into the import functionality; some fixes.
Paul:
Agenda
[see box on the right-hand side]
Postcards
Up to two minutes (uninterrupted) per person where they can answer two questions:
- What I did last week (since the last meeting),
- What I plan to do in the next week.
No questions until we get through everyone :)
Plans for patch-releases
Should we make a new patch release?
We will have a 1.9.12 release next week, 2.0.1 and maybe a 1.9.5.
Trunk activity
Progress with new features...
Pulling out xroot plugins out of dCache tree: how to distribute them?
We need to have a new package in EMI that supports these plugins.
Two:
installations that already have the plugin; after upgrade their dCache instance won't work. We need to be clear how these sites obtain their xroot plugin(s).
Gerd is retracting the merge request, only changed in 2.1-series.
What happens if we want to change the API? What do we do if the plugins don't update their plugin to match our new API.
Sites looking at how to develop the namespace mappings.
Our site should maintain a list of available plugins.
Keep in mind this is a public API and any changes may break plugins.
Changes to API (if ever) are announced on user-forum.
C.f. gPlazma: this has plugins too.
Distribute existing plugins?
Who hosts the plugins? Have a plugin repository?
We have to certify them? No.
How does Scala do it? It's very pluggable. Most of it comes from Slac themselves.
Do we support sites that use a plugin?
Questions:
o RT ticket for support from site that has installed a plugin: what is our response? o RT ticket asking for the CMS plugin: what do we answer?
The main reason for doing this (beyond the wishful thinking ..) Alice have switches in their software to correct for dCache bugs. We need a much faster release cycle to keep up with them.
Import tool
If metadata of a file is fetched from the namespace .. you're not updating any of the data. New approach, using createEntry on the repository then it should work.
Issues from yesterday's Tier-1 meeting
Tigran has a couple of ideas:
o Couldn't kill movers could be: could be kills the wrong mover. o Movers that hang never get a connection: can't kill mover if client never connects.
Will sent a dtrace script to identify what's happening with interrupting the cell.
IN2P3 have an issue with connecting to their SRM: RT ticket xxxx
Changing cells names will break scripts; plus we probably still have some hard-coded references (e.g., "PnfsManager")
Could we support aliases? It's just an idea: not implemented yet.
Gerd can add binary mode to the SSH v2 GUI, but someone needs to update the GUI to use ssh v2.
Issues from EMI
Padova: RPM lint errors are to be fixed in the releases.
What kind of errors do we have?
AP/ Christian to post to team a link to the ETICS
AP/ Christian to ask about changing the package name.
Outstanding RT Tickets
[This is an auto-generated item. Don't add items here directly]
Ticket 6671.
Need to adjust our defaults. The thing to recommend is there number of partitions and the number of connections.
Each service has its own BoneCP instance. Several components have their own hard-coded maxNumber of connections.
Go (slowly) in the direction of having a single domain per node and starting/stopping services within that domain. The reason for domains is that an OutOfMemory? doesn't kill other cells. We can't use a single domain until we have this isolation. For example, using OGSI or some application server.
Main motivation at DESY is to restart single domain without affecting the others. Split disk into 10 RAID arrays, rebuild just one pool .. although we can restart a cell .. we could add the ability to stop/start a service. However, we can't enforce this, unlike an application server.
Maybe we need to have this for version v3.0
RT 6561: dCache 1.9.12-8: WebDav? door with constant CPU activity despite doing nothing
AP/ Paul to ask Gerard ... (suspect that it's something Jetty). The other problem that forced us to downgrade. Gerd found lots of references to people complaining about high CPU usage.
Could be related to updating the CRL?
Give the impression that something happening out of cron (e.g., CRL update, port scan, ..).
Gerd: need to check whether latest version has really fixed bug. Initially they claimed to fix it but didn't.
If port-scanner connects to our GridFTP server then it leaves ports in CLOSED_WAIT.
Port scans are using nmap; Paul to circulate the settings.
RT 6688: feature request: collection of files
It's about to run some commands on collections of files; e.g., migration files to other pool or rebalanance them.
We have another request: how to unpin all files in the pool.
Cumbersome feature to provide: a very poor-man's version of Gerd's container idea.
One use-case: users provide a list of files (to admin) that they are going to work on.
Gerd: This seems like something that should be done with a script.
This is something that can be easily implement as a front-end system.
If pool-manager is trying to respect these collections (e.g., spreading the files over different pools) then we should have something more well thought-out.
Why repinfoof in UserAdminShell? rather than PnfsManager? Most operations in central shell, rather than specific
Tigran: Only 10--15 commands that people use on a daily bases. Have these in a single place.
Gerd's worried we're painting ourselves into a corner: if it's in UserAdminShell? then it's there for ever.
Could maybe have an alias system, so commands at the UserAdminShell relays the command to PnfsManager ?
Component view?
Idea of having a directory as the collection.
ChangePolicyForSpace?: too easy to tell dCache to recall a PB of data.
Directory might not reflect the
CMS: file-family-tag.
Tigran: this should be part of dCache's core functionality to discover access patterns and to move files accordingly; e.g., discovering data-sets and redistribute when clients read from this dataset then try to
Danger is adding a command for each use-case.
If the data is being read from the same tape then can we spread the files over multiple pools.
While a tape is loaded, try to read the files on different pools, to anticipate that files on the same tape are part of the same data-set.
Plugin system for PnfsManager allows us to write experimental stateful systems. ATLAS uses a single directory, same storage group. Some files (subset) should be distributed because (they claim) the performance isn't good enough.
Main problem is that they buy boxes when the .. rebalance..
Is the random distribution good enough?
Need a Wizard that says "might be time to run rebalance" ... something for Jan.
RT 6700: FHS and deb packages
We have "dcache" for FHS and "dcache-server" /opt/d-cache layout.
SRM client; "dcache-client" --> "dcap-client"
Installing in /opt is triggering RPM-lint errors.
Pulling dcap code out of dCache? Yes.
Review of RB requests
Pseudo-filesystem patch for NFS controlling exports.
DTNM
Same time, next week.
