Ticket #174 (new bugs)

Opened 12 years ago

pool <-> xrootd-door interaction

Reported by: hauki Owned by: omsynge
Priority: major Milestone: 1.8.0-15p6
Component: core Keywords: xrootd
Cc: Sub Version:

Description

The number of max. connections at a time in the xrootd-door does not work together with the max. number of movers in the pool.

When a pool queues an xrootd-request, because the max. number of movers is reached, the xrootd-door just throws an exception:

CacheException("Pool responded with invalid redirection address, transfer failed")

and leaves an orphaned mover behind in the queue. If the requests are retried faster than the pool is actually working on the files, the queue builds up and up. At one point e.g. Xrootd-door said it knows about 41 logins, but the pool had already 120 movers...(recored was 13000 movers queued so far). Then it happens that there are multiple movers in the pool which are working on the same file :

grep " A "  mover_ls | sort -k 5 # from a snapshot of the admin-interface

10800 A R {Xrootd-madhatter-unknow-66070@xrootd-madhatterDomain:0} 0001000000000000004D57C8 h={diskCacheV111.movers.XrootdProtocol_2@1d184373} bytes=74642734 time/sec=3061 LM=102170
11410 A R {Xrootd-madhatter-unknow-66752@xrootd-madhatterDomain:0} 0001000000000000004D57C8 h={diskCacheV111.movers.XrootdProtocol_2@2e66af7d} bytes=74642734 time/sec=3043 LM=86451
11670 A R {Xrootd-madhatter-unknow-66752@xrootd-madhatterDomain:0} 0001000000000000004D57C8 h={diskCacheV111.movers.XrootdProtocol_2@4bdf8c56} bytes=74642734 time/sec=2947 LM=83444
10610 A R {Xrootd-madhatter-unknow-66070@xrootd-madhatterDomain:0} 0001000000000000004D57C8 h={diskCacheV111.movers.XrootdProtocol_2@ee5d3cc} bytes=74642734 time/sec=3026 LM=105292
21940 A R {Xrootd-madhatter-unknow-71654@xrootd-madhatterDomain:0} 0001000000000000004E62E8 h={diskCacheV111.movers.XrootdProtocol_2@1d350e1a} bytes=0 time/sec=0 LM=0
22040 A R {Xrootd-madhatter-unknow-71664@xrootd-madhatterDomain:0} 0001000000000000004E62E8 h={diskCacheV111.movers.XrootdProtocol_2@230ca63b} bytes=0 time/sec=0 LM=0
21970 A R {Xrootd-madhatter-unknow-71657@xrootd-madhatterDomain:0} 0001000000000000004E62E8 h={diskCacheV111.movers.XrootdProtocol_2@4d964636} bytes=0 time/sec=0 LM=0
22000 A R {Xrootd-madhatter-unknow-71660@xrootd-madhatterDomain:0} 0001000000000000004E62E8 h={diskCacheV111.movers.XrootdProtocol_2@578b4f57} bytes=0 time/sec=0 LM=0
21910 A R {Xrootd-madhatter-unknow-71651@xrootd-madhatterDomain:0} 0001000000000000004E62E8 h={diskCacheV111.movers.XrootdProtocol_2@7e13055} bytes=0 time/sec=0 LM=0
21920 A R {Xrootd-madhatter-unknow-71652@xrootd-madhatterDomain:0} 0001000000000000004E65D0 h={diskCacheV111.movers.XrootdProtocol_2@14fe99c9} bytes=0 time/sec=0 LM=0
22020 A R {Xrootd-madhatter-unknow-71662@xrootd-madhatterDomain:0} 0001000000000000004E65D0 h={diskCacheV111.movers.XrootdProtocol_2@34cb9a81} bytes=0 time/sec=0 LM=0
21980 A R {Xrootd-madhatter-unknow-71658@xrootd-madhatterDomain:0} 0001000000000000004E65D0 h={diskCacheV111.movers.XrootdProtocol_2@442c397b} bytes=0 time/sec=0 LM=0
21950 A R {Xrootd-madhatter-unknow-71655@xrootd-madhatterDomain:0} 0001000000000000004E65D0 h={diskCacheV111.movers.XrootdProtocol_2@6c8f46e0} bytes=0 time/sec=0 LM=0
22050 A R {Xrootd-madhatter-unknow-71665@xrootd-madhatterDomain:0} 0001000000000000004E72F0 h={diskCacheV111.movers.XrootdProtocol_2@54f73897} bytes=0 time/sec=0 LM=0
22010 A R {Xrootd-madhatter-unknow-71661@xrootd-madhatterDomain:0} 0001000000000000004E72F0 h={diskCacheV111.movers.XrootdProtocol_2@799b3b7c} bytes=0 time/sec=0 LM=0
21960 A R {Xrootd-madhatter-unknow-71656@xrootd-madhatterDomain:0} 0001000000000000004E79E8 h={diskCacheV111.movers.XrootdProtocol_2@11cec248} bytes=0 time/sec=0 LM=0
21990 A R {Xrootd-madhatter-unknow-71659@xrootd-madhatterDomain:0} 0001000000000000004E79E8 h={diskCacheV111.movers.XrootdProtocol_2@18b3ea5d} bytes=0 time/sec=0 LM=0
21930 A R {Xrootd-madhatter-unknow-71653@xrootd-madhatterDomain:0} 0001000000000000004E79E8 h={diskCacheV111.movers.XrootdProtocol_2@2fb8b6ad} bytes=0 time/sec=0 LM=0
21900 A R {Xrootd-madhatter-unknow-71650@xrootd-madhatterDomain:0} 0001000000000000004E79E8 h={diskCacheV111.movers.XrootdProtocol_2@3c5cbb02} bytes=0 time/sec=0 LM=0
22030 A R {Xrootd-madhatter-unknow-71663@xrootd-madhatterDomain:0} 0001000000000000004E79E8 h={diskCacheV111.movers.XrootdProtocol_2@791547a7} bytes=0 time/sec=0 LM=0

This is weird, since I guess some of the active movers are already dead on the Xrootd-door. So what happens to those ?

Looking at the speed, it is really slow eg

10800 A R {Xrootd-madhatter-unknow-66070@xrootd-madhatterDomain:0} 0001000000000000004D57C8 h={diskCacheV111.movers.XrootdProtocol_2@1d184373} bytes=74642734 time/sec=3061 LM=102170

which is not the fault of the disk-system

[root@pacific03 d-cache-pools]# find . -name 0001000000000000004D57C8
./4/pool/data/0001000000000000004D57C8
[root@pacific03 d-cache-pools]# time cat ./4/pool/data/0001000000000000004D57C8 >> /dev/null

real    0m5.623s
user    0m0.106s
sys     0m0.851s
[root@pacific03 d-cache-pools]# du -h ./4/pool/data/0001000000000000004D57C8
895M    ./4/pool/data/0001000000000000004D57C8

gives about 180 MB/sec
Note: See TracTickets for help on using tickets.