Example xrootd setup and configuration - SLAC
The SLAC/BaBar xrootd system
SLAC operates the largest xrootd system in BaBar and uses many of the features
to serve data to o(2000) client jobs from tens of data servers (with
dynamic staging of data from HPSS mass storage when necessary). The
applications reading data from this system are a mixture of user analysis and
production skimming jobs. Here is a schematic of the system:
(Note: all of this is an Internet Free Zone (IFZ) and hence not visible to
the outside world.)
The redirector(s)
The redirector in such a large system is a critical element. In order to
support such a large number of clients and provide some redundancy to allow
the redirector machine(s) to go down periodically for updates, SLAC uses
DNS-style round-robin load balancing. All client jobs are configured to
open files via "bbr-rdr-a":
root://bbr-rdr-a//store/..../aaaFile.root
however behind the scenes this is DNS load-balanced to two separate physical
servers (bbr-rdr03 and bbr-rdr04):
noric15> nslookup bbr-rdr-a
Server: ns4.slac.stanford.edu
Address: 134.79.18.41
Name: bbr-rdr-a.slac.stanford.edu
Addresses: 134.79.85.23, 134.79.85.24
noric15> nslookup 134.79.85.23
Server: ns4.slac.stanford.edu
Address: 134.79.18.41
Name: bbr-rdr03.slac.stanford.edu
Address: 134.79.85.23
noric15> nslookup 134.79.85.24
Server: ns4.slac.stanford.edu
Address: 134.79.18.41
Name: bbr-rdr04.slac.stanford.edu
Address: 134.79.85.24
If the TXNetFile client is configured to connect to "bbr-rdr-a" (as above) it
will automatically chose randomly among the two DNS possibilities bbr-rdr03
and bbr-rdr04. The olbd on each data server registers with the manager olbd
on both bbr-rdr03 and bbr-rdr04, hence both are capable of
redirecting to any data server in the system (full redundancy). If the client
tries one of the two redirector machines and does not succeed in connecting
to it, it simply tries the other one.
In practice, reasonably low-end hardware can be used for the redirector
machines.
The specifications of the two SLAC bbr-olb* machines are:
- dual 440MHz sparc
- 1GB of memory
- 2GB swap
- Solaris9
- maximum number of file descriptors - 16384
The maximum number of file descriptors is increased from the default (I
think it is 1024 for Solaris) in order to allow many clients to have an
open connection to the redirector if necessary.
[Early versions of the TXNetFile client, used for the first half of 2004
at SLAC, systematically chose the first of these (bbr-rdr03) and would only
fail over to bbr-rdr04 if bbr-rdr03 was unavailable. This bug has been
fixed in more recent client versions, but it does demonstrate that in terms
of scaling even one of these machines was capable of handling rather larger
redirection loads.]
The data servers
The specifications of the SLAC kanNNN data servers are:
- dual 1GHz sparc
- 2GB of memory
- 32GB swap
- Solaris9
- 5TB of disk space
- maximum number of file descriptors - 16384
The maximum number of file descriptors is increased from the default (I
think it is 1024 for Solaris) in order to allow many clients to have an
open connection to the data server if necessary. It is less relevant here
than for the redirector, but the change should not cause any problems.
The 5TB of disk space is setup as 6 separate filesystems, for example:
/kanga (/dev/dsk/c2t1d0s6 ):86810656 blocks 1356395 files
/kanga/cache1 (/dev/dsk/c2t1d1s6 ):82408992 blocks 1287629 files
/kanga/cache2 (/dev/dsk/c2t1d2s6 ):88150720 blocks 1377341 files
/kanga/cache3 (/dev/dsk/c2t1d3s6 ):87299232 blocks 1364030 files
/kanga/cache4 (/dev/dsk/c2t1d4s6 ):88702720 blocks 1385973 files
/kanga/cache5 (/dev/dsk/c2t1d5s6 ):86944112 blocks 1358485 files
There is more on this in the section "The cache filesystem(s)" below.
The xrootd and olbd configurations
The actual config files used at SLAC (plus annotated versions with
explanations) are:
The same config file is used for the xrootd and the olbd on a given machine.
Wrapper scripts are used to actually start the server, however the effective
command line options (ignoring paths to the executables and config files) are:
- Redirectors:
olbd -m -l /var/adm/olbd/logs/olbdlog -c slac_redirector.cf
xrootd -r -l /var/adm/xrootd/logs/xrdlog -c slac_redirector.cf
- Dataservers:
olbd -s -l /var/adm/olbd/logs/olbdlog -c slac_dataserver.cf
xrootd -l /var/adm/xrootd/logs/xrdlog -c slac_dataserver.cf
The cache filesystem(s)
/kanga mountpoint for filesystem 0
/kanga/cache0 simple directory in filesystem 0
/kanga/cache1 mountpoint for filesystem 1
/kanga/cache2 mountpoint for filesystem 2
/kanga/cache3 mountpoint for filesystem 3
/kanga/cache4 mountpoint for filesystem 4
/kanga/cache5 mountpoint for filesystem 5
(More coming soon....)
Dynamic staging and HPSS
(Coming soon....)
Restarting crashed xrootd and olbd servers
The system is now past its commissioning phase and relatively stable, however
there is of course always the possibility for rare bugs which cause one of
the xrootd or olbd daemons to crash. The client (by design) is robust enough
to survive this, it will simply back off and try again periodically to connect
to the server which is down. In these cases the best thing to do is (a) to
save the core file (for examination by the Developer) and (b) to arrange
such that some process checks and restarts any xrootd or olbd daemons which
have gone down.
SLAC uses its own custom system "Ranger" to check that a daemon is running
and restart it if it is not. This check happens once every 15 minutes for
both daemons (the xrootd and the olbd) on all of the relevant machines.
Adding new data to the system
The system described here is intended to provide read access to the large
amount of data needed for data analysis in BaBar. An important property of
this data is that it is write-once/read-only: It is written by
centrally-controlled production systems doing data (re)reconstruction,
production of simulated data and data/MC skimming and read many times for
various analysis and production reasons. As the "write once" and "read-many"
problems can benefit from different optimizations we keep them separate: new
data flowing into the system thus passes via a separate set of
buffers from those described on this webpage. The new data is written into
mass storage (HPSS at SLAC) and from there becomes available for staging
and read access via the system described here.
Software installation
SLAC uses the same Solaris9 binary tarballs which are available from the
download page for any given xrootd version. These are placed in an area in
afs at SLAC and a SLAC-specific distribution method (called "taylor") is used
to copy them to the local disk of each server machine (under /opt):
kan002> ls -l /opt/xrootd
drwxr-xr-x 5 bbdatsrv bfactory 512 Jun 10 01:56 20040609-1119
drwxr-xr-x 6 bbdatsrv bfactory 512 Aug 19 16:38 20040819-0249
drwxr-xr-x 6 bbdatsrv bfactory 512 Aug 23 11:47 20040823-1105
drwxr-xr-x 6 bbdatsrv bfactory 512 Aug 31 01:56 20040830-0105
lrwxrwxrwx 1 bbdatsrv br 13 Aug 23 11:47 prod -> 20040823-1105
kan002> ls -l /opt/xrootd/20040823-1105/
drwxr-xr-x 2 bbdatsrv bfactory 512 Aug 23 11:30 bin
drwxr-xr-x 2 bbdatsrv bfactory 512 Aug 23 11:47 etc
drwxr-xr-x 2 bbdatsrv bfactory 512 Aug 23 11:30 lib
drwxr-xr-x 2 bbdatsrv bfactory 512 Aug 23 11:30 utils
Since the full path of the loadable shared libraries (e.g libXrdOfs.so) is
specified in the config file, a soft link "prod" points to the current version
so that the same config file can be used as the version changes.
Until recently the client side plugin has just been included in BaBar software
releases. It is now included directly in ROOT itself.