Example xrootd setup and configuration - SLAC


The SLAC/BaBar xrootd system

SLAC operates the largest xrootd system in BaBar and uses many of the features to serve data to o(2000) client jobs from tens of data servers (with dynamic staging of data from HPSS mass storage when necessary). The applications reading data from this system are a mixture of user analysis and production skimming jobs. Here is a schematic of the system:

*

(Note: all of this is an Internet Free Zone (IFZ) and hence not visible to the outside world.)

The redirector(s)

The redirector in such a large system is a critical element. In order to support such a large number of clients and provide some redundancy to allow the redirector machine(s) to go down periodically for updates, SLAC uses DNS-style round-robin load balancing. All client jobs are configured to open files via "bbr-rdr-a":

root://bbr-rdr-a//store/..../aaaFile.root
however behind the scenes this is DNS load-balanced to two separate physical servers (bbr-rdr03 and bbr-rdr04):

noric15> nslookup bbr-rdr-a
Server:  ns4.slac.stanford.edu
Address:  134.79.18.41

Name:    bbr-rdr-a.slac.stanford.edu
Addresses:  134.79.85.23, 134.79.85.24

noric15> nslookup 134.79.85.23
Server:  ns4.slac.stanford.edu
Address:  134.79.18.41

Name:    bbr-rdr03.slac.stanford.edu
Address:  134.79.85.23

noric15> nslookup 134.79.85.24
Server:  ns4.slac.stanford.edu
Address:  134.79.18.41

Name:    bbr-rdr04.slac.stanford.edu
Address:  134.79.85.24
If the TXNetFile client is configured to connect to "bbr-rdr-a" (as above) it will automatically chose randomly among the two DNS possibilities bbr-rdr03 and bbr-rdr04. The olbd on each data server registers with the manager olbd on both bbr-rdr03 and bbr-rdr04, hence both are capable of redirecting to any data server in the system (full redundancy). If the client tries one of the two redirector machines and does not succeed in connecting to it, it simply tries the other one.

In practice, reasonably low-end hardware can be used for the redirector machines. The specifications of the two SLAC bbr-olb* machines are: The maximum number of file descriptors is increased from the default (I think it is 1024 for Solaris) in order to allow many clients to have an open connection to the redirector if necessary.

[Early versions of the TXNetFile client, used for the first half of 2004 at SLAC, systematically chose the first of these (bbr-rdr03) and would only fail over to bbr-rdr04 if bbr-rdr03 was unavailable. This bug has been fixed in more recent client versions, but it does demonstrate that in terms of scaling even one of these machines was capable of handling rather larger redirection loads.]

The data servers

The specifications of the SLAC kanNNN data servers are: The maximum number of file descriptors is increased from the default (I think it is 1024 for Solaris) in order to allow many clients to have an open connection to the data server if necessary. It is less relevant here than for the redirector, but the change should not cause any problems.

The 5TB of disk space is setup as 6 separate filesystems, for example:

/kanga             (/dev/dsk/c2t1d0s6 ):86810656 blocks  1356395 files
/kanga/cache1      (/dev/dsk/c2t1d1s6 ):82408992 blocks  1287629 files
/kanga/cache2      (/dev/dsk/c2t1d2s6 ):88150720 blocks  1377341 files
/kanga/cache3      (/dev/dsk/c2t1d3s6 ):87299232 blocks  1364030 files
/kanga/cache4      (/dev/dsk/c2t1d4s6 ):88702720 blocks  1385973 files
/kanga/cache5      (/dev/dsk/c2t1d5s6 ):86944112 blocks  1358485 files
There is more on this in the section "The cache filesystem(s)" below.

The xrootd and olbd configurations

The actual config files used at SLAC (plus annotated versions with explanations) are: The same config file is used for the xrootd and the olbd on a given machine. Wrapper scripts are used to actually start the server, however the effective command line options (ignoring paths to the executables and config files) are:

The cache filesystem(s)


/kanga             mountpoint for filesystem 0
/kanga/cache0      simple directory in filesystem 0
/kanga/cache1      mountpoint for filesystem 1
/kanga/cache2      mountpoint for filesystem 2
/kanga/cache3      mountpoint for filesystem 3
/kanga/cache4      mountpoint for filesystem 4
/kanga/cache5      mountpoint for filesystem 5
(More coming soon....)

Dynamic staging and HPSS

(Coming soon....)

Restarting crashed xrootd and olbd servers

The system is now past its commissioning phase and relatively stable, however there is of course always the possibility for rare bugs which cause one of the xrootd or olbd daemons to crash. The client (by design) is robust enough to survive this, it will simply back off and try again periodically to connect to the server which is down. In these cases the best thing to do is (a) to save the core file (for examination by the Developer) and (b) to arrange such that some process checks and restarts any xrootd or olbd daemons which have gone down.

SLAC uses its own custom system "Ranger" to check that a daemon is running and restart it if it is not. This check happens once every 15 minutes for both daemons (the xrootd and the olbd) on all of the relevant machines.

Adding new data to the system

The system described here is intended to provide read access to the large amount of data needed for data analysis in BaBar. An important property of this data is that it is write-once/read-only: It is written by centrally-controlled production systems doing data (re)reconstruction, production of simulated data and data/MC skimming and read many times for various analysis and production reasons. As the "write once" and "read-many" problems can benefit from different optimizations we keep them separate: new data flowing into the system thus passes via a separate set of buffers from those described on this webpage. The new data is written into mass storage (HPSS at SLAC) and from there becomes available for staging and read access via the system described here.

Software installation

SLAC uses the same Solaris9 binary tarballs which are available from the download page for any given xrootd version. These are placed in an area in afs at SLAC and a SLAC-specific distribution method (called "taylor") is used to copy them to the local disk of each server machine (under /opt):

kan002> ls -l /opt/xrootd
drwxr-xr-x   5 bbdatsrv bfactory     512 Jun 10 01:56 20040609-1119
drwxr-xr-x   6 bbdatsrv bfactory     512 Aug 19 16:38 20040819-0249
drwxr-xr-x   6 bbdatsrv bfactory     512 Aug 23 11:47 20040823-1105
drwxr-xr-x   6 bbdatsrv bfactory     512 Aug 31 01:56 20040830-0105
lrwxrwxrwx   1 bbdatsrv br            13 Aug 23 11:47 prod -> 20040823-1105

kan002> ls -l /opt/xrootd/20040823-1105/
drwxr-xr-x   2 bbdatsrv bfactory     512 Aug 23 11:30 bin
drwxr-xr-x   2 bbdatsrv bfactory     512 Aug 23 11:47 etc
drwxr-xr-x   2 bbdatsrv bfactory     512 Aug 23 11:30 lib
drwxr-xr-x   2 bbdatsrv bfactory     512 Aug 23 11:30 utils
Since the full path of the loadable shared libraries (e.g libXrdOfs.so) is specified in the config file, a soft link "prod" points to the current version so that the same config file can be used as the version changes.

Until recently the client side plugin has just been included in BaBar software releases. It is now included directly in ROOT itself.

Last modified 29-Nov-2004, Peter.Elmer@cern.ch