Example of xrootd setup at BNL/STAR


The BNL/STAR xrootd system

The Solenodial Tracker (STAR) experiment at Relativistic Heavy Ion Collider (RHIC) located at Brookhaven National Laboratory (BNL) produce PetaByte of data (raw and reconstructed) per year which bears deep puzzle about data management of such a big quantity.  STAR disposes with 135 TB of disk space spread over on 320 analysis nodes at its Tier-0 Center at RHIC Computing facility (RCF) which offers great opportunity to use XROOTD and deal with a heavy load on these nodes caused by user's jobs. Please refer here for more information.

Here is a schematic picture of STAR xrootd setup:


STAR XROOTD setup


The olbd daemon/layer (load balancing)


All STAR machines are heavily loaded by user's analysis job which implies using additional olbd daemon and in case of more than 64 dataserver using also additional supervisor layer. STAR has 135TB of disk space spread over on 320 data servers. This implies using at least 320/64=5 supervisors + one redundant supervisor node for a failsafe.

Olbd offers computing of an overall load in many ways. It is a combination of 5 main information (cpu,memory,networkio,pagingio,uptime) reported by each node (see perf command in olb/odc ref). The relative weight and combination of those 5 parameter have been chosen to reflect our environment workload and provide a smooth operation of the Xrootd deployment (see olb.sched configuration directive in our configuration).

The redirector(s)


The redirector is the biggest single-point of failure in such a large distributed environment.  STAR followed the idea of having DNS round robin mechanism for reducing a load and preventing a loosing of XROOTD cluster accessibility  in case of node's crash.


DNS round robin


This solution is not functional in Xrootd version 20060523-1741 or below for clusters with more than 64 servers and the code needs to be reimplemented. We have also chosen nodes each at the separated network switch to prevent loosing both of redirectors in case of a switch problems.

Redirectors hardware configuration:
The load caused by requests didn't show any major impact and requirement for increasing the power.

Dataserver(s)

The hardware configuration of dataservers in STAR environment is very
various:
Each server has read only access and files could be dynamically staged from HPSS (High performance storage system). This allows to build a unique fault-tolerance environment with almost zero point of failure (server out of configuration, network glitch etc.).

The Cache filesystem(s) / MPS configuration(s)

Each dataserver has from one to four separated filesystems which allows to create xrootd uniform namespace. Our dataserver directives are therefore:


cache_fs


A real example is something like this:
[rcas6010] ~/> ls -alrt /home/starlib/home/starreco/reco/productionLow/FullField/P05ic/2004/038/st_physics_5038025_
raw_1020003.MuDst.root
lrwxrwxrwx 1 starlib rhstar 120 Feb 6 17:41 
/home/starlib/home/starreco/reco/productionLow/FullField/P05ic/2004/038/st_physics_5038025_raw_1020003.MuDst.root ->
/data0/%home%starlib%home%starreco%reco%productionLow%FullField%P05ic%2004%038%st_physics_5038025_raw_1020003.MuDst.root
MPS configuration is used for managing many cache filesystems. STAR uses its features for selecting a cache in case of dynamic staging from HPSS and also for dynamic purging of used space when needed.

The purging of used space is applied if and only if needed. A node is selected by load report and purging of missing free space is called within dynamic staging. Some of the cache filesystems are used as temporary space for user's job (in other words shared space) which implied using additional external purging mechanism. (see the thresholds for selecting a node for staging and for applying a purge action).

The dynamic staging from HPSS

Each requests could be handshaked with HPSS (missing file in distributed disk or even direct request to HPSS). Each dataserver may request data simultaneously, therefore falling into an uncoordinated request trap. This is significant by using xrootd in such a large cluster and by many users which at the end may lead to HPSS collapse and downtime. Hence we decided to adapt our system called DataCarousel.

Its main purpose is to organize requests to HPSS, applying sorting algorithm (effort to have as many files requests as possible on the same tape) and also re-queuing in case of failure.

Xrootd needs to implement two scripts to enable MSS (Mass storage system) staging:
We implemented mssgwcmd command as the perl script using FTP perl module for communicating with HPSS.
The stagecmd script was designed and implemented for passing requests to DataCarousel and reporting problems if needed.

All requests are handled asynchronously within the DataCarousel and therefore a single file restore within this scheme may be slower than a direct ftp approach. This solution is useful for many requests come simultaneously and also to satisfy the need of sharing the resource with other tools outside XROOTD as seen in STAR environment.

The xrootd security

STAR also has deployed password-based authentication which is available in xrootd distribution.  The authentication is enabled just on dataservers, since it is useless to authenticate user twice per one request. (see sec.protocol in our configuration).

The xrootd and olbd configurations

Redirectors and supervisors can't serve a data by default.  In case of STAR environment, it means a loosing of 6 servers which can serve data. This issue could be solved by using additional xrootd/olbd running on the same node.

We are using two separated configuration files:


Managing xrootd and olbd daemons

We developed our own csh wrapper script for starting/restarting/stopping xrootd and olbd daemons with particular command line options. This script is executed on each server from crontab which also bears opportunity for restarting crashed servers.
Each xrootd/olbd instance uses instance name (xrootd terminology) based on host name which allows smoother and more transparent configuration files. (if named "instance name" statement)

Last modified 27-Jun-2006
Pavel Jakl