Basic xrootd monitoring

Xrootd is capable of providing a real-time monitoring. Basic (light) monitoring includes the following information:

Setup

Monitoring data is sent from xrootd servers via UDP packets to a collector, therefore it is non-disruptive to the xrootd no matter what happens to the collector. The collector is capable of collecting and decoding in real time packets from multiple xrootd servers. Decoded data, in the form of ASCII files is formated such that it can easily be loaded into a relational database.

Setting up the light monitoring involves several simple steps:

  1. Prepare the database server
  2. Configure xrootd
  3. Start the collector
  4. Set up the MySQL database
  5. Run the MySQL loader
  6. Run Application to Prepare Statistics
  7. Set up Apache tomcat server
  8. Download the monitoring web application

Preparing Database Server

Collector/decoder and the applications that populate the database do not require a sophisticated hardware, even a modest 1-CPU machine with a few hundred GB of disk space will do. Pick a host (<dbServerName>) and create the base directory <baseDir> and the site directory <baseDir>/<thisSite> for the site where the database server is running. such a structure will allow for future expansion of the system to include data from other sites. The output from the decoder is directed to <baseDir>/<thisSite>/logs.

Configuring xrootd

Optionally pick up a port number <ctrPort> for the collector (default = 9930) and add the following line to the xrootd config file:

xrootd.monitor all dest files info user <dbHostName>:<crtPort>

For instance:

xrootd.monitor all dest files info user xrootd-stats:9930

Xrootd has to be restarted for that change to take effect. Make sure you are using the same port number for all xrootd servers that will be monitored as one cluster.

Starting Collector

Simply run the command xrdmonCollector.pl < configFile> start.
If you used a non-default port number in xrootd config file, you have to use the same number in the configFile . Above script runs the binary xrdmonCollector which has to be in your $PATH. The log file is writen to
<baseDir>/<thisSite>/logs/out/xrdmonCollector.<date>

If you decide to run multiple collectors per machine (e.g. one for each group of hosts), you should repeat all of the steps described in this document for each cluster and use a different <baseDir> for each cluster in addition to using a different port number (in xrootd config and < configFile>)

To stop the collector, use the command xrdmonCollector.pl < configFile> stop
The collector buffers some data in memory, and killing it in any other way will result in losing buffered data. Please keep in mind that it can take a few minutes for the collector to completely shut down.

Setting up MySQL

To load data into MySQL you will need:

You need to do (if you run multiple collectors, you need to repeat these steps for each collector):

  1. Pick a MySQL server<dbServerName>, user name <MySQLUser> and database name <dbName>
  2. Make sure the mysqld server is running
  3. on the MySQL server <dbServerName>
  4. Make sure the time_zone tables are loaded into MySQL. If not use the appropriate command for your platform to load them.
  5. Grant the user appropriate privileges using the account with GRANT privileges. At MySQL prompt enter:
    grant all on <dbName> to <MySQLUser>
  6. Run xrdmonCreateMySQL.pl < configFile> .
xrdmonCreateMySQL.pl will terminate without doing anything if <dbName> already exists. This is a protection against accidentally corrupting an existing database. If you want to recreate a database you have to remove it first.

Platform command to load time_zone tables
Solaris mysql_tzinfo_to_sql /usr/share/lib/zoneinfo/ | mysql -uroot mysql
Linux mysql_tzinfo_to_sql /usr/share/zoneinfo/ | mysql -uroot mysql

Running MySQL Loader

To load data coming out of the collector into the MySQL database start the MySQL loader using the command:

xrdmonLoadMySQL.pl < configFile> start >& <loaderLogFile>

But first you have to replace the dummy subroutine getFileTypes in xrdmonLoadMySQL.pl with a suitable one for your file types. (see example for BaBar experiment.)

xrdmonLoadMySQL.pl will run indefinitely. To stop it use the command

xrdmonLoadMySQL.pl < configFile> stop

This will ensure that the program loads everything in the buffer before terminating. Killing it in any other way might leave the database tables in an inconsitent state. Also the loader sets certain semaphore files that might be left behind if it is not stopped properly.

Preparing Statistics Tables

The data loaded into the database needs further processing to produce the statistics tables that are published on the web. You do this with the command:

xrdmonPrepareStats.pl < configFile> start >& <prepareLogFile>

xrdmonPrepareStats.pl will run indefinitely. To stop it use the command

xrdmonPrepare.pl < configFile> stop

Killing it in any other way might leave the database tables in an inconsitent state.

Configuration File

All perl scripts in xrootd monitoring system use a single config file. Make sure that you do not alter the config file after you have created the database unless you are sure that the changes are not incompatible with the existing database and the applications that are already running.

Two of the definitions used in the config file are specific to this project and need some explanation. These are file types and Job

File Types

A file can be classified in many ways which differ from experiment to experiment. An example is whether a file contains real or simulated data which can in turn be filtered or unfiltered. This defines a class or type of files with the name, say dataType and values (real, real-filtered, simulated, simulated-filtered). A file containing filtered data can also belong to a second class named filters with values (filter-1, filter-2, ..., filter-100) One can imagine a class describing data taking periods or beam energies, software releases etc. The classification names dataType, filters etc are given in the configuration file as:

fileType: dataType 4
fileType: filters 100
etc
The numbers following the names are the maximum number of values expected. Allow for future increases.

getFileTypes

subroutine getFileTypes uses the file path as an argument to determine which classes the file belongs to and returns the corresponding values in a predetermined order. If a file does not belong to a given class the value undef is returned for that class. The order is communicated in the initialization call when getFileTypes is called without an argument, by returning the ordered list of class names.

Definition of Job

A job is defined as a collection of sessions having the same userName, processId and clientHost and satisfying a certain timing criteria defined below. Each job is assigned a unique jobId. The start time of a job is given by the earliest session connect time and the end time is given by the latest session disconnect time. The connect time of any session in the job can not exceed the latest session disconnect time by more than <maxJobIdleTime>. At a given time a job can assume one of the three states:
  1. running: Number of open sessions > 0
  2. finished: Number of open sessions = 0 AND latest disconnect time is at least <maxJobIdleTime> before current time.
  3. dormant: When none of above is true. A dormant job can resume running if a new session is started before the <maxJobIdleTime> limit is reached.

Contents of Configuration File

The config file contains lines with the following format:

<token>: <Value 1> [ <Value 2> [ <Value 3> ] ] ]
(no space between token and colon and all values are separated by single space)

The valid tokens and explanation of values are given in the following table.

Time and interval values are integers followed by MINUTE | HOUR | DAY | WEEK | MONTH | YEAR

The date format is: YYYY-MM-DD<space>HH:MM:SS

token explanation default
dbName Database Name None
MySQLUser Name of the user with full permissions to the database None
webUser Name of the user with SELECT permission to the database None
MySQLSocket Location of MySQL Socket /tmp/mysql.sock
baseDir Pathname for the parent directory of the sites subdirectories. None
thisSite Name of the site where the database server is running None
site Value 1: Site Name
Value 2: Time Zone Code
Value 3: Date of first data from the site
None
backupInt Value 1: Site Name
Value 2: Time interval for starting a new backup file for the input data
1 DAY
backupIntDef Default backup interval to be used in the absence of the site specific backup interval. 1 DAY
fileType Value 1: file classification name
Value 2: maximum number of values
None
fileCloseWaitTime Maximum time the loader waits for a file close signal after receiving the corresponding session close signal. The file is force closed and the file close time is set to the session close time. 10 MINUTE
maxJobIdleTime Maximum time an xrootd client job is allowed to run without an open session. 15 MINUTE
maxSessionIdleTime Time after which the session close signal is assumed to be lost. The session is force closed and the session disconnect time is set to the latest file close time in the session. For a session with no files the session duration is set to zero. 12 HOUR
maxConnectTime Maximum time a session can remain connected with at least one open file. After this time the session and all open files are force closed. The session disconnect time is set to the latest file close time in the session. 70 DAY
closeFileInt Time interval for checking the open files associated with closed sessions. 15 MINUTE
closeIdleSessionInt Time interval for checking the idle sessions. 1 HOUR
closeLongSessionInt Time interval for checking the sessions running over the "maxConnectTime" limit. 24 HOUR
nTopPerfRows Number of rows in Top Performance tables 20
yearlyStats ON/OFF switch for yearly statistics OFF
allYearsStats ON/OFF switch for AllYears statistics OFF

See for example the configuration file for the BaBar setup where the lines corresponding to default values are included for completeness.

Location of Perl Scripts

The perl scripts are currently located in xrootd/src/XrdMon. This is a temporary location and will be fixed very soon.
You may have to tweak the very first line in each script to point to the correct location of your perl on your system.

Setting up Tomcat Server

To deploy the web application for xrootd monitoring you need to set up Apache Tomcat.

  1. Choose a server <webServerName>, preferably other than the one for MySQL database (e.g. you might want to run MySQL and the collector inside internet-free-zone for security reasons)
  2. Download Apache Tomcat 5.5.* (we are using 5.5.7 at SLAC)
  3. Install the administration Web Application which has to be dowloaded separately
  4. Create a user with manager and admin privileges by editting the file conf/tomcat-users.xml. Add in the line:
    <user username="<userName>" password="<password>" roles="manager,admin"/>
  5. Configure the database connection by editing and adding the following lines to conf/server.xml

    <Resource
    name="jdbc/xrdmon-database"
    type="javax.sql.DataSource"
    password="<password>"
    driverClassName=" org.gjt.mm.mysql.Driver "
    maxIdle="2"
    maxWait="5000"
    username="<webUser>"
    url="jdbc:mysql://<dbServerName>/<dbName>"
    maxActive="4"/>

  6. Start the tomcat server by running the startup script in the bin directory

Deploying Web Application

Downloaded (location to be decided)/xrdmon.war (6 MB) and deploy the application through the manager tomcat application by loading the war file.

Additional Informtion

More on real time

So how "real-time" is this system? Most of the information on the xrootd server side is flushed after a udp packet is filled up, or when a specified time elapses (currently 1 min). Once sent, data should be (a) collected, (b) decoded and (c) flushed to the real time log file within few seconds (depends on the RT flush setting, default is 10 sec). Loading to MySQL introduces another short delay, at slac it is 1 minute. So in practice, the information is available in MySQL a couple of minutes after the event happens.

More on log files

By default collector will decode all the light data and produce 3 different types of log files:

type format default frequency default location
collected packets binary N/A (default: flush when log size = 1GB) ./logs/collector/<xrdHost><port>
decoded real time data ASCII 15 sec ./logs/rt/rtLog_verXXX.txt
decoded history data ASCII 10 min ./logs/decoder/<xrdHost>/<port>

Collected packets' logs contain segregated raw packets exactly as they arrive from xrootd servers, packets from each xrootd instance in a separate directory.

Decoded real time data logs contain information about current snapshot of the system, for instance current number of opened sessions, users, or files. If you are not interested in watching current "snapshot" of the system you may consider turning it off (xrdmonCollector -rt off). This data is regarded as temporary/scratch, e.g. it becomes invalid when the collector is stopped.

History data logs contain information about closed sessions and closed files. It should be treated as a permanent record.

Collected packets

Collected packets' logs are segregated according to xrootd host and port. Log name: active.ctr. Default maximum size for the log file is 1GB. When active.ctr reaches that size it is renamed to <timestamp>.ctr, and a new active.ctr is opened. Never delete active.ctr while collector is running. You can backup/copy/rm closed logs.

Real time logs

Real time data is kept in rtLog_verXXX.txt. If you use the provided script for loading data to MySQL it will remove the rtLog_verXXX.txt each time it loads the data to MySQL. It backs up the data in the backup directory for each site by appending it to the current backup file. A new backup file is started at intervals given by <backup interval>. You will also see lock files rtLog_verXXX.lock which is used to synchronize access to rtLog_verXXX.txt. The lock file should not be touched.

Format of the log file:
u <uniqueSessionId> <userName> <processId> <clientHostName> <xrootdHostName>
d <uniqueSessionId> <durationInSec> disconnectTime>
o <uniqueFileId> <userName> <processId> <clientHostName> <filePath> <openTime> <fileSize> <xrootdHostName>
c <uniqueFileId> <bytesRead> <bytesWritten> <closeTime>

Lines starting with 'u' indicate 'user connect' (session start).
Lines starting with 'd' indicate 'user disconnect' (session stop).
Use <uniqueSessionId> to correlate 'd' lines with the corresponding 'u' lines.

Lines starting with 'o' indicate 'file open'.
Lines starting with 'c' indicate 'file close'.
Use <uniqueFileId> to correlate 'o' lines with the corresponding 'c' lines.

All values are separated by TAB.
Format of timestamp is: YYYY-MM-DD<space>HH:MM:SS

Example:

u 115 dgustaf 12078 barb0045 kan012.slac.stanford.edu
o 120 owen 1960 barb0154 /store/SP/R12/001237/200010/12.4.0j/SP_001237_003641.02E.root 2005-03-16 22:43:07 186067283 kan006.slac.stanford.edu
c 66 0 0 2005-03-16 22:41:40
d 84 0 2005-03-16 22:42:40
o 107 palombo 26144 barb0077 /store/SP/R14/001237/200401/14.4.2a/SP_001237_007640.01.root 2005-03-16 22:43:00 1276663486 kan031.slac.stanford.edu
o 128 roethel 7553 barb0131 /store/SP/R12/000998/200301/12.6.0b/SP_000998_013189.01.root 2005-03-16 22:43:20 1353696936 kan031.slac.stanford.edu
o 131 owen 8791 barb0093 /store/SP/R12/001237/200007/12.4.0j/SP_001237_003653.01.root 2005-03-16 22:43:30 16643497 kan031.slac.stanford.edu
u 116 dgustaf 20445 noma0065 kan032.slac.stanford.edu

Future Developments

This is a first attempt in providing a real time monitoring for xrootd. More work is in progress including:

Questions or comments? Send email to Tofigh Azemoon or Jacek Becla.

Last modified October 5, 2006 Tofigh Azemoon