Xrootd is capable of providing a real-time monitoring. Basic (light) monitoring includes the following information:

Monitoring data is sent from xrootd servers via UDP packets to a collector, therefore it is non-disruptive to the xrootd no matter what happens to the collector. The collector is capable of collecting and decoding in real time packets from multiple xrootd servers. Decoded data, in the form of ASCII files is formated such that it can easily be loaded into a relational database.
Setting up the light monitoring involves several simple steps:
Optionally pick up a port number <ctrPort> for the collector (default = 9930) and add the following line to the xrootd config file:
xrootd.monitor all dest files info user <dbHostName>:<crtPort>
For instance:
xrootd.monitor all dest files info user xrootd-stats:9930
Xrootd has to be restarted for that change to take effect. Make sure you are using the same port number for all xrootd servers that will be monitored as one cluster.Simply run the command xrdmonCollector.pl < configFile> start.
If you used a non-default port number in xrootd config file, you have to use the same number in the
configFile .
Above script runs the binary xrdmonCollector which
has to be in your $PATH. The log file is writen to
<baseDir>/<thisSite>/logs/out/xrdmonCollector.<date>
If you decide to run multiple collectors per machine (e.g. one for each group of hosts), you should repeat all of the steps described in this document for each cluster and use a different <baseDir> for each cluster in addition to using a different port number (in xrootd config and < configFile>)
To stop the collector, use the command
xrdmonCollector.pl < configFile> stop
The collector buffers some data in memory, and killing it in any other way will
result in losing buffered data. Please keep in mind that it can take a few minutes
for the collector to completely shut down.
To load data into MySQL you will need:
You need to do (if you run multiple collectors, you need to repeat these steps for each collector):
| Platform | command to load time_zone tables |
|---|---|
| Solaris | mysql_tzinfo_to_sql /usr/share/lib/zoneinfo/ | mysql -uroot mysql |
| Linux | mysql_tzinfo_to_sql /usr/share/zoneinfo/ | mysql -uroot mysql |
To load data coming out of the collector into the MySQL database start the MySQL loader using the command:
xrdmonLoadMySQL.pl < configFile> start >& <loaderLogFile>But first you have to replace the dummy subroutine getFileTypes in xrdmonLoadMySQL.pl with a suitable one for your file types. (see example for BaBar experiment.)
xrdmonLoadMySQL.pl will run indefinitely. To stop it use the command
xrdmonLoadMySQL.pl < configFile> stopThis will ensure that the program loads everything in the buffer before terminating. Killing it in any other way might leave the database tables in an inconsitent state. Also the loader sets certain semaphore files that might be left behind if it is not stopped properly.
The data loaded into the database needs further processing to produce the statistics tables that are published on the web. You do this with the command:
xrdmonPrepareStats.pl < configFile> start >& <prepareLogFile>xrdmonPrepareStats.pl will run indefinitely. To stop it use the command
xrdmonPrepare.pl < configFile> stopKilling it in any other way might leave the database tables in an inconsitent state.
All perl scripts in xrootd monitoring system use a single config file. Make sure that you do not alter the config file after you have created the database unless you are sure that the changes are not incompatible with the existing database and the applications that are already running.
Two of the definitions used in the config file are specific to this project and need some explanation. These are file types and Job
The config file contains lines with the following format:
<token>: <Value 1> [ <Value 2> [ <Value 3> ] ] ]
(no space between token and colon and all values are separated by single space)
The valid tokens and explanation of values are given in the following table.
Time and interval values are integers followed by MINUTE | HOUR | DAY | WEEK | MONTH | YEAR
The date format is:
YYYY-MM-DD<space>
| token | explanation | default |
|---|---|---|
| dbName | Database Name | None |
| MySQLUser | Name of the user with full permissions to the database | None |
| webUser | Name of the user with SELECT permission to the database | None |
| MySQLSocket | Location of MySQL Socket | /tmp/mysql.sock |
| baseDir | Pathname for the parent directory of the sites subdirectories. | None |
| thisSite | Name of the site where the database server is running | None |
| site | Value 1: Site Name Value 2: Time Zone Code Value 3: Date of first data from the site |
None |
| backupInt | Value 1: Site Name Value 2: Time interval for starting a new backup file for the input data |
1 DAY |
| backupIntDef | Default backup interval to be used in the absence of the site specific backup interval. | 1 DAY |
| fileType | Value 1: file classification name Value 2: maximum number of values |
None |
| fileCloseWaitTime | Maximum time the loader waits for a file close signal after receiving the corresponding session close signal. The file is force closed and the file close time is set to the session close time. | 10 MINUTE |
| maxJobIdleTime | Maximum time an xrootd client job is allowed to run without an open session. | 15 MINUTE |
| maxSessionIdleTime | Time after which the session close signal is assumed to be lost. The session is force closed and the session disconnect time is set to the latest file close time in the session. For a session with no files the session duration is set to zero. | 12 HOUR |
| maxConnectTime | Maximum time a session can remain connected with at least one open file. After this time the session and all open files are force closed. The session disconnect time is set to the latest file close time in the session. | 70 DAY |
| closeFileInt | Time interval for checking the open files associated with closed sessions. | 15 MINUTE |
| closeIdleSessionInt | Time interval for checking the idle sessions. | 1 HOUR |
| closeLongSessionInt | Time interval for checking the sessions running over the "maxConnectTime" limit. | 24 HOUR |
| nTopPerfRows | Number of rows in Top Performance tables | 20 |
| yearlyStats | ON/OFF switch for yearly statistics | OFF |
| allYearsStats | ON/OFF switch for AllYears statistics | OFF |
The perl scripts are currently located in xrootd/src/XrdMon. This is a temporary location and
will be fixed very soon.
You may have to tweak the very first line in each script to point to the correct location of your
perl on your system.
To deploy the web application for xrootd monitoring you need to set up Apache Tomcat.
<Resource
name="jdbc/xrdmon-database"
type="javax.sql.DataSource"
password="<password>"
driverClassName=" org.gjt.mm.mysql.Driver "
maxIdle="2"
maxWait="5000"
username="<webUser>"
url="jdbc:mysql://<dbServerName>/<dbName>"
maxActive="4"/>
Downloaded (location to be decided)/xrdmon.war (6 MB) and deploy the application through the manager tomcat application by loading the war file.
So how "real-time" is this system? Most of the information on the xrootd server side is flushed after a udp packet is filled up, or when a specified time elapses (currently 1 min). Once sent, data should be (a) collected, (b) decoded and (c) flushed to the real time log file within few seconds (depends on the RT flush setting, default is 10 sec). Loading to MySQL introduces another short delay, at slac it is 1 minute. So in practice, the information is available in MySQL a couple of minutes after the event happens.
By default collector will decode all the light data and produce 3 different types of log files:
| type | format | default frequency | default location |
|---|---|---|---|
| collected packets | binary | N/A (default: flush when log size = 1GB) | ./logs/collector/<xrdHost><port> |
| decoded real time data | ASCII | 15 sec | ./logs/rt/rtLog_verXXX.txt |
| decoded history data | ASCII | 10 min | ./logs/decoder/<xrdHost>/<port> |
Collected packets' logs contain segregated raw packets exactly as they arrive from xrootd servers, packets from each xrootd instance in a separate directory.
Decoded real time data logs contain information about current snapshot of the system, for instance current number of opened sessions, users, or files. If you are not interested in watching current "snapshot" of the system you may consider turning it off (xrdmonCollector -rt off). This data is regarded as temporary/scratch, e.g. it becomes invalid when the collector is stopped.
History data logs contain information about closed sessions and closed files. It should be treated as a permanent record.
Collected packets
Collected packets' logs are segregated according to xrootd host and port. Log name: active.ctr. Default maximum size for the log file is 1GB. When active.ctr reaches that size it is renamed to <timestamp>.ctr, and a new active.ctr is opened. Never delete active.ctr while collector is running. You can backup/copy/rm closed logs.
Real time data is kept in rtLog_verXXX.txt. If you use the provided script for loading data to MySQL it will remove the rtLog_verXXX.txt each time it loads the data to MySQL. It backs up the data in the backup directory for each site by appending it to the current backup file. A new backup file is started at intervals given by <backup interval>. You will also see lock files rtLog_verXXX.lock which is used to synchronize access to rtLog_verXXX.txt. The lock file should not be touched.
Format of the log file:
u <uniqueSessionId> <userName> <processId> <clientHostName>
<xrootdHostName>
d <uniqueSessionId> <durationInSec> disconnectTime>
o <uniqueFileId> <userName> <processId> <clientHostName>
<filePath> <openTime> <fileSize> <xrootdHostName>
c <uniqueFileId> <bytesRead> <bytesWritten> <closeTime>
Lines starting with 'u' indicate 'user connect' (session start).
Lines starting with 'd' indicate 'user disconnect' (session stop).
Use
Lines starting with 'o' indicate 'file open'.
Lines starting with 'c' indicate 'file close'.
Use
Format of timestamp is: YYYY-MM-DD<space>
Example:
u 115 dgustaf 12078 barb0045 kan012.slac.stanford.edu
o 120 owen 1960 barb0154 /store/SP/R12/001237/200010/12.4.0j/SP_001237_003641.02E.root
2005-03-16 22:43:07 186067283 kan006.slac.stanford.edu
c 66 0 0 2005-03-16 22:41:40
d 84 0 2005-03-16 22:42:40
o 107 palombo 26144 barb0077 /store/SP/R14/001237/200401/14.4.2a/SP_001237_007640.01.root
2005-03-16 22:43:00 1276663486 kan031.slac.stanford.edu
o 128 roethel 7553 barb0131 /store/SP/R12/000998/200301/12.6.0b/SP_000998_013189.01.root
2005-03-16 22:43:20 1353696936 kan031.slac.stanford.edu
o 131 owen 8791 barb0093 /store/SP/R12/001237/200007/12.4.0j/SP_001237_003653.01.root
2005-03-16 22:43:30 16643497 kan031.slac.stanford.edu
u 116 dgustaf 20445 noma0065 kan032.slac.stanford.edu
This is a first attempt in providing a real time monitoring for xrootd. More work is in progress including:
Questions or comments? Send email to Tofigh Azemoon or Jacek Becla.