Xrootd Performance
Bill Weeks & Andrew Hanushevsky
Stanford Linear Accelerator Center, February 4, 2005
Xrootd is representative of the next generation of high performance,
scalable random access data servers. Like other servers in its class, xrootd achieves
high levels of performance by extensively using parallel, low-latency
algorithms. In order to test the real-world performance of the server, a series
of BaBar analysis jobs were run against a single
file. Using a single file allowed data to be served from the file system memory
cache and avoided disk-speed anomalies that make performance results hard to
interpret. The CPU-intensive work in the analysis job was removed to force the
maximum possible request rate from each client while preserving the original
data access pattern. Thus an “event” in this context represents a bounded
series of server transactions.
The test was run using a
single server on a Sun Microsystems Sun Fire V20Z, whose characteristics are:
A series of eight workloads were run. Each run, from 50 clients to 400 clients, in steps of 50, was identical.
Scaling:
The most significant feature that the performance graph shows is that xrootd scales linearly with the number of clients (as
indicated by the “CPU remaining” line). Linear scalability means that the
number of clients that a single xrootd server can support is not limited by the server software
but by factors such as memory, CPU, disk speed, and network interface. Linear
scaling also explains why network bandwidth utilization and the number of
events per seconds uniformly increase as more clients use the server.
Overhead;
The amount of per-client user-space load that the server transfers to the
system is low. In fact, xrootd
user-level overhead (i.e., protocol framing, queue manipulations, thread
scheduling, etc) accounts for only 12% of the total CPU utilization by xrootd. The vast
majority of the CPU is utilized by NIC processing overhead.
Latency:
We measured average server-side latency per request. Our tests showed that the
server added an average of 59us for a 4K data transfer operation. The client likely
sees a larger latency once network and client-side overhead is included.
Efficiency: Given the above observations, it is not surprising the number of
events per second increases as the number of clients. However, the graph also shows
that the rate of increase unexpectedly slows after about 200 clients. This
effect is, unfortunately, a benchmark induced aberration. The first two hundred clients were each run
on a dedicated machine; after which up to two clients were run on each machine.
Our measurements show that running more than one client on a machine adversely
affected each client machine’s performance by 9.7%. This loss of efficiency
appears as a deviation of the expected event rate.