Introduction to the XrootdFS

Wei Yang

Stanford Linear Accelerator Center

yangw@slac.stanford.edu

 

What is XrootdFS

Introduction to FUSE

Introduction to Xrootd

How XrootdFS works

Xrootd Composite Name Space

Installation Guide

Install FUSE

Install XrootdFS

Performance

Limitations

Acknowledgement

Reference

 

What is XrootdFS

 

XrootdFS is a Posix filesystem for an Xrootd storage cluster. It is based on FUSE (Filesystem in Userspace) and runs in user space. Many modern storage tools in GRID computing world, such as the GridFTP, the Storage Resource Manager (SRM) were designed for a filesystem like storage. The goal of the XrootdFS filesystem is to allow these tools to work with the Xrootd base storage systems. The filesystem itself also provides a method for users to keep track of what is in their Xrootd storage systems.

Introduction to FUSE

XrootdFS is designed under the FUSE framework. Using XrootdFS requires installing the FUSE software, which can be downloaded from the FUSE web page. This page also provides very useful information to understand how XrootdFS works, and to troubleshoot problems with XrootdFS.

Introduction to Xrootd

Xrootd is a high performance network storage system widely used in High Energy Physics experiments such as Babar, STAR and LHC. The underline Xroot data transfer protocol provides very high efficient access to the ROOT based data files. For more information about Xrootd, please refer to the Xrootd web page.

 

Using XrootdFS to access data files will not take the advantages provided by the Xroot data transfer protocol. For this reason, the preferred environments to use XrootdFS are data import, export and data management, not the actual data analysis. Because Xrootd is designed with large data files in mind, it is not efficient to use XrootdFS for large number of small files.

How XrootdFS works

XrootdFS is designed under the FUSE framework. FUSE provides a kernel module that intercepts user requests to XrootdFS file system and direct them a user space program that does the actual work. Please refer to the FUSE web page for technical details on how FUSE works.

 

XrootdFS / FUSE provide a seamless filesystem interface to the underline storage system upon which the user space program operates. It is the XrootdFS user space program that implements normal I/O operations such as stat(), open(), close(), read(), write(), seek(), opendir(), readdir() and closedir() against the Xrootd storage system. The following figure shows how all these pieces work together.

 

A cluster of Xrootd storage system only supports querying of file status, not querying of directory status. One can optionally use a Composite Name Space (CNS) to record file and directory info. Without CNS, most of the XrootdFS functions will still work, except listing directory contents. (In order word, readdir() function will return nothing without CNS)

 

 

Because a user program can access a Xrootd storage system directly or via XrootdFS, both XrootdFS and the Xrootd storage system itself need to update the CNS when there are operations such as create, remove, rename, mkdir and rmdir. 

Xrootd Composite Name Space

XrootdFS can optionally use a dedicated instance of Xrootd to store file and directory information in an Xrootd cluster. File and directory creation, deletion and rename on the Xrootd cluster will be replicated to this special Xrootd instance by XrootdFS. We call this special instance of Xrootd the Composite Name Space (CNS). If a user chooses to use Xroot protocol to access the Xrootd storage cluster directly, a Cluster Name Space daemon running on every storage data server node will help forwarding these I/O operations to the CNS.

 

CNS hosts all file and directory information of an Xrootd storage cluster. Each file on CNS has exactly the same size, modification time, etc. as its peer in the storage cluster, except that files on CNS doesnt really have data inside, and therefore only use a minimum amount of disk space. However, they do use inodes.

 

CNS is only used to store file and directory information. Most part of the XrootdFS can function without it. A notable thing without CNS is that listing a directory will return no file. Listing a file should still work without CNS.

Update (2010-07-12): XrootdFS releases >=3.0rc3 will function without CNS. If CNS is not used, these new XrootdFS releases will collect directory entries directly from Xrootd data servers.

Installation Guide

Install FUSE

Please refer to the FUSE web page for installation instruction, troubleshooting and mailing list. Installation of FUSE, including the FUSE kernel module is needed in order to use XrootdFS.

Install XrootdFS

Download XrootdFS tar ball and expand it under the top FUSE source directory. This will create a xrootdfs directory. Go to that directory and modify the location of Xrootd source and library directory in Makefile under variable XRDSRC and XRDLIB, then do a make to create the necessary executable.

 

All runtime configurable environment variables are located in start.sh and stop.sh. In these files, MOUNT_POINT and XROOTDFS_RDRURL are needed. XROOTDFS_CNSURL and XROOTDFS_FASTLS are optional.

 

MOUNT_POINT should be the absolute path to the actual mount point of the XrootdFS. The mount point directory itself should be empty (otherwise Co nonempty is needed in start.sh).

 

XROOTDFS_RDRURL is the URL of the top Xrootd path to the Xrootd storage cluster. The host name and port should be that of the Xrootd redirector in the cluster.

 

XROOTDFS_CNSURL is the URL of the top Xrootd path to the CNS. This environment variable is optional. If there is no CNS instance running, this environment variable must not be defined.

 

XROOTDFS_FASTLS is also optional. If it is defined, XrootdFS will only do stat() calls against CNS. This can significantly speed up directory listing. XROOTDFS_FASTLS can be defined to any value, even a . If it is not defined, stat() calls will be performed against the Xrootd redirector.

 

We suggest to run CNS on the same machine as the Xrootd redirector, using the same xrootd.path value in Xrootd configuration. The Xrootd redirector normally doesnt use much system resource and we do not expect CNS to be resource intensive either. So a reasonable modern machine should be able to host both. Doing so will require that the CNS instance of Xrootd be run under a different port number and a different Xrootd name space (see the Cn option of xrootd). One advantage of doing so is that with the Xrootd Posix Preload library (LD_PRELOAD), one can do a ls of a directory against the Xrootd redirector, and the redirector will return information of all files in that directory. This is similar to what CNS does without XROOTDFS_FASTLS.

Performance

Our testing environment has one Xrootd redirector and two Xrootd data servers. We run CNS on the redirector. These three hosts are all (very old) VA 1220 machines (2x 866Mhz Intel Pentium, 1GB memory, and 100Mbit/sec Ethernet) running RedHat Enterprise Linux 3.

 

The XrootdFS runs on a more powerful machine (2x dual core AMD Opteron 275 at 2.2Ghz, 4GB memory and 1Gb/sec Ethernet, and 32-bit RedHat Enterprise 4). With unix dd, globus-url-copy and uberftp, the reading and writing speed range from 5-7MB/sec if we use 128KB I/O block size. With unix cp and other tools that use 4KB block size, the I/O speed is around 0.9MB/sec.

 

As a rough estimation of the speed of ls, we get ~ 25-30 stat() calls per second without XROOTDFS_FASTLS (so all stat() are performed against the redirector), and we get ~ 100 stat() calls per second with XROOTDFS_FASTLS (all stat() are done against the CNS).

Limitations

FUSE always returns 4KB block size for the file system it is responsible (ignore st_blksize retuned by the stat() call). Unix tools such as cp, tar use this value as the read/write block size. Therefore, the speed of cp is significantly slower than dd with a large block size. There is a kernel patch to solve this problem. However, this kernel patch is not in the mainstream RedHat kernel we are using.

 

Currently mv command is very costly. It basically does a cp plus a rm. And since we have not implemented the equivalent of utime() in XrootdFS, file time stamp will lost under mv.

 

Xrootd storage systems impost a 5 second delay when creating a new file. So untarring a tar ball of large number of small files will take a long time. 

File ownership and permission have not been implemented. All files will be owned by the user running the XrootdFS program.

 

The current version of Xrootd software doesnt provide the information of total space and total free space in the storage system. Thus the number returned by df command on an XrootdFS systems is a predefined fix number. Future version of Xrootd software will address this issue.

Acknowledgement

Thanks to Andrew Hanushevshy of SLAC for adding new functions to the Xrootd software to help the XrootdFS development, and the idea of using Composite Name Space. Also thanks to Fabrizio Furano of INFN and Karl Amrhein of SLAC for debugging, new ideas and experiments that help the XrootdFS development.

Reference

The Scalla Software Suite: xrootd/olbd: http://xrootd.slac.stanford.edu

 

Filesystem in Userspace: http://fuse.sourceforge.net