LJ Archive

The Lustre Distributed Filesystem

Petros Koutoupis

Issue #210, October 2011

Deploy a scalable and high-performing distributed filesystem in your clustered environment with Lustre.

There comes a time in a network or storage administrator's career when a large collection of storage volumes needs to be pooled together and distributed within a clustered or multiple client network, while maintaining high performance with little to no bottlenecks when accessing the same files. That is where Lustre comes into the picture. The Lustre filesystem is a high-performance distributed filesystem intended for larger network and high-availability environments.

The Storage Area Network and Linux

Traditionally, Lustre is configured to manage remote data storage disk devices within a Storage Area Network (SAN), which is two or more remotely attached disk devices communicating via a Small Computer System Interface (SCSI) protocol. This includes Fibre Channel, Fibre Channel over Ethernet (FCoE), Serial Attached SCSI (SAS) and even iSCSI. To better explain what a SAN is, it may be more beneficial to begin with what it isn't. For instance, a SAN shouldn't be confused with a Local Area Network (LAN), even if that LAN carries storage traffic (that is, via networked filesystem shares and so on). Only if the LAN carries storage traffic using the iSCSI or FCoE protocols can it then be considered a SAN. Another thing that a SAN isn't is Network Attached Storage (NAS). Again, the SAN relies heavily on a SCSI protocol, while the NAS uses the NFS and SMB/CIFS file-sharing protocols.

An external storage target device will represent storage volumes as Logical Units within the SAN. Typically, a set of Logical Units will be mapped across a SAN to an initiator node—in our case, it would be the server(s) managing the Lustre filesystem. In turn, the server(s) will identify one or more SCSI disk devices within its SCSI subsystem and treat them as if they were local drives. The amount of SCSI disks identified is determined by the amount of Logical Units mapped to the initiator. If you want to follow along with the examples here, it is relatively simple to configure a couple virtual machines: one as the server node with one or more additional disk devices to export and the second to act as a client node and mount the Lustre enabled volume. Although it is bad practice, for testing purposes, it also is possible to have a single virtual machine configured as both server and client.

The Distributed Filesystem

A distributed filesystem allows access to files from multiple hosts sharing a computer network. This makes it possible for multiple users on multiple client nodes to share files and storage resources. The client nodes do not have direct access to the underlying block storage but interact over the network using a protocol and, thus, make it possible to restrict access to the filesystem depending on access lists or capabilities on both the servers and the clients, unlike a clustered filesystem, where all nodes have equal access to the block storage where the filesystem is located. On these systems, the access control must reside on the client. Other advantages to utilizing distributed filesystems include the fact that they may involve facilities for transparent replication and fault tolerance. So, when a limited number of nodes in a filesystem goes off-line, the system continues to work without any data loss.

Lustre (or Linux Cluster) is one such distributed filesystem, usually deployed for large-scale cluster computing. Licensed under the GNU General Public License (or GPL), Lustre provides a solution in which high performance and scalability to tens of thousands of nodes and petabytes of storage becomes a reality and is relatively simple to deploy and configure. Despite the fact that Lustre 2.0 has been released, for this article, I work with the generally available 1.8.5.

Lustre contains a somewhat unique architecture, with three major functional units. One is a single metadata server or MDS that contains a single metadata target or MDT for each Lustre filesystem. This stores namespace metadata, which includes filenames, directories, access permissions and file layout. The MDT data is stored in a single disk filesystem mapped locally to the serving node and is a dedicated filesystem that controls file access and informs the client node(s) which object(s) make up a file. Second are one or more object storage servers (OSSes) that store file data on one or more object storage targets or OST. An OST is a dedicated object-base filesystem exported for read/write operations. The capacity of a Lustre filesystem is determined by the sum of the total capacities of the OSTs. Finally, there's the client(s) that accesses and uses the file data.

Lustre presents all clients with a unified namespace for all of the files and data in the filesystem that allow concurrent and coherent read and write access to the files in the filesystem. When a client accesses a file, it completes a filename lookup on the MDS, and either a new file is created or the layout of an existing file is returned to the client. Locking the file on the OST, the client will then run one or more read or write operations to the file but will not directly modify the objects on the OST. Instead, it will delegate tasks to the OSS. This approach will ensure scalability and improved security and reliability, as it does not allow direct access to the underlying storage, thus, increasing the risk of filesystem corruption from misbehaving/defective clients. Although all three components (MDT, OST and client) can run on the same node, they typically are configured on separate nodes communicating over a network (see the details on LNET later in this article). In this example, I'm running the MDT and OST on a single server node while the client will be accessing the OST from a separate node.

Installing Lustre

To obtain Lustre 1.8.5, download the prebuilt binaries packaged in RPMs, or download the source and build the modules and utilities for your respective Linux distribution. Oracle provides server RPM packages for both Oracle Enterprise Linux (OEL) 5 and Red Hat Enterprise Linux (RHEL) 5, while also providing client RPM packages for OEL 5, RHEL 5 and SUSE Linux Enterprise Server (SLES) 10,11. If you will be building Lustre from source, ensure that you are using a Linux kernel 2.6.16 or greater. Note that in all deployments of Lustre, the server that runs on an MDS, MGS (discussed below) or OSS must utilize a patched kernel. Running a patched kernel on a Lustre client is optional and required only if the client will be used for multiple purposes, such as running as both a client and an OST.

If you already have a supported operating system, make sure that the patched kernel, lustre-modules, lustre-ldiskfs (a Lustre-patched backing filesystem kernel module package for the ext3 filesystem), lustre (which includes userspace utilities to configure and run Lustre) and e2fsprogs packages are installed on the host system while also resolving its dependencies from a local or remote repository. Use the rpm command to install all necessary packages:

$ sudo rpm -ivh kernel-2.6.18-194.3.1.0.1.el5_lustre.1.8.4.i686.rpm
$ sudo rpm -ivh lustre-modules-1.8.4-2.6.18_194.3.1.0.1.el5_
↪lustre.1.8.4.i686.rpm 
$ sudo rpm -ivh lustre-ldiskfs-3.1.3-2.6.18_194.3.1.0.1.el5_
↪lustre.1.8.4.i686.rpm
$ sudo rpm -ivh lustre-1.8.4-2.6.18_194.3.1.0.1.el5_
↪lustre.1.8.4.i686.rpm

$ sudo rpm -ivh e2fsprogs-1.41.10.sun2-0redhat.oel5.i386.rpm

After these packages have been installed, list the boot directory to reveal the newly installed patched Linux kernel:

[petros@lustre-host ~]$ ls /boot/
config-2.6.18-194.3.1.0.1.el5_lustre.1.8.4
grub
initrd-2.6.18-194.3.1.0.1.el5_lustre.1.8.4.img
lost+found
symvers-2.6.18-194.3.1.0.1.el5_lustre.1.8.4.gz
System.map-2.6.18-194.3.1.0.1.el5_lustre.1.8.4
vmlinuz-2.6.18-194.3.1.0.1.el5_lustre.1.8.4

View the /boot/grub/grub.conf file to validate that the newly installed kernel has been set as the default kernel. Now that all packages have been installed, a reboot is required to load the new kernel image. Once the system has been rebooted, an invocation of the uname command will reveal the currently booted kernel image:

[petros@lustre-host ~]$ uname -a
Linux lustre-host 2.6.18-194.3.1.0.1.el5_lustre.1.8.4 #1 
 ↪SMP Mon Jul 26 22:12:56 MDT 2010 i686 i686 i386 GNU/Linux

Meanwhile, on the client side, the packages for the lustre client (utilities for patchless clients) and lustre client modules (modules for patchless clients) need to be installed on all desired client machines:

$ sudo rpm -ivh lustre-client-1.8.4-2.6.18_194.3.1.0.1.el5_
↪lustre.1.8.4.i686.rpm
$ sudo rpm -ivh lustre-client-modules-1.8.4-2.6.18_194.3.1.0.1.el5_
↪lustre.1.8.4.i686.rpm

Note that these client machines need to be within the same network as the host machine serving the Lustre filesystem. After the packages are installed, reboot all affected client machines.

Configuring Lustre Server

In order to configure the Lustre filesystem, you need to configure Lustre Networking, or LNET, which provides the communication infrastructure required by the Lustre filesystem. LNET supports many commonly used network types, which include InfiniBand and IP networks. It allows simultaneous availability across multiple network types with routing between them. In this example, let's use tcp, so use your favorite editor to append the following line to the /etc/modprobe.conf file:

options lnet networks=tcp

This step restricts LNET to be using only the specified network interfaces and prevents LNET from using all network interfaces.

Before moving on, it is important to discuss the role of the Management Server or MGS. The MGS stores configuration information for all Lustre filesystems in a clustered setup. An OST contacts the MGS to provide information, while the client(s) contact the MGS to retrieve information. The MGS requires its own disk for storage, although there is a provision that allows the MGS to share a disk with a single MDT. Type the following command to create a combined MGS/MDT node:

[petros@lustre-host ~]$ sudo /usr/sbin/mkfs.lustre 
 ↪--fsname=lustre --mgs --mdt /dev/sda1

   Permanent disk data:
Target:     lustre-MDTffff
Index:      unassigned
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x75
              (MDT MGS needs_index first_time update )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mdt.group_upcall=/usr/sbin/l_getgroups

device size = 1019MB
2 6 18
formatting backing filesystem ldiskfs on /dev/sda1
        target name  lustre-MDTffff
        4k blocks     261048
        options        -i 4096 -I 512 -q -O dir_index,uninit_groups -F
mkfs_cmd = mke2fs -j -b 4096 -L lustre-MDTffff  -i 4096 -I 512 -q -O 
 ↪dir_index,uninit_groups -F /dev/sda1 261048
Writing CONFIGS/mountdata

If nothing has been provided, by default, the fsname is lustre. If one or more of these filesystems are created, it is necessary to use unique names for each labeled volume. These names become very important for when you access the target on the client system.

Create the OST by executing the following command:

[petros@lustre-host ~]$ sudo /usr/sbin/mkfs.lustre --ost 
 ↪--fsname=lustre --mgsnode=10.0.2.15@tcp0 /dev/sda2

   Permanent disk data:
Target:     lustre-OSTffff
Index:      unassigned
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x72
              (OST needs_index first_time update )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.0.2.15@tcp

checking for existing Lustre data: not found
device size = 1027MB
2 6 18
formatting backing filesystem ldiskfs on /dev/sda2
        target name  lustre-OSTffff
        4k blocks     263064
        options        -J size=40 -i 16384 -I 256 -q -O 
 ↪dir_index,extents,uninit_groups -F
mkfs_cmd = mke2fs -j -b 4096 -L lustre-OSTffff  
 ↪-J size=40 -i 16384 -I 256 -q -O 
 ↪dir_index,extents,uninit_groups -F /dev/sda2 263064
Writing CONFIGS/mountdata

When the target needs to provide information to the MGS or when the client accesses the target for information lookup, all also will need to be aware of where the MGS is, which you have defined for this target as the server's IP address followed by the network interface for communication. This is just a reminder that the interface was defined earlier in the /etc/modprobe.conf file.

Now you easily can mount these newly formatted devices local to the host machine. Mount the MDT:

[petros@lustre-host ~]$ sudo mkdir -p /mnt /mdt
[petros@lustre-host ~]$ sudo mount -t lustre /dev/sda1 /mnt/mdt/

Verify that it is mounted with the df command:

[petros@lustre-host ~]$ df -t lustre
Filesystem          1K-blocks   Used  Available Use% Mounted on
/dev/sda1              913560   17752 843600    3%   /mnt/mdt

The /var/log/messages file will log the following messages:

Jun 25 10:26:54 lustre-host kernel: Lustre: lustre-MDT0000: 
 ↪new disk, initializing
Jun 25 10:26:54 lustre-host kernel: Lustre: lustre-MDT0000: 
 ↪Now serving lustre-MDT0000 on /dev/sda1 with recovery enabled
Jun 25 10:26:54 lustre-host kernel: Lustre: 
 ↪3285:0:(lproc_mds.c:271:lprocfs_wr_group_upcall()) 
 ↪lustre-MDT0000: group upcall set to /usr/sbin/l_getgroups
Jun 25 10:26:54 lustre-host kernel: Lustre: lustre-MDT0000.mdt: 
 ↪set parameter group_upcall=/usr/sbin/l_getgroups

Mount the OST:

[petros@lustre-host ~]$ sudo mkdir -p /mnt /ost
[petros@lustre-host ~]$ sudo mount -t lustre /dev/sda2 /mnt/ost/

Verify that it is mounted with the df command:

[petros@lustre-host ~]$ df -t lustre
Filesystem        1K-blocks      Used Available Use% Mounted on
/dev/sda1            913560     17768    843584   3% /mnt/mdt
/dev/sda2           1035692     42460    940620   5% /mnt/ost

The /var/log/messages file will log the following messages:

Jun 25 10:41:33 lustre-host kernel: Lustre: 
 ↪lustre-OST0000: new disk, initializing
Jun 25 10:41:33 lustre-host kernel: Lustre: 
 ↪lustre-OST0000: Now serving lustre-OST0000 on 
 ↪/dev/sda2 with recovery enabled
Jun 25 10:41:39 lustre-host kernel: Lustre: 
 ↪3417:0:(mds_lov.c:1155:mds_notify()) MDS lustre-MDT0000: 
 ↪add target lustre-OST0000_UUID
Jun 25 10:41:39 lustre-host kernel: Lustre: 
 ↪3188:0:(quota_master.c:1716:mds_quota_recovery()) 
 ↪Only 0/1 OSTs are active, abort quota recovery
Jun 25 10:41:39 lustre-host kernel: Lustre: lustre-OST0000: 
 ↪received MDS connection from 0@lo
Jun 25 10:41:39 lustre-host kernel: Lustre: MDS lustre-MDT0000: 
 ↪lustre-OST0000_UUID now active, resetting orphans

Even though they are mounted like typical filesystems, you will notice that the MDT and OST labeled volumes do not contain the typical filesystem characteristics. That is where the client will come into play:

[petros@lustre-host ~]$ ls /mnt/ost/
ls: /mnt/ost/: Not a directory
[petros@lustre-host ~]$ sudo ls /mnt/ost/
ls: /mnt/ost/: Not a directory
[petros@lustre-host ~]$ sudo ls -l /mnt/
total 4
d--------- 1 root root    0 Jun 25 10:22 ost
d--------- 1 root root    0 Jun 25 10:22 mdt

Configuring Lustre Client(s)

Remember that you named the Lustre-enabled volume lustre. When you mount the volume over the network on the client node, you must specify this name. This can be observed below following the network method (tcp0) by which you are mounting the remote volume. Again, TCP was defined in the /etc/modprobe.conf file for the supported LNET networks interfaces:

[petros@client1 ~]$ sudo mkdir -p /lustre
[petros@client1 ~]$ sudo mount -t lustre 
 ↪10.0.2.15@tcp0:/lustre /lustre/

After successfully mounting the remote volume, you will see the /var/log/messages file append:

Jun 25 10:44:17 client1 kernel: Lustre: 
 ↪Client lustre-client has started

Use df to list the mounted Lustre-enabled volumes:

[petros@client1 ~]$ df -t lustre
Filesystem	       1K-blocks Used  Available Use% Mounted on
10.0.2.15@tcp0:/lustre 1035692   42460 940556    5%   /lustre

Once mounted, the filesystem now can be accessed by the client node. For instance, you cannot read or write from or to files located on the mounted OST. In the following example, let's write approximately 40MB of data to a new file, on /lustre:

[petros@client1 lustre]$ sudo dd if=/dev/zero 
 ↪of=/lustre/test.dat bs=4M count=10
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 1.30103 seconds, 32.2 MB/s

A df listing will reflect the changes of available and used capacities to the /lustre mountpoint:

[petros@client1 lustre]$ df -t lustre
Filesystem              1K-blocks   Used  Available Use% Mounted on
10.0.2.15@tcp0:/lustre  1035692     83420 899596    9%   /lustre
[petros@client1 lustre]$ ls -l
total 40960
-rw-r--r-- 1 root root 41943040 Jun 25 10:47 test.dat

Summary

Although I covered a simple introduction and tutorial of the Lustre distributed filesystem, there still is so much uncharted territory and methods by which the filesystem can be configured for a truly rich computing environment. For instance, Lustre can be configured for high availability to ensure that in the situation of failure, the system's services continue without interruption. That is, the accessibility between the client(s), server(s) and external target storage is always made available through a process called failover. Additional HA is provided through an implementation of disk drive redundancy in some sort of RAID (hardware or software via mdadm) configuration for the event of drive failures. These high-availability techniques normally would apply to server nodes hosting the MDS and OSS. It is suggested to place the OST storage on a RAID 5 or, preferably, RAID 6, while the MDT storage should be RAID 1 or RAID 0+1.

It isn't a difficult technology to work with; however, there still exists an entire community with excellent administrator and developer resources ranging from articles, mailing lists and more to aid the novice in installing and maintaining a Lustre-enabled system. Commercial support for Lustre is made available by a non-exhaustive list of vendors selling bundled computing and Lustre storage systems. Many of these same vendors also are contributing to the Open Source community surrounding the Lustre Project.

Petros Koutoupis is a full-time Linux kernel, device-driver and application developer for embedded and server platforms. He has been working in the data storage industry for more than seven years and enjoys discussing the same technologies.

LJ Archive