Abstract

PUFS - Peer Union File System - is a poor man's naïve distributed file system built on top of FUSE, hence running totally in user space.

It is distributed under the GPL license.

PUFS' philosophy is somewhat in line with a network grid paradigm in the form of a file system: its contents consists in a unified view of the directories exported by each peer and can become larger or smaller as these peers join or leave this "union".

It is designed to work best on LANs where multicast is readily available, allowing seamless peer auto-discovery and notification. It can also be used in unicast mode, but it this case all the contributing peers must be enumerated in advance.

The peers constitute a topological mesh where any peer is virtually connected to every other one, so that the unified file hierarchy will be the same all around, whichever node a snapshot is taken.

Since there's no central authority enforcing membership or maintaining connectivity status, peers are actually loosely bound to the "union" and can leave or join any time.

Changes in the file system, both data and metadata are reflected on a best effort basis. In particular, the most recently accessed metadata is typically cached and is considered valid throughout a configurable period of time, after which it is refreshed.
One the other hand, a failure or a downtime in one node will be reflected in PUFS by the disappearance of the segment or top directory exported by that very peer.

PUFS is an almost POSIX compliant file system with the current exception of of both symbolic and hard links which are lacking.
File locking, both mandatory and advisory, is also missing due to a limitation in FUSE which doesn't dispatch fcntl(2) calls to the applications.

Rationale

Sometime ago I thought of exporting a tmpfs file system via NFS. This might look as a silly idea but I was on a quest to put together a quick-and-dirty read-only distributed shared memory system. The problem is that NFS does not support some pseudo file systems, tmpfs being one of them.

At that point I thought it wouldn't be too difficult to write a FUSE based file system that would act as a simple wrapper to tmpfs and at the same time act as a daemon providing a protocol for requesting file system operations through a TCP/IP network.

While I was at it, I realised that such a project shouldn't limit itself to wrap tmpfs but in fact could act as proxy to any POSIX compatible file system.

There are, actually, many distributed file systems out there, but most look overly complex, overkill, or cumbersome to use, by either:

PUFS curtails these particularities by taking a simpler and decentralised approach to the problem which is sufficient for many applications.

Usage scenarios

PUFS setting up is simple, just throw in a bunch of cooperating peers and you can have instantly a distributed file system up and running. This simplicity comes, however, with a price: despite a limited support for replication, PUFS is essentially non-redundant an not failsafe.
It is best suited for applications focused mainly in distributing and sharing data, where the impact of a temporary outage on a peer is not significant or can be remedied without loss.

Typical usage scenarios comprise:

Networking

Peers communicate with each other by two means: multicast datagrams and stream oriented connections.

Multicast packets are typically used for sending notifications and location requests.
Notifications (or traps in PUFS terminology) are sent whenever modifications on local files are made which affect metadata, ie. namespace changes (creation, renaming and removal of nodes), or file attribute changes (such as modification times). Traps are collected and sent in deferred time so as to avoid excessive traffic such as that which occurs when writing multiple blocks to a file.

Location requests are used upon discovery time, ie. when there's a need for a peer to locate a requested node (file or directory) which isn't found locally.

Namespaces and Brokering

Namespace management is currently completely decentralised in PUFS. In other words this means that there is no entity who knows, at any given time, where each file resides.
At join time each peer advertises itself by multicasting its identity, among other initialisation tasks, discussed below.
Upon receiving this trap, every other peer inserts the brand new host in their list of known peers.

File (vnode in PUFS terminology) discovery is then done by multicasting special location requests and waiting a configurable amount of time for responses from every known peers. Failure to get a response from any of them is not considered to be a node failure until an explicit TCP connection attempt to that peer fails.
Subsequent lookups on the same file made by applications use the cached metadata information stored in the respective peer.

The cached data per node has a validity which is limited in time and its expiration triggers a new round of location requests/responses. The expiry time can also be postponed when a trap concerning that node is received, which has the effect of an unsolicited metadata update.

Directories undergo a slightly more elaborate process when getting their contents. Instead of blindingly requesting a full listing each time the respective cache entry expires, peers use a global timestamp as a checkpoint value for requesting only an incremental list of changes made in that directory.
These changes may comprise node deletions or additions. This global timestamp is negotiated at the time a peer joins the union, by computing a rudimentary estimate for the clock offset between it and the peer with the greatest uptime using an algorithm similar to NTP.

This approach is simple to implement and requires no special central broker. However, in a clustered environment, it has admittedly several drawbacks, first and foremost is the consistency issue. Since multicasting is inherently unreliable, the discovery mechanism may end up not accomplishing its purpose leaving an undetermined number of nodes undiscovered. This is particularly serious when using replicated nodes discussed below. Lowering the node validity time is a possible way to minimise the impact of such errors but adversely affects response time and increases network traffic.

The future use of a reliable multicast protocol/framework (like Spread) is currently an open issue.

Buffer Cache

PUFS implements a buffer cache just like the Linux kernel. In fact bypasses the kernel cache whenever possible by using direct IO. Such approach is necessary as PUFS tries to achieve a best-effort single node consistency. In other words if a file is opened by several nodes, data written by one of them must invalidate the corresponding buffers in the other nodes, so the data can be reread again and reflect the changes made in the meantime.

The Linux kernel, by default, implements a write behind strategy, unless fdatasync(2) is explicitly called, and thus gives no control to an application as to when data is really written to the storage device (typically a hard disk). Therefore the only means to regain that control is for the applications to do the cache management themselves. PUFS reimplementation of the buffer cache also follows a write behind policy by 'syncing' dirty buffer data periodically to disk and multicasting the file offsets changed to all peers.
Besides asynchronous write operations PUFS also supports a simple adaptive read ahead strategy. The number of prefetched buffers is proportional to the sequentially buffers read so far up to a configurable limit.

Replication

PUFS has currently a very rudimentary support for file replication.
Directories specified in a suitable configuration directive are assumed to hold files replicated across the peers which run under with the same configuration. These "clustered" directories are established by the same discovery procedure discussed before.

Due to the very uncentralised nature of PUFS, the replication mechanism is inherently flawed. Since a peer can join or leave the union any time, the cluster membership is volatile, so changes made in a file are not attempted to be committed at once in all the clustered nodes before being considered successful. This would be the approach database systems use and is known as the 2 (or 3) phase commit protocol. Instead, PUFS assigns to each mutating operation on a clustered file, a global operation serial number (OSN) and stores it, along with some file status metadata, persistently on a disk based btree. It then asynchronously notifies all peers of the occurrence of an operation using the very same trap expedient described above. Peers pertaining to the same cluster will use the OSN to tell whether a given file is lagging and in that case will request all the changes made since the last local recorded OSN to a node having a more recent version. Currently the replication mechanism in PUFS is lazy, meaning that a replica will only pull changes when a file is explicitly accessed. One major flaw in this approach occurs in the event of all peers in a cluster being down. In this scenario a node which comes up but has a lagging file may erroneously assume that its version is the last one. Disputes can only be resolved by comparing OSN's which is not enough and may require manual intervention.

The use of replication in PUFS will be most useful in a write seldom/read often scenario. Non mutating operations made to any "clustered" file are despatched to each peer in a round robin fashion. This allows calls such as read(2) to be balanced among the cluster. More elaborate balancing policies might be added in the future, most notably one that despatches calls based on the idleness factor or load in each node.

Name Collisions

In the process of "unifying" the directory contents, name collisions are bound to appear.
Name clashes, if present are disambiguated by appending to each duplicate name the suffix @peer-uri where peer-uri is a URI-like expression which denotes how that peer can be contacted.

Performance

I've not yet done a systematic and extensive performance assessment, but informal measuring puts PUFS au pair with the NFS user mode server on Linux 2.6.11. NFS does tricks like reporting a 128KB block size in the stat(2) output structure, thus forcing utilities like cp(1) to use this large buffer in order to do network transfers more efficiently. One can also fiddle with PUFS parameters to tune performance, key parameters are the buffer size and the total cache size.
See further details here.

Security

There's no provision for secure communications in PUFS. Since it was designed for use in private LANs which are a multicast friendly medium, that was not a concern. If security is an issue, one can always set up an Stunnel to provide such a capability although limited to the encryption of TCP connections only. Anyway, the use of UDP in PUFS is limited to traps and discovery datagrams which do not convey sensitive data (apart from file names). On the OS level, the directories exported by a peer can be modified by any other peer, provided that they are not exported in read-only mode. The operations will be issued with the uid and gid of the user which runs the PUFS daemon.

Locking

Locking is a complex issue I've not tackled with yet mainly because FUSE doesn't proxy fcntl(2) system calls.