PUFS' philosophy is somewhat in line with a network grid paradigm in the form of a file system: its contents consists in a unified view of the directories exported by each peer and can become larger or smaller as these peers join or leave this "union".
It is designed to work best on LANs where multicast is readily available, allowing seamless peer auto-discovery and notification. It can also be used in unicast mode, but it this case all the contributing peers must be enumerated in advance. The peers constitute a topological mesh where any peer is virtually connected to every other one, so that the unified file hierarchy will be the same all around, whichever node a snapshot is taken. Since there's no central authority enforcing membership or maintaining connectivity status, peers are actually loosely bound to the "union" and can leave or join any time. Changes in the file system, both data and metadata are reflected on a best effort basis. In particular, the most recently accessed metadata is typically cached and is considered valid throughout a configurable period of time, after which it is refreshed.fcntl(2) calls to the applications.
Sometime ago I thought of exporting a tmpfs file system via NFS.
This might look as a silly idea but I was on a quest to put together a
quick-and-dirty read-only distributed shared memory system.
The problem is that NFS does not support some pseudo file systems,
tmpfs being one of them.
tmpfs and at the
same time act as a daemon providing a protocol for requesting file system
operations through a TCP/IP network.
While I was at it, I realised that such a project shouldn't limit itself to
wrap tmpfs but in fact could act as proxy to any POSIX compatible
file system.
There are, actually, many distributed file systems out there, but most look overly complex, overkill, or cumbersome to use, by either:
PUFS setting up is simple, just throw in a bunch of cooperating peers and you can have instantly a distributed file system up and running.
This simplicity comes, however, with a price:
despite a limited support for replication,
PUFS is essentially non-redundant an not failsafe.
It is best suited for applications focused mainly in distributing and
sharing data, where the impact of a temporary outage on a peer is not significant or can be remedied without loss.
tmpfs based directory/etc or, wilder, /proc or /sysPeers communicate with each other by two means: multicast datagrams and stream oriented connections.
Multicast packets are typically used for sending notifications and location
requests.
Notifications (or traps in PUFS terminology) are sent whenever
modifications on local files are made which affect metadata, ie. namespace
changes (creation, renaming and removal of nodes), or file attribute changes
(such as modification times). Traps are collected and sent in deferred time
so as to avoid excessive traffic such as that which occurs when writing multiple
blocks to a file.
Location requests are used upon discovery time, ie. when there's a need for a peer to locate a requested node (file or directory) which isn't found locally.
The cached data per node has a validity which is limited in time and its expiration triggers a new round of location requests/responses. The expiry time can also be postponed when a trap concerning that node is received, which has the effect of an unsolicited metadata update.
Directories undergo a slightly more elaborate process when getting their
contents. Instead of blindingly requesting a full listing each time the
respective cache entry expires, peers use a global timestamp as a checkpoint
value for requesting only an incremental list of changes made in that directory.
These changes may comprise node deletions or additions.
This global timestamp is negotiated at the time a peer joins the union, by
computing a rudimentary estimate for the clock offset between it and the peer
with the greatest uptime using an algorithm similar to NTP.
This approach is simple to implement and requires no special central broker. However, in a clustered environment, it has admittedly several drawbacks, first and foremost is the consistency issue. Since multicasting is inherently unreliable, the discovery mechanism may end up not accomplishing its purpose leaving an undetermined number of nodes undiscovered. This is particularly serious when using replicated nodes discussed below. Lowering the node validity time is a possible way to minimise the impact of such errors but adversely affects response time and increases network traffic.
The future use of a reliable multicast protocol/framework (like Spread) is currently an open issue.
PUFS implements a buffer cache just like the Linux kernel. In fact bypasses the kernel cache whenever possible by using direct IO. Such approach is necessary as PUFS tries to achieve a best-effort single node consistency. In other words if a file is opened by several nodes, data written by one of them must invalidate the corresponding buffers in the other nodes, so the data can be reread again and reflect the changes made in the meantime.
The Linux kernel, by default, implements a write behind strategy, unless
fdatasync(2) is explicitly called, and thus gives no control to an application
as to when data is really written to the storage device (typically a hard disk).
Therefore the only means to regain that control is for the applications to
do the cache management themselves. PUFS reimplementation of the buffer cache
also follows a write behind policy by 'syncing' dirty buffer data periodically
to disk and multicasting the file offsets changed to all peers.
Besides asynchronous write operations PUFS also supports a simple adaptive read ahead strategy. The number of prefetched buffers is proportional to the sequentially buffers read so far up to a configurable limit.
PUFS has currently a very rudimentary support for file replication.
Directories specified in a suitable configuration directive are
assumed to hold files replicated across the peers which run under with the
same configuration. These "clustered" directories are established by the same
discovery procedure discussed before.
read(2) to
be balanced among the cluster. More elaborate balancing policies might be added
in the future, most notably one that despatches calls based on the idleness
factor or load in each node.
In the process of "unifying" the directory contents, name collisions are bound
to appear.
Name clashes, if present are disambiguated by appending
to each duplicate name the suffix @peer-uri where peer-uri is a URI-like
expression which denotes how that peer can be contacted.
I've not yet done a systematic and extensive performance assessment, but
informal measuring puts PUFS au pair with the NFS user mode server on Linux 2.6.11. NFS does tricks like reporting a 128KB block size in the stat(2) output structure, thus forcing utilities like cp(1) to use this large buffer in order to do
network transfers more efficiently. One can also fiddle with PUFS parameters to
tune performance, key parameters are the buffer size and the total cache size.
See further details here.
There's no provision for secure communications in PUFS. Since it was designed for use in private LANs which are a multicast friendly medium, that was not a concern. If security is an issue, one can always set up an Stunnel to provide such a capability although limited to the encryption of TCP connections only. Anyway, the use of UDP in PUFS is limited to traps and discovery datagrams which do not convey sensitive data (apart from file names). On the OS level, the directories exported by a peer can be modified by any other peer, provided that they are not exported in read-only mode. The operations will be issued with the uid and gid of the user which runs the PUFS daemon.
Locking is a complex issue I've not tackled with yet mainly because FUSE
doesn't proxy fcntl(2) system calls.