BGmailFS

Posted by Benjamin Close on November 13, 2008 under Projects | Read the First Comment

This page documents a concept that was discussed between a number of members of the Wearable Computer Lab. As such we’ve not investigated whether the concept is in violation of the gmail usage terms and conditions. Hence please consider below nothing but an idea for now. If you have questions about this or would like to implement it, please contact Me.


Background

With the growing amount of personal data every person is starting to accumulate, one of the biggest problems lies in how to back the data up. The introduction of cheap broadband has made network bandwidth relatively quick, and also affordable. However, despite this, backups require disk space in alternative locations. With Google [1] now offering 3GB gmail[2] email accounts, one possibility is to backup your data in your email.

However emailing files to yourself in not only cumberson but also very error prone. Also with a maximum size limit [3] smaller than a number of files you may want to backup, things become even more complex. As a user you would have to send multiple emails, all containing sections of a file

Introducing the Backup Gmail FS

Early on after the creation of gmail based filesystem[4] using a fuse [5], I got to thinking gmail could be used to backup data via a file system.

The problem with the existing GMail FS is doesn’t deal with redundancy, has issues with large files, puts all the data content in one place and is limited to the size of 1 gmail account.

Backup Gmail FS (BGMailFS for short) is the next step up. The aim of it is to feature:

  • data redundancy
  • data security
  • self sizing filesystem
  • no limit on file size

Each of these are detailed below.

Data Redundancy

Considering this is aimed at storing backups, Data Redundancy is of prime consideration. As such taking some sort of approach where CRC’s of data is stored is important. Ideally spreading the data across multiple accounts in some time of mirror, or striped set.

Data Security

The one problem to backup up to Google, is how much do you trust them? Whilst I know there motto is: Don’t Be Evil [6] I can’t help thinking that a giant company might take a sneak peak at my data. Since the data is backups, it could very well be IP sensitive.

Hence to combat this, using a method like the one describe in the Information Dispersal Algorithm (IDA) [7] [8] proposed by Micheal Rabin in 1989, would enable us to distribute the data and also keep it secure. Though it does rely on the fact that various bits of data stored in different accounts are never pieced together (though encryption could help out there).

Self Sizing Filesystem

One thing that would be great is to have the filesystem automatically grow when it’s size reaches a certain point. This is quite possible using gmail accounts. All you have to do is register another account and add it to the file system.

No Limit on File Size

The one thing you don’t want is to only be able to have 20M max files in the file system. Hence the implementation must accommodate files of any size.

Possible Implementation

With the idea embedded, it now comes down to how would you go about implementing something like this.

With the advent of ZFS[9] under FreeBSD things suddenly became much more possible.

ZFS, FreeBSD & Geom

ZFS is a great filesystem / logical volume manager. It supports resizing of the file system on the fly, software based parity & double parity, read checksums, caching and compression. This makes it perfect for use as the frontend to the BGmailFS.

FreeBSD has had the ZFS fs ported to it by pjd. It’ll be an experimental feature in FreeBSD 7.0 and hopefully a fully fledged FS in 7.1. ZFS on FreeBSD to make use of the Geom storage system.

Geom is abstraction of the block devices available to the operating system. It consists of consumers and producers. A consumer makes use of a geom device. Ie a filesystem consumes a geom device. A producer provides the infrastructure for a consumer to write blocks.

Now under FreeBSD a zpool (the lvm size of zfs) is both a geom producer and a geom consumer. It consumes disks. It produces a storage pool.

Putting Things together

So the question becomes how can we make use of zfs to provide the file system front end and somehow provide the backend which talks to gmail. The diagram below gives some insight to how this could fit together.

bgmailfs
The key behind the design is the the use of geom to map a gmail account to a geom producer. This can then be used with a ZFS ZPool to let zfs do most of the work.

The concept is simple. Someone implements

  • The BGFSDeamon
  • The BGFSDriver
  • The BGFSCommand

These all work together to end up giving the user a redunant file system.

BGFSDeamon

This Daemon is responsible for mapping low level disk requests in to the relevant gmail account commands. For instance, when a write is done to a ‘disk’ the relevant block is considered an email in a gmail account. It can be written either via SMTP or now gmail supports imap [10], via imap. For security of data. This block may be spread between multiple gmail account using the IDA algorithm. Like wise a read obtains all the relevant pieces of the data block from the known accounts and presents the disk block back to the BGFSDriver.

Communication between the driver and the Daemon happen vio ioctls, preferably creating a mmaped region that both can read for efficiency.

BGFS Driver

The driver is actually fairly simple. It works similar to how the fuse driver works but at a block level rather than a file system level. Ie every block request the driver gets to read/write a block, it passes via the device file back to the BGmailDeamon. The driver is really just a translation layer!

BGFSCommand

This BGFSCommand is a command line tool that can be used for adding more space (ie more accounts), checking network communications or retiring accounts. It talks to the BGFSDeamon which actually does the work. Consider it like what apachectl is to apache.

After thoughts / Conclusion

It’s ironic that after writing this page, there is many other possibilities that exist for BGmailFS. With most of the work done in the daemon, it’s entirely plausable that this same system could work with hotmail, yahoo or for that matter a separate daemon that listens on another machine somewhere. This may make even more sense as many people now run servers 24×7 behind their broadband connection. Many of them have the same problem of how to backup their data securely.

Perhaps they could run one daemon, one kernel module and then it just works!

References

  1. http://google.com
  2. http://www.gmail.com
  3. ↑ http://mail.google.com/support/bin/answer.py?answer=6589
  4. http://richard.jones.name/google-hacks/gmail-filesystem/gmail-filesystem.html
  5. http://fuse.sourceforge.net/
  6. http://en.wikipedia.org/wiki/Don’t_Be_Evil
  7. http://www.answers.com/topic/information-dispersal-algorithm
  8. http://bryanmills.net/archives/2007/09/information-dispersal-algorithms/
  9. http://www.opensolaris.org/os/community/zfs/
  10. http://mail.google.com/support/bin/topic.py?topic=12760


Donations keep this site alive

  • Alexis Valencia said,

    Very nice, very elegant, if only the TOS didn’t get in the way…

Add A Comment

*