Set up your own distributed, redundant, and encrypted storage grid with Tahoe-LAFS
Published: 08-11-2012 | Author: Sven Slootweg | Text only version of this article
Table of Contents
Note: this guide was written by Sven Slootweg, AKA joepie91, and is releasedby him under the WTFPL
If you have a few different VPSes, you'll most likely have a significant amountof unused storage space across all of them. This guide will be a quickintroduction to setting up and using Tahoe-LAFS, a distributed, redundant,and encrypted storage system - some may call it 'cloud storage'.
- At least 2 VPSes required, at least 3 VPSes recommended. More is better.
- Each VPS should have at least 256MB RAM (for OpenVZ burstable), or 128MB RAM (for OpenVZ vSwap and other virtualization technologies with proper memory accounting).
- Reading comprehension and an hour of your time or so :)
From the Tahoe-LAFS website:
Tahoe-LAFS is a Free and Open cloud storage system. It distributes your dataacross multiple servers. Even if some of the servers fail or are taken over byan attacker, the entire filesystem continues to function correctly, includingpreservation of your privacy and security.
The short version: Tahoe-LAFS uses a RAID-like mechanism to store 'shares'(parts of a file) across the storage grid, according to the settings youspecified. When a file is retrieved, all storage servers will be asked forshares of this file, and those that responded fastest will be used to retrievethe data from. The shares are reconstructed by the requesting client into theoriginal file.
All shares are encrypted and checksummed; storage servers cannot possibly knowor modify the contents of a share, or the file it derives from.
There are (roughly) two types of files: immutable (these cannot be changedafterwards) and mutable (these can be changed). Immutable files will result in a"read capability" (an encoded string that tells Tahoe-LAFS how to find it andhow to decrypt it) and a "verify capability" (that can be used for verifying orrepairing the file). A mutable file will also yield a "write capability" thatcan be used to modify the file. This way, it is possible to have a mutable file,but restrict the write capability to yourself, while sharing the read capabilitywith others.
There is also a pseudo-filesystem with directories; while it isn't required touse this, it makes it possible to for example mount part of a Tahoe-LAFSfilesystem via FUSE.
For more specifics, read this documentation entry.
1. Install dependencies
Follow the below instructions for all VPSes.
To install and run Tahoe-LAFS, you will need Python (with development files),setuptools, and the usual tools for compiling software. On Debian, this can beinstalled by running
apt-get install python python-dev python-setuptools build-essential. If you use a different distro, your package manager or package namesmay differ.
Python setuptools comes with a Python package manager (or installer, rather)named easy_install. We'd rather have pip as our Python package manager, so we'llinstall that instead:
After installing pip, we'll install the last dependency we need to installmanually (
pip install twisted), and then we can install Tahoe-LAFS itself:
pip install allmydata-tahoe.
When you're done installing all of the above, you'll have to make a new user(
adduser tahoe) that you're going to use to run Tahoe-LAFS under. From thispoint on, run all commands as the
2. Setting up an introducer
First of all, you'll need an 'introducer' - this is basically the central serverthat all other nodes connect to, to be made aware of other nodes in the storagegrid. While the storage grid will continue to function if the introducer goesdown, no new nodes will be discovered, and there will be no reconnections tonodes that went down until the introducer is back up.
Preferably, this introducer should be installed on a server that is not astorage node, but it's possible to run an introducer and a storage nodealongside each other.
Run the following on the VPS you wish to use as an introducer, as the
tahoe create-introducer ~/.tahoe-introducertahoe start ~/.tahoe-introducer
Your introducer should now be started successfully. Read out the file
~/.tahoe-introducer/introducer.furl and note the entire contents down somewhere. Youwill need this later to connect the other nodes.
3. Setting up storage nodes
Now it's time to set up the actual storage nodes. This will involve a littlemore configuration than the introducer node. On each storage node, run thefollowing command:
If all went well, a storage node should now be created. Now edit~/.tahoe/tahoe.cfg in your editor of choice. I will explain all the importantconfiguration values - you can leave the rest of the values unchanged. Note thatthe 'shares' settings all apply to uploads from that particular server - eachmachine connected to the network can pick their own encoding settings.
- nickname : The name for this particular storage node, as it will appear in the web panel.
- introducer.furl : The FURL for the introducer node - this is the address that you noted down before.
- shares.needed : This is the amount of shares that will be needed to reconstruct a file.
- shares.happy : This is the amount of different servers that have to be available for storing shares, for an upload to succeed.
- shares.total : The total amount of shares that should be created on upload. One storage node may hold more than one share, as long as it doesn't violate the shares.happy setting.
- reserved_space : The amount of space that should be reserved for other applications on this server.
Tahoe-LAFS has a somewhat interesting way of counting space - instead of keepingtrack of how much space it can use for itself, it will try to make sure that acertain amount of space is available for other applications. What this means inpractice is, that if another application fills up 1GB of disk space, this 1GBwill be subtracted from the amount of space that Tahoe-LAFS can use, not fromthe amount of space that it can't use. The end result is Tahoe-LAFS being veryconservative in the way it uses disk space. This means that you can typicallyset the amount of reserved space to a very low value like 1GB to 5GB, because bythe time you hit that amount of free space, you will still have plenty of timeto clean up your VPS, before the last gigabytes are used up by otherapplications.
At first, share settings may seem very tricky to configure correctly. My advicewould be to set it as the following:
- shares.total : about 80% of the amount of servers you have available.
- shares.happy : 2 lower than shares.total
- shares.needed : half of shares.total
This means that if you have for example 10 storage servers, shares.total = 8,shares.happy = 6, shares.needed = 4.
Now you can't just set any arbitrary values here - your share settings willinfluence the 'expansion factor' - how many times more space you use than thefile would take up on its own. You can calculate the expansion factor by doing
shares.total / shares.needed - for example, with the above suggested setupthe expansion factor would be 2, meaning that a 100MB file would take up 200MBof space.
The level of redundancy can be calculated quite easily as well: the amount ofservers you can lose while being guaranteed to still have access to your data,is
shares.happy - shares.needed (this assumes worst case scenario). In mostcases, the amount of servers you can lose will be
4. Starting your storage nodes
On each node, simply run the command
tahoe start as the
tahoe user, and youshould be in business!
5. (optional) Install a local client
To more easily use Tahoe-LAFS, you may want to install a Tahoe-LAFS client onyour local machine. To do this, you should basically follow the instructions instep 3 - however, instead of running
tahoe create-node, you should run
tahoecreate-client. Configuring and starting works the same, but you don't need tofill in the
reserved_space option (as you're not storing files).
Using your new storage grid
There are several ways to use your storage grid:
Via the web interface
Simply make sure you have a client (or storage node) installed, and point yourbrowser at - you will see the web interface for Tahoe-LAFS, which will allow youto use it. The "More info" link on a directory page (or for a file) will giveyou the read, write, and verify capability URIs that you need to work with themusing other methods.
I recently started working on a Python module named
pytahoe, that you can useto easily interface with Tahoe-LAFS from a Python application or shell. Toinstall it, simply run
pip install pytahoe as root - you'll need to make surethat you have libfuse/libfuse2 installed. There is no real documentation for nowother than in the code itself, but the below code gives you an idea of how itworks:
>>> import pytahoe>>> fs = pytahoe.Filesystem()>>> d = fs.Directory("URI:DIR2:hnncfsbzsxv5fhdymxhycm3xc4:qjipiqg3bozb5evb6krdwfmsgks6j4ymivopgx7eoxcjb3avslqq")>>> d.upload("devilskitchen.tar.gz")
The result of this is something like this.
Mounting a directory
You can also mount a directory as a local filesystem using FUSE (on OpenVZ, makesure your host supports FUSE). Right now, the easiest way appears to be usingpytahoe (this can be done from a Python shell as well). Example:
>>> import pytahoe>>> fs = pytahoe.Filesystem()>>> d = fs.Directory("URI:DIR2:hnncfsbzsxv5fhdymxhycm3xc4:qjipiqg3bozb5evb6krdwfmsgks6j4ymivopgx7eoxcjb3avslqq")>>> d.mount("http://www.lowendtalk.com/mnt/something")
Via the web API
If you're using something that is not Python, or want a bit more control overwhat you do, you may want to use the Tahoe-LAFS WebAPI directly - documentationfor this can be found here.
HalfEatenPie November 8:
Out of curiosity @joepie91, what if one of the servers suddenly just "disappear" from the network? What happens to the files?
joepie91 November 8:
This doesn't really matter; if you have set up your share settings as I advised above, for example, you can usually lose half the servers before it becomes a problem. It's usually worth repairing (via a deep check) now and then if you often lose nodes, because this will redistribute shares over new nodes to meet the original settings again.From a practical viewpoint, I've had many (and I mean MANY) nodes disappear from my storage grid over time, and barely ever had an issue with it. If you get to the point where you have 20 shares spread over 20 nodes and you only need 10 to reconstruct the file... your storage grid is pretty much practically invincible. Just be sure to do a deep check now and then :)
rm_ November 8:
okay assuming I have 10 nodes with 10 GB of space each, with your recommended settings: - how many of those 10 can disappear with data still intact? - what is the amount of usable space out of the raw 10x10GB capacity?
joepie91 November 8:
- how many of those 10 can disappear with data still intact? Total shares would be 8, happy would be 6, and needed would be 4 - this means you can lose 6 - 4 = 2 servers (worst case scenario) without losing access to your data. It's likely possible to lose 3 or 4 servers (this depends on whether the servers you are losing hold 1 or more shares). In this, with "losing" servers I only mean the (max.) 8 servers that you uploaded a share to, to start with. Since your total amount of servers is 10, you could lose 2 more servers without any issues if those servers happen to not hold any shares for this file. Summary: worst case scenario, you can lose any 2 servers. Best case scenario, you can lose 6 servers. It'll usually be somewhere in the middle.- what is the amount of usable space out of the raw 10x10GB capacity? Since your expansion factor is 8 / 4 = 2, and every storage server has an equal amount of space available, you should be able to use 100 / 2 = 50GB of practical space.
pubcrawler November 8:
how much space are you combing in nodes and doing so all over internet?
joepie91 November 8:
iqj5wkzuo2x3tdcjhauzsafpe5gwcojq [name removed] CA 13.41GB a2bjjtujmabiwfqungzlywzyjszm2gyp [name removed] 265.96GB fzu6dmqq23u2km6ywtlym4tvmtefn25b Box 3.35GB oywsltqtxm6su6gu54j6bxmgh5qf6o5r Git 4.29GB mbbs6staiw56f7dtyxxnzecixjoz2m2r Haless 44.04GB n3fhesvxzg5mpq3gsov76lf2sdwfwo45 Konjassiem 9.16GB z3hc2nw2g2jjhb7vntt5z3mtdcebiho6 Arvel 7.14GB cqq4hmk7flrfwmlt6mldulfrc4swdrhl Eris 26.86GB akd5kzq4bsmdr6yeyltaro3t2rtap5xo [name removed] 600.95GB u5ygxnwa25ztku4qpubsjjahlp2pl5bp Discordia 11.01GB sxbcue26orebknqpzchx5yl63ywep66n Alba 69.10GB s72mw7cw3ojzki5wz7qxhxs2eex4ethf CVM-VZ 54.00GB 6ck5rd7g46o6kx2wxcym3ku3obwv645d [name removed] 26.60GB hepqdbu7mohz6jg4uzozouotapfm74pk [name removed] US 11.37GB qenkbcotohq4c4vhsfmzjmixqhj7ohww Shi 4.45GB mhelfzivcdzjisxrlwkxo3rnmp5bef3m Basket 43.67GB jxba3idp4epcvfughxsni5c7pprgrxkw Aarnist 33.83GB 5yunndzcq7a2bqvlyqjj6kxedgiymhtt [name removed] ZNC 13.46GB y3hgi5fi3qdnoamemuj5qpfrnmopy5ra equinox 5.03GB jyq6lzjwff3a7ijae54y3zfg2mcv2ykr Nijaxor 48.43GB pu5m53joaxfdc5zwbcvzu3gv65v3wab3 Sabit 17.66GB Total free storage space: 1313.78GBThe nodes are distributed geographically fairly evenly.The 600.95GB node is a bit lost, because it's connected to the old introducer address (which no longer exists), so I can't use that space right now. I'm having some issues tracking down the owner :)
pubcrawler November 8:
Fascinating post with the storage amounts. So Tahoe doesn't care that nodes have different storage amounts available? No sort of disclaimer or worry or best case against such?
joepie91 November 8:
No, the actual amount of storage space that you have available doesn't really matter. The only caveat is that you won't be able to use up all of it in all situations - say that you, for example, have total/happy shares set to 10, but only 2 servers offer more than 30GB of space, then your ceiling for storing files will be at about 30GB - after all, at some point, you simply only have 2 servers left that have more space to store files, and that wouldn't satisfy shares.happy.
craigb November 8:
also, isn't it the case that by default, nodes closest in latency terms get filled up faster on average?
joepie91 November 8:
No. Nodes are, as far as I am aware, only chosen by latency when downloading. Uploading will happen with deterministic randomness - as it should, because if the storage servers were picked on basis of latency, it would create a single (geographical) point of failure.That being said, if you're planning on for example building a CDN with Tahoe-LAFS as backend, you'll probably want to make sure that you either have an expansion factor of at least 3, or heavy caching, so that it's likely that data can be retrieved entirely from the same geographical area as the request originates from :)
Need more help?
There's plenty more (very clear) documentation on the Tahoe-LAFS website!:)
EDIT: For those interested in copying this guide - it's released under theWTFPL, meaning you can basically do with it whatever you want, includingcopying it elsewhere. Credits or a donation are both appreciated, but neither isrequired :)