Managing large datasets on the decentralized web is a largely unsolved problem, without many established precedents or best practices.
Two large-scale IPFS deployments are relevant:
the IPFS Gateway, maintained by Protocol Labs at https://gateway.ipfs.io/ipfs/…
Textile, a decentralized photo manager
Interface & Goals
As far as the Underlay is concerned, IPFS is a storage and distribution layer exposing put: (file) => hash
and get: (hash) => file
interfaces, so our goals in designing an IPFS deployment look like the goals of any data store:
Durability. We really cannot afford to lose data and we want as strong guarantees as possible that it will not happen.
Availability. Separately from persisting data, we want to serve files to people who ask for them.
Availability is a little nuanced in the IPFS world: IPFS clients (nodes) that request files by their hash get to leverage all sorts of peer-to-peer magic to assemble their files BitTorrent-style from whoever happens to have them. This is the ideal case, where we pin all of our files on a “server of last resort” that nodes eventually find when nobody closer to them has the file they want, but where they usually get what they need from a near neighbor.
But that ideal future isn’t quite here yet - most browsers don’t know how to resolve ipfs://
URIs and many use cases will still end up interfacing over HTTP. Protocol Labs maintains a public gateway - you can access any file ipfs://Qmfoo…
via the gateway at https://gateway.ipfs.io/ipfs/Qmfoo…
- but we probably want to maintain our own gateway so that we don’t rely on their site. Plus, our own gateway would be able to serve files it has pinned locally without hitting the IPFS network at all.
The bottom line is that we need to provide durability by running an IPFS node(s) pinning our files, and satisfy availability by running an IPFS gateway(s) serving our files, but these two services are largely orthogonal. We can replicate our storage by having lots of nodes but serve from a single gateway, we can set up multiple gateways while only having one storage node, or anything in between.
IPFS Cluster
IPFS Cluster is a separate Protocol Labs project that provides “pinset orchestration for IPFS nodes”. When you have a fleet of several IPFS nodes that you want collectively pinning some large set of files, you can run a separate ipfs-cluster
daemon alongside each of your IPFS nodes. IPFS Cluster uses Raft to maintain “pinset consensus” and distributes your files between the nodes in the cluster with a replication factor of your choice.
(IPFS Cluster is also cool because of the way large files are stored in IPFS. The lowest-level data object are “blocks”, which are a maximum of 4 megabytes. Any file larger than this is sharded into a hash array mapped trie, which lets a cluster of nodes pin a file larger than any one node’s storage capacity.)
Textile
Textile is an open-source decentralized photo manager built on IPFS, and their team has published several well-written reports of their experience setting up a production-grade IPFS deployment.
AWS Services Used
ECS
EC2
Elastic Load Balancer
Scaling
A tricky part of working with IPFS Cluster is starting the cluster peers such that they successfully enter consensus with each other. There are two ways of starting a cluster:
Fixing a set of cluster peers by listing every peer’s address (DNS or IP) in a peerstore
file before starting the cluster daemon, and then launching all peers at the same time. The peers will jointly start a new cluster together.
Starting the daemon with a --bootstrap <address>
pointing to any existing stable peer. The peer will bootstrap itself into the existing cluster.
Starting the daemon without either of these options will start a new cluster with just the one peer.