Skip to main content

Underlay Architecture

Published onApr 25, 2019
Underlay Architecture


The Underlay is a peer-to-peer network of semantic databases united by a universal query interface. These databases compose a decentralized knowledge graph of public structured data that can be freely consumed and replicated by humans and machines alike.

Two principles framing this design are autonomy and flexibility: virtually no components are “required”, and nodes are delegated maximal freedom to act as they see fit. This is done for simplicity, to avoid making decisions in spaces that we don't yet understand, and is justified by the good-faith nature of the nodes we anticipate the network comprising - largely public institutions like universities and libraries that are already invested in supporting a better knowledge infrastructure. The system will be more inclusive and more sustainable if these institutions are able to support exactly those services that align with their interests.

The purpose of the Underlay is not to prescribe the operation of specific services, but rather to define common interfaces at useful points of coordination.


The Underlay is built on top of IPFS, a peer-to-peer hypermedia protocol. IPFS uses a distributed hash table to create an immutable distributed filesystem where files are addressed by their hash. A node “adding a file to IPFS” doesn’t force other nodes to store the file, but rather uses the distributed hash table to associate the file’s hash with that node’s network addresses. This lets nodes request files by their hash from the network at large without prior knowledge of the file’s location.

IPFS itself is built on the libp2p networking stack, which provides abstract interfaces for a variety of interoperable network components such as “Transports” and “Connections”. Applications using libp2p register handlers for the custom protocols that they define, which are named by versioned strings. Libp2p has seen increasingly widespread adoption in decentralized projects like Filecoin, Substrate, Polkadot, and Ethereum 2.0.

An Underlay node is an IPFS node extended with a custom libp2p protocol. This means that nodes implement at least /mplex for libp2p, /ipfs/id and /ipfs/bitswap for IPFS, and /ul/0.1.1/cbor-ld for the Underlay.

Underlay nodes also maintain an abstract integral dataset containing a select subset of the messages they send and receive. Integrating a message involves pinning the normalized message to IPFS and adding its contents into the local integral dataset (however implemented). Nodes can also disintegrate messages by unpinning them from IPFS and removing them from the dataset.



Nodes send and receive messages over a custom protocol /ul/0.1.1/cbor-ld, which will be incremented according to semantic versioning as it evolves.

Messages are expressed as RDF Datasets. For an introduction to RDF, review the W3C RDF 1.1 Primer. Datasets, graphs, triples, blank nodes, and grounding are referenced here extensively.

The last path element of the libp2p protocol string identifies an RDF dataset serialization format. In the initial versions of the protocol, all messages are serialized as CBOR-encoded JSON-LD, following section 4 of RFC 7049. “CBOR-LD” is only prescribed as an ephemeral, transmission-only serialization. Nodes may choose to persist the RDF content of messages in any format they choose.

Nodes are encouraged to use JSON-LD contexts to compact their messages, but must not reference remote contexts unless the remote context URI carries in-band cryptographic resource integrity (e.g. IPFS/IPLD URIs or Hashlinks). This is a serious security concern - domain names may expire or otherwise get taken over by malicious parties that should not have the ability to redefine the semantic content of past statements.

Other serialization formats may be supported for messages in the future.


The receipt of a message does not require a response. Instead, each node implements its own routing and logic to determine what messages to send to which other nodes, if any. The node may consider the content of each message, local state accumulated over messages, or anything else at its disposal. Nodes may send messages that were not “triggered” by other messages at all.

This is similar to the actor model or the original conception of object-oriented programming as implemented by Smalltalk. Messages are only sent - not even acknowledged. Nodes act as black boxes with private internal state that may spontaneously send anything to anyone.

In combination with the view of queries as computations that the network performs, the Underlay can be seen as a distributed semantic computer or knowledge engine.


Messages are RDF Datasets and their interpretation is governed by a semantic extension of RDF referred to here as UnderlaySemantics. Its entailment regime will be formally described in a later document, but a few key properties are highlighted here:

  • The graph name of a named graph denotes that named graph, as described in section 3.3 of the W3C working group note on the semantics of RDF Datasets.

  • Only blank graph names are valid. Named graphs with IRI graph names have no meaning.

  • Underlay nodes, messages, graphs, queries, and triples imply their corresponding rdf:type and prov:hadMember relations.

All messages are interpreted in this regime. Nodes may internally follow monotonic extensions to this regime, but those extensions must not negate the existing rules of entailment under UnderlaySemantics.

The content of messages are not considered facts themselves; they’re treated as assertions of truth made in an open-world framework.


Supporting provenance and reification in the Underlay involves referring to nodes and messages, and the graphs, blank nodes, and triples within messages.

It’s important for these URIs to be one-to-one: given isomorphic messages in different serializations, nodes should derive identical URIs for the corresponding messages and their graphs, blank nodes, and triples. An identifier scheme that is not one-to-one does not truly identify the dataset as an abstract syntactic construct; it introduces redundancy by identifying a specific serialization.

Any scheme for deriving one-to-one identifiers for graphs is guaranteed to be NP-complete, since it can be used to decide graph isomorphism. Paying for this “up-front” means that identifiers, once they exist, can be used for constant-time equality checks (or “containment” checks of nodes within graphs or graphs within messages). A one-to-many scheme would mean pushing this complexity onto users, where isomorphism would be redundantly re-computed many times in a world where any two identifiers might identify the same thing.

In the initial versions of the Underlay, these identifiers use the URDNA2015 RDF Dataset Normalization algorithm published by the W3C Credentials Community Group. The algorithm produces a normalized N-Quads serialization of an RDF dataset as a UTF-8 string ending in a newline character.



The canonical URI for a message is ul:/ipfs/{CIDv0}, where {CIDv0} is the base-58 CID v1 of the UnixFSv1 file containing its URDNA2015 string. In practice, it just means copying the output of ipfs add and pasting it after ul:/ipfs/.

Nodes are encouraged to pin relevant messages to IPFS. Doing so ensures that other nodes are able to retrieve the full content of a message given only its URI.

Messages have an implied rdf:type of ul:message.

Blank nodes


Messages in the Underlay will invariably contain blank nodes, and nodes may wish to reference them in subsequent messages. Grounding blank nodes has historically been a tricky undertaking, but the canonical blank node labels that the URDNA2015 algorithm generates for each blank node in the dataset during normalization suggest a simple approach using fragment identifiers.

In the Underlay, canonical message URIs with fragment identifiers that label a blank node in the normalize dataset are considered semantically equivalent to the blank node in the dataset with that node label.

Named graphs


Since messages can only contain named graphs with blank graph names, and since the semantics of messages dictate that the graph name denotes the named graph, the same approach to identifying blank nodes can be re-used for identifying named graphs.

Each named graph will be assigned a canonical blank graph name during normalization; canonical message URIs with fragment identifiers that label a blank graph name in the normalized dataset are considered semantically equivalent to the named graph in the dataset with that graph name.

Named graphs have an implied rdf:type of ul:Graph, unless they are explicitly typed to have an rdf:type of ul:Query (in which case they are not of type ul:Graph). More types beyond ul:Query may be added in the future.

Default graphs


The default graph of a message, as distinct from the message itself, is identified by an empty fragment identifier (i.e. just # appended to the canonical message URI).

Default graphs always have an implied rdf:type of ul:Graph.



Nodes may also wish to reference individual triples from previous messages. Since the normalization algorithm produces a canonical ordering of all quads in the dataset, they can be unambiguously identified by their index within the canonical N-Quads serialization.

Quad URIs extend the message URI with a fragment identifier /{index}, where {index} is the (base-10) index of the triple in the normalized dataset. They are distinguished from blank node and named graph URIs by their leading slash; all blank node labels and blank graph names begin with _:.

Quads have an implied rdf:type of rdf:Statement.



IPFS and libp2p nodes are identified by their “PeerId”, which is the the sha2-256 multihash of the libp2p node’s public key. PeerIds are typically displayed as base-58 strings.

This identity scheme allows nodes to change network addresses, physical locations, or even institutional affiliation, since identity is governed by cryptographic key ownership. A node’s PeerId can be resolved to its advertised network addresses with a distributed hash table.

In RDF contexts, a node is referenced by ul:/node/{PeerId}, where {PeerId} is its base-58 PeerId. Note that node is distinct from the ipfs root used in the other URI formats.

This may conveniently double as a dereferenceable URI over IPFS or a public gateway (e.g.{PeerId}) if the node chooses to publish an IPNS record with its key pair. However it is only used in the Underlay as an identifier.

Nodes have an implied rdf:type of ul:Node.

Named graphs

There is no distinction between the semantics of named ul:Graph graphs and the default graph in messages, especially since even the default graph of a message becomes a named graph once integrated by nodes. Named graphs are treated as a means of reification within messages - allowing for statements about graphs as collections of statements themselves. Later messages can achieve the same semantic result (making statements about graphs) by using the message URI with a fragment of the canonized graph name.

This method was chosen over of other reification methods such as RDF* on the strength of a few advantages:

  • Concision: a content-addressed URI is significantly more concise for large graphs than the overhead of replicating the entire graph as reified rdf:Statement objects. This concision is achieved without loss of access to the graph if desired so long as the source assertion is persisted by any node on the network.

  • Vertical scoping: letting authors pick which statements to organize into which named graphs allows for space to express the natural grouping of statements that frequently exists in the real world.

  • Lateral scoping: this content-addressing scheme names graphs relative to their containing dataset, which means authors have control over what other data is permanently associated with any given set of statements. This is good for tentative, controversial or even incorrect statements that a node may want to publish, but only if it is certain that the data will always be accompanied by some qualifying provenance.


Queries are an extremely general and powerful primitive in the Underlay. They can be used to select and retrieve data, like traditional relational queries. They can be used to perform computation, à la Prolog. They are also used for meta-level API-style calls within the network by sending nodes queries about themselves.

Syntactically, queries are RDF graphs (and are always named graphs within the dataset), but they have different semantics from the other graphs in a message. This does respect the RDF 1.1 Semantics as well as the On Semantics of RDF Datasets Working Group Note, which indicate that “a dataset SHOULD be understood to have at least the same content as its default graph”, but allows semantic extensions to define a named graph’s relation to its graph name in addition to the interpretation of the named graph itself.

Therefore we define that the default graph of a query dataset expresses RDF content in the same entailment regime as assertions, but that the named graphs follow different semantics. In an Underlay query, the meaning of a query named graph Q is a resource that is syntactically a ground graph G such that:

  1. G is an instance of Q

  2. The local integral dataset entails a dataset P containing G.

Informally, queries express a computation that is executed by matching and arranging things that have been said, where blank nodes within queries are existentially quantified variables, and the meaning of a query is grounding of every blank node to an IRI or literal value such that the result is an implication of the merged graph of all assertions that the node knows about. And since that merged graph must imply all of its subgraphs, at the very least you can use queries to match subgraphs.

Note that there may be multiple meanings of a query, and number of possible results monotonically increases as more messages are integrated (and monotonically decreases as messages are removed from the integral dataset). It is again left to each node to implement its own heuristic for query resolution, which is optional and best-effort: nodes may entirely ignore some queries, or spend limited or unlimited resources in attempting to resolve them.

Query “results” are “returned” by way of messages: once resolved, nodes assemble an RDF dataset P containing the ground result G as a named graph, and a default graph that links the graph name with an IRI denoting the original (normalized) query by content-addressed URI via the ul:satisfies property. Provenance data describing the source or derivation of G will also be included in the default graph (this will be a necessary consequence of entailment under UnderlaySemantics: the integral dataset will only imply the validity of datasets that document every quad’s content-addressed source in the default graph, or some other minimal provenance requirement). This “result” dataset is sent as a message to the node that posed the query.

Many nodes will have recursive query resolution strategies, involving forwarding subqueries to other nodes based on content heuristics and performing joins across the results. Nodes may use the default graph of queries to include meta-level statements that may assist, guide, or constrain resolution. Heuristics for analyzing and routing subqueries will be explored and possibly formalized in future versions of the protocols.

Static queries

Since each node's entailment regime must be a monotonic extension, we know that graphs entail themselves. This means that at the very least, queries can be used to match subgraphs of a node's local integral dataset.

Node queries

UnderlaySemantics also include the entailment of specific properties about nodes themselves, which are used in lieu of an API for the Underlay network itself. These properties are described in the Schema section below.

Integral dataset

The abstract “database” that nodes maintain is a collection of messages combined into a single RDF dataset through a new operation called integration that produces a set of normalized, content-labeled, and content-ground graphs.

Combining multiple RDF graphs is straightforward except for the handling of blank nodes. The RDF 1.1 Semantics document describes two methods of combining multiple RDF graphs: graph union, which considers blank nodes from different graphs with the same label to be the same node in the union graph, and graph merge, which relabels blanks nodes to be independent in the merged graph.

While merging graphs involves relabeling blank nodes, a similar a process called “skolemization” is a “transformation on RDF graphs which eliminates blank nodes by replacing them with “new” IRIs, which means IRIs which are coined for this purpose and are therefore guaranteed to not occur in any other RDF graph (at the time of creation)”. Typically these introduce a universally unique identifier for every blank node in each graph. Skolemized graphs achieve merge- and union-safety by becoming ground, since merge and union are identical over ground graphs.

To formalize the storage and persistence of many messages as a single dataset, we introduce a concept roughly analogous to a “content-skolemized dataset merge” called the integral dataset. The integral dataset achieves merge- and union-safety by first normalizing and hashing each dataset, then replacing every blank node with its content-addressed blank node identifier, and every graph name (including the default graph) with its content-addressed named graph identifier.


The integral dataset over a set of datasets D_0...D_n is derived by:

  1. Normalizing every dataset D_i by the URDNA2015 algorithm. This produces an N-Quads serialization of each dataset as a UTF-8 string N(D_i), as well as a canonical blank node label c(n) and blank graph name c(g) for every blank node and named graph, respectively.

  2. Computing a content-addressed URI H(N(D_i)) for each normalized dataset. For version 0.1.1 of the assertion protocol these URIs are of the form ul:/ipfs/{hash}, where {hash} is the CID v1 in lowercase base 32 of the UnixFSv1 IPFS file containing the normalized dataset.

  3. Relabeling every blank node n in every normalized dataset N(D_i) to H(N(D_i)) | '#' | c(n), the dataset’s content-addressed URI with the canonical blank node label as a fragment identifier.

  4. Adding every named graph g of every normalized dataset N(D_i) to the integral dataset under the graph name H(N(D_i)) | '#' | c(g), the source dataset’s content-addressed URI with the named graph’s canonical graph name as a fragment identifier.

  5. Adding the default graph of every normalized dataset N(D_i) to the integral dataset as a named graph with graph name H(N(D_i)) | '#', the source dataset's content-addressed URI with an empty fragment identifier.


Useful properties of the integral dataset include:

  • The integral dataset is always ground.

  • The integral dataset always has an empty default graph.

Universal integral dataset

We frequently refer to the “integral dataset over all assertions”, or just “the integral over all assertions”. This universal integral dataset will be used in describing the entailment regime in UnderlaySemantics.

Local integral dataset

The “database” that each node maintains is a local integral dataset over assertions which it chooses to store, although the format and mechanism by which the integral dataset is actually persisted is left to each node. When a node receives an assertion, it may elect to integrate the assertion to its local dataset. Nodes may also integrate assertions that they generate themselves without publishing, or may choose not to integrate anything (all nodes are abstractly prescribed a database, but it may be permanently empty).

It’s unlikely that a literal, complete copy of the integral dataset over all assertions will ever exist in one place, and more likely that each Underlay node will store a subset of the assertions it receives (which is itself a subset of the assertions that have been published), thus only keeping a partial copy of the integral dataset. It may be the some assertions are not integrated by any node, and thus are lost forever, even though they may have induced side effects along the way.


The Underlay defines a namespace with preferred prefix alias ul.

Prefixes used in this schema include:



A ul:Node is a node in the Underlay.

ul:Node rdf:type rdfs:Class .

Nodes are identified by PeerId URIs as previously described.


The property ul:peer relates a ul:Node to a single peer ul:Node.

ul:peer rdf:type rdfs:Property .
ul:peer rdfs:domain ul:Node .
ul:peer rdfs:range ul:Node .


A ul:Message is a generalized RDF dataset expressing RDF data under the UnderlaySemantics entailment regime.

ul:Message rdf:type rdfs:Class .
ul:Message rdfs:subClassOf prov:Collection .

Messages are treated as digital resources and extend PROV Collections. Messages implicitly have a prov:hadMember relation to each of their constituent graphs (including the default graph), which are all themselves subclasses of PROV Collections.

All named graphs in a message must have blank graph names. Messages are identified a URI that content-addresses the normalized dataset.


A ul:Dataset is an integrated dataset over some number of ul:Message datasets.

ul:Dataset rdf:type rdfs:Class . ul:Dataset rdfs:subClassOf prov:Collection .

A ul:Dataset is identified by a URI that content-addresses the dataset in normalized N-Quads form.

A ul:Dataset differs from a ul:Message in the following ways:

  • A ul:Dataset is always ground.

  • A ul:Dataset has an empty default graph.

  • A ul:Dataset contains RDF graphs, while a ul:Message may contain queries that are generalized RDF graphs.

  • The named graphs in a ul:Dataset have URI graph names, while the named graphs in a ul:Message have blank graph names.


A ul:Graph is an RDF graph.

ul:Graph rdf:type rdfs:Class .
ul:Graph rdfs:subClassOf prov:Collection .

This class mainly exists to have a well-typed range for the ul:satisfies property. Like queries, graphs are a subclass of PROV Collections and implicitly have a prov:hadMember relation to each of their constituent rdf:Statement quads. Unlike queries, graphs do not need an explicit type. The default graph of a ul:Message is always implied to have an rdf:type of ul:Graph.


A ul:Query is an RDF graph expressing a query in the Underlay.

ul:Query rdf:type rdfs:Class .
ul:Query rdfs:subClassOf prov:Collection .

Queries must be explicitly typed in messages. Named graphs are assumed to be of type ul:Graph unless given an rdf:type of ul:Query.

Queries are identified a URI that content-addresses the normalized dataset. Queries are a subclass of PROV Collections and implicitly have a prov:hadMember relation to each of their constituent rdf:Statement quads


The ul:satisfies property expresses resolution of a ul:Query by a ul:Graph.

ul:satisfies rdf:type rdfs:Property .
ul:satisfies rdfs:domain ul:Graph .
ul:satisfies rdfs:range ul:Query .

A query may be resolved by an arbitrary number of unique satisfying graphs. In other words, a message may include multiple graphs satisfying the same query, so long as no two of those graphs are isomorphic.


The ul:index property relates a ul:Query to the blank nodes over which it is indexed. These blank nodes are referred to as the query’s index set, and they constrain the number and kind of graphs that may satisfy it in any particular message.

ul:index rdf:type rdfs:Property .
ul:index rdfs:domain ul:Query .
ul:index rdfs:range rdf:Resource .

Each graph that ul:satisfies a query must differ in at least one of the variables in the index set. In other words, in any one message, no two graphs satisfying the same query can have the same value for all of the variables in the query’s index set.

By default, a ul:Query is assumed to have an empty index set, which places no constraint on the graphs that may satisfy it.

Example Queries

These examples are written in JSON-LD and elide the following context object:

  "@vocab": "",
  "xsd":    "",
  "ul":     ""

They queries assume a hypothetical Underlay node serving Wikidata content under the schema, loosely translated for sake of illustration.

Provenance data is omitted and will be described in a later document.

The hashes in the results below can be reproduced with jsonld-cli and IPFS:

echo '{query}' | jsonld normalize | ipfs add -n -Q

Find something named Vincent van Gogh.

  "@type": "ul:Query",
  "@graph": { "name": "Vincent van Gogh" }

This normalizes to:

_:c14n0 <> "Vincent van Gogh" _:c14n1 .
_:c14n1 <> <> .

… with CID QmVtKcLCXDVgJXKKPReeaC5aLn471MzrL9mH4zMxRe9zLp. This means that the query graph has an absolute URI ul:/ipfs/…9zLp#_:c14n1.

A response to this query might look like:

  "ul:satisfies": "ul:/ipfs/…rEyU#_:c14n1",
  "@graph": {
    "@id": "",
    "name": "Vincent van Gogh"

Find a Person named Vincent van Gogh, along with their date of birth.

  "@type": "ul:Query",
  "@graph": {
    "@type": "Person",
    "name": "Vincent van Gogh",
    "birthDate": { }

This normalizes to:

_:c14n0 <> _:c14n2 _:c14n1 .
_:c14n0 <> "Vincent van Gogh" _:c14n1 .
_:c14n0 <> <> _:c14n1 .
_:c14n1 <> <> .

… with CID QmXvomf8FoomTJjD7urL6axEG2QQ7SD3mx2smoJUWD3ctt. This means that the query graph has an absolute URI ul:/ipfs/…3ctt#_:c14n1.

A response to this query might look like:

  "ul:satisfies": "ul:/ipfs/…3ctt#_:c14n1",
  "@graph": {
    "@id": "",
    "birthDate": {
      "@value": "1853-03-30",
      "@type": "xsd:date"

Open questions

Open problem statements of have been migrated to the Underlay research directions document.

Joel Gustafson:

Can I really just get away with throwing around the word “implied” like this? It seems like it?

Joel Gustafson:

I guess it only really checks out if/when we switch to our own URI scheme ul:/… and then we’ll be able to specify this in our semantic extension which we’ll definitely for sure formally specify some day

+ 1 more...
Samuel Klein:

Worth linking to a stub

Joel Gustafson:

I suspect this property will be short-lived; we’ll need to drop it as soon as we come up with a mechanism for universal quantification.

Joel Gustafson:

This is an absolutely absurd sentence to have to write, but is necessary. The W3C could not reach consensus on what it means to use a graph name as the subject or object of a triple, but the emerging consensus (and most practically useful for our purposes) is that - shocker - it means the graph whose name it is.

Samuel Klein:

This doc will read better if every hash only appears once and is given a shorthand (used thereafter). “bafy… (<VanGogh-1>)”

Samuel Klein:

Examples of standard extensions? Are there some core entailment functions that are not ‘extended’ at all?

Sandro Hawke:

It’s an odd choice to pick two standards and combine them in a new non-standard way.

Joel Gustafson:

Yeah - although “CBOR-LD” has been discussed on the JSON-LD working group mailing list and the CBOR spec includes a section on round-tripping to JSON and back. I feel like I want to prescribe some kind of compression, both to save space on the wire and to emphasize that JSON-LD is just a transmission format.

+ 1 more...
Samuel Klein:

Consider a parallel section like this per design choice

Samuel Klein:

many will be mistakes and spam, even if all nodes are benevolent

Samuel Klein:

It may be instructive to catalog some degenerate types of nodes and how others are expected to interact with them: {continuous broadcast} {only write to other nodes} {only listen and gather network statistics} {only listen and propagate}

Samuel Klein:

Worth making space in the open problems for the known problem areas that might fall outside this assumption, even among initial public institutions.

Samuel Klein:

How does immutability play with high volume? (e.g.: disappearance, deletion-equivalents)

Joel Gustafson:

So nodes are free and welcome to drop assertions they no longer care about - ‘immutability’ just means there’s no ‘editing’ of data within assertions, just more assertions that might come later

Joel Gustafson:

The more I think about it the more I think this should be required, and part of the definition of “local integral dataset” includes that the node pins all of the assertions it claims to have integrated.

Joel Gustafson:

It wouldn’t be materially restrictive because we already let nodes keep an empty integral dataset - it just strengthens the meaning of “integrate”

+ 1 more...
Joel Gustafson:

This is substantially different than a malicious party taking over a domain of a schema/ontology used as a predicate, since they’re treated as identifiers and should not be strictly considered as dereferenceable in the first place.

Joel Gustafson:

This was a tough call because the spec basically requires everyone to canonicalize queries and assertions on both ends. But the canonicalized string ends up being really big (since every instance of every URI/predicate/literal/etc is written out in full, absolute form, and *especially* because RDF doesn’t have native support for lists and JSON-LD uses native arrays) and JSON-LD lets us alias all of those super nicely, even though the recipient needs to canonicalize it right away. I suspect we’ll end up caring about space a lot (some assertion datasets will be huge). CBOR buys us another ~20-50% for free

Joel Gustafson:

With any luck at all, the Underlay will create pressure to develop faster implementations of the canonicalization algorithm