Introduction
The Underlay is a peer-to-peer network of semantic databases united by a universal query interface. These databases compose a decentralized knowledge graph of public structured data that can be freely consumed and replicated by humans and machines alike.
Two principles framing this design are autonomy and flexibility: virtually no components are “required”, and nodes are delegated maximal freedom to act as they see fit. This is done for simplicity, to avoid making decisions in spaces that we don't yet understand, and is justified by the good-faith nature of the nodes we anticipate the network comprising - largely public institutions like universities and libraries that are already invested in supporting a better knowledge infrastructure. The system will be more inclusive and more sustainable if these institutions are able to support exactly those services that align with their interests.
The purpose of the Underlay is not to prescribe the operation of specific services, but rather to define common interfaces at useful points of coordination.
Architecture
The Underlay is built on top of IPFS, a peer-to-peer hypermedia protocol. IPFS uses a distributed hash table to create an immutable distributed filesystem where files are addressed by their hash. A node “adding a file to IPFS” doesn’t force other nodes to store the file, but rather uses the distributed hash table to associate the file’s hash with that node’s network addresses. This lets nodes request files by their hash from the network at large without prior knowledge of the file’s location.
IPFS itself is built on the libp2p networking stack, which provides abstract interfaces for a variety of interoperable network components such as “Transports” and “Connections”. Applications using libp2p register handlers for the custom protocols that they define, which are named by versioned strings. Libp2p has seen increasingly widespread adoption in decentralized projects like Filecoin, Substrate, Polkadot, and Ethereum 2.0.
An Underlay node is an IPFS node extended with a custom libp2p protocol. This means that nodes implement at least /mplex
for libp2p, /ipfs/id
and /ipfs/bitswap
for IPFS, and /ul/0.1.1/cbor-ld
for the Underlay.
Underlay nodes also maintain an abstract integral dataset containing a select subset of the messages they send and receive. Integrating a message involves pinning the normalized message to IPFS and adding its contents into the local integral dataset (however implemented). Nodes can also disintegrate messages by unpinning them from IPFS and removing them from the dataset.
Protocol
/ul/0.1.1/cbor-ld
Nodes send and receive messages over a custom protocol /ul/0.1.1/cbor-ld
, which will be incremented according to semantic versioning as it evolves.
Messages are expressed as RDF Datasets. For an introduction to RDF, review the W3C RDF 1.1 Primer. Datasets, graphs, triples, blank nodes, and grounding are referenced here extensively.
The last path element of the libp2p protocol string identifies an RDF dataset serialization format. In the initial versions of the protocol, all messages are serialized as CBOR-encoded JSON-LD, following section 4 of RFC 7049. “CBOR-LD” is only prescribed as an ephemeral, transmission-only serialization. Nodes may choose to persist the RDF content of messages in any format they choose.
Nodes are encouraged to use JSON-LD contexts to compact their messages, but must not reference remote contexts unless the remote context URI carries in-band cryptographic resource integrity (e.g. IPFS/IPLD URIs or Hashlinks). This is a serious security concern - domain names may expire or otherwise get taken over by malicious parties that should not have the ability to redefine the semantic content of past statements.
Other serialization formats may be supported for messages in the future.
Mechanics
The receipt of a message does not require a response. Instead, each node implements its own routing and logic to determine what messages to send to which other nodes, if any. The node may consider the content of each message, local state accumulated over messages, or anything else at its disposal. Nodes may send messages that were not “triggered” by other messages at all.
This is similar to the actor model or the original conception of object-oriented programming as implemented by Smalltalk. Messages are only sent - not even acknowledged. Nodes act as black boxes with private internal state that may spontaneously send anything to anyone.
In combination with the view of queries as computations that the network performs, the Underlay can be seen as a distributed semantic computer or knowledge engine.
Messages
Messages are RDF Datasets and their interpretation is governed by a semantic extension of RDF referred to here as UnderlaySemantics
. Its entailment regime will be formally described in a later document, but a few key properties are highlighted here:
The graph name of a named graph denotes that named graph, as described in section 3.3 of the W3C working group note on the semantics of RDF Datasets.
Only blank graph names are valid. Named graphs with IRI graph names have no meaning.
Underlay nodes, messages, graphs, queries, and triples imply their corresponding rdf:type
and prov:hadMember
relations.
All messages are interpreted in this regime. Nodes may internally follow monotonic extensions to this regime, but those extensions must not negate the existing rules of entailment under UnderlaySemantics
.
The content of messages are not considered facts themselves; they’re treated as assertions of truth made in an open-world framework.
Identifiers
Supporting provenance and reification in the Underlay involves referring to nodes and messages, and the graphs, blank nodes, and triples within messages.
It’s important for these URIs to be one-to-one: given isomorphic messages in different serializations, nodes should derive identical URIs for the corresponding messages and their graphs, blank nodes, and triples. An identifier scheme that is not one-to-one does not truly identify the dataset as an abstract syntactic construct; it introduces redundancy by identifying a specific serialization.
Any scheme for deriving one-to-one identifiers for graphs is guaranteed to be NP-complete, since it can be used to decide graph isomorphism. Paying for this “up-front” means that identifiers, once they exist, can be used for constant-time equality checks (or “containment” checks of nodes within graphs or graphs within messages). A one-to-many scheme would mean pushing this complexity onto users, where isomorphism would be redundantly re-computed many times in a world where any two identifiers might identify the same thing.
In the initial versions of the Underlay, these identifiers use the URDNA2015 RDF Dataset Normalization algorithm published by the W3C Credentials Community Group. The algorithm produces a normalized N-Quads serialization of an RDF dataset as a UTF-8 string ending in a newline character.
Messages
ul:/ipfs/{CIDv0}
The canonical URI for a message is ul:/ipfs/{CIDv0}
, where {CIDv0}
is the base-58 CID v1 of the UnixFSv1 file containing its URDNA2015 string. In practice, it just means copying the output of ipfs add
and pasting it after ul:/ipfs/
.
Nodes are encouraged to pin relevant messages to IPFS. Doing so ensures that other nodes are able to retrieve the full content of a message given only its URI.
Messages have an implied rdf:type
of ul:message
.
Blank nodes
ul:/ipfs/{CIDv0}#{label}
Messages in the Underlay will invariably contain blank nodes, and nodes may wish to reference them in subsequent messages. Grounding blank nodes has historically been a tricky undertaking, but the canonical blank node labels that the URDNA2015 algorithm generates for each blank node in the dataset during normalization suggest a simple approach using fragment identifiers.
In the Underlay, canonical message URIs with fragment identifiers that label a blank node in the normalize dataset are considered semantically equivalent to the blank node in the dataset with that node label.
Named graphs
ul:/ipfs/{CIDv0}#{name}
Since messages can only contain named graphs with blank graph names, and since the semantics of messages dictate that the graph name denotes the named graph, the same approach to identifying blank nodes can be re-used for identifying named graphs.
Each named graph will be assigned a canonical blank graph name during normalization; canonical message URIs with fragment identifiers that label a blank graph name in the normalized dataset are considered semantically equivalent to the named graph in the dataset with that graph name.
Named graphs have an implied rdf:type
of ul:Graph
, unless they are explicitly typed to have an rdf:type
of ul:Query
(in which case they are not of type ul:Graph
). More types beyond ul:Query
may be added in the future.
Default graphs
ul:/ipfs/{CIDv0}#
The default graph of a message, as distinct from the message itself, is identified by an empty fragment identifier (i.e. just #
appended to the canonical message URI).
Default graphs always have an implied rdf:type
of ul:Graph
.
Quads
ul:/ipfs/{CIDv0}#/{index}
Nodes may also wish to reference individual triples from previous messages. Since the normalization algorithm produces a canonical ordering of all quads in the dataset, they can be unambiguously identified by their index within the canonical N-Quads serialization.
Quad URIs extend the message URI with a fragment identifier /{index}
, where {index}
is the (base-10) index of the triple in the normalized dataset. They are distinguished from blank node and named graph URIs by their leading slash; all blank node labels and blank graph names begin with _:
.
Quads have an implied rdf:type
of rdf:Statement
.
Nodes
ul:/node/{PeerId}
IPFS and libp2p nodes are identified by their “PeerId”, which is the the sha2-256 multihash of the libp2p node’s public key. PeerIds are typically displayed as base-58 strings.
This identity scheme allows nodes to change network addresses, physical locations, or even institutional affiliation, since identity is governed by cryptographic key ownership. A node’s PeerId can be resolved to its advertised network addresses with a distributed hash table.
In RDF contexts, a node is referenced by ul:/node/{PeerId}
, where {PeerId}
is its base-58 PeerId. Note that node
is distinct from the ipfs
root used in the other URI formats.
This may conveniently double as a dereferenceable URI over IPFS or a public gateway (e.g. https://gateway.ipfs.io/ipns/{PeerId}
) if the node chooses to publish an IPNS record with its key pair. However it is only used in the Underlay as an identifier.
Nodes have an implied rdf:type
of ul:Node
.
Named graphs
There is no distinction between the semantics of named ul:Graph
graphs and the default graph in messages, especially since even the default graph of a message becomes a named graph once integrated by nodes. Named graphs are treated as a means of reification within messages - allowing for statements about graphs as collections of statements themselves. Later messages can achieve the same semantic result (making statements about graphs) by using the message URI with a fragment of the canonized graph name.
This method was chosen over of other reification methods such as RDF* on the strength of a few advantages:
Concision: a content-addressed URI is significantly more concise for large graphs than the overhead of replicating the entire graph as reified rdf:Statement
objects. This concision is achieved without loss of access to the graph if desired so long as the source assertion is persisted by any node on the network.
Vertical scoping: letting authors pick which statements to organize into which named graphs allows for space to express the natural grouping of statements that frequently exists in the real world.
Lateral scoping: this content-addressing scheme names graphs relative to their containing dataset, which means authors have control over what other data is permanently associated with any given set of statements. This is good for tentative, controversial or even incorrect statements that a node may want to publish, but only if it is certain that the data will always be accompanied by some qualifying provenance.
Queries
Queries are an extremely general and powerful primitive in the Underlay. They can be used to select and retrieve data, like traditional relational queries. They can be used to perform computation, à la Prolog. They are also used for meta-level API-style calls within the network by sending nodes queries about themselves.
Syntactically, queries are RDF graphs (and are always named graphs within the dataset), but they have different semantics from the other graphs in a message. This does respect the RDF 1.1 Semantics as well as the On Semantics of RDF Datasets Working Group Note, which indicate that “a dataset SHOULD be understood to have at least the same content as its default graph”, but allows semantic extensions to define a named graph’s relation to its graph name in addition to the interpretation of the named graph itself.
Therefore we define that the default graph of a query dataset expresses RDF content in the same entailment regime as assertions, but that the named graphs follow different semantics. In an Underlay query, the meaning of a query named graph Q
is a resource that is syntactically a ground graph G
such that:
G
is an instance of Q
The local integral dataset entails a dataset P
containing G
.
Informally, queries express a computation that is executed by matching and arranging things that have been said, where blank nodes within queries are existentially quantified variables, and the meaning of a query is grounding of every blank node to an IRI or literal value such that the result is an implication of the merged graph of all assertions that the node knows about. And since that merged graph must imply all of its subgraphs, at the very least you can use queries to match subgraphs.
Note that there may be multiple meanings of a query, and number of possible results monotonically increases as more messages are integrated (and monotonically decreases as messages are removed from the integral dataset). It is again left to each node to implement its own heuristic for query resolution, which is optional and best-effort: nodes may entirely ignore some queries, or spend limited or unlimited resources in attempting to resolve them.
Query “results” are “returned” by way of messages: once resolved, nodes assemble an RDF dataset P
containing the ground result G
as a named graph, and a default graph that links the graph name with an IRI denoting the original (normalized) query by content-addressed URI via the ul:satisfies
property. Provenance data describing the source or derivation of G
will also be included in the default graph (this will be a necessary consequence of entailment under UnderlaySemantics
: the integral dataset will only imply the validity of datasets that document every quad’s content-addressed source in the default graph, or some other minimal provenance requirement). This “result” dataset is sent as a message to the node that posed the query.
Many nodes will have recursive query resolution strategies, involving forwarding subqueries to other nodes based on content heuristics and performing joins across the results. Nodes may use the default graph of queries to include meta-level statements that may assist, guide, or constrain resolution. Heuristics for analyzing and routing subqueries will be explored and possibly formalized in future versions of the protocols.
Static queries
Since each node's entailment regime must be a monotonic extension, we know that graphs entail themselves. This means that at the very least, queries can be used to match subgraphs of a node's local integral dataset.
Node queries
UnderlaySemantics
also include the entailment of specific properties about nodes themselves, which are used in lieu of an API for the Underlay network itself. These properties are described in the Schema section below.
Integral dataset
The abstract “database” that nodes maintain is a collection of messages combined into a single RDF dataset through a new operation called integration that produces a set of normalized, content-labeled, and content-ground graphs.
Combining multiple RDF graphs is straightforward except for the handling of blank nodes. The RDF 1.1 Semantics document describes two methods of combining multiple RDF graphs: graph union, which considers blank nodes from different graphs with the same label to be the same node in the union graph, and graph merge, which relabels blanks nodes to be independent in the merged graph.
While merging graphs involves relabeling blank nodes, a similar a process called “skolemization” is a “transformation on RDF graphs which eliminates blank nodes by replacing them with “new” IRIs, which means IRIs which are coined for this purpose and are therefore guaranteed to not occur in any other RDF graph (at the time of creation)”. Typically these introduce a universally unique identifier for every blank node in each graph. Skolemized graphs achieve merge- and union-safety by becoming ground, since merge and union are identical over ground graphs.
To formalize the storage and persistence of many messages as a single dataset, we introduce a concept roughly analogous to a “content-skolemized dataset merge” called the integral dataset. The integral dataset achieves merge- and union-safety by first normalizing and hashing each dataset, then replacing every blank node with its content-addressed blank node identifier, and every graph name (including the default graph) with its content-addressed named graph identifier.
Algorithm
The integral dataset over a set of datasets D_0...D_n
is derived by:
Normalizing every dataset D_i
by the URDNA2015 algorithm. This produces an N-Quads serialization of each dataset as a UTF-8 string N(D_i)
, as well as a canonical blank node label c(n)
and blank graph name c(g)
for every blank node and named graph, respectively.
Computing a content-addressed URI H(N(D_i))
for each normalized dataset. For version 0.1.1
of the assertion protocol these URIs are of the form ul:/ipfs/{hash}
, where {hash}
is the CID v1 in lowercase base 32 of the UnixFSv1 IPFS file containing the normalized dataset.
Relabeling every blank node n
in every normalized dataset N(D_i)
to H(N(D_i)) | '#' | c(n)
, the dataset’s content-addressed URI with the canonical blank node label as a fragment identifier.
Adding every named graph g
of every normalized dataset N(D_i)
to the integral dataset under the graph name H(N(D_i)) | '#' | c(g)
, the source dataset’s content-addressed URI with the named graph’s canonical graph name as a fragment identifier.
Adding the default graph of every normalized dataset N(D_i)
to the integral dataset as a named graph with graph name H(N(D_i)) | '#'
, the source dataset's content-addressed URI with an empty fragment identifier.
Properties
Useful properties of the integral dataset include:
Universal integral dataset
We frequently refer to the “integral dataset over all assertions”, or just “the integral over all assertions”. This universal integral dataset will be used in describing the entailment regime in UnderlaySemantics.
Local integral dataset
The “database” that each node maintains is a local integral dataset over assertions which it chooses to store, although the format and mechanism by which the integral dataset is actually persisted is left to each node. When a node receives an assertion, it may elect to integrate the assertion to its local dataset. Nodes may also integrate assertions that they generate themselves without publishing, or may choose not to integrate anything (all nodes are abstractly prescribed a database, but it may be permanently empty).
It’s unlikely that a literal, complete copy of the integral dataset over all assertions will ever exist in one place, and more likely that each Underlay node will store a subset of the assertions it receives (which is itself a subset of the assertions that have been published), thus only keeping a partial copy of the integral dataset. It may be the some assertions are not integrated by any node, and thus are lost forever, even though they may have induced side effects along the way.
Schema
The Underlay defines a namespace http://underlay.mit.edu/ns#
with preferred prefix alias ul
.
Prefixes used in this schema include:
rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs: http://www.w3.org/2000/01/rdf-schema#
prov: http://www.w3.org/ns/prov#
ul: http://underlay.mit.edu/ns#
ul:Node
A ul:Node
is a node in the Underlay.
ul:Node rdf:type rdfs:Class .
Nodes are identified by PeerId URIs as previously described.
ul:peer
The property ul:peer
relates a ul:Node
to a single peer ul:Node
.
ul:peer rdf:type rdfs:Property .
ul:peer rdfs:domain ul:Node .
ul:peer rdfs:range ul:Node .
ul:Message
A ul:Message
is a generalized RDF dataset expressing RDF data under the UnderlaySemantics
entailment regime.
ul:Message rdf:type rdfs:Class .
ul:Message rdfs:subClassOf prov:Collection .
Messages are treated as digital resources and extend PROV Collections. Messages implicitly have a prov:hadMember
relation to each of their constituent graphs (including the default graph), which are all themselves subclasses of PROV Collections.
All named graphs in a message must have blank graph names. Messages are identified a URI that content-addresses the normalized dataset.
ul:Dataset
A ul:Dataset
is an integrated dataset over some number of ul:Message
datasets.
ul:Dataset rdf:type rdfs:Class . ul:Dataset rdfs:subClassOf prov:Collection .
A ul:Dataset
is identified by a URI that content-addresses the dataset in normalized N-Quads form.
A ul:Dataset
differs from a ul:Message
in the following ways:
A ul:Dataset
is always ground.
A ul:Dataset
has an empty default graph.
A ul:Dataset
contains RDF graphs, while a ul:Message
may contain queries that are generalized RDF graphs.
The named graphs in a ul:Dataset
have URI graph names, while the named graphs in a ul:Message
have blank graph names.
ul:Graph
A ul:Graph
is an RDF graph.
ul:Graph rdf:type rdfs:Class .
ul:Graph rdfs:subClassOf prov:Collection .
This class mainly exists to have a well-typed range for the ul:satisfies
property. Like queries, graphs are a subclass of PROV Collections and implicitly have a prov:hadMember
relation to each of their constituent rdf:Statement
quads. Unlike queries, graphs do not need an explicit type. The default graph of a ul:Message
is always implied to have an rdf:type
of ul:Graph
.
ul:Query
A ul:Query
is an RDF graph expressing a query in the Underlay.
ul:Query rdf:type rdfs:Class .
ul:Query rdfs:subClassOf prov:Collection .
Queries must be explicitly typed in messages. Named graphs are assumed to be of type ul:Graph
unless given an rdf:type
of ul:Query
.
Queries are identified a URI that content-addresses the normalized dataset. Queries are a subclass of PROV Collections and implicitly have a prov:hadMember
relation to each of their constituent rdf:Statement
quads
ul:satisfies
The ul:satisfies
property expresses resolution of a ul:Query
by a ul:Graph
.
ul:satisfies rdf:type rdfs:Property .
ul:satisfies rdfs:domain ul:Graph .
ul:satisfies rdfs:range ul:Query .
A query may be resolved by an arbitrary number of unique satisfying graphs. In other words, a message may include multiple graphs satisfying the same query, so long as no two of those graphs are isomorphic.
ul:index
The ul:index
property relates a ul:Query
to the blank nodes over which it is indexed. These blank nodes are referred to as the query’s index set, and they constrain the number and kind of graphs that may satisfy it in any particular message.
ul:index rdf:type rdfs:Property .
ul:index rdfs:domain ul:Query .
ul:index rdfs:range rdf:Resource .
Each graph that ul:satisfies
a query must differ in at least one of the variables in the index set. In other words, in any one message, no two graphs satisfying the same query can have the same value for all of the variables in the query’s index set.
By default, a ul:Query
is assumed to have an empty index set, which places no constraint on the graphs that may satisfy it.
Example Queries
These examples are written in JSON-LD and elide the following context object:
{
"@vocab": "http://schema.org/",
"xsd": "http://www.w3.org/2001/XMLSchema#",
"ul": "http://underlay.mit.edu/ns#"
}
They queries assume a hypothetical Underlay node serving Wikidata content under the schema.org schema, loosely translated for sake of illustration.
Provenance data is omitted and will be described in a later document.
The hashes in the results below can be reproduced with jsonld-cli and IPFS:
echo '{query}' | jsonld normalize | ipfs add -n -Q
Find something named Vincent van Gogh.
{
"@type": "ul:Query",
"@graph": { "name": "Vincent van Gogh" }
}
This normalizes to:
_:c14n0 <http://schema.org/name> "Vincent van Gogh" _:c14n1 .
_:c14n1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://underlay.mit.edu/ns#Query> .
… with CID QmVtKcLCXDVgJXKKPReeaC5aLn471MzrL9mH4zMxRe9zLp
. This means that the query graph has an absolute URI ul:/ipfs/…9zLp#_:c14n1
.
A response to this query might look like:
{
"ul:satisfies": "ul:/ipfs/…rEyU#_:c14n1",
"@graph": {
"@id": "http://www.wikidata.org/entity/Q5582",
"name": "Vincent van Gogh"
}
}
Find a Person named Vincent van Gogh, along with their date of birth.
{
"@type": "ul:Query",
"@graph": {
"@type": "Person",
"name": "Vincent van Gogh",
"birthDate": { }
}
}
This normalizes to:
_:c14n0 <http://schema.org/birthDate> _:c14n2 _:c14n1 .
_:c14n0 <http://schema.org/name> "Vincent van Gogh" _:c14n1 .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> _:c14n1 .
_:c14n1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://underlay.mit.edu/ns#Query> .
… with CID QmXvomf8FoomTJjD7urL6axEG2QQ7SD3mx2smoJUWD3ctt
. This means that the query graph has an absolute URI ul:/ipfs/…3ctt#_:c14n1
.
A response to this query might look like:
{
"ul:satisfies": "ul:/ipfs/…3ctt#_:c14n1",
"@graph": {
"@id": "http://www.wikidata.org/entity/Q5582",
"birthDate": {
"@value": "1853-03-30",
"@type": "xsd:date"
}
}
}
Open questions
Open problem statements of have been migrated to the Underlay research directions document.