Skip to main content

Content-Addressing Semantic Data

Published onOct 24, 2019
Content-Addressing Semantic Data

This document describes alternative approaches to identification and reification in RDF based on content-addressing canonicalized datasets.

A basic understanding of RDF is assumed, especially the concepts of blank nodes and RDF datasets, but skimming the RDF Primer should be enough to get by. JSON-LD is also mentioned but the details aren’t important; for our purposes it’s “just another RDF serialization”.

For simplicity, the distinction between IRIs and URIs is ignored.


Blank nodes in RDF are anonymous nodes that don’t have any semantic identifier, and they’re a blessing and a curse. In practice most serializations end up needing locally-scoped labels to actually represent them, although in JSON-LD, every JSON object that doesn’t have an explicit "@id" property is implicitly interpreted as a new, distinct blank node. So the JSON-LD document

  "": "John Doe"

is equivalent to the N-Triples file

_:foo <> "John Doe" .

and they both encode the graph

The _:foo label in the N-Triples file only exists to distinguish it from other blank nodes in the same dataset. It’s not expected to have any relation to other _:foo nodes in other datasets.


Blank nodes are useful for representing data structures that don’t fit into a graph model well, like linked lists, where the elements must be ordered:

_:b0 <> "fee" .
_:b0 <> _:b1 .
_:b1 <> "fi" .
_:b1 <> _:b2 .
_:b2 <> "fo" .
_:b2 <> _:b3 .
_:b3 <> "fum" .
_:b3 <> <> .

We don’t want to come up with URIs for every link in our linked list, so we use blank nodes to structure our data without them.


But sometimes we want URIs for blank nodes, even though they were invented to avoid that! Suppose you came across the graph in the first example:

Maybe someone emailed it to you, or you found it on the ground, or a USB stick fell out of the sky. And suppose you wanted to reference the blank node in a graph of your own:

Hey, I think I know that guy! Here are some more facts about him…

There’s no clear way to do this! Ideally we’d be able to eat our cake and have it too: blank nodes should be a convenience for authors, but also shouldn’t prevent future authors from linking to them unambiguously. But how?


The RDF spec has a section on “skolemization”, which is what they call the process of replacing blank nodes with globally unique URIs:

In situations where stronger identification is needed, systems MAY systematically replace some or all of the blank nodes in an RDF graph with IRIs. Systems wishing to do this SHOULD mint a new, globally unique IRI (a Skolem IRI) for each blank node so replaced.

The spec prescribes minting well-known URIs with the registered name genid to get URIs that look like this:

It’s hard to tell exactly what kind of situation is meant by the vague reference to “systems” in the spec. This is probably intentional - the authors don’t want to make assumptions or prescriptions about how RDF data will be used - but some centralized entity will have to do the work of choosing a domain name and generating the unique identifiers.

Problems with skolemization

Consider a prototypical user story: a library publishes an RDF dataset describing their catalog, and uses linked lists to represent the sequence of books in a series. One book series was adapted into a movie, and later, a movie database application wishes to reference one of the library’s linked lists in an RDF dataset of their own. Unless the library anticipated this need and generated well-known URIs to replace their blank nodes, the movie application has no way of referencing them.

On the other hand, if the movie application generates well-known URIs for the library’s blank nodes, then there’s no effective link to the library’s dataset at all. Out of context, the well-known URI is no more useful than another blank node, since it leaves a user with no way to find out more about its referent from other sources, which is the fundamental premise of linked data.

Indexing datasets

Since blank nodes are strictly “scoped” to a dataset, the problem of identifying blank nodes reduces to the problem of identifying datasets. Given a URI for a dataset in a serialization that assigns local labels to every blank node, we could address blank nodes using fragment identifiers.

For example, if everybody in the world knew that the URI referred to the serialized file

_:foo <> "John Doe" .

… then we could globally identify the blank node labelled _:foo with the URI It’s important that the specific serialization is the thing identified by the root URI, since

_:bar <> "John Doe" .

expresses the same graph using different blank node labels. But if we have a URI for the specific serialization, then we’re free to use the labels in the fragment identifier without ambiguity.


Why…#foo and not…#_:foo?

The first is simply more readable, and the precedent set by the RDFJS spec treats the _: as a “serialization specific prefix” that should be dropped whenever lifting the label out of that serialization.


This interpretation of fragment identifiers isn’t prescribed in the N-Quads spec but is compatible with the fragment identifier semantics in RFC 3986:

The fragment identifier component of a URI allows indirect identification of a secondary resource by reference to a primary resource and additional identifying information. The identified secondary resource may be some portion or subset of the primary resource, some view on representations of the primary resource, or some other resource defined or described by those representations.

Tim Berners-Lee’s personal view is that fragment identifiers refer to the thing referred to by the fragment identifier in the document:

The fragment identifier on an RDF (or N3) document identifies not a part of the document, but whatever thing, abstract or concrete, animate or inanimate, the document describes as having that identifier.

This is less confusing than it may appear; all it means is that the referent of…#foo is “whatever it is that _:foo refers to”, not “_:foo itself”. Put another way, it means that the reference graph looks like the image on the left, not the image on the right.

<p>referring to the referent (good)</p>

referring to the referent (good)

<p>referring to the reference (bad)</p>

referring to the reference (bad)

Broadly, this fits with all of our expectations about fragment identifiers on the web. When browsers fetch a web page, they don’t send the URL fragment to the server: they request the whole page, and then use the fragment to index a particular DOM element within the document. Similarly, blank node labels require the context of the entire dataset to interpret.

URIs for datasets

Sometimes a dataset has a clear URI that is used to refer to it. Other times a dataset is available at an HTTP URL, but the URL is a mirror, and there is some other canonical URL that would be more appropriate to use as an identifier. And in some situations there may be no obvious URI at all, like the case of finding a dataset on a USB stick or attaching one to an email.

This may seem contrived, but decoupling data from its host is critical for its long-term utility. We can’t depend on the persistence of URLs, servers, or even organizations, and tying the usability (however marginal) of a dataset to their stability would be irresponsible.

Instead, the future-facing archive-friendly durability-first vision is to treat the container of RDF data as completely self-contained. This is not to say that there aren’t links to other objects - but rather that the interpretation of the data isn’t dependent on runtime results of attempts to dereference URLs. Datasets should speak for themselves!

There are many facets to this vision (such as linked data signatures for in-band self-authentication to replace the web’s origin-based authority model), but the one in focus here is identity. We’d ideally like a way of identifying a dataset that is independent of its host and even independent of its serialization: derived purely from its semantic content.

This is possible through the combination of two building blocks:

  • A canonicalization algorithm that produces a deterministic normalized serialization of any dataset, such that isomorphic datasets produce identical serializations

  • A content-addressing scheme that assigns a unique, deterministic URI to any serialization (i.e. file)


The former is by far the harder of the two, but fortunately the Credentials W3C Community Group has put an enormous amount of effort in publishing a RDF Dataset Normalization spec which achieves exactly that. In particular, the URDNA2015 canonicalization algorithm involves a renaming of blank node with deterministic labels, derived from the structure of the graph and a lexicographic comparison of the URIs and literals.

For example, the N-Quads documents

_:foo <> "John Doe" .
_:foo <> "Firefighter" .
_:bar <> "Jane Doe" .
_:bar <> "Professor" .
_:bar <> _:foo .


_:jane <> "Professor" .
_:jane <> _:john .
_:jane <> "Jane Doe" .
_:john <> "Firefighter" .
_:john <> "John Doe" .

are equivalent, and both would normalize to the canonical N-Quads document

_:c14n0 <> "Firefighter" .
_:c14n0 <> "John Doe" .
_:c14n1 <> "Professor" .
_:c14n1 <> _:c14n0 .
_:c14n1 <> "Jane Doe" .


The simplest scheme that assigns a unique, deterministic URI to every file is one that just encodes the entire file inside of the URI:


This is obviously unreasonable, but notice that beyond uniqueness and determinism, this URI scheme also gives a built-in way of recovering the content of dataset from the URI. This is the same result as de-referencing an HTTP URL! Not only do we get a decentralized way of all agreeing on the same identifier for a dataset and all of its blank nodes (no matter how or where we came across it), but we get an automatic way of retrieving the content of the dataset from that identifier.

We can expect datasets to be large, and data URIs are too long. However, we can turn unique long things into unique short things by hashing them! Two approaches are described here, each with their own trade-offs.


IPFS is a decentralized filesystem that uses a content-hash naming scheme called Content Identifiers, or CIDs, to address files. CIDs are hashes prefixed by a few bytes of metadata, like the base encoding of the CID itself, the hash algorithm used, and the format of the data that was hashed. These self-describing formats are called Multiformats and a few of them are IETF drafts.

The IPFS network then acts like a giant name resolver for CIDs, using a distributed hash table to associate CIDs with the network addresses of nodes who advertise that they’re hosting the associated file. This lets users request files from the network at large, without knowing the network address of any particular host.

URIs for files on IPFS use the dweb scheme, e.g. dweb:/ipfs/….


A typical use case for someone running an IPFS daemon in the background might look like this:

$ cat data.jsonld
  "@context": {
    "@vocab": ""
  "name": "Jane Doe",
  "jobTitle": "Professor",
  "knows": {
    "name": "John Doe",
    "jobTitle": "Firefighter"
$ cat data.jsonld | jsonld normalize | ipfs add -Q --raw-leaves --cid-version 1

Then others can later refer to Jane with the canonical URI


… which gives anyone who sees it the ability to dereference bafkreih64vymm7y7tc7wm6jpopaqbqj4wujgnz6jozrz4ig2ektkccqely into a real dataset, as long as somebody in the world is pinning it to their IPFS node.

$ ipfs cat bafkreih64vymm7y7tc7wm6jpopaqbqj4wujgnz6jozrz4ig2ektkccqely
_:c14n0 <> "Firefighter" .
_:c14n0 <> "John Doe" .
_:c14n1 <> "Professor" .
_:c14n1 <> _:c14n0 .
_:c14n1 <> "Jane Doe" .

Of course, this isn’t a guarantee that it’ll resolve, but neither are HTTP URLs. IPFS at least lets content move between hosts and leverage mirrors without breaking addresses.


One drawback to using CIDs to identify canonical datasets is that just given an IPFS URI dweb:/ipfs/bafk…#c14n0, there’s no immediate way to tell if the referenced hash is an N-Quads dataset or some other kind of file. We might be using IPFS URIs to point to PDFs or CSVs, and CIDs don’t have any way of encoding MIME type.

This is even mentioned as a feature of skolemization in the RDF syntax spec:

Systems may wish to mint Skolem IRIs in such a way that they can recognize the IRIs as having been introduced solely to replace blank nodes. This allows a system to map IRIs back to blank nodes if needed.

Whether this is an actual limitation depends on usage. Would there be systems that want to distinguish these “blank node URIs” from regular URIs - maybe fetching the datasets containing referenced blank nodes but ignoring other files? Maybe!

One way around this might be to use a URI scheme other than dweb specifically for canonicalized N-Quads CIDs, such as x:/ipfs/bafk…#c14n0. This is a radical act but might be justified if there isn’t much overlap in the way these “dataset links” and regular links were treated within a system.

Another scheme for content-addressing data is described the Cryptographic Hyperlink spec proposed last year, casually called Hashlinks.

Hashlinks are very similar to CIDs and use mostly the same building blocks (multihash and multibase, but not multicodec) to generate a self-describing content hash of arbitrary files. The major difference is that instead of using a multicodec table to map the first few prefix bytes to a known “format”, Hashlinks support a whole extra URI element where users can encode arbitrary MIME type and user-defined metadata.

The metadata is also binary-packed (with CBOR) so that both the hash itself and the metadata appear as compact base58-encoded strings.


The spec is very new and not widely used, although it has been implemented as a JavaScript library for the Browser and NodeJS.

Using our example canonicalized dataset:

const hl = require("hashlink");
const jsonld = require("jsonld");

const data = {
  "@context": { "@vocab": "" },
  name: "Jane Doe",
  jobTitle: "Professor",
  knows: {
    name: "John Doe",
    jobTitle: "Firefighter"

const meta = { 'content-type': 'application/n-quads' };

  .then(data => hl.encode({ data, meta }))
  .then(uri => console.log(uri));

… which gives us another URI for Jane:


This is pretty verbose! But it does explicitly declare a MIME that lets us tell that it’s an RDF dataset “on sight”.


Hashlinks are in an awkward position of doing a little too much - they’re positioned as an easy way to encode both a file’s hash and its associated HTTP URLs (see the spec for an example of URLs encoded in a Hashlink’s metadata). This is definitely convenient for use cases where a centralized authority publishes a dataset, but prevents two users who acquired the same dataset from different sources from independently generating the same identifier.

Using Hashlinks in a truly decentralized way means committing to a micro-standard on top of Hashlinks: no URLs, only an explicit content-type of application/n-quads. This has the small added benefit of eliminating a CBOR decoding step: the serialized metadata element of a “dataset hashlink” URI will always be zkiKpvWP3HqVQEfLDhexQzHj4sN413x.

And then there’s dereferencing - Hashlinks are only an identifier scheme, and have no relation to IPFS or any other kind of retrieval network. And since they use a slightly different encoding scheme than CIDs, they won’t look like the addresses IPFS generates and can’t be pasted into ipfs cat.

That said, it is technically possible to use two together. A revised usage of the ipfs CLI tool to create a Hashlink URI might look like:

$ cat data.jsonld 
  "@context": {
    "@vocab": ""
  "name": "Jane Doe",
  "jobTitle": "Professor",
  "knows": {
    "name": "John Doe",
    "jobTitle": "Firefighter"
$ cat data.jsonld | \
> jsonld normalize | \
> ipfs add -Q --raw-leaves | \
> ipfs cid format -f "%m" -b base58btc

… which is identical to the hash element of the previous Hashlink example. Subsequent retrieval over IPFS in a JavaScript application might look like:

const CID = require("cids");
const multibase = require('multibase');
const jsonld = require("jsonld");
const ipfs = require("ipfs-http-client")();

const id = "zQmfVfBaKgFFkUtvnPjQ35khooRpbyrUK2wWq3SzvsKAavq";
const hash = multibase.decode(id);
const cid = new CID(1, "raw", hash);

const context = { "@vocab": "" };
  .then(file => jsonld.fromRDF(file.toString()))
  .then(doc => jsonld.compact(doc, context))
  .then(doc => console.log(doc));

… which gives us our original data back:

  "@context": { "@vocab": "" },
  "@graph": [
      "@id": "_:c14n0",
      jobTitle: "Firefighter",
      name: "John Doe"
      "@id": "_:c14n1",
      jobTitle": "Professor",
      knows: { "@id": "_:c14n0" },
      name: "Jane Doe"

Additional uses

Our original motivation here was to find a scheme for addressing blank nodes, but the practice of content-addressing canonicalized datasets is also more broadly useful.

Referencing datasets

The most obvious application is using dataset identifiers in RDF data to make reference about datasets as digital objects themselves. This could be use to make post-hoc claims about provenance, veracity, correction, or general annotation by the publisher or any third party.

Referencing graphs

Using blank node labels in fragment identifiers can be extended to index blank graph names as well - particularly convenient is the practice of using the empty fragment (e.g. dweb:/ipfs/…#) to refer to the default graph.

Referencing statements

Since URDNA2015 sorts each quad of the dataset into a normalized order, each statement can be uniquely and concisely identified by its integer line number in the serialized file with a syntax like dweb:/ipfs/bafk…#/81. A more idiomatic syntax could just use the path component of a URI directly (e.g. using an IPLD codec and a URI like dweb:/ipld/…/81) if it wouldn’t be confused with other ways to interpret paths.

This would mean that a dataset can’t refer to its own statements, but this could also be considered a feature: you shouldn’t be able to reference a statement until it has, in fact, been stated. Content-addressing permanently binds statement URIs to the context of a dataset that may further (and significantly!) qualify their meaning.

Questions and Challenges

This approach inherits many of the usability challenges common to all content-addressing schemes. Content-addresses only identify specific versions of resources, and don’t capture any sense of preserved identity across different versions of the same resource. It’s unclear how the “same” blank nodes could be co-identified after an edit to a dataset, or even whether that’s an appropriate kind of thing to want to do. New practices around sharing and linking versions of immutable datasets would have to emerge.

Other problems are more technical. N-Quads is the least space-efficient RDF serialization, which may make storing canonicalized datasets impractical. In these cases, the canonicalized dataset may not have to be store as such, but instead store in a more compact representation that is associated with its canonical hash.

A larger problem is that canonicalization itself can be an expensive operation, since graph isomorphism is NP-complete. In practice, the cost of canonicalization is proportional to the number and density of the blank nodes in the dataset: the harder the blank nodes are to distinguish from each other (by e.g. a triple grounding them to a literal or URI), the harder the graph is to canonicalize. This does mean that most common RDF structures - even linked lists or trees of blank nodes - do canonicalize relatively easily. At one extreme, datasets without any blank nodes can be canonicalized by simply sorting their quads lexicographically. But there will always be pathological cases like densely connected all-blank-node graphs that take exponential time to canonically label.


The scheme presented here for addressing in RDF is a specific application of a general paradigm that marries elements of the decentralized web and semantic web movements. In particular, treating the RDF Dataset as an immutable container of graph data is fertile ground for new approaches to some of the semantic web’s long-standing pain points, like reification, dereferencing, archiving, and link rot.


No comments here