Skip to main content

Open Research Questions (2019)

Published onNov 26, 2019
Open Research Questions (2019)

The Underlay is a network of semantic databases that compose a public decentralized knowledge graph.

These are some of the questions we’re working on. If these sound interesting to you, or you’re working on similar projects, or have relevant literature to suggest, get in touch! Open a discourse thread for longer discussions or comment on this document for shorter ones.

Query language

A critical piece of the Underlay is a new graph query language that is itself expressed in RDF syntax. Queries take the form of named graphs in message datasets whose (blank) graph names have an rdf:type of ul:Query in the default graph. Blank nodes in query graphs are used for simple subgraph matching, and more powerful queries are expressed by introducing rules of logical entailment that let queries match against larger (potentially infinite) databases of “virtual” edges and structures (like virtual lessThan edges between every appropriate integer term). This approach to expressive queries through logical entailment is closely patterned off of the logic programming and query languages Prolog and Datalog and their relatives.

Formal semantics

Laying out formal semantics of query graphs is important for uncovering edge cases that we might not otherwise notice. In particular, RDF datasets “share” blank nodes between all of their graphs, which makes their meaning unclear when one graph is a query. What do queries mean?

The current spec says that queries are digital resources that entail a ul:satisfies relation from every (non-query) graph that is a ground instance of the query. This feels unsatisfying because it amounts to queries meaning “look - this is a resource of type query”, without capturing any intent to evaluate or resolve. Alternatively, this might be desirable, since nodes aren’t required to attempt evaluating the queries they receive.

Quantification & negation

Blank nodes are naturally interpreted as existentially quantified variables, but queries also need a mechanism for expressing universal quantification, or negation, or both. What should this look like? What’s ergonomic for applications or tools that will produce queries? What is efficient to evaluate?

Negation could be done through RDF*-style reification within the graph, where negated edges are reified statements that get negated with an explicit property on the statement. Alternatively, all negated edges could be collected into a separate named graph that is related to the query graph with an statement in the default graph.

Specific blank nodes could be labeled as universal in the default graph, or as universal within a particular named graph.

What are the pros and cons of promoting primitive forms of both negation and universal quantification, as opposed to encouraging the emulation of one as the other?


Many common queries ask for “all of” something, which requires a mechanism for expressing enumeration of several results or sub-results.

This could be done through entailment of RDF lists, where “all of an actor’s movies” is expressed as “the list such that every element of the list is a movie with the actor”. Either Datalog-style “minimal model” semantics or an explicit “and no element of the list is not a movie with the actor” clause would also have to constrain the list.

This could alternatively be done through enumeration of entire satisfying graphs. This approach might involve something like a ul:index predicate in the default graph that relates a query to the blank nodes over which multiple graph results (if they exist) are to be uniquely indexed.

Distributed queries

Expressing queries as graphs suggests direct applications of graph analysis to intelligently split queries into overlapping subqueries that get forwarded to relevant peers, resolved, and reassembled.

This would probably require metadata in the default graph about which other peers have been given subqueries that share blank nodes, and mechanisms for peers to collaboratively join over shared variables before returning results.

Knowing which peers to delegate subqueries to requires both analysis of the semantic content of the query, and some stateful knowledge of the semantic content of each peer’s internal databases.

Constraining subgraph matching

The subgraph matching problem is famously NP-Complete in general, but may have faster algorithms if the graphs are sufficiently constrained. For example, matching two ground graphs (with no blank nodes) is equivalent to checking that the pattern graph’s edges are a subset of the target graph’s edges. Are there intermediate degrees of constraint that give intermediate bounds?

A related problem is graph simulation, where all nodes (in both the data and the pattern graphs) are known to have non-unique labels that much match (“types” as distinct from identifiers). Computing graph simulation can be done in quadratic time.

Routing schemes

The primitive operation of the Underlay network is just “sending” messages to individual peers. This is lower-level than most applications will find directly useful, so there need to be well-developed intermediate mechanisms for automatically distributing messages to relevant nodes.

Polling & subscription

Many use cases involve applications consuming streams of messages validating a shape, or following new results to an open query. This could be done statelessly (polling) or statefully (subscriptions) on the part of a “source”.

Semantic heuristics

More powerful than continuously querying a single source is expressing interest or intent to the network at large - analogous to a subscribing to a decentralized pub/sub topic where the topic is a query. This and distributed queries might be considered two sides of the same general goal of semantic routing.

This may be amount to “graph summarization”, or leverage some kind of knowledge graph embedding, or simply build on existing distributed hash tables.

Knowledge representation

No single ontology can serve everyone’s needs. This isn’t just because no ontology can be large enough to describe the entire world, but also because a domain is often defined by its context.

Every knowledge graph so far attempted has nevertheless followed the time-honored tradition of partitioning concepts into a class hierarchy, and relating them with properties from a fixed vocabulary.

This approach is doomed to fail at representing all knowledge. It’s fine for representing domain-specific knowledge, since common sets of abstractions are what structure domains in the first place. Different domains end up having such wildly different views of the same things that unifying them is impossible.

Dataset serialization

The current spec calls for every message to be addressed by the CIDv0 of the canonical n-quads serialization given by the URDNA2015 Dataset Normalization Algorithm. This is costly in space (expanding every URI to fully-qualified form with no compression) and doesn’t support serializing generalized RDF, which queries may wish to transition to in the future.

This suggests a new serialization for RDF datasets as an IPLD format that uses native (and still canonical) compression, and supports generalized RDF. JSON-LD supports blank nodes as predicates, which may be a promising start.


No comments here