Current open questions and problem statements
The Underlay is a network of semantic databases that compose a public decentralized knowledge graph.
These are the most relevant open questions we’re working on. If these sound interesting to you, or if you’re working on similar projects, or if you know of relevant literature we may have missed, or if you have ideas, get in touch! Open a discourse thread for longer discussions or comment on this document for shorter ones.
A critical piece of the Underlay is a new graph query language that is itself expressed in RDF syntax. Queries take the form of named graphs in message datasets whose (blank) graph names have an
ul:Query in the default graph. Blank nodes in query graphs are used for simple subgraph matching, and more powerful queries are expressed by introducing rules of logical entailment that let queries match against larger (potentially infinite) databases of “virtual” edges and structures (like virtual
lessThan edges between every appropriate integer term). This approach to expressive queries through logical entailment is closely patterned off of the logic programming and query languages Prolog and Datalog and their relatives.
Laying out formal semantics of query graphs is important for uncovering edge cases that we might not otherwise notice. In particular, RDF datasets “share” blank nodes between all of their graphs, which makes their meaning unclear when one graph is a query. What do queries mean?
The current spec says that queries are digital resources that entail a
ul:satisfies relation from every (non-query) graph that is a ground instance of the query. This feels unsatisfying because it amounts to queries meaning “look - this is a resource of type query”, without capturing any intent to evaluate or resolve. Alternatively, this might be desirable, since nodes aren’t required to attempt evaluating the queries they receive.
Blank nodes are naturally interpreted as existentially quantified variables, but queries also need a mechanism for expressing universal quantification, or negation, or both. What should this look like? What’s ergonomic for applications or tools that will produce queries? What is efficient to evaluate?
Negation could be done through RDF*-style reification within the graph, where negated edges are reified statements that get negated with an explicit property on the statement. Alternatively, all negated edges could be collected into a separate named graph that is related to the query graph with an statement in the default graph.
Specific blank nodes could be labeled as universal in the default graph, or as universal within a particular named graph.
What are the pros and cons of promoting primitive forms of both negation and universal quantification, as opposed to encouraging the emulation of one as the other?
Many common queries ask for “all of” something, which requires a mechanism for expressing enumeration of several results or sub-results.
This could be done through entailment of RDF lists, where “all of an actor’s movies” is expressed as “the list such that every element of the list is a movie with the actor”. Either Datalog-style “minimal model” semantics or an explicit “and no element of the list is not a movie with the actor” clause would also have to constrain the list.
This could alternatively be done through enumeration of entire satisfying graphs. This approach might involve something like a
ul:index predicate in the default graph that relates a query to the blank nodes over which multiple graph results (if they exist) are to be uniquely indexed.
Expressing queries as graphs suggests direct applications of graph analysis to intelligently split queries into overlapping subqueries that get forwarded to relevant peers, resolved, and reassembled.
This would probably require metadata in the default graph about which other peers have been given subqueries that share blank nodes, and mechanisms for peers to collaboratively join over shared variables before returning results.
Knowing which peers to delegate subqueries to requires both analysis of the semantic content of the query, and some stateful knowledge of the semantic content of each peer’s internal databases.
The subgraph matching problem is famously NP-Complete in general, but may have faster algorithms if the graphs are sufficiently constrained. For example, matching two ground graphs (with no blank nodes) is equivalent to checking that the pattern graph’s edges are a subset of the target graph’s edges. Are there intermediate degrees of constraint that give intermediate bounds?
A related problem is graph simulation, where all nodes (in both the data and the pattern graphs) are known to have non-unique labels that much match (“types” as distinct from identifiers). Computing graph simulation can be done in quadratic time.
The primitive operation of the Underlay network is just “sending” messages to individual peers. This is lower-level than most applications will find directly useful, so there need to be well-developed intermediate mechanisms for automatically distributing messages to relevant nodes.
Many use cases involve applications consuming streams of messages validating a shape, or following new results to an open query. This could be done statelessly (polling) or statefully (subscriptions) on the part of a “source”.
More powerful than continuously querying a single source is expressing interest or intent to the network at large - analogous to a subscribing to a decentralized pub/sub topic where the topic is a query. This and distributed queries might be considered two sides of the same general goal of semantic routing.
This may be amount to “graph summarization”, or leverage some kind of knowledge graph embedding, or simply build on existing distributed hash tables.
No single ontology can serve everyone’s needs. This isn’t just because no ontology can be large enough to describe the entire world, but also because needs change over time.
But every knowledge graph so far attempted has followed the same time-honored tradition of partitioning concepts into a class hierarchy, and relating them with properties from a fixed vocabulary.
This approach is doomed to fail at representing all knowledge, but it’s perfect for representing domain-specific knowledge, since common sets of abstractions are what structure domains in the first place. Different domains end up having such wildly different views of the same things that unifying them is impossible. This is true of grand concepts like entropy and for
The current spec calls for every message to be addressed by the CIDv0 of the canonical n-quads serialization given by the URDNA2015 Dataset Normalization Algorithm. This is costly in space (expanding every URI to fully-qualified form with no compression) and doesn’t support serializing generalized RDF, which queries may wish to transition to in the future.
This suggests a new serialization for RDF datasets as an IPLD format that uses native (and still canonical) compression, and supports generalized RDF. JSON-LD supports blank nodes as predicates, which may be a promising start.