Our Underlay efforts have been through many phases. One trend has been a movement from grandiose towards grounded. The vision and drive for what we’re building has been consistent, but identifying the steps to get there has forced compromises given the inherited reality.
To make decisions around which compromises are acceptable, it is helpful to clarify the value proposition we are making with the initial Underlay build. Here is a preliminary proposition proposal:
The key value proposition for the Underlay is increasing the number of people that actually use a published dataset.
The folks that are best suited for Underlay are those who want their data to be helpful and effective for a large number of people and applications.
This implies that we’re focused on public data. Data that people want to be freely and widely used. To begin, Underlay is not trying to be the right data platform for an internal team (though it may be useful) — its features and decisions are based on publicly shared datasets that want to be used freely and widely.
There are many data repositories, github repos with data files, sites with prominent ‘Download Data’ buttons, etc — yet using these datasets in a production environment is difficult and uncommon. There are a number of reasons why downloadable data isn’t heavily used:
It’s difficult to access the data given the format it’s uploaded in. Maybe it’s a CSV file, but you want an API; maybe it’s and API but you need a single JSON file. Currently, the data producer is responsible for guessing the most helpful format for how their data will be consumed. This is expensive and difficult for the data producer.
It’s difficult to grok the shape of the data. Creating a mental model of the human-friendly “entities” within a dataset (e.g. the people, places, things, and relationships) is often obscured by a format that has been optimized for computational processing, rather than human understanding.
There is no connection between a dataset and how it has been used by others. There are no examples to inspire or guide you.
It’s hard to judge which datasets are useful. There are a dearth of social signals to help you curate the overwhelming number of available datasets.
It’s difficult to understand how a dataset has changed over time. This is challenging even if you are the data producer yourself! Keeping track of long-lived datasets and how they’ve evolved is difficult. This can make it difficult to trust the data when you re-encounter it some time later. The larger the team working on the data, and the longer the duration of passed time, the harder it becomes. Collaborative datasets that are long-lived require a great deal of blind faith and trust.
There is no canonical reference to a dataset. In a Node.js project I can refer to [email protected]
and get the exact copy I want to get, but rarely can I specify [email protected]
for generating a reproducible build for an app.
It’s hard for data-producers to measure data usage. This is in addition to the lack of usage examples. Data publishers face a dilemma between releasing a blob file for download and only getting a download count or spending an enormous amount of time and resources to maintain an API to get accurate usage data. At the end of the day, it’s too difficult and expensive to know how useful the data you’ve produced actually is.
So, if we want folks to use Underlay because it’s the best way for data to be widely used, the sustainability model should align with that. While team/org accounts are an obvious thing to charge for (a la Github), it does not necessarily align with our ability to deliver on that value prop.
A better option would be a business model that lets people pay for hosted exports (e.g. a hosted API or hosted private database that is generated from a collection). This payment would then be split between the costs of hosting, payment to the data producer, and revenue for Underlay (e.g. charge a 10% overhead on hosting and split it between Underlay and the data producer). This incentivizes the data producer to make their collections widely used (they have revenue opportunities from doing so) and incentivizes Underlay to make that easy and effective. To the benefit of data consumers, it incentives Underlay to greatly reduce the time it takes for people to find a dataset and get it into production.
Organization/Team accounts could still exist and be a payment center to allow group billing or shared instances for hosted exports.
Some projects and organizations may not be able to take payment for the dataset. In those cases, we can offer a reduced fee (e.g. just a 3% overheard and UL takes it all as revenue).
One challenge with this is that it makes the initial business model one that depends on payouts to data producers. Stripe supports this well, but it is an added legal complexity over a simple pay-for-product/service business model. At the moment, I think that’s worth the added complexity given the good alignment with the Underlay value prop.