Skip to main content

EOY Musings: What is PubPub?

Published onJun 07, 2022
EOY Musings: What is PubPub?

Make knowledge.
Share it with people who care.
We’ll take care of the rest.

PubPub is where knowledge communities can focus on what they really want (having their work used) without needing to worry about fitting it into the form that academic circles require. It’s the bridge between web publishing, broadly defined, and the narrow niche of academic publishing.

As we saw during the pandemic, when everyone ditched journals for a distributed web of preprint servers and twitter, web publishing is how researchers want to disseminate their work. You really just need:

  • A place to put your work (bioxriv, etc.)

  • A way for others to find it (covidscholar, etc.)

  • A way to discuss it with others (twitter)

  • A way to understand and communicate all of this activity back to funders and institutions as “impact.” (Altmetric)

That more or less describes the absolute basic feature-set for every single CMS in existence, from horrible XSLT-driven legacy .NET behemoths that only update by generating a brand new build, content and all, every 15 minutes1 to Twitter to Wordpress to Squarespace. You might argue that there are some specific considerations for academics that change that feature-set, like citations, or LaTeX. But if we’re being completely honest with ourselves, none of that really matters. One of the most common requests we get is for a PDF viewer. Most people would be perfectly happy just to have a pretty place to park a PDF.

So, why does PubPub exist? Why not go use Wordpress? Or a file server, for that matter? Mostly, because the interface for digital academic publishing is a nightmare (somewhat by design, but that’s for another manifesto). Unlike web publishing, in academic publishing, it is not enough simply to produce an HTML document that can be crawled by search engines and shared to social media.

Publishing just a single paper in a way that guarantees even the most minimal acceptance into the academic record effectively requires,2 at minimum, knowledge of the following technical interfaces, all of which are either highly specialized, poorly documented, outdated and not well supported in modern web stacks, or some combination of those drawbacks:

  • JATS XML (for aggregation and archiving, even if you’re just publishing a PDF)

  • Crossref (for making work citable)

  • DC Metadata tags (for SEO, Zotero, Mendeley, etc.)

  • Highwire Metadata tags (for Google Scholar)

  • FTP (for aggregation and archiving)

This basic picture excludes the technical complexity (mostly impedance mismatch and terrible docs/libraries) of converting a document from HTML (or, more commonly, Word or LaTeX) to JATS in the first place, which is done at most presses (including biorxiv/medrxiv!) by hand by offshore vendors. It doesn’t take into account the relationship-building required to ensure that the content you publish is actually aggregated and archived by the companies that serve that role, all of whom have their own individual requirements and interfaces. And worst of all, it all presumes that what you’re publishing can be described as a journal article as we know it today. Trying to publish a standalone dataset, or piece of code, or visualization, is infinitely harder, because no standard formats, or even best practices, yet exist, and even if they did, there’s no guarantee that your thing would fit in whatever that box looked like.

So, and this is really nothing new, just taking what the PubPub team has been saying for years and putting a particular spin on it, what the universe needs PubPub to be, at its core, is a place you can go to work in the web publishing world while making that work legible to the academic publishing world. Put another way:

If you…

  1. give us your stuff (even if it’s a PDF)

  2. tell us what it is

We will…

  1. give it a nice place to live

  2. figure out the best way to stuff it in the box required for it to be acknowledged by academia

  3. tell you what other people are saying about it (and help you package that into “useful” “impact” “metrics”)

That’s obviously an oversimplification, and I think it sounds kinda boring from a product level as stated. But there’s a really, really, really cool thing that falls out of step #2, which makes steps 3, 4, and 5 much more interesting.

As ridiculous as the academic publishing system is, one thing it appears to have gotten right-ish that the web didn’t really think about is the difference between data and metadata, or as our friend Richard Severs like to say, “content” and “content about content.” That’s largely because academic publishing has Crossref, which is objectively rough as a technology, but simultaneously awesome as a resource because it’s a giant, free-to-query database in the sky that allows you to understand what’s (supposed to be) living at a given URI independent of the content itself — and, increasingly, what other people are saying about that URI. That all works because everyone voluntarily tells Crossref an enormous amount about their content when they publish, and Crossref uses that info to go and crawl for people talking about that content elsewhere. Crossref has therefore accidentally built, in academic publishing, a public version of the equivalent private graphs that power Google, Facebook, et al. in web publishing.

All this means that people have an incentive to give us a lot of information about what their content is, in highly structured ways. If we can make the information-collection process not too painful, it will allow us to do the hard work of figuring out how to make the most important of those kludgy academic interfaces understand that content, without our users having to think about it. And thanks to Crossref, it will allow us to fairly easily pull in information about what other people are saying about that content, which we can use to demonstrate social proof, and which our users can use, together with basic web metrics, to demonstrate impact to their funders and employers.

Knowing more about the content will also allow us to do some incredibly cool things with the content itself, especially if we think beyond the confines of the “document.”3 For example, conditionally render the content people give us based on how they describe it to us. Oh, this thing is a presentation? Great, here’s a paginated version where each H1 is a new slide. Oh, this thing is an Underlay collection of a time-series? Great, here’s a timeline. Oh, this thing is a collection of documents, each one with a birthday tagged with Great, here’s an archive with a sortable birthDate field. And so on.5 Even if your thing is just a PDF, if you tell us something interesting about it in the metadata, or even if we just pull in some relationships from Crossref, we can do cool things for you with that information.

Now, we’re not just bridging between academic publishing and web publishing. Or rather, because we’re bridging that gap, and using the best part of academic publishing to our advantage, we’re doing something much more interesting than bridging. We’re giving people the ability to create real hypermedia, the kind that the semantic web-ers dreamed of but never really found a broad use-case for, that’s interactive and interesting for creators and readers while being legible (enough) to machines and academia writ large.

This shift in thinking is also what opens up the potential for better community-building features. When we talk about the future of community-building features on PubPub, we usually circle around some kind of forum feature, aka Community-level discussions. And we’ll probably do something like that, because discussions are likely to be the one content object that survives the coming great galaxy pub collision. But when the serious community-builders (like Cursor) come to us with requests, they don’t actually ask us for forums. They ask for things like “wikis.” For “idea posts” that can be commented on and maybe voted on and built over time and eventually turned into releasable things, but maybe not. What Communities actually want is the ability to define their own content types for their purposes, and some basic tools to manage content lifecycles. Well, guess what? When you give up on the idea of everything being a “document,” you end up being able to offer much more flexibility for Communities to build the types of things they want, and probably even greatly reduce the need for lifecycle management, which in most cases is a vestige of peer review, and not actually what communities want. For example: “in this Collection, everything has a title field, a rich text field, and a status field. And we’d like the layout to be a kanban board by status, please, with documents ordered by the number of discussions.” Yes, that’s basically Notion. But it’s Notion with DOIs and public views of both collections and content that make sense.6

I know that’s a lot to think about, and a big departure from what PubPub is now. But compared to the codebase we currently manage…is it, really? What’s nice about this premise is it allows us to reduce the surface area of what we maintain to focus on content, and to start to say no to a lot of the vestiges of academic publishing output that don’t actually matter that much. Like, review. Yes, seriously. Review is a status field, a kanban board, and the ability to invite selected people to comment (maybe anonymously? maybe in a pre-generated connected pub submission?). Also, PDFs. Go home, you’re too static. We’ll spit something out with pagedjs if and only if your thing is a simple document, but otherwise, if you care that much about how your PDFs look, you should probably just make bespoke PDFs and upload them and forget about web documents. That’s totally fine by us! And so on down the list until we strip it down to the very basics that people actually need to have a leg in the academic world, which is actually less than what we do today: just enough metadata to generate JATS abstracts, Crossref deposits, ORCID records, and DC/Google Scholar tags. Which gets us right back to where we started: give us your stuff and tell us about it. So that feels pretty good.

And really, it all starts with building versions of #1 and #2, which, thankfully, are both things we’re either already building, or planning to.

#1, give us your stuff, is submissions, which we’re working on now. Interestingly, I think we already accidentally backed ourselves into the world described above by ditching the form builder concept and instead saying “which Pub fields do you need the author to fill out?” Similarly, I wonder if the way to build submissions in this universe is simply to define what metadata the objects within a given collection need, and let the admins say when they turn on submissions whether or not the user needs to fill out those fields. Or, in a more user-friendly version, adding “fields” to a “submissions form” is actually just adding metadata to content that we can render in a specific way in a submissions layout.

#2, tell us what it is, is the scopes/facets idea that’s been floating around for months now. Basically, the task is to unify and genericize all the types of content that can exist on PubPub,7 and allow them to be expanded on and customized in various ways. That’s, obviously, an extremely ambitious undertaking, but we don’t need to do it all at once. We can probably start with genericizing scopes, as Ian has suggested. Or maybe stuffing everything into Prosemirror, as Travis has. Or some other way we haven’t thought of yet. Either way, it’s going to be fun to experiment with and answer that question. And when we figure it out, it will lay the groundwork for making PubPub what it should be: the default place to go if you want to work in the future but need a tenuous anchor to the past not to lose your job.

So, let’s figure out the best/simplest/fastest way to get from where we are now to that?

No comments here
Why not start the discussion?