Where Does Book Data Come from?

Any book tracking website needs a lot of data. At the time of this writing we have 1,655,833 books, 961,182 authors and 23,90,576 editions – a large database but far from the largest.

Getting data on that many books is difficult – both functionally and legally. Each book data provider has different data available and different licensing. This is one of the first road blocks anyone comes to when building something in the book space.

I’d love it if we were able to get all of our book data from one source. While there are services that sell book data, our bootstrapped startup couldn’t afford the high licensing fees. Because of that, we’ve built Hardcover around freely available data.

Data Sources

Here’s a breakdown of where this data comes from.

OpenLibrary Data Exports

OpenLibrary provides data dumps of their entire book database – and absolutely amazing service which we’re grateful for. When we initially started Hardcover, we used these data dumps to initially populate our database.

OpenLibrary API

To refresh data from OpenLibrary, we’ll occasionally hit their API if we don’t have a local cache of the data.

Reader Imports

When someone imports their library from Goodreads, The StoryGraph or using our (new) custom CSV format, we create those books, editions, authors, contributions, series, genres, moods, content warnings and tags in our system.

Goodreads

When someone imports a book from Goodreads, or when a Librarian associates an edition with a Goodreads ID, we fetch data from Goodreads.

Google Books API

We’ll fetch whatever data we can from the Google Books API. We also hit the Google Books API when someone searches, in order to find other books with that title and add them to our database.

Inventaire

Inventaire has some good quality covers for books available from their API.

Abe Books

Abe Books has some good quality covers for books available from their website.

Librarians & Reader Additions

Readers and librarians can add and edit books on Hardcover too! We always use reader-entered data over data from external sites if available on a field by field basis.

Data Caching

All of this data needs to be saved somewhere. We cache data from all of these services in line with their data-retention policies.

Data Parsing

This could get a little technical. In our database, we have a structure that looks like this:

  • A Book has many Editions
  • An Edition has many Book Mappings
  • A Book Mapping has an External Data Cache

Think of a Book as a placeholder that represents all editions that have ever, or will ever be written for a book. It could have many audio, physical and ebook editions, but it’ll still be the same Book.

An Edition is a specific release of a Book. An Edition has everything you’d expect – the format, release date, authors, publisher, etc.

A Book Mapping is a link to an external data provider (Google Books, Goodreads, etc). This is how we save the “Google Books ID” for a specific Edition.

An External Data Cache is a very basic cache of the data returned from the external system. That way we don’t need to hit those sources every time we change something on our side.

At the Book Mapping, Edition and Book level, we save something we call a Data Transfer Object (DTO). The DTOs at these levels all share the same structure.

You can think of the DTO has a JSON object that represents an Edition. Currently this object has about 35 fields – everything from title and author to page count and audiobook length to isbns and other identifiers.

We cache this DTO object at the Book Mapping, Edition and Book level.

At the Book Mapping level this is easy.

At the Edition level, we need to combine multiple Book Mapping DTOs and generate a new DTO. We do that by prioritizing which data we use from which place. For example:

For title, we use Goodreads, OpenLibrary, Google Books.

For cover, we use Inventaire, Abe Books, Google Books.

For description we use OpenLibrary, Google Books.

This system allows us to comply with each external services data restrictions. For example, we wouldn’t want to use Goodreads descriptions.

We do that again at the Book level, combining multiple Editions DTO objects, and prioritizing them by language (we favor English for Book data), which Editions are most read, and which ones we have the most data from.

At this point we have a field in Edition and Book called dto_external. It’s everything we know about this book from external sources.

Overwriting Book & Edition Data

At the Edition and Book level, readers and librarians can also overwrite any field. We save that snapshot of user entered data into a dto field on Edition and Book.

The last step is combining the dto and the dto_external fields into a single dto_combined field, and setting each of those fields to the other database columns.

Verified Books

Only books that we considered Verified show up in search. In order for a book to be verified, we need to be able to fetch data about it from an external source, or for it to be verified by a librarian.

Can I Use Hardcover Book Data in My Project?

That’s a question we’re currently figuring out with our legal team (which is just one lawyer we pay by the hour whenever we have a question 😅). While I’m confident that we’re complying with all legalities, it’s unclear how this would work if we share data from Google Books, for example.

If you’re building a personal project, or using this data for personal means, I wouldn’t be worried. If you’re planning to build a business, or share this data in any way, I’d suggest you hold off on using this data until we have a formal API data policy in place.

← More from the reference library