Any book tracking website needs a lot of data. At the time of this writing we have 1,655,833 books, 961,182 authors and 23,90,576 editions – a large database but far from the largest.
Getting data on that many books is difficult – both functionally and legally. Each book data provider has different data available and different licensing. This is one of the first road blocks anyone comes to when building something in the book space.
I’d love it if we were able to get all of our book data from one source. While there are services that sell book data, our bootstrapped startup couldn’t afford the high licensing fees. Because of that, we’ve built Hardcover around freely available data.
Here’s a breakdown of where this data comes from.
OpenLibrary provides data dumps of their entire book database – and absolutely amazing service which we’re grateful for. When we initially started Hardcover, we used these data dumps to initially populate our database.
To refresh data from OpenLibrary, we’ll occasionally hit their API if we don’t have a local cache of the data.
When someone imports their library from Goodreads, The StoryGraph or using our (new) custom CSV format, we create those books, editions, authors, contributions, series, genres, moods, content warnings and tags in our system.
When someone imports a book from Goodreads, or when a Librarian associates an edition with a Goodreads ID, we fetch data from Goodreads.
We’ll fetch whatever data we can from the Google Books API. We also hit the Google Books API when someone searches, in order to find other books with that title and add them to our database.
Inventaire has some good quality covers for books available from their API.
Abe Books has some good quality covers for books available from their website.
Readers and librarians can add and edit books on Hardcover too! We always use reader-entered data over data from external sites if available on a field by field basis.
All of this data needs to be saved somewhere. We cache data from all of these services in line with their data-retention policies.
This could get a little technical. In our database, we have a structure that looks like this:
Think of a Book as a placeholder that represents all editions that have ever, or will ever be written for a book. It could have many audio, physical and ebook editions, but it’ll still be the same Book.
An Edition is a specific release of a Book. An Edition has everything you’d expect – the format, release date, authors, publisher, etc.
A Book Mapping is a link to an external data provider (Google Books, Goodreads, etc). This is how we save the “Google Books ID” for a specific Edition.
An External Data Cache is a very basic cache of the data returned from the external system. That way we don’t need to hit those sources every time we change something on our side.
At the Book Mapping, Edition and Book level, we save something we call a Data Transfer Object (DTO). The DTOs at these levels all share the same structure.
You can think of the DTO has a JSON object that represents an Edition. Currently this object has about 35 fields – everything from title and author to page count and audiobook length to isbns and other identifiers.
We cache this DTO object at the Book Mapping, Edition and Book level.
At the Book Mapping level this is easy.
At the Edition level, we need to combine multiple Book Mapping DTOs and generate a new DTO. We do that by prioritizing which data we use from which place. For example:
For title, we use Goodreads, OpenLibrary, Google Books.
For cover, we use Inventaire, Abe Books, Google Books.
For description we use OpenLibrary, Google Books.
This system allows us to comply with each external services data restrictions. For example, we wouldn’t want to use Goodreads descriptions.
We do that again at the Book level, combining multiple Editions DTO objects, and prioritizing them by language (we favor English for Book data), which Editions are most read, and which ones we have the most data from.
At this point we have a field in Edition and Book called dto_external
. It’s everything we know about this book from external sources.
At the Edition and Book level, readers and librarians can also overwrite any field. We save that snapshot of user entered data into a dto
field on Edition and Book.
The last step is combining the dto
and the dto_external
fields into a single dto_combined
field, and setting each of those fields to the other database columns.
Only books that we considered Verified show up in search. In order for a book to be verified, we need to be able to fetch data about it from an external source, or for it to be verified by a librarian.
That’s a question we’re currently figuring out with our legal team (which is just one lawyer we pay by the hour whenever we have a question 😅). While I’m confident that we’re complying with all legalities, it’s unclear how this would work if we share data from Google Books, for example.
If you’re building a personal project, or using this data for personal means, I wouldn’t be worried. If you’re planning to build a business, or share this data in any way, I’d suggest you hold off on using this data until we have a formal API data policy in place.