I recently started exploring the world of linked book metadata. The Internet Archive (IA) digitizes and stores millions of books, and I am trying to find more metadata for those books as well as discover ways to link the books together.
Open Library (an IA project) uses custom-built technology to do some of this work, but those tools were built before much of the data discussed here existed (at least in the open). Given the number of resources available, it seemed like we should be able to simplify this job somewhat.
I had a few goals in mind:
- Don’t duplicate digitization of books
- Don’t duplicate physical storage of books
- Provide identifiers for books that allow others to interact with us
- Find related metadata we can point to, include in items, and/or crawl
- Connect editions and authors
There are several million digital books on archive.org. These have varying levels of availability – some are freely downloadable, others must be borrowed, and some are only available to the print disabled. I ignored availability for the purposes of this research, since we want to achieve the goals above regardless of how many people can access the books.
A little over 2 million of the books in archive.org have an OCLC control number (OCN) and/or an ISBN in their metadata already. OCLC’s xISBN service can be used to try to resolve an ISBN to an OCN. (The xISBN service accepts both OCN and ISBN.)
OCLC Linked Data
With the OCLC control number we can try to get the OCLC linked data which has a lot of valuable information. The linked data provides subjects, alternate titles, audience level, genre, type of work (e.g. books vs journals) and other editions. Some of this is a repeat of the data that comes back from xISBN, while other data points appear to be unique. The key, though, is that it also leads us to a lot of other linked data and unique identifiers that we can mine for more information and/or use to make our metadata about works and authors better.
Work/Edition Identifier Examples
Internet Archive Book: https://archive.org/details/littlewomenchild00loui
Worldcat Edition: http://www.worldcat.org/title/little-women-or-meg-jo-beth-and-amy/oclc/40144923
OCLC Work Entity: http://experiment.worldcat.org/entity/work/data/6130
VIAF Work: http://viaf.org/viaf/174058143/
Wikidata Work: https://www.wikidata.org/wiki/Q523076
Wikipedia Work: https://en.wikipedia.org/wiki/Little_Women
Project Gutenberg Book: https://www.gutenberg.org/ebooks/514
Author Identifier Examples
OCLC Person Identity: http://experiment.worldcat.org/entity/person/data/2635657228
Worldcat Identities: https://www.worldcat.org/identities/lccn-n79-117152/
Library of Congress (LC) Name Authority File: http://id.loc.gov/authorities/names/n79117152.html
VIAF Person: http://viaf.org/viaf/29528997/
ISNI Person: http://isni.org/isni/0000000121257727
Wikidata Person: https://www.wikidata.org/wiki/Q185696
Wikimedia Commons Person: https://commons.wikimedia.org/wiki/Category:Louisa_May_Alcott
Wikipedia Person: https://en.wikipedia.org/wiki/Louisa_May_Alcott
Dbpedia Person: http://dbpedia.org/page/Louisa_May_Alcott
Project Gutenberg Author: https://www.gutenberg.org/ebooks/author/102
Open Library Author: https://openlibrary.org/authors/OL26680A/Louisa_May_Alcott
A Note About Works
The OCLC work entity does not appear to lead to any non-OCLC work entity IDs. Making a connection to other organizations’ work entity IDs will require more effort and be less reliable.
The Wikidata work can be found using the Wikidata person entity, but the link is not absolute. If there are multiple works listed with similar names, we would not necessarily have high confidence for choosing the right one. However, in *some* cases the Wikidata work contains an OCN, so we could try to confirm whether we chose the correct work using that information (and/or start with Wikidata and match any existing OCN to an IA book if it exists).
The Wikipedia work article can be found via the Wikidata work or Wikipedia person article, but again the paths are not necessarily clear. If we can confirm that we have the right Wikidata work, then there appears to be a direct link to the Wikipedia work article.
VIAF works suffer the same problem with not having a necessarily clear path from the VIAF Person. However, the VIAF work page does link to worldcat URLS. So we could download VIAF work records and work backwards to the OCNs we have for a more “confident” link.
Example:
- IA book: https://archive.org/details/epaminondashisau00bryarich
- Leads to worldcat record: http://www.worldcat.org/title/epaminondas-and-his-auntie/oclc/751974
- Leads to OCLC work entity: http://experiment.worldcat.org/entity/work/data/1601406
- The VIAF work entity at: http://viaf.org/viaf/309289167/ links to worldcat URLs:
- http://worldcat.org/oclc/558691094 – included in the OCLC workExample IDs
- http://worldcat.org/oclc/872301877 – NOT included in the OCLC workExample IDs
- http://worldcat.org/oclc/852017353 – included in the OCLC workExample IDs
In Summary
Since the last time I investigated this topic (admittedly that was probably in 2008 or so), there have been major developments in the type and variety of data openly available on the net. It’s still not necessarily a straightforward problem to solve at scale, though. In particular, we may have trouble de-duplicating editions of the same book using these methods. But this is great progress forward, and linked data could be used to solve at least some of the problems Internet Archive faces with digitizing a large corpus of books for the public.
More information about projects mentioned in this post:
OCLC
The nonprofit OCLC and its member libraries run WorldCat, a huge catalog of library holdings.
VIAF
Virtual International Authority File (VIAF) is an OCLC service that pulls together name authority files from OCLC, LC, and a lot of national libraries around the world to create “super” authority records. These records are intended to allow people to re-purpose bibliographic data produced by libraries serving different language communities. There are data dumps available at http://viaf.org/viaf/data/.
ISNI
ISNI (International Standard Name Identifier) is an ISO certified global standard number for identifying contributors to creative works. This is another unique identifier we can use to refer to authors and other creators. They have about 9 million identities. The ISNI lookup tool is “powered by OCLC,” though ISNI appears to be a separate entity. I’m not sure what the relationship is.
Wikimedia Projects
Wikidata is a knowledge base created by public editors that contains structured data.
Wikimedia Commons is a repository of freely available media.
Wikipedia is a publicly editable Internet encyclopedia (and probably needs no introduction).
DBpedia tries to extract structured content from Wikipedia.
Project Gutenberg
Project Gutenberg is a volunteer project that provides full text ebooks of primarily public domain works.
Open Library
Open Library is a project of the Internet Archive that is intended to contain one web page for every book ever published.
Internet Archive
Internet Archive is a digital library dedicated to amassing all the published works of humankind and making them widely accessible.