First published on Internet Archive blog
About a year and a half ago, the Internet Archive launched a collection of older books that were determined to qualify for the “Last 20” provision in Copyright Law, also known as Section 108(h) for the lawyers. As I understand this provision, it states that published works in the last twenty years of their copyright term may be digitized and distributed by libraries, archives and museums under certain circumstances. At the time, the small number of books that went into the collection were hand-researched by a team of legal interns. As you can imagine, this is a process that would be difficult to perform one-by-one for a large and ever-growing corpus of works.
So we set out to automate it. Amazon has an API with book information, so I figured with a little data massaging it shouldn’t be too hard to build a piece of software to do that job for us. Pull the metadata from our MARC* metadata records, send it to Amazon, and presto!
I was wrong. It was hard.
Library Catalog Names are different from Book Seller’s Names
Library-generated metadata is often very detailed, which leads to problems when we try to match the metadata provided by librarians to the metadata used on consumer-oriented web sites. For example, an author listed in a MARC record might appear as
Purucker, G. de (Gottfried), 1874-1942
But when you look on Amazon, that same author appears as
G. de Purucker
If we search the full author from the MARC on Amazon (including full name and birth and death dates), we may miss potential matches. And this is just one simple example. We have to transform every author field we get from MARC using a set of rules that may continue to expand as we find new problems to solve. Here are the current rules just for transforming this one field:
General rules for transforming MARC author to Amazon author:
- Maintain all accented or non-Roman characters as-is
- If there are no commas, semicolons or parentheses in the string, use the whole string as-is
- If there are no commas in the string, but there are semicolon and/or parentheses, use anything before semicolon or parentheses as the entire author string
- If there are commas in the string:
- Everything before the first comma should be used as the author’s last name
- Everything after the first comma but BEFORE any of these should be used as the author’s first name:
- comma [ , ],
- semicolon [ ; ],
- open parentheses [ ( ]
- any number [0-9]
- end of string
- Remaining information should be discarded
- Period [ . ] and apostrophe [ ‘ ] and other symbols should not be used to delimit any name and should be maintained as-is in the transformed string.
An Account of the Saga of the Never-ending Title: as told to the author by three blah blah blahs…
Some older books have really long titles. The MARC record contains the entire title, of course! Why wouldn’t it?! But consumer-oriented sites like Amazon often carry these books with shortened or modified titles.
For example, here’s the title of a real page-turner:
American authors, 1600 – 1900 a biographical dictionary of American literature ; compl. in 1 vol. with 1300 biographies and 400 portraits
But on Amazon that title is:
American Authors 1600-1900: A Biographical Dictionary of American Literature (Wilson Authors)
As you can image, it’s far more difficult to reliably match books with longer titles. A human can look at those two titles and think “yeah, that’s probably the same book,” but software doesn’t work quite that well.
Now that the librarians have had a laugh, let’s explain that for everybody else! Think back to the days of yore when you went to the library and looked things up in a physical card catalog. If you wanted to know where a serial or periodical was located within the library collections, you really just needed one card to tell you that. It’s on this shelf in this area and the collection contains these years.
Great! Except when you’re looking at digital versions of these serials, they are distinct entities – they have different dates, different topics, different authors sometimes, etc. And yet they often still have just one MARC record – the digital equivalent of that one card in the catalog.
And that means that the publication dates pulled from the MARC records are sometimes very wrong.
For example, we have several items from the annual series The Book of Knowledge – 1947, 1957, 1958, 1959, 1974… The date provided in the MARC file for all of these is 1940.
As you can imagine, when we are filtering texts by year for various purposes, serials are a consistent issue.
Even when we have a correct date, Amazon does not match very well on volume and other serial or periodical-based information. For example, when we search for a particular month of a magazine, we are likely to match an entirely different month of that same magazine.
Not All Metadata is Good Metadata
Unbelievably, librarians do make mistakes. Sometimes the data we have from MARC records has typos, or a MARC record for a different publication date was attached to the book. For example, we have an author named Fkorence A Huxley, but her name is really Florence. Not according to the MARC record, though! Fat finger errors don’t just happen on phones. Another example: we scanned a book originally published in 1924, and *republished* in 1971. We have the 1971 version. But the MARC record tells us it’s from 1924.
Essentially, our search is only as good as our metadata. If there are typos, or the wrong MARC record, or wrong data, our search and/or filtering will not be accurate.
Commercial APIs Are Not Built to Solve Library Problems
Amazon’s API is built to sell books to end users. Yes, it helps you find a particular book, but the other data the API contains about availability, formats and pricing is less accurate. Because the Section 108(h) exemption for libraries (read more here) involves knowing whether copies are being sold at reasonable prices, we need to know about these aspects of the book to determine whether they qualify. But Amazon’s API is incomplete in this area. So we found ourselves needing to use the API to find a match for the title and author, and then go to the page and scrape it to actually get accurate availability and pricing information.
This increases the complexity of the programming required to use Amazon as a source for information, and greatly lengthened the process of building tools for this purpose.
We are making a determination about whether a book meets the qualifications for Section 108(h) at a particular point in time. Even with all of the issues discussed here, the accuracy of the data we can now pull about book availability and price is high. But it’s only accurate for the moment that we pull the data, because Amazon’s marketplace is constantly changing. If we don’t find a book on Amazon today, that doesn’t mean it won’t appear on the site tomorrow.
Because of this, when we make an item available to the public via Section 108(h), we write into the item’s metadata the date on which the determination was made.
Who Wants In!?
Since I’ve made this process sound SO appealing, I would imagine that any number of other library institutions are going to line up around the block wanting to try it out for themselves. Or not. But here’s the good news! If we digitize your books, the Internet Archive may be able to do the Section 108(h) determination on your behalf. Please contact us if you would like to participate.
*A MARC record is a MAchine-Readable Cataloging record. Essentially, it is the digital equivalent of the physical card from a card catalog.