A Love Letter to the People Who Build the Internet Archive

photobooth1When you visit a public library, you get to meet the librarians and others who build and care for those collections. You know there are people who empty the garbage cans, who put back the borrowed books, who maintain the computers, and who determine what ends up on the shelf.

A digital library, on the other hand, is “just” a web site. You don’t really see the people who build it – we are often anonymous. But the Internet Archive wasn’t built by computers and algorithms.

From its inception, the Internet Archive has been built by thousands of people who understand that we have an opportunity to use the Internet to give everyone access to knowledge. Every person on the planet should have the opportunity to learn and to make a contribution.

photobooth2This goal – Universal Access to All Knowledge – inspires the people who have built the Internet Archive over the past 23 years.

People clean and repair the buildings that we occupy. People do payroll, choose our health plans, answer the phones, plan our events, reply to user emails, clean up spam, and pay our bills. People design and build the computers that hold the collections. People construct the network that carries data to every corner of the world. People write software that processes, backs up, and delivers files. People design and test and build interfaces. People digitize analog media and type in metadata. People curate collections, establish collaborations, and manage projects.

holidaypic

There’s no way I can mention all of these people by name. Even if I listed every employee from the past 23 years, I would still be missing the volunteers, the people from other organizations who worked on joint projects with us, the pro bono lawyers, the delightfully compulsive collectors, the funding organizations, the idea generators, our sounding boards for crazy ideas, the individuals who have donated money or materials, and the hundreds of thousands of people who have uploaded media into the archive.

staff2011-768x449

Libraries are built by people, for people. Thank you so much to all of the people who have contributed to building the Internet Archive, whether they were employees or our huge group friends and family. We would not be here without you, and we hope you will continue to help bring universal access to all knowledge in the future.

Happy Valentine’s Day!

(First published on the Internet Archive blog)

151020-Archive-staff-large-768x516

tvlaunch

01ia20thanniversary20161026-david-rinehart-select_0001

20thcenturytimemachineimages_0074

20thcenturytimemachineimages_0091

aaron-jason

libraryleadersforum20161026-bradshirakawa-2_0009

20thcenturytimemachineimages_0332

libraryleadersforum20161026-bradshirakawa-2_0015

libraryleadersforum20161026-bradshirakawa-2_0025

nsa-1024x638

libraryleadersforum20161026-bradshirakawa-2_0031

Making Out-of-Print Pre-1942 books available with “Last 20” provision

First published on Internet Archive blog
About a year and a half ago, the Internet Archive launched a collection of older books that were determined to qualify for the “Last 20” provision in Copyright Law, also known as Section 108(h) for the lawyers. As I understand this provision, it states that published works in the last twenty years of their copyright term may be digitized and distributed by libraries, archives and museums under certain circumstances. At the time, the small number of books that went into the collection were hand-researched by a team of legal interns. As you can imagine, this is a process that would be difficult to perform one-by-one for a large and ever-growing corpus of works.

So we set out to automate it. Amazon has an API with book information, so I figured with a little data massaging it shouldn’t be too hard to build a piece of software to do that job for us. Pull the metadata from our MARC* metadata records, send it to Amazon, and presto!

I was wrong. It was hard.

Library Catalog Names are different from Book Seller’s Names

Library-generated metadata is often very detailed, which leads to problems when we try to match the metadata provided by librarians to the metadata used on consumer-oriented web sites. For example, an author listed in a MARC record might appear as

Purucker, G. de (Gottfried), 1874-1942

But when you look on Amazon, that same author appears as

G. de Purucker

If we search the full author from the MARC on Amazon (including full name and birth and death dates), we may miss potential matches. And this is just one simple example.  We have to transform every author field we get from MARC using a set of rules that may continue to expand as we find new problems to solve.  Here are the current rules just for transforming this one field:

General rules for transforming MARC author to Amazon author:

  • Maintain all accented or non-Roman characters as-is
  • If there are no commas, semicolons or parentheses in the string, use the whole string as-is
  • If there are no commas in the string, but there are semicolon and/or parentheses, use anything before semicolon or parentheses as the entire author string
  • If there are commas in the string:
    • Everything before the first comma should be used as the author’s last name
    • Everything after the first comma but BEFORE any of these should be used as the author’s first name:
      • comma [ , ],
      • semicolon [ ; ],
      • open parentheses [ ( ]
      • any number [0-9]
      • end of string
    • Remaining information should be discarded
  • Period [ . ] and apostrophe [ ‘ ] and other symbols should not be used to delimit any name and should be maintained as-is in the transformed string.

An Account of the Saga of the Never-ending Title: as told to the author by three blah blah blahs…

Some older books have really long titles. The MARC record contains the entire title, of course! Why wouldn’t it?! But consumer-oriented sites like Amazon often carry these books with shortened or modified titles.

For example, here’s the title of a real page-turner:

American authors, 1600 – 1900 a biographical dictionary of American literature ; compl. in 1 vol. with 1300 biographies and 400 portraits

But on Amazon that title is:

American Authors 1600-1900: A Biographical Dictionary of American Literature (Wilson Authors)

As you can image, it’s far more difficult to reliably match books with longer titles. A human can look at those two titles and think “yeah, that’s probably the same book,” but software doesn’t work quite that well.

*$%!@$* Serials

Now that the librarians have had a laugh, let’s explain that for everybody else! Think back to the days of yore when you went to the library and looked things up in a physical card catalog. If you wanted to know where a serial or periodical was located within the library collections, you really just needed one card to tell you that. It’s on this shelf in this area and the collection contains these years.

Great! Except when you’re looking at digital versions of these serials, they are distinct entities – they have different dates, different topics, different authors sometimes, etc. And yet they often still have just one MARC record – the digital equivalent of that one card in the catalog.

And that means that the publication dates pulled from the MARC records are sometimes very wrong.

For example, we have several items from the annual series The Book of Knowledge – 1947, 1957, 1958, 1959, 1974…  The date provided in the MARC file for all of these is 1940.

As you can imagine, when we are filtering texts by year for various purposes, serials are a consistent issue.

Even when we have a correct date, Amazon does not match very well on volume and other serial or periodical-based information.  For example, when we search for a particular month of a magazine, we are likely to match an entirely different month of that same magazine.

Not All Metadata is Good Metadata

Unbelievably, librarians do make mistakes. Sometimes the data we have from MARC records has typos, or a MARC record for a different publication date was attached to the book. For example, we have an author named Fkorence A Huxley, but her name is really Florence.  Not according to the MARC record, though! Fat finger errors don’t just happen on phones. Another example: we scanned a book originally published in 1924, and *republished* in 1971. We have the 1971 version.  But the MARC record tells us it’s from 1924.

Essentially, our search is only as good as our metadata. If there are typos, or the wrong MARC record, or wrong data, our search and/or filtering will not be accurate.

Commercial APIs Are Not Built to Solve Library Problems

Amazon’s API is built to sell books to end users. Yes, it helps you find a particular book, but the other data the API contains about availability, formats and pricing is less accurate. Because the Section 108(h) exemption for libraries (read more here) involves knowing whether copies are being sold at reasonable prices, we need to know about these aspects of the book to determine whether they qualify. But Amazon’s API is incomplete in this area. So we found ourselves needing to use the API to find a match for the title and author, and then go to the page and scrape it to actually get accurate availability and pricing information.

This increases the complexity of the programming required to use Amazon as a source for information, and greatly lengthened the process of building tools for this purpose.

Everything changes

We are making a determination about whether a book meets the qualifications for Section 108(h) at a particular point in time. Even with all of the issues discussed here, the accuracy of the data we can now pull about book availability and price is high. But it’s only accurate for the moment that we pull the data, because Amazon’s marketplace is constantly changing.  If we don’t find a book on Amazon today, that doesn’t mean it won’t appear on the site tomorrow.

Because of this, when we make an item available to the public via Section 108(h), we write into the item’s metadata the date on which the determination was made.

Who Wants In!?

Since I’ve made this process sound SO appealing, I would imagine that any number of other library institutions are going to line up around the block wanting to try it out for themselves. Or not. But here’s the good news! If we digitize your books, the Internet Archive may be able to do the Section 108(h) determination on your behalf. Please contact us if you would like to participate.

*A MARC record is a MAchine-Readable Cataloging record. Essentially, it is the digital equivalent of the physical card from a card catalog.

New Views Stats for the New Year

First published on the Internet Archive blog

We began developing a new system for counting views statistics on archive.org a few years ago. We had received feedback from our partners and users asking for more fine-grained information than the old system could provide. People wanted to know where their views were coming from geographically, and how many came from people vs. robots crawling the site.

The new system will debut in January 2019. Leading up to that in the next couple of weeks you may see some inconsistencies in view counts as the new numbers roll out across tens of millions of items.

With the new system you will see changes on both items and collections.

Item page changes

An “item” refers to a media item on archive.org – this is a page that features a book, a concert, a movie, etc. Here are some examples of items: Jerky TurkeyEmmaGunsmoke.

On item pages the lifetime views will change to a new number.  This new number will be a sum of lifetime views from the legacy system through 2016, plus total views from the new system for the past two years (January 2017 through December 2018). Because we are replacing the 2017 and 2018 views numbers with data from the new system, the lifetime views number for that item may go down. I will explain why this occurs further down in this post where we discuss how the new system differs from the legacy system.

Collection page changes

Soon on collection page About tabs (example) you will see 2 separate views graphs. One will be for the old legacy system views through the end of 2018. The other will contain 2 years of views data from the new system (2017 and 2018). Moving forward, only the graph representing the new system will be updated with views numbers. The legacy graph will “freeze” as of December 2018.

Both graphs will be on the page for a limited time, allowing you to compare your collections stats between the old and new systems.  We will not delete the legacy system data, but it may eventually move to another page. The data from both systems is also available through the views API.

People vs. Robots

The graph for new collection views will additionally contain information about whether the views came from known “robots” or “people.”  Known robots include crawlers from major search engines, like Google or Bing. It is important for these robots to crawl your items – search engines are a major source of traffic to all of the items on archive.org. The robots number here is your assurance that search engines know your items exist and can point users to them.  The robots numbers also include access from our own internal robots (which is generally a very small portion of robots traffic).

One note about robots: they like text-based files more than audio/visual files.  This means that text items on the archive that have a publicly accessible text file (the djvu.txt file) get more views from robots than other types of media in the archive. Search engines don’t just want the metadata about the book – they want the book itself.

“People” are a little harder to define. Our confidence about whether a view comes from a person varies – in some cases we are very sure, and in others it’s more fuzzy, but in all cases we know the view is not from a known robot. So we have chosen to class these all together as “people,” as they are likely to represent access by end users.

What counts as a view in the new system

  • Each media item in the archive has a views counter.
  • The view counter is increased by 1 when a user engages with the media file(s) in an item.
    • Media engagement includes experiencing the media through the player in the item page (pressing play on a video or audio player, flipping pages in the online bookreader, emulating software, etc.), downloading files, streaming files, or borrowing a book.
    • All types of engagements are treated in the same way – they are all views.
  • A single user can only increase the view count of a particular item once per day.
    • A user may view multiple media files in a single item, or view the same media file in a single item multiple times, but within one day that engagement will only count as 1 view.
  • Collection views are the sum of all the view counts of the items in the collection.
    • When an item is in more than one collection, the item’s view counts are added to each collection it is in. This includes “parent” collections if the item is in a subcollection.
    • When a user engages with a collection page (sorting, searching, browsing etc.), it does NOT count as a view of the collection.
    • Items sometimes move in or out of collections. The views number on a collection represents the sum of the views of the items that are in the collection at that time (e.g. the September 1, 2018 views number for the collection represents the sum of the views on items that were in the collection on September 1, 2018. If an item moves out of that collection, the collection does not lose the views from September 1, 2018.).

How the new system differs from the legacy system

When we designed the new system, we implemented some changes in what counted as a “view,” added some functionality, and repaired some errors that were discovered.

  • The legacy system updated item views once per day and collection views once per month. The new system will update both item and collection views once per day.
  • The legacy system updated item views ~24 hours after a view was recorded.  The new system will update the views count ~4 days after the view was recorded. This time delay in the new system will decrease to ~24 hours at some point in the future.
  • The legacy system had no information about geographic location of users. The new system has approximate geolocation for every view. This geographic information is based on obfuscated IP addresses. It is accurate at a general level, but does not represent an individual user’s specific location.
  • The legacy system had no information about how many views were caused by robots crawling the site. The new system shows us how well the site is crawled by breaking out media access by robots (vs. interactions from people).
  • The legacy system did not count all book reader interactions as views.  The new system counts bookreader engagements as a view after 2 interactions (like page flips).
  • On audio and video items, the legacy system sometimes counted views when users saw *any* media in the item (like thumbnail images). The new system only counts engagements with the audio or video media files in an item in those media types, respectively.

In some cases, the differences above can lead to drastic changes in views numbers for both items and collections. While this may be disconcerting, we think the new system more accurately reflects end user behavior on archive.org.

If you have questions regarding the new stats system, you may email us at info@archive.org.