World Wide Webarchives

New Webrecorder release introduces support for seamlessly combining resources from multiple public web archives and the live web.

Rhizome’s Webrecorder is both a tool to create high-fidelity, interactive web archives of any web site you browse and a platform to make those archives accessible. Today we are thrilled to announce exciting new features.

Different public web archives collect web materials in different ways. The UK National Archives focus on saving sites from the .uk domain, perma.cc allows users to create single URL copies, Arquivo.pt looks after the Portuguese web and serves the scientific community, the Internet Archive goes for as much breadth as possible—and there are many more web archives all over the world. The good news here is that if a web resource is missing from one web archive, another archive may well have it. Unfortunately, until today, it has been very difficult to assemble materials from different archives together into a more complete copy.

Rhizome’s Webrecorder offers users a new ability to tap into these resources and not only archive what is on the web right now, but additionally create new, standalone collections from the old web.

Extraction 

Every Webrecorder session starts by the user entering a URL that should be recorded. If Webrecorder recognizes that this URL is actually from another web archive, it enters Extraction mode: instead of recording the web archive system with its navigation elements like calendars or diagrams, Webrecorder only picks up the actual archived web pages.

In Extraction Mode, you can easily add archival resources to your collection.

Rhizome’s 2009 website, as captured by arquivo.pt and extracted and patched on Webrecorder

Extract + Patch

In addition to every web archive out there having different collection policies, the actual technical process is also often different: Some first get all the text (HTML) and look for images in a second run later—then these images might have already disappeared. Some would attempt to capture Javascript while others might focus on media files. Webrecorder now natively offers to automatically provide the best archival recording by automatically patching in missing resources from any known web archive and the live web, which can lead to dramatically improved versions of web pages from the past.

Webrecorder will create two recordings with Extract+Patch: one just contains the resources from the archive you pasted the URL of, another contains resources patched in from all other archives that have been used.

The Cassina Projects website as captured in NYARC’s New York City Galleries collection on Archive-It, extracted and patched on Webrecorder

The information on where archival resources came from is stored in your recording session as provenance metadata, which also persists when downloaded as WARC. The latest release (1.0.5) of the Webrecorder Player desktop application displays the provenance info exactly as the online version.

Patch existing recordings from multiple archive sources

Webrecorder has always included a “Patch Mode” which allowed users to “patch” in missing resources that were not previously recorded. Until this release, the patch mode would only look at the the latest live version of a resource. Now, Webrecorder will also look at other archives to attempt to patch in the most accurate archival copy of each missing piece.

For example, if you have created a collection with Webrecorder one year ago and forgot to include a resource, going to Patch Mode will now look for resources as close as possible to original material’s recording date, and fall back to the live web if other archives cannot provide.

If you’re part of an organization that runs their own web archive, you might want to patch any resources from the past that have been missed: Paste your own archive’s URL into Webrecorder, download the patch recording and integrate it into your resources.

Read about three practical applications of these new functions.

Public Web Archives Directory

How does Webrecorder know what these other web archives are and how does it access them?

While there are a few well known web archives, there are actually at least twenty five of them all over the world! That’s why we’ve created an open source public web archives directory on GitHub, where each archive’s capabilities are described in the new Web Archive Manifest format.

When you’re browsing a web archive, the pages are actually rewritten, so that for instance links to the original web site now point to the archive’s server. However, most web archives provide access to unmodified archived resources and at least one API for requesting a copy from a specific date. These APIs include Memento, CDX Server, and Wayback raw content replay.

If you are stewarding a public web archive, you can check if it is already listed correctly. If not, feel free to submit a pull request to our GitHub to add your archive, to make it accessible by Webrecorder.

About Webrecorder 

Webrecorder was developed by Ilya Kreymer, and is a project of Rhizome under its digital preservation program led by Dragan Espenschied. It's currently developed by Kreymer with the assistance of Senior Front-End Developer Mark Beasley, Design Lead Pat Shiu, and Contract Developer Raffaele Messuti.