Sunday 7 September 2008

What About Web Archiving?


Some of my role involves working on a large web archiving project. One of the questions that's often asked by those who use the UK Web Archive, Internet Archive and the like, or are looking to start web archiving themselves, is about the 'quality' of archived web pages in comparison to the original.

The key issue to get across is that website aren't 'things' you can locate, pin down and put in a box - even the most static collection of HTML pages exists in the continuum between server, network, browser and screen. More simply, there's no such thing as a website1.

Web archiving is more akin to photography; most of what we do is to make copies of sites, trying to make them fit with our own individual construction of them at the particular moment and conditions of archiving. Similarly, the problems associated with archiving dynamic content can be compared to taking photographs of a city: images of 1940s Cardiff don't capture the entirety of the city but do provide insights and representations of a transient experience, just as an archive of Amazon.com will only give a glimpse into the constantly moving community beneath the surface.

The question we, as the custodians of information, have to ask ourselves is: is this enough? Should we strive for completeness even if that means emulation of servers and proprietary software and when all we'd be keeping is a shell of the original (akin to a digital St Fagans)?

With web archiving we need to engage with the content more directly. At the moment it seems cheaper to keep things than to select them but this isn't always going to be the case. If we take an object in to our collections we're looking to keep it in perpetuity - and the lifetime costs of keeping that single file will be infinite (this is true, of course, for physical things as well).

For domain level harvests it's clear that captures will only ever be superficial. It's my view that there will always be a place for selective archiving - both of sites which fall within collection policies of organisations but outside of the domain and in terms of effort put into the processing and checking of specific sites selected to be of special interest.

It's tempting to equate domain harvesting to the collection of printed material through legal deposit however there are some key differences. Firstly, the range of material (and therefore preservation requirements) of printed material is limited, not necessarily small but limited whereas web content will contain every obscure file format you can think of stored in all kinds of different ways. Secondly, there is a significant cost in printing material which reduces duplication (although, of course, it also reduces the range of material - for better or worse).

By putting our information on the web we're all becoming our own librarian-archivists (hooray!) - although we've not yet taken the next step and become records managers. (My Gmail currently tells me that I'm currently using 534 MB (7%) of my 7081 MB - if this keeps up I'll never need to delete anything and, as long as search technologies keep up, I will be able to find them again too.)

There is a movement within our information society towards keeping all the iterations of content as part of the services themselves. We're constantly seeing statistics which suggest that the amount of information available is increasing exponentially but, if Wikipedia and blogs are anything to go by, most of this will be drafts, previous versions and backups!

Similarly the web is notoriously difficult to time. Print and even 'formal' e-Journals are published on a specific schedule whereas blogs and other forms of social content can be started, flourish and die out in a short period of time (perhaps even between large-scale harvests). Part of our engagement with content has to be a realistic and positive in our engagement with content creators.

If you look at the short history of the web you can begin to see a shift from silos of content (personal and corporate websites, each with their own domain) to portals (where the silos are connected by indexes and directories) to dispersed content (where photos live on flickr, blogs live on wordpress.com, updates on twitter, videos on youtube and links on delicious)- it's easy to believe that this dispersal was the inevitable consequence of hypertext.

Of course this changes the way we archive and - more importantly - provide access to archived sites. If all the information on particular subjects is spread across user pages on multiple websites which are reliant on the major search indexes like google to link them together then we need to think about not only capturing the separate content but also the links between them. At the moment we work primarily at site level, preserving the links between pages but treating them as immutable objects - in the future we will need to let the harvesting agents roam freely, capturing snippets of content which make up a web2.0 website.

Ultimately, the archiving of the web is a positive thing and it's certainly a great area to work in, I know that every site we archive is a resource preserved for future research. The challenges that face us shouldn't put us off adding these important cultural assets to our collections but we do need to begin to engage with them before they seem insurmountable.

1 In fact it's this facet that really attracted me to working on Web Archiving in the first place. A substantial portion of my doctoral research was centred around the social construction of the web and the fallacy of the monolithic websites. If you ever want a long and overly detailed conversation about constructivism and the web you know where to come...

No comments: