We're losing our digital history. Can the Internet Archive save it?

Sep 16, 2024 - 11:06

(Credit: Serenity Strull/ Getty Images)

Research shows 25% of web pages posted between 2013 and 2023 have vanished. A few organisations are racing to save the echoes of the web, but new risks threaten their very existence.

It's possible, thanks to surviving fragments of papyrus, mosaics and wax tablets, to learn what Pompeiians ate for breakfast 2,000 years ago. Understand enough Medieval Latin, and you can learn how many livestock were reared at farms in Northumberland in 11th Century England – thanks to the Domesday Book, the oldest document held in the UK National Archives. Through letters and novels, the social lives of the Victorian era – and who they loved and hated – come into view.

But historians of the future may struggle to understand fully how we lived our lives in the early 21st Century. That's because of a potentially history-deleting combination of how we live our lives digitally – and a paucity of official efforts to archive the world's information as it's produced these days.

However, an informal group of organisations are pushing back against the forces of digital entropy – many of them operated by volunteers with little institutional support. None is more synonymous with the fight to save the web than the Internet Archive, an American non-profit based in San Francisco, started in 1996 as a passion project by internet pioneer Brewster Kahl. The organisation has embarked what may be the most ambitious digital archiving project of all time, gathering 866 billion web pages, 44 million books, 10.6 million videos of films and television programmes and more. Housed in a handful of data centres scattered across the world, the collections of the Internet Archive and a few similar groups are the only things standing in the way of digital oblivion.

Depending on what you're looking for, the Internet Archive's collection is so thorough it can sometimes feel like a functionally complete record of the web

"The risks are manifold. Not just that technology may fail, but that certainly happens. But more important, that institutions fail, or companies go out of business. News organisations are gobbled up by other news organisations, or more and more frequently, they're shut down," says Mark Graham, director of the Internet Archive's Wayback Machine, a tool that collects and stores snapshots of websites for posterity. There are numerous incentives to put content online, he says, but there's little pushing companies to maintain it over the long term.

Despite the Internet Archive's achievements thus far, the organisation and others like it face financial threats, technical challenges, cyberattacks and legal battles from businesses who dislike the idea of freely available copies of their intellectual property. And as recent court losses show, the project of saving the internet could be just as fleeting as the content it's trying to protect.

"More and more of our intellectual endeavours, more of our entertainment, more of our news, and more of our conversations exist only in a digital environment," Graham says. "That environment is inherently fragile."

Saving our history

A quarter of all web pages that existed at some point between 2013 and 2023 now… don't. That's according to a recent study by Pew Research Center, a think tank based in Washington, DC, which raised the alarm of our disappearing digital history. Researchers found the problem is more acute the older a web page is: 38% of web pages that Pew tried to access that existed in 2013 no longer function. But it's also an issue for more recent publications. Some 8% of web pages published at some point 2023 were gone by October that same year.

This isn't just a concern for history buffs and internet obsessives. According to the study, one in five government websites contains at least one broken link. Pew found more than half of Wikipedia articles have a broken link in their references section, meaning the evidence backing up the online encyclopaedia's information is slowly disintegrating.

Serenity Strull/ Getty Images With no formalised public efforts to document the web, the Internet Archive has become a critical piece of digital infrastructure (Credit: Serenity Strull/ Getty Images) — ***With no formalised public efforts to document the web, the Internet Archive has become a critical piece of digital infrastructure (Credit: Serenity Strull/ Getty Images)***

But thanks to the work of the Internet Archive, not all those dead links are totally inaccessible. For decades, the Archive's Wayback Machine project has sent armies of robots to crawl through the cascading labyrinths of the internet. These systems download functional copies of websites as they change over time – often capturing the same pages multiple times in a single day – and make them available to public free of charge.

"When we then went and looked at how many of those URLs were available in the Wayback Machine, we found that two-thirds of those were available in a way," he says. In that sense, the Internet Archive is doing what it set out to do – it's saving records of online society for posterity.

Historians of the future may struggle to understand fully how we lived our lives in the early 21st Century

A few other organisations, big and small, work on similar projects. The US Library of Congress, for example, preserves government websites, the sites of congressmembers and a collection of US news sites. The Library of Congress also preserved a copy of every single tweet sent since the founding of Twitter (now known as X), until the project was shut down in 2017. Other governments run their own initiatives. The UK Web Archive conducts an annual crawl of websites with .UK domain names, capturing a snapshot of the British internet at least once a year. In 2022, band of volunteers to set out to save the Ukrainian internet as it was hit by Russian cyberattacks.

But the scope of these projects is narrow, while the Internet Archive aims for a comprehensive approach. Given the available resources, it would be impossible to collect anything close to the whole internet, but its systems cast a broad net. Depending on what you're looking for, the Internet Archive's collection is so thorough it can sometimes feel like a functionally complete record of the web.

Success breeds complacency

The Archive's publicly accessible documents help sustain records of our lives in the current era. It's become a standard practise on Wikipedia to cite copies of websites from the Internet Archive’s Wayback Machine, rather than the original websites themselves. The organisation also preserves a vast collection of media that predates the digital era. The beloved 1977 comedy series Fernwood 2 Night isn't available on any streaming service, but you can watch it free on the Internet Archive. Books, magazines and websites cite the Internet Archive’s scanned digital copies of books that are unavailable in physical libraries. It even acts as a preservation tool for the public; anyone can upload videos, websites and practically anything else to organisation's servers.

Every few years there's a new platform come along and then the economic forces suddenly kind of collapse in it – Andrew Jackson

Among the major collections that the Wayback Machine has salvaged from the digital scrapheap are deep records of websites built on GeoCities, a now defunct personal web hosting service. Long before social media, GeoCities was among the first platforms that made it easy for anyone to create their own website. Historians view GeoCities as one of the most important chapters in the early days of the world wide web, without the efforts of the Internet Archive, most of its websites would be lost. In more recent history, a US Congressional Committee relied on the Internet Archive to preserve article and documents related to the January 6 insurrection.

"Every few years there's a new platform come along and then the economic forces suddenly kind of collapse in it," says Andrew Jackson, preservation registry technical architect at the Digital Preservation Coalition, a UK-based advocacy group and charity that advises on how to preserve the world's online digital archives. "That's one big source of churn."

The tech news website CNET faced backlash in 2023 after reports that the company had deleted tens of thousands of articles, amounting to decades of lost history. Among CNET's responses was a promise that all its deleted articles had been preserved in the Wayback Machine. Many critics argued the company was taking the Internet Archive for granted, passing on its own archival responsibilities.

According to the Pew Research Center, a quarter of all web pages that existed at some point between 2013 and 2023 now… don't

"Even though Google and other search engines actively incentivise you to maintain stable URLs, it's just technically quite difficult to do that," says Jackson. "Every time a new company kind of revamped its website, it has to work out how much of its new URLs it's going to try and maintain through time."

But it's worth remembering what the Internet Archive is: a non-profit organisation, financed by donations from charitable foundations. It makes for a never-ending project with exponentially growing costs. The Internet Archive volunteered to take on the mantle of being the world's leading library for our digital lives. As the web approaches its fourth decade, this entirely unofficial project has become a foundational pillar of the internet.

But as our reliance on the Internet Archive grows, so too do the threats pecking away at its efforts.

Single points of failure

Last week, the organisation announced a major partnership with Google, where the tech giant engine will include links to the Wayback Machine in search results – though neither released financial details about the deal.

But other recent news demonstrates that the project is still fragile. That vulnerability was laid bare in a court case against the Internet Archive by four large book publishers, who alleged that the Internet Archive’s practise of scanning physical books and lending out digital copies breaches US copyright law. Before the pandemic, the Internet Archive would only lend one digital copy at a time for each physical book in its collection. But during the Covid shutdowns, the organisation lifted that restriction, letting patrons borrow unlimited digital copies of books to try and make up for the closure of physical libraries.

A US court ruled that practice was illegal in 2023, and in early September, the Internet Archive’s appeal against that decision was rejected. The organisation previously said that it agreed to pay the a publishing industry trade group an undisclosed sum in relation to the case.

With that lawsuit in the rearview, the Internet Archive is fighting yet another court case against music labels for digitising records that could cost it $400m (about £305m) if it loses. It's an amount that could jeopardise the non-profit's survival.

Serenity Strull/ Getty Images The Internet Archive's three-decades-long collection spans across hundreds of billions of web pages (Credit: Serenity Strull/ Getty Images) — ***The Internet Archive's three-decades-long collection spans across hundreds of billions of web pages (Credit: Serenity Strull/ Getty Images)***

Internet Archive's director of library services Chris Freeland said the organisation is reviewing the courts' opinion a statement about the ruling.

Existential legal battles aren't the only hazards menacing the world of digital preservation. The British Library's UK Web Archive got a taste of some malevolent technical challenges last when a cyberattack took its digital systems offline in October 2023. Almost a year later, the UK Web Archive is still dealing with the fallout. Online access to much of its collection is still unavailable.

In May 2024, the Internet Archive announced it was in the midst of a large distributed denial of service (DDoS) attack. In a DDoS attack, vandals or other bad actors set up automated systems to bombard websites with visits, attempting to push them offline by overwhelming their servers. At its peak, tens of thousands of concurrent visits were happening every second. Services, including the Wayback Machine, went down. It meant that the regular drumbeat of archiving was disrupted for a time, and there may be permanent gaps in the historical record as a result.

We have a wealth of documents from the past. But we only have certain documents and certain people's voices, and a lot of those voices that were missing were incredibly important, and they've been erased – Mar Hicks

The Internet Archive "was started by one individual, and it has become a kind of linchpin", says Jackson. "It also feels like this potential single point of failure. Although it's a lot more sophisticated than just volunteers, it is one institution in one region, under one legal framework."

The organisation shares these concerns. If the Internet Archive's work stopped and "that void wasn't immediately filled, then much of what is currently made available on the public web would be at risk", says Graham.

He's clear that the Internet Archive won't step back from its responsibilities anytime soon, but the project can use outside help. "There are opportunities for many others to contribute in a variety of ways," he says.

Shared responsibilities, split priorities

With no formal effort to organise efforts to preserve the internet, the project is left to hobbyists, volunteers, and a few groups of unofficial bodies that generally operate independently.

"It makes sense that the archival response is decentralised," says Mar Hicks, a historian of technology at the University of Virginia. "But one of the problems is the varied priorities."

Hicks points out that one of the first things any archivist will consider when building an archive is what to prioritise. "And when it's so decentralised, the priorities are going to be very different," Hicks says. "There's going to be people in groups who prioritise trying to grab everything – as much as they possibly can, they might be very completionist." Then there will be others who are focused only on certain areas – for instance, the UK archiving effort.

The concern about such an ad hoc, decentralised approach is that it's possible there's overlap, meaning precious archiving resources are wasted getting duplicate or triplicate copies of the most popular websites – all while some areas that may have historical importance are overlooked because they fall between different groups' responsibilities.

A book is a more obviously finite resource; it can be lost or damaged. But the internet feels so accessible. Anyone with an internet connection can pull up a web browser and dial in a URL. It's all right there – until it isn't

"Archivists will tell you that these issues have existed for a very long time," Hicks says. But they're exacerbated by the level of stuff being produced in our digital world. Nearly a billion emails are sent every day. YouTube reports that more than 500 hours' worth of video content is posted on the platform every minute.

The internet is "essentially a firehose of information and material," says Hicks. "It doesn't make sense to try to catch everything that comes out of the firehose. That wouldn't make sense from a resource standpoint."

In one sense this is an old concern. "We have, as historians, those same problems," says Hicks. "We have a wealth of documents from the past. But we only have certain documents and certain people's voices, and a lot of those voices that were missing were incredibly important, and they've been erased."

For Hicks, there needs to be some sort of priority about what is being saved from the digital footprints of our generation. Otherwise we run the risk that rapidly ballooning costs will sideline efforts to save the history of the web – not to mention the oceans of digital files that live offline.

"If you have to keep everything, it becomes very expensive," says Jackson of the Digital Preservation Coalition. "There's often older content or less compelling content [that] gets lost by the wayside," he says.

"We're not capturing the non-Western world well," admits Jackson. "There are gaps now around incompleteness in different cultural domains."

And while many of those organisations work to fight against their biases and prejudices, they're often left to carry the weight of the task while governments and the companies that run the platforms and websites sit by. "Independent groups of people, who are just caring about it and are willing to spend their free time doing it, are better resourced and more highly skilled than the institutions which are formally responsible," says Jackson.

There's a vacuum, argues Hicks, which few people other than a handful of archivist obsessives are filling. "It's not clear whose responsibility it is to archive [the internet] or whose interest it would serve," Hicks says.

One thing is clear, though, Hicks says, we should all pay up to support the fight for preservation. "From a very pragmatic perspective, if you do not pay these people and make sure that these archives are funded, they will not exist into the future, they will break down and then the whole point of collecting them will have gone out the window," says Hicks. "Because the whole point of the archive is not that it just gets collected, but that it persists indefinitely into the future."

The Enlightenment of the 18th century saw the birth of an international library movement as governments and philanthropists took on the need to preserve and distribute books for the public. But that sense of civic responsibility hasn't extended to the internet. That may be due to the complicated business interests of the digital world, or just the immense technical challenge. Or, perhaps, it's because it doesn't feel like the web needs saving to casual observers. A book is a more obviously finite resource; it can be lost or damaged. But the internet feels so accessible. Anyone with an internet connection can pull up a web browser and dial in a URL. It's all right there – until it isn't.

Source: BBC