source-codesource codeauthorAnna KramerNoneWant your finger on the pulse of everything that's happening in tech? Sign up to get David Pierce's daily newsletter.64fd3cbe9f
Get access to Protocol

Will be used in accordance with our Privacy Policy
I’m already a subscriber
        open source
People
    
        The internet is splitting apart. The Internet Archive wants to save it all forever.
    
    The Internet Archive has grand ambitions for preserving the internet. But in order to do that, Big Tech has to stay out of the way.

Brewster Kahle, the founder of the Internet Archive, worries about how the splintering internet could end a golden age for the Internet Archive.

        Photo: Internet Archive
    
Anna Kramer

        March 10, 2021
    
The internet's first librarian likes to reminisce. The early internet is like a fantasy for the founder of the Internet Archive, a place he returns to over and over again in conversation when questions about the present turn dark or depressing. Brewster Kahle might know more about the early years of the web than anyone else. 
He has occasion to talk about the Archive's beginnings perhaps more than he should these days. Discussing its future can at times be grim, or, at the very least, uncertain. The glories of the Wayback Machine, the petabytes of data capturing every day of human existence online in warehouses scattered across the world, the smooth system of crawlers marching from my Twitter to the homepage for the Russian government to Clubhouse in China — in the grand scheme of history, all of this could be an ephemeral golden age.
The so-called balkanization of the internet isn't just a theoretical problem for the Internet Archive. If internet firewalls stay up in China, Iran and Russia, new content continues to move mostly behind paywalls and passwords, and U.S. political leaders decide it's finally time for Section 230 to go, the crawlers whose simple formulas have preserved the last few decades for future historians might not do the same for more than the next few decades.

"There are more and more walled gardens where you can't go. We just have crawlers going at a crazy scale, and they can get blocked just like anybody can get blocked," said Jefferson Bailey, the Archive's director of web archiving and data services. 
But even still, until someone or something fundamentally changes the rules of the web, the Internet Archive will keep doing what it's been doing since 1996: preserving every fragment of text you or I are ever likely to read. Tech's walled gardens might make it harder to get a perfect picture, but the small team of librarians, digital archivists and software engineers at the Internet Archive plan to keep bringing the world the Wayback Machine, the Open Library, the Software Archive, etc., until the end of time. Literally.
The balkanization of the internetWhen Kahle was a student at MIT in the early '80s, he used a professor's ID to break into the Harvard Law library to access cases for a project. If there was a moment in his lifetime that encapsulated the closed nature of access to information before the internet, it was that. 

But today, anyone can find the information he needed back then without so much as a library card. "Usually, things are very closed and locked down. Historically, this is a very rare moment," he said. 
That could soon change, however. "Are we at risk of locking down? Yes, absolutely," he said. The Internet Archive is currently blocked in China, and occasionally as well in Russia, India and Turkey, and that's just at the whim of nation-state governments that have the tools to make that work. According to Kahle and Bailey, corporations are just as capable of fracturing the web in ways that make it harder to access and archive; even "user lock-in" to a specific browser and products could one day create internet bubbles, and then walls, based on the products people pay for. 
"The Facebooks and the Googles are taking over, and they want to make money," Bailey said. The more people act on the internet behind a password and the more the web becomes corporate, the more the open internet ethos fades away from the public consciousness, easing the way toward that splintering that Kahle fears.
"That's a strategic concern for everyone. Of course, it impacts archiving, too," Bailey said. The archive does its best to capture Twitter, Tumblr, Instagram, YouTube, Vimeo, Facebook and others. Facebook is the hardest, because the company is archiving-unfriendly in general, according to Bailey. But in reality, if any of these social companies decided they wanted to stop the Internet Archive from doing its job, they probably could, he said. 

"We're embedded in the community," Bailey said. "At the end of the day, we're just a library."
Kahle fears that the eventual "walling" of the internet could develop in an incongruous place: from tech companies eager for regulation that would cement their own status by stifling future innovation. For example, almost any proposed change to Section 230 — which protects website owners from legal liability for content created and posted by its users — would destroy the delicate legal framework that protects the Internet Archive's work (as well as Wikipedia and user-contributed projects), according to Kahle. Facebook's Mark Zuckerberg is among the many tech leaders to express support for a rewrite.
And tech companies, book publishers and even the music industry have lobbied to limit, change or even remove general copyright fair use exceptions, as well as specific copyright and use exemptions for libraries. Changes to these laws could (accidentally or intentionally, depending on who you ask) make it much harder for people to share their creative work online, and for groups like the Internet Archive to save them.
"Why are they doing this? Some people say it's money. But when you have oligarchies, it's really about protecting against new entrants in the market," Kahle said. At the end of the day, large companies have adapted to the current legal regimes, and they have the money and technical know-how to be able to advocate for stricter regulations that would allow them to preserve their monopolies while changing or limiting fair-use protections.

How the Internet Archive decides what to archiveUntil the day these more existential problems firm into something Kahle can fight with more than words, the Internet Archive's day-to-day struggle is preserving the constantly transient web. Web pages have an average lifespan of about 90 days before they change or disappear, and so the Archive needs to capture those pages at a minimum of every 90 days to preserve a full picture of the web over time. 
The archivists employ three main strategies to capture most of what might be important for future historians. Bailey wouldn't guess exactly what percentage of the web they manage to preserve — "I'd look like an idiot," he said — because no one really can guess the size or scale of the internet. (Don't get there in your head, if you can avoid it. How would you even measure: by data size? Number of objects? Number of distinct URLs?) "There's no use being anxious over what's outside your control," he said.
The archivists start by considering the entirety of the web and seeking out the most important fraction. They capture a shallow outline of the entire internet (every single URL and associated homepage that's accessible), and then they dive deep into as many pages as possible for the top 5 million or so most-visited websites. This creates a fairly flat, bird's-eye view of the internet.
To get a more three-dimensional picture, they seek other signals of importance, ranging from news aggregators to the entirety of a national domain (like Cuba, France, Somalia, etc.) when there is an important event, and even every single YouTube URL ever shared on Twitter (they can't capture all of YouTube, but at least they can capture the videos people deem important enough to share elsewhere).

And finally, other institutions can use the Internet Archive to build their own archiving services, usually creating specialized collections around topics like human rights or bioengineering. All of those collections are then copied back into the Wayback Machine, which is the publicly accessible version of the web archive. 
Abbie Grotke, the web archiving team lead at the Library of Congress, has been involved in this work in one way or another for over 20 years. The Library of Congress's own archive is one of the special collections built in collaboration with Bailey, and it contains about 2.4 petabytes and over 18 billion objects, ranging from U.S. government websites to the most culturally important memes. Grotke has given her life to preserving the internet for the Library of Congress. 
The work itself is technically an enormous task, but it boils down to one simple goal. "We're just trying to capture changes over time," she said.

        Brewster Kahle is the internet's first librarian.Photo: Internet Archive

The Library of Congress began capturing websites in 2014, focusing mostly on political collections and at-risk websites and collections that might be taken down before they can be captured. "We're always sort of worried about, are we collecting everything we need to be collecting? Is there something we're missing?" said Amber Paranick, one of the Library of Congress's reference librarians. But this problem isn't that different because it's digital: "That's always the dilemma of the librarian." 

The web archive alone is about 45 petabytes — 4,500 terabytes — and the Internet Archive itself is about double that size (the group has other collections, like a huge database of educational films, music and even long-gone software programs). 
It's impossible to conceptualize actually usable, accessible data at that scale, let alone make it text-searchable. So while the Archive has some projects to use machine learning to identify some images, like pictures of horses, Bailey likes to think about the odd, unimaginable applications that have emerged and how they foretell grander uses in the future.
The Wayback Machine has evolved to play an important role in patent litigation, for example. People fighting over patent ownership look for what's called "prior art," which indicates who might have first thought of a product. In one case, when two people were disputing who first created a specific design for hubcap rims, one was able to prove their ownership by finding an old website that had been archived in the Wayback Machine.
And there are other use cases, too: The people building open-source translation tools at Mozilla have also found the internet archive's collection of websites in multiple languages useful for training their translation tools. There is very little printed or digitized material that has large amounts of the same text in two languages, but many official websites do, which can help build quality translation tools for "minor languages," like English-Swahili translations, according to Bailey. 

The future of our historiesWhen I asked Kahle how he thinks about preserving today for historians centuries away, he grew philosophical. He sent links in the Zoom chat, first to the Google doc for a book he wrote, then a Nation piece, then a long blog post he wrote in 2015. By the time we hung up the call, I had piles for reading material, most of it dense, most of it dated.
There's value to all of this history, he told me. "What we're able to do now is know about your individual history. We're able to get to the specificity of the historical record. Which I think is going to really be engaging in 100 years' time. What would you give for a video of your great-grandmother? It would just give you this ballast, it would give you an anchoring, that we right now lack," he said. "We're living in the perpetual present, and that is dangerous." Kahle believes our history makes us better people, and gives us better knowledge. But history isn't financially lucrative.
Social media companies want us to focus on tomorrow, not on the posts we made a year ago. Publishers do, too. HarperCollins is suing the archive to try to prevent it from sharing out-of-print books in its digital library, arguing that publicly sharing out-of-print books is a massive violation of copyright laws. While at first it might seem odd that publishers would care about books that aren't in print anymore, for companies whose business depends on people buying new things, archiving so that people can focus on the past is not in their financial interest.

"They are erasing the past through every legal and political means they can," Kahle said. 
If the balkanization of the internet can be prevented, the Internet Archive could transform the way we learn about larger historical moments, Kahle said. History books and historians are limited to a few textual works, mostly by the powerful people of the time. With the Internet Archive, the everyday history will become suddenly accessible to those studying our time. Imagine if each of us could look back on our great-grandparents and know what they said or thought at age 15, and then 25, and 50. The Archive would allow that. 
The Archive could also force historians to become professional data miners. "There will be a lot of these comparison studies at a much larger scale in the future — every tweet from every president in 30 years. Longitudinal analysis could be done with petabytes of data," Bailey said. The research questions themselves may not change much; they will just stretch over bigger timelines and larger comparisons.
"We're in the process of building macroscopes," Kahle said. 
Caught in a golden ageMore than 1 million people use the Internet Archive every day. Most of them seek out the Wayback Machine, but people also read the digitized books in the archive's open library, or watch movies from the huge archive of public domain films. 

"We love the dreamers, the people who come to this new medium with their ideas. The dreams are important to archive, whatever happens," Kahle said. Despite the existential threats to his work and to the values of the open internet, Kahle wants to be hopeful.
"Those who want to monopolize the internet are very well-funded. We need to communicate and deliver the value of openness. Am I optimistic we can do that? I'd say yes. But it's based on an enormous number of people wanting it to happen," he said. 
"Some believe that people will only do things if you pay them, others that people are just sheep," Kahle said. "None of that is true. They may not be interested in the same things, but when we look at what people produce on the internet, if it's about the things they care about … They'll prove you wrong in a nanosecond."
From Your Site Articles

                        Libby is stuck between libraries and e-book publishers - Protocol ... ›
                    
                        How one woman is building the future for Google in Silicon Valley ... ›
                    
                        Ex-Mozilla employees lead the tech reform charge - Protocol — The ... ›
                    
Anna Kramer
	Anna Kramer is a reporter at Protocol (Twitter: @
	anna_c_kramer), where she helps write and produce Source Code, Protocol's daily newsletter. Prior to joining the team, she covered tech and small business for the San Francisco Chronicle and privacy for Bloomberg Law. She is a recent graduate of Brown University, where she studied International Relations and Arabic and wrote her senior thesis about surveillance tools and technological development in the Middle East.

            open source
        
            wayback machine
        
            big tech
        
            internet archive
        
        Most Popular
    
        SF's relationship with tech is broken. Can this city leader fix it?
    
        Amazon's warehouse injury rates are highest in the industry
    
        Cloudera has been acquired by PE giants for $5.3 billion
    
        The future of data health: An interview with Christal Bemont, CEO of Talend
    
        Biden is quietly continuing Trump's social media registration requirement
    
        Robinhood adds three new board directors
    
        Bulletins
    
        June 01, 2021 11:54 EST
    
        Robinhood adds three new board directors
    
        June 01, 2021 10:15 EST
    
        Amazon's warehouse injury rates are highest in the industry
    
        June 01, 2021 08:57 EST
    
        Cloudera has been acquired by PE giants for $5.3 billion
    
Get Source Code in your inbox
David Pierce's daily analysis of the tech news that matters.
Email Address
Latest Stories

See more

        Most Popular
    
        SF's relationship with tech is broken. Can this city leader fix it?
    
        Amazon's warehouse injury rates are highest in the industry
    
        Cloudera has been acquired by PE giants for $5.3 billion
    
        The future of data health: An interview with Christal Bemont, CEO of Talend
    
        Biden is quietly continuing Trump's social media registration requirement
    
        Robinhood adds three new board directors
    
        Bulletins
    
        June 01, 2021 11:54 EST
    
        Robinhood adds three new board directors
    
        June 01, 2021 10:15 EST
    
        Amazon's warehouse injury rates are highest in the industry
    
        June 01, 2021 08:57 EST
    
        Cloudera has been acquired by PE giants for $5.3 billion
    
Get Source Code in your inbox
David Pierce's daily analysis of the tech news that matters.
Email Address