The Business & Technology Network
Helping Business Interpret and Use Technology
«  
  »
S M T W T F S
 
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
 
10
 
11
 
12
 
13
 
14
 
15
 
16
 
17
 
18
 
19
 
20
 
21
 
22
 
23
 
24
 
25
 
26
 
27
 
28
 
29
 
30
 
31
 
 
 

AI is breaking the internet’s memory

Tags: digital new
DATE POSTED:June 18, 2025
AI is breaking the internet’s memory

AI bots are quietly overwhelming the digital infrastructure behind our cultural memory. In early 2025, libraries, museums, and archives around the world began reporting mysterious traffic surges on their websites. The culprit? Automated bots scraping entire online collections to fuel training datasets for large AI models. What started as a few isolated incidents is now becoming a global pattern.

To investigate, the GLAM-E Lab (focused on Galleries, Libraries, Archives, and Museums) launched a survey that reached 43 institutions across North America, Europe, and Oceania. Their findings reveal a growing tension between open access and technical resilience in the face of AI-scale data extraction.

Bots aren’t browsing, they’re swarming

Of the 43 institutions surveyed, 39 reported recent traffic spikes. Most had no idea what was happening until their servers slowed down or went offline entirely. When they dug deeper, they discovered that many of these requests came from bots, often linked to companies building training corpora for large AI models.

Unlike traditional search engine crawlers, these bots don’t operate gently or gradually. They arrive in dense, rapid waves, downloading everything, following every link, and ignoring signals like robots.txt. Their activity mimics a distributed denial-of-service attack, even if their intent is simply data collection.

Each GLAM institution has its own digital setup. Some operate on robust cloud architectures, others on legacy systems barely equipped to handle regular visitor loads. When bots strike, the impact can be wildly uneven. A national museum might absorb the spike; a community archive could crash within minutes.

The analytics tools used by these institutions weren’t built to detect bots. Many respondents said they only discovered the true source of the traffic after breakdowns occurred. Some had mistaken bot visits for rising public interest until they realized those numbers couldn’t be trusted.

Open access is becoming a vulnerability

One might assume that bots target only openly licensed content. The reality is more blunt: bots do not care. Both open and restricted collections are scraped. Licensing signals aren’t being read, let alone respected. That puts every digital collection online no matter how carefully curated at risk of exploitation and collapse.

This presents a dilemma. GLAM institutions exist to share culture and knowledge widely. But the same openness that serves the public is also what exposes them to industrial-scale scraping from AI developers, many of whom provide no attribution, compensation, or regard for infrastructure costs.

Institutions reported seeing bots arrive in swarms, often rotating IP addresses and spoofing user agents to avoid detection. Traffic would surge without warning, spike server CPU to 100%, and crash systems for hours or days. After grabbing what they needed, the bots would disappear, until the next swarm.

Some respondents described patterns where bots revisited monthly. Others saw increasing frequency, suggesting either growing demand or more actors entering the AI training space. In all cases, the disruption was real, measurable, and costly.

Many GLAM teams deployed countermeasures: firewalls, IP blocks, geofencing, and bot detection services like Cloudflare. But each solution has trade-offs. Blocking by geography might prevent legitimate researchers from accessing materials. User agent filtering is easy for bad actors to circumvent. Some institutions considered login gates, but that conflicts with their public access mission.

The most effective countermeasures are also the most expensive. Scaling up server capacity, migrating infrastructure, or integrating sophisticated traffic monitoring tools costs money—and cultural institutions often have none to spare.

What counts as a “user” now?

One of the deeper questions raised by the report is philosophical. If bots now represent a significant share of traffic, do they count as users? Should institutions try to serve them, block them, or treat them as a new class of visitor? Many institutions said their visitor counts were inflated by bot traffic. Once corrected, the real engagement metrics painted a very different picture.

There’s no easy fix

Updating robots.txt no longer works. Reporting abuse gets mixed results. Adding login barriers risks excluding welcome visitors. Even identifying which bots are “good” (e.g., search engines) and which are “bad” (AI scrapers) is murky as the boundaries between indexing and dataset collection blur.

Some institutions are considering building APIs to serve bots more efficiently. But that assumes the bots will use them, which they likely won’t. Others are hoping for legal protections, like those proposed in the EU’s Digital Single Market directive. But enforcement is far from guaranteed.

This isn’t just a technical challenge. It’s a stress test for the values of openness and access in the digital age. The GLAM community, despite its global diversity, shares a strikingly unified ethic: culture should be freely accessible. But the infrastructure supporting that ethic wasn’t designed to handle AI-scale extraction.

If AI companies want to rely on the public internet as a training ground, they may need to support its maintenance. That could mean abiding by better standards, funding sustainable access programs, or respecting new opt-out protocols.

Featured image credit

Tags: digital new