Image depicting a cybersecurity breach scenario with a broken digital padlock, various data icons, a hacker silhouette, and the Internet Archive logo surrounded by digital fragments, symbolizing the recent Internet Archive data breach.

7TB of Stolen Data: What Really Happened in the Internet Archive Hack?

In the ever-evolving world of cybersecurity, breaches continue to underscore the importance of vigilance in protecting sensitive information. One recent and significant event is the breach at the Internet Archive (IA), a non-profit digital library responsible for the Wayback Machine, which stores over 866 billion web pages and digital content, making it a crucial resource for historical data. In October 2024, the Internet Archive was breached, exposing GitLab authentication tokens and compromising Zendesk support tickets, with personal data now in the hands of malicious actors.

This article examines the Internet Archive breach, the root causes, the fallout, and the lessons organizations can take away to avoid similar incidents.

What Happened?

The breach at the Internet Archive began with the exposure of GitLab authentication tokens in a configuration file on one of the organization’s servers, which was accessible since December 2022. These tokens, critical for verifying and granting access to different parts of the Internet Archive’s infrastructure, were not properly rotated or secured, leading to vulnerabilities.

The exposed tokens gave malicious actors access to the Internet Archive’s GitLab source code and, subsequently, sensitive internal information. Among the compromised systems was Zendesk, the customer service platform used by the Internet Archive. Support tickets dating back to 2018 were accessed, and the breach may have exposed personal data, including user requests and even personal identification documents that were required to remove content from the Wayback Machine.

The Breach Timeline and Key Events

  1. Initial Breach: The breach was initiated through the discovery of a misconfigured GitLab server. A hacker found an exposed GitLab configuration file that allowed them to access the Internet Archive’s GitLab environment. This gave them a window to extract sensitive source code and authentication tokens.
  2. Vulnerability Ignored: Despite BleepingComputer repeatedly alerting the Internet Archive about the exposed tokens, the issue wasn’t resolved promptly. The tokens were left unchanged for almost two years, leading to this breach. Once inside the system, the attackers were able to escalate their access and download sensitive information.
  3. Compromise of Zendesk Tickets: One of the key aspects of this breach was gaining access to Zendesk, the platform used to handle customer inquiries. This included 800,000+ support tickets—some of which contained personal data, potentially including government IDs submitted by users requesting removal of content from the Wayback Machine. With this breach, sensitive information was stolen and potentially traded or sold among cybercriminal communities.
  4. No Immediate Financial Gain: While some breaches are motivated by financial gain, the Internet Archive breach was not carried out for extortion or ransom. The attackers seemed to be more focused on the challenge of hacking a prominent website and gaining cyberstreet credibility by leaking the data and being recognized among the hacking community.

The DDoS Attack and Further Misreporting

Around the same time as the breach, the Internet Archive suffered a Distributed Denial of Service (DDoS) attack, launched by a pro-Palestinian hacktivist group known as BlackMeta. This group claimed responsibility for the DDoS attack, but it was not involved in the actual data breach. Due to the simultaneous timing, many mistakenly believed BlackMeta was responsible for both the breach and the DDoS attack, which frustrated the original attackers, who wanted credit for the data breach itself.

The DDoS attack further disrupted the Internet Archive’s operations, forcing the site into a read-only mode, where users could access archived material but could not upload new content. This limited functionality continues as the organization works on securing its systems.

The Significance of the Breach

The Internet Archive is an essential repository of digital information. For years, it has preserved snapshots of the web, allowing users to access historical versions of websites. The breach exposed a fundamental vulnerability in the cybersecurity posture of the organization, which operates more like a small business despite managing such vast digital assets.

Experts, including Adam Brown from Black Duck Software, pointed out that the Internet Archive’s security practices—such as using Bcrypt for password hashing—helped limit the blast radius of the breach. However, the mismanagement of API tokens and access credentials was a critical failure.

The 7TB of stolen data, while not confirmed to be leaked, represents a significant amount of information. As with many breaches, this data will likely appear on hacking forums in the future, where it will be distributed among cybercriminals.

Lessons in Cybersecurity

  1. Token Management: One of the critical failures in this breach was the mismanagement of GitLab tokens. Organizations must regularly rotate tokens and authentication credentials, especially when vulnerabilities are flagged. Allowing tokens to remain exposed for nearly two years opens the door to malicious activity.
  2. Swift Response to Alerts: In this case, BleepingComputer alerted the Internet Archive multiple times about the vulnerability, but the organization failed to respond promptly. Companies must take warnings from the cybersecurity community seriously and act swiftly to address issues.
  3. API Security: The breach also exposed vulnerabilities in the management of API tokens. It is crucial to implement strong API security measures that limit access, especially to systems managing sensitive data such as customer service platforms like Zendesk.
  4. Comprehensive Audits: Regular security audits are necessary for organizations, especially those handling vast amounts of data. Had a proper audit been conducted, the misconfigured GitLab server and exposed tokens might have been detected earlier, preventing the breach.
  5. Transparency and Communication: The Internet Archive has been criticized for its lack of communication following the breach. In cybersecurity incidents, timely and transparent communication with users and stakeholders can help mitigate damage and maintain trust.

The Road Ahead for Internet Archive

The Internet Archive is still recovering from this significant breach. Its read-only mode has limited functionality, and there are ongoing discussions about how to improve its security posture. Given its critical role in preserving digital history, the organization must prioritize its security infrastructure to prevent future breaches.

In the wake of this breach, other organizations that rely heavily on open-source tools like GitLab should take a closer look at their security practices. Breaches like this one are reminders that even small misconfigurations can have devastating consequences.

Conclusion

The Internet Archive breach serves as a critical case study in modern cybersecurity challenges. Despite being a non-profit organization, its mishandling of sensitive tokens and failure to respond to security alerts left it vulnerable to a breach that exposed personal data and compromised its systems. By learning from this event, organizations can reinforce their security practices to protect against similar incidents.

For users, this breach underscores the importance of being cautious about the data they share with online platforms, as even well-meaning organizations can sometimes fail in their duty to protect it.

Leave a Comment

Your email address will not be published. Required fields are marked *

3 × three =