Let take a look at Robots Exclusion Protocol and finally discuss whether it’s time to finally throw into the trash bin.
By Julius Cerniauskas, CEO at Oxylabs
The internet is an inherently social entity, based on numerous explicit and conventional rules that bind or guide millions of its users, including those that are inanimate: today, the internet space is rife with bots that perform a variety of beneficial functions, from answering customer inquiries to crawling the web for data. As much as humans, they also need guidance and clear rules.
The Robots Exclusion Protocol (REP), usually called simply “robots.txt,” has provided some guidance for the last 30 years. Since 1994, when Martijn Koster came up with the idea, robots.txt has acted as a machine-readable way to inform robots where they may wander (for indexing, data gathering, or other purposes) and which pages or sites should be circumvented. Though REP has no legal power, some have accepted and promoted it. Meanwhile, others saw REP as something like instructions on how to act in public spaces. As such, these instructions are criticized for being too easy to use unreasonably, without properly balancing legitimate business and public interest.
The AI “arms race” provided more urgency to this question of whether REP is the most feasible or beneficial solution, with some saying that trying to block AI crawlers might be a long-term disaster. Google, which planned on making robots.txt a widely accepted internet standard five years ago, recently issued a call for industry stakeholders to explore alternatives to the decades-old REP order.
Why does a simple, legally non-binding text file increasingly concern industry giants, and is it about time to throw robots.txt into the trash bin?
For many years, you could almost count web crawlers on your fingers, with the major ones belonging to search engines thoroughly indexing the whole world wide web. There were also bots from major marketplaces, such as Amazon, and archival projects like the Internet Archive. Webmasters could actually list all the bots they were concerned about in the robots.txt. Google counted that 5 years ago, around 500 million websites were using robots.txt.
With AI advancing at a major speed, thousands of bots scan the internet for data collection purposes today. Most of them collect publicly available information — something you could easily write down by hand unless you do it at a large scale. In the latter case, manually collecting information from thousands of websites is both resource- and time-wise impossible. Thus, you need automation, and that is what crawlers (or bots) do. As such, they are performing a valuable function.
Many websites, however, are mad — they believe that, by collecting public information for AI training purposes, bots are stealing their butter and bread. For example, major media publishers argue that generative AI systems are coming in direct competition with the content they create since public data is being scraped from their pages and used by AI to generate similar content or content summaries, often without any attribution. Thus, publishers lose revenue streams but don’t get any compensation from AI creators.
A study Reuters journalist made found that 606 of 1,156 surveyed publishers had used the REP to block GPTBot, the crawler that belongs to OpenAI. Another study revealed that 306 of the top 1,000 websites blocked GPTBot. Interestingly, only 28 blocked Anthropic’s anthropic-ai. This happens not because GPTBot is a “bad” one; there are so many AI bots that it is impossible to list them all in robots.txt. Thus, only the famous ones get the unfortunate “ban”. And even if they do, they often choose to silently ignore it.
In this regard, robots.txt clearly doesn’t live up to its expectations anymore. The question is whether those expectations are even real. AI advancements depend on two main technical factors — computational power and data availability. If diverse data becomes unavailable, AI developers are forced to rely on synthetic datasets that are ineffective for general-purpose model training. If millions of websites prohibited AI crawlers via robots.txt, expecting this prohibition to be respected, the machine-learning-based AI technology development would be stalemated for years.
It is important to note that the decision to ignore robots.txt when collecting public data isn’t new or unprecedented — the biggest internet archival project, the Internet Archive, openly acknowledged they do not always follow robots.txt as it comes in contradiction to their public mission of saving the internet as it is for future generations.
Bots are indigenous species of the internet; they might cause trouble, but if used responsibly, they perform beneficial functions. Without advancements in web scraping technologies, we wouldn’t have witnessed the recent AI boom. Who should reap the monetary benefits of machine-powered automation is, however, a different question.
It would be naive to treat the ongoing battles between AI companies and public data sources as a fight over data ownership and fairness — first of all, it is a fight over the distribution of monetary gains. There is a widespread argument that the conflict can be solved by installing a compensation mechanism for those whose data is being used. A new startup called TollBilt is already acting as an intermediate between data-greedy AI companies and media publishers interested in selling them public data.
The first broad steps to ensure compliance with copyright regulations and compensate copyright holders have already been made. In the EU, the newly adopted AI Act laid down specific transparency requirements for general-purpose AI systems, obliging AI companies to provide “sufficiently” detailed summaries of the training data. This obligation grants various copyright holders an easier possibility to exercise and enforce their rights.
It sounds like a perfect deal if everyone could get compensated for sharing their data and content, doesn’t it? Unfortunately, compensating all data creators and/or owners is simply impossible.
The internet is a vast repository of public data, much of which lacks attributed rights and is subject to ongoing legal debate regarding copyright. For AI firms, it would be impossible to identify everyone who could at least potentially claim copyright, let alone the prospect of compensating them. The escalating costs of such proactive compensation could render AI development unfeasible for smaller companies, thereby consolidating AI technology and its benefits in the hands of major tech players. This situation could potentially undermine the positive societal impact that AI could otherwise deliver.
Major social media platforms, forums, and media publishers already act as gatekeepers, trying to paywall and monetize public data that often doesn’t even belong to them since it is created by millions of people who use those platforms. Whether it is fair to use this data for AI training should be answered on the basis of the fair use doctrine instead of the REP regime that leaves a questionable right for anyone on the internet to make a one-sided decision to lock public data using robots.txt.
AI technology raises questions we, as humans, haven’t answered yet — is there a difference between human and machine creativity? Can a machine be held responsible and liable? Where is the line between humans and creative machines? Answering these questions will undoubtedly result in modifying decade-old legal regimes, such as copyright law and rules surrounding bot responsibility.
The internet is already very different from what it was a few years ago — search engines are employing multiple AI-driven functionalities, whereas ChatGPT itself is turning into an alternative search engine. Web data is becoming the backbone of the digital economy, and there’s no way to reverse the tide without killing major technological innovations.
Robots.txt doesn’t have a mechanism to address a variety of bots and crawlers that travel the web today, nor can it control many different AI use cases and potential ways in which data might be used. Fighting the AI revolution seems like Don Quixote’s quest, and even so do the attempts to save the decades-old textual file. If we are to move into the next industrial revolution — the age of intelligent machines — we will have to rethink the main legal and social frameworks that set the ways in which the digital space is organized.
Julius Černiauskas is the CEO of Oxylabs. A company that once was a small startup, run by 5 staff members, now is one of the biggest companies in the web intelligence collection industry employing over 400 specialists.
Since joining the company in 2015, Julius Černiauskas successfully transformed a bare business idea of Oxylabs to the tech giant that it is today by employing his profound knowledge of big data and information technology trends. He implemented a brand new company structure which led to the development of the most sophisticated public web data gathering service to the market. As a testimony to his groundbreaking work, Oxylabs was trusted with long-term partnerships with dozens of Fortune Global 500 companies.
Today, he continues to lead Oxylabs as a top global provider of premium proxies and data scraping solutions, helping companies and entrepreneurs to realize their full potential by harnessing the power of external data.
Patti Jo Rosenthal chats about her role as Manager of K-12 STEM Education Programs at ASME where she drives nationally scaled STEM education initiatives, building pathways that foster equitable access to engineering education assets and fosters curiosity vital to “thinking like an engineer.”