August 20, 2024 What to do with Robots Exclusion Protocol?

Let take a look at Robots Exclusion Protocol and finally discuss whether it’s time to finally throw into the trash bin.

By Julius Cerniauskas, CEO at Oxylabs

The internet is an inherently social entity, based on numerous explicit and conventional rules that bind or guide millions of its users, including those that are inanimate: today, the internet space is rife with bots that perform a variety of beneficial functions, from answering customer inquiries to crawling the web for data. As much as humans, they also need guidance and clear rules.

The Robots Exclusion Protocol (REP), usually called simply “robots.txt,” has provided some guidance for the last 30 years. Since 1994, when Martijn Koster came up with the idea, robots.txt has acted as a machine-readable way to inform robots where they may wander (for indexing, data gathering, or other purposes) and which pages or sites should be circumvented. Though REP has no legal power, some have accepted and promoted it. Meanwhile, others saw REP as something like instructions on how to act in public spaces. As such, these instructions are criticized for being too easy to use unreasonably, without properly balancing legitimate business and public interest.

The AI “arms race” provided more urgency to this question of whether REP is the most feasible or beneficial solution, with some saying that trying to block AI crawlers might be a long-term disaster. Google, which planned on making robots.txt a widely accepted internet standard five years ago, recently issued a call for industry stakeholders to explore alternatives to the decades-old REP order.

Why does a simple, legally non-binding text file increasingly concern industry giants, and is it about time to throw robots.txt into the trash bin?

To block or not to block

For many years, you could almost count web crawlers on your fingers, with the major ones belonging to search engines thoroughly indexing the whole world wide web. There were also bots from major marketplaces, such as Amazon, and archival projects like the Internet Archive. Webmasters could actually list all the bots they were concerned about in the robots.txt. Google counted that 5 years ago, around 500 million websites were using robots.txt.

With AI advancing at a major speed, thousands of bots scan the internet for data collection purposes today. Most of them collect publicly available information — something you could easily write down by hand unless you do it at a large scale. In the latter case, manually collecting information from thousands of websites is both resource- and time-wise impossible. Thus, you need automation, and that is what crawlers (or bots) do. As such, they are performing a valuable function.

Many websites, however, are mad — they believe that, by collecting public information for AI training purposes, bots are stealing their butter and bread. For example, major media publishers argue that generative AI systems are coming in direct competition with the content they create since public data is being scraped from their pages and used by AI to generate similar content or content summaries, often without any attribution. Thus, publishers lose revenue streams but don’t get any compensation from AI creators.

A study Reuters journalist made found that 606 of 1,156 surveyed publishers had used the REP to block GPTBot, the crawler that belongs to OpenAI. Another study revealed that 306 of the top 1,000 websites blocked GPTBot. Interestingly, only 28 blocked Anthropic’s anthropic-ai. This happens not because GPTBot is a “bad” one; there are so many AI bots that it is impossible to list them all in robots.txt. Thus, only the famous ones get the unfortunate “ban”. And even if they do, they often choose to silently ignore it.

In this regard, robots.txt clearly doesn’t live up to its expectations anymore. The question is whether those expectations are even real. AI advancements depend on two main technical factors — computational power and data availability. If diverse data becomes unavailable, AI developers are forced to rely on synthetic datasets that are ineffective for general-purpose model training. If millions of websites prohibited AI crawlers via robots.txt, expecting this prohibition to be respected, the machine-learning-based AI technology development would be stalemated for years.

It is important to note that the decision to ignore robots.txt when collecting public data isn’t new or unprecedented — the biggest internet archival project, the Internet Archive, openly acknowledged they do not always follow robots.txt as it comes in contradiction to their public mission of saving the internet as it is for future generations.

Bots are indigenous species of the internet; they might cause trouble, but if used responsibly, they perform beneficial functions. Without advancements in web scraping technologies, we wouldn’t have witnessed the recent AI boom. Who should reap the monetary benefits of machine-powered automation is, however, a different question.

Slicing the pie: between fairness and the common sense

It would be naive to treat the ongoing battles between AI companies and public data sources as a fight over data ownership and fairness — first of all, it is a fight over the distribution of monetary gains. There is a widespread argument that the conflict can be solved by installing a compensation mechanism for those whose data is being used. A new startup called TollBilt is already acting as an intermediate between data-greedy AI companies and media publishers interested in selling them public data.

The first broad steps to ensure compliance with copyright regulations and compensate copyright holders have already been made. In the EU, the newly adopted AI Act laid down specific transparency requirements for general-purpose AI systems, obliging AI companies to provide “sufficiently” detailed summaries of the training data. This obligation grants various copyright holders an easier possibility to exercise and enforce their rights.

It sounds like a perfect deal if everyone could get compensated for sharing their data and content, doesn’t it? Unfortunately, compensating all data creators and/or owners is simply impossible.

The internet is a vast repository of public data, much of which lacks attributed rights and is subject to ongoing legal debate regarding copyright. For AI firms, it would be impossible to identify everyone who could at least potentially claim copyright, let alone the prospect of compensating them. The escalating costs of such proactive compensation could render AI development unfeasible for smaller companies, thereby consolidating AI technology and its benefits in the hands of major tech players. This situation could potentially undermine the positive societal impact that AI could otherwise deliver.

Major social media platforms, forums, and media publishers already act as gatekeepers, trying to paywall and monetize public data that often doesn’t even belong to them since it is created by millions of people who use those platforms. Whether it is fair to use this data for AI training should be answered on the basis of the fair use doctrine instead of the REP regime that leaves a questionable right for anyone on the internet to make a one-sided decision to lock public data using robots.txt.

Conclusions: the age of intelligent machines

AI technology raises questions we, as humans, haven’t answered yet — is there a difference between human and machine creativity? Can a machine be held responsible and liable? Where is the line between humans and creative machines? Answering these questions will undoubtedly result in modifying decade-old legal regimes, such as copyright law and rules surrounding bot responsibility.

The internet is already very different from what it was a few years ago — search engines are employing multiple AI-driven functionalities, whereas ChatGPT itself is turning into an alternative search engine. Web data is becoming the backbone of the digital economy, and there’s no way to reverse the tide without killing major technological innovations.

Robots.txt doesn’t have a mechanism to address a variety of bots and crawlers that travel the web today, nor can it control many different AI use cases and potential ways in which data might be used. Fighting the AI revolution seems like Don Quixote’s quest, and even so do the attempts to save the decades-old textual file. If we are to move into the next industrial revolution — the age of intelligent machines — we will have to rethink the main legal and social frameworks that set the ways in which the digital space is organized.

julius cerniauskas oxylabs — *Julius Černiauskas*

Julius Černiauskas is the CEO of Oxylabs. A company that once was a small startup, run by 5 staff members, now is one of the biggest companies in the web intelligence collection industry employing over 400 specialists.

Since joining the company in 2015, Julius Černiauskas successfully transformed a bare business idea of Oxylabs to the tech giant that it is today by employing his profound knowledge of big data and information technology trends. He implemented a brand new company structure which led to the development of the most sophisticated public web data gathering service to the market. As a testimony to his groundbreaking work, Oxylabs was trusted with long-term partnerships with dozens of Fortune Global 500 companies.

Today, he continues to lead Oxylabs as a top global provider of premium proxies and data scraping solutions, helping companies and entrepreneurs to realize their full potential by harnessing the power of external data.

Subscribe to Industry Today

This field is hidden when viewing the form

Name

Name(Required)

First Last

Company Name(Required)

Email(Required)

Job Title(Required)

Other

Country(Required)

Business Type(Required)

Your Industry(Required)

CAPTCHA

Read Our Current Issue

Hire Heroes USA: Channeling Veteran Skills to Power U.S. Manufacturing

Most Recent EpisodeSmart Sites, Smart Growth: Inside Site Selection with Didi Caldwell

Listen Now

A warm welcome to our guest Didi Caldwell, CEO of Global Location Strategies (GLS) and one of the world’s top site selection experts. With over $44 billion in projects across 30+countries, Didi is reshaping how companies choose where to grow. Here she shares insights on reshoring, data-driven strategy, and navigating global industry shifts.

News ............. And More

August 11, 2025

Boost Back Office Profits In An Uncertain Economy

August 11, 2025

Safeguarding Workers’ Rights Amid Global Turmoil

August 11, 2025

Rethink the Line: Solving Manufacturing’s Talent Crisis

August 11, 2025

Overuse Hits Manufacturing Hard – Can Tech Save It?

August 7, 2025

Manufacturing News

August 6, 2025

Refineries Reconfiguring Safety Systems, Sans PLCs

August 6, 2025

Prioritizing Mental Health in the Workplace

August 6, 2025

Advanced Extraction: From Scraps to Sustainable Systems

August 6, 2025

Manufacturing News – Discovery Education

August 6, 2025

Photonic AI Chip Platform

August 5, 2025

Stronger, Lighter UAVs Start with Advanced Materials

August 5, 2025

AI Can Help Semiconductor Companies with Tariffs

See All

Get In Touch

Google news and SEO compliant, Industry Today’s state-of-the-art digital media platform offers bespoke media campaigns that target key decision makers and buyers to achieve your marketing and promotional goals.

Industry Today

472 Meeting Street
Ste C-156
Charleston, SC 29403
USA
Telephone

Voice: +001 973.218.0310
Email

For further information please contact the following:

Media Campaigns: Susan Poeton
spoeton@industrytoday.com

Press Releases:
editor@industrytoday.com or submit direct

Content Submissions/Interview Opportunities:
editorialdesk@industrytoday.com

Contribute

Showcase your brand and promote your business to our highly targeted audience. We offer detailed Google Analytics with measurable ROI to assure success. Submit your content for review by our Editorial team who will contact you to discuss the project further.

About Us

Reach Your Targeted Audience and Grow Your Business. Learn more About Industry Today.

Contact Us

This field is hidden when viewing the form

Name

Name(Required)

Email(Required)

Phone

Comments

CAPTCHA

August 12, 20255 Essential Tips for Designing Your New Home

August 11, 2025Boost Back Office Profits In An Uncertain Economy

August 11, 2025Safeguarding Workers’ Rights Amid Global Turmoil

August 11, 2025Rethink the Line: Solving Manufacturing’s Talent Crisis

August 11, 2025Overuse Hits Manufacturing Hard – Can Tech Save It?

August 7, 2025Manufacturing News

August 6, 2025Photonic AI Chip Platform

July 29, 20252025 Best Places for Food Manufacturing Insights Report

July 25, 2025Reduce Downtime with a Predictive Maintenance Strategy

July 14, 2025Turning the Tide Against a Tsunami of Cyber Threats

July 8, 2025AI and Data Drive the Aftermarket of the Future

June 24, 2025Industrial Manufacturing US Deals 2025 Midyear Outlook

August 6, 2025Caldwell Launches Custom Lifting Beam with LGH

August 5, 2025Clay Comm Launches AI “Query Engineering” Service

July 31, 2025Unirope Begins Twin-Path® Sling Manufacturing in AB

July 31, 2025Your Networks are @Risk

July 30, 2025Anscer Robotics Launches LBR500

July 29, 20252025 Best Places for Food Manufacturing Insights Report

August 20, 2024 What to do with Robots Exclusion Protocol?

To block or not to block

Slicing the pie: between fairness and the common sense

Conclusions: the age of intelligent machines

Subscribe to Industry Today

Most Recent EpisodeSmart Sites, Smart Growth: Inside Site Selection with Didi Caldwell

News ............. And More

Subscribe to Industry Today’s regular e-newslettersindustrytoday.com

August 12, 20255 Essential Tips for Designing Your New Home

August 11, 2025Boost Back Office Profits In An Uncertain Economy

August 11, 2025Safeguarding Workers’ Rights Amid Global Turmoil

August 11, 2025Rethink the Line: Solving Manufacturing’s Talent Crisis

August 11, 2025Overuse Hits Manufacturing Hard – Can Tech Save It?

August 7, 2025Manufacturing News

August 6, 2025Photonic AI Chip Platform

July 29, 20252025 Best Places for Food Manufacturing Insights Report

July 25, 2025Reduce Downtime with a Predictive Maintenance Strategy

July 14, 2025Turning the Tide Against a Tsunami of Cyber Threats

July 8, 2025AI and Data Drive the Aftermarket of the Future

June 24, 2025Industrial Manufacturing US Deals 2025 Midyear Outlook

August 6, 2025Caldwell Launches Custom Lifting Beam with LGH

August 5, 2025Clay Comm Launches AI “Query Engineering” Service

July 31, 2025Unirope Begins Twin-Path® Sling Manufacturing in AB

July 31, 2025Your Networks are @Risk

July 30, 2025Anscer Robotics Launches LBR500

July 29, 20252025 Best Places for Food Manufacturing Insights Report

August 20, 2024 What to do with Robots Exclusion Protocol?

To block or not to block

Slicing the pie: between fairness and the common sense

Conclusions: the age of intelligent machines

Subscribe to Industry Today

Subscribe to Industry Today’s regular e-newslettersindustrytoday.com

Most Recent EpisodeSmart Sites, Smart Growth: Inside Site Selection with Didi Caldwell

News ............. And More