Meta quietly releases new web scraper for collecting AI data

Meta has quietly released a new web crawler that combs the internet to collect reams of data to feed into its AI models.

The crawler, called Meta External Agent, was released last month, according to three companies that track web scrapers and bots on the web. The automated bot essentially copies, or “scrapes,” all the publicly available data from websites, such as the text of news articles or conversations in online discussion groups.

A representative from Dark Visitors, which provides tools to website owners to automatically block known scraper bots, said Meta External Agent is similar to OpenAI's GPTBot, which scrapes the web for AI training data. Two other organizations involved in tracking web scrapers confirmed the existence of the bot and that it is being used to collect AI training data.

Meta, the parent company of Facebook, Instagram, and Whatsapp, updated its corporate website for developers in late July to include a tab disclosing the existence of the new scraper, according to a version history discovered using the Internet Archive. Aside from updating the page, Meta has not publicly announced the new crawler.

A Meta spokesperson said the company had been using the crawler under a different name “for years,” but that the crawler, called Facebook External Hit, “has been used for a variety of purposes over time, including sharing link previews.”

“Like other companies, we train our generative AI models on content publicly available online,” the spokesperson said. “We recently updated our guidance on how publishers can best exclude their domains from being crawled by Meta's AI-related crawlers.”

Scraping web data to train AI models is a controversial practice that has led to numerous lawsuits by artists, writers, and others who claim AI companies have used their content and intellectual property without their consent. Some AI companies, such as OpenAI and Perplexity, have struck deals in recent months to pay content providers for access to their data. (Fortune was one of several news providers that announced a revenue-sharing deal with Perplexity in July.)

Flying Under the Radar

According to data from Dark Visitors, roughly 25% of the world's most popular websites currently block GPTBot, while only 2% block Meta's new bot.

Websites that attempt to block web scrapers must add a line of code to their codebase to tell scraper bots to ignore information on their site. However, respecting robots.txt usually requires that the specific name of the scraper bot also be added, which is difficult to achieve if the name is not publicly available. Scraper bot operators are free to ignore robots.txt; robots.txt is not enforceable or legally binding.

Such scrapers are used to extract large amounts of data and text from the web to use as training data for generative AI models (also known as large-scale language models, or LLMs) and related tools. Meta's Llama is one of the largest LLMs available, and is used for things like the AI chat bot Meta AI that's currently appearing on various Meta platforms. The company hasn't released the training data used for the latest version of the model, Llama 3, but earlier versions of the model used large data sets compiled by other sources, such as Common Crawl.

Earlier this year, Meta co-founder and longtime CEO Mark Zuckerberg boasted during an earnings call that his social platform was amassing a dataset for AI training that was “larger than Common Crawl, which has been collecting about 3 billion web pages every month since 2011.”

But the new crawler suggests that Meta's vast data pool may no longer be enough. The company is working on updating Llama and expanding Meta AI, which typically requires new, better training data to keep improving. Meta plans to spend up to $40 billion this year, mostly on AI infrastructure and related costs.

Are you a Meta employee or have insights and tips? Contact Kali Hays securely on Signal at +1-949-280-0267 or [email protected].

Recommended Newsletter: Advanced insights for high-powered executives. Subscribe to the CEO Daily newsletter for free today. Subscribe now.

Source link

What's Hot

The housing crisis. “Starts to push the residents”

Dominican is a dead pa- / Andrs Cooscksonki dead

Pope Francis showed the Polish cardinal. He got an important mission

Saudi Ministry of Education is taking part in the Geneva International Exhibition for Invention 2025

Enhance scientific, technical and health cooperation between Komstique and China

Jordan recognizes the importance of investing in young minds and supporting an innovation environment – a coalition of press agencies for organisations of Islamic cooperation

Everything you need to know about Mercosur and its translation into several different languages

6 Best Free Online Translation Tools

How do I translate my mobile app?

EF Polymer Named to Forbes' List of 100 Asia Companies to Watch for Agricultural Innovation

Champions League draw: Improved format packed with high-profile rematches between Europe's biggest clubs

Costacurta expects Milan to have a 'good journey' in Europe this season

Two of Europe's most successful teams meet in the Champions League

The housing crisis. “Starts to push the residents”

Review: 7 Future Fashion Trends Shaping the Future of Fashion

Meta’s AlbedoGAN Advances Realistic 3D Face Generation

Subscribe to Updates

What's Hot

Meta quietly releases new web scraper for collecting AI data

Flying Under the Radar

Related Posts