Meta has quietly released a new web crawler that combs the internet to collect reams of data to feed into its AI models.
The crawler, called Meta External Agent, was released last month, according to three companies that track web scrapers and bots on the web. The automated bot essentially copies, or “scrapes,” all the publicly available data from websites, such as the text of news articles or conversations in online discussion groups.
A representative from Dark Visitors, which provides tools to website owners to automatically block known scraper bots, said Meta External Agent is similar to OpenAI's GPTBot, which scrapes the web for AI training data. Two other organizations involved in tracking web scrapers confirmed the existence of the bot and that it is being used to collect AI training data.
Meta, the parent company of Facebook, Instagram, and Whatsapp, updated its corporate website for developers in late July to include a tab disclosing the existence of the new scraper, according to a version history discovered using the Internet Archive. Aside from updating the page, Meta has not publicly announced the new crawler.
A Meta spokesperson said the company had been using the crawler under a different name “for years,” but that the crawler, called Facebook External Hit, “has been used for a variety of purposes over time, including sharing link previews.”
“Like other companies, we train our generative AI models on content publicly available online,” the spokesperson said. “We recently updated our guidance on how publishers can best exclude their domains from being crawled by Meta's AI-related crawlers.”
Scraping web data to train AI models is a controversial practice that has led to numerous lawsuits by artists, writers, and others who claim AI companies have used their content and intellectual property without their consent. Some AI companies, such as OpenAI and Perplexity, have struck deals in recent months to pay content providers for access to their data. (Fortune was one of several news providers that announced a revenue-sharing deal with Perplexity in July.)
Flying Under the Radar
According to data from Dark Visitors, roughly 25% of the world's most popular websites currently block GPTBot, while only 2% block Meta's new bot.
Websites that attempt to block web scrapers must add a line of code to their codebase to tell scraper bots to ignore information on their site. However, respecting robots.txt usually requires that the specific name of the scraper bot also be added, which is difficult to achieve if the name is not publicly available. Scraper bot operators are free to ignore robots.txt; robots.txt is not enforceable or legally binding.
Such scrapers are used to extract large amounts of data and text from the web to use as training data for generative AI models (also known as large-scale language models, or LLMs) and related tools. Meta's Llama is one of the largest LLMs available, and is used for things like the AI chat bot Meta AI that's currently appearing on various Meta platforms. The company hasn't released the training data used for the latest version of the model, Llama 3, but earlier versions of the model used large data sets compiled by other sources, such as Common Crawl.
Earlier this year, Meta co-founder and longtime CEO Mark Zuckerberg boasted during an earnings call that his social platform was amassing a dataset for AI training that was “larger than Common Crawl, which has been collecting about 3 billion web pages every month since 2011.”
But the new crawler suggests that Meta's vast data pool may no longer be enough. The company is working on updating Llama and expanding Meta AI, which typically requires new, better training data to keep improving. Meta plans to spend up to $40 billion this year, mostly on AI infrastructure and related costs.
Are you a Meta employee or have insights and tips? Contact Kali Hays securely on Signal at +1-949-280-0267 or [email protected].
Recommended Newsletter: Advanced insights for high-powered executives. Subscribe to the CEO Daily newsletter for free today. Subscribe now.
Source link