OpenAI’s Web crawler is being barred from accessing content by websites such as Amazon, CNN, and The New York Times ►

A | a-+=

Numerous websites, including prominent ones like Amazon, The New York Times, and Shutterstock, have taken measures to prevent OpenAI’s Web crawler from gathering content that could contribute to the improvement of its artificial intelligence (AI) models.

OpenAI, the creator of ChatGPT, introduced GPTBot in August, but within the initial two weeks, major websites, including the aforementioned ones as well as Quora, CNN, and wikiHow, blocked GPTBot’s access. This information stems from a recent study by Originality.AI, a company focused on detecting AI-related content.

The New York Times updated its terms of service to more explicitly forbid the “scraping of our content for AI training and development,” according to a spokesperson quoted in a report from The Guardian. The revised terms, effective since August 3, clearly state that the newspaper’s content cannot be utilized for the development of software programs like machine learning or AI systems without consent.

OpenAI expressed that GPTBot, designed to navigate the internet, has the potential to enhance future AI models. It stated that permitting GPTBot’s access to websites can enhance AI accuracy, capabilities, and safety. Additionally, OpenAI emphasized that websites retain the option to limit GPTBot’s access, either partially or entirely.

AI language models such as ChatGPT gather knowledge from extensive internet resources to refine their outputs. Concerns have arisen regarding the training of AI models through web crawlers. Notably, there have been instances of pirated content, such as works by authors like Stephen King, being used to train AI tools, as highlighted by The Atlantic.

(Source: Amanda Lee | The Straits Times)