Wikipedia opens AI training data set to deter web scrapers

The news: Wikimedia is setting up a data-set-sharing program to discourage web crawlers from scraping Wikipedia, offering an ethical and time-saving route for smaller AI companies, research organizations, and developers to use Wikipedia’s information.

It launched a collection of stripped-down Wikipedia data for AI developers, which is housed on a Google-owned data science platform called Kaggle.

  • Wikipedia’s parent company said the open-license data set can be used for model development, benchmarking, and alignment, without using web crawlers to pull information directly from articles.