Inside Out

Notes on seeking wisdom and crafting software

Simple wikipedia dataset

We are releasing a cleaned and labeled dataset for the english Simple Wikipedia at github. Every article is labeled with categories. The categories dataset provides category name, url and relationships for each category. This data is derived from the upstream wikipedia dumps. See readme for details of sources and scripts used for cleanups.

Some quick stats

  • 148714 articles/pages
  • 41571 categories
  • Max sub categories in a category is 10624
  • Max pages in a category is 20473
  • 457 categories are present in Articles but not in Category list
    • Due to case mismatch
    • Or extra unicode characters etc.

This dataset may be useful in classification experiments. I hope to do one more release fixing the 457 errors in future. Onto experiments and learning now!