Inside Out

Notes on seeking wisdom and crafting software.

Simple wikipedia dataset

We are releasing a cleaned and labeled dataset for the english Simple Wikipedia at github. Every article is labeled with categories. The categories dataset provides category name, url and relationships for each category. This data is derived from the upstream wikipedia dumps. See readme for details of sources and scripts used for cleanups.

Some quick stats

This dataset may be useful in classification experiments. I hope to do one more release fixing the 457 errors in future. Onto experiments and learning now!