ClueWeb’12 LDA Models

We used Mammoth to train a 1000-topic LDA model on the full 700M-document ClueWeb12 collection using a truncated vocabulary of 100,000 terms.

The trained model is available for download here (gzipped: 609MB, uncompressed: 2GB). The file structure looks something like this:

0 0.001 0.002 0.00001 0.3 0.001
1 0.500 0.698 0.99998 0.1 0.899
2 0.499 0.300 0.00001 0.2 0.100

Features (terms) are represented as rows and topics as columns. Each row’s first column states the feature number (corresponding to the dictionary). The remaining columns represent the probability of that feature for the respective topic. The probabilities per column add up to 1.

If you would like to refer to the dataset, it was originally described and used in: