Written Language Identification

In our recent paper on language identification in Twitter (Bergsma et al., 2012) we build LID systems and test them on collections of tweets. We are making available our datasets and also a Python-based tool for classification based on compression language models.


Our LSM-12 paper reported on experiments in three writing systems (Arabic, Cyrillic, and Devanagari) using three different languages per writing system. Our data, complete with train/dev/test splits is available as a compressed tarball: shared.twitter.lsm12.tar.gz

Because our understanding of Twitter's Terms of Service precludes us from redistributing the original tweet text, we are instead releasing the data as lists of tweetids. The 'Readme' file contains details.

To obtain actual tweets, you may (or may not) find benefit from this tool, which was used at the TREC 2011 Microblog track:

LID Tool

We are still packaging up our Python-based compression language model tool, and will post that here in the very near future.