Lexical Normalisation of Twitter Data

Reading Time: 1 minutes

Twitter with over 500 million users globally, generates over 100,000 tweets per minute . The 140 character limit per tweet, perhaps unintentionally, encourages users to use shorthand notations and to strip spellings to their bare minimum “syllables” or elisions e.g. “srsly”.

The analysis of Twitter messages which typically contain misspellings, elisions, and grammatical errors, poses a challenge to established Natural Language Processing (NLP) tools which are generally designed with the assumption that the data conforms to the basic grammatical structure commonly used in English language.

In order to make sense of Twitter messages it is necessary to first transform them into a canonical form, consistent with the dictionary or grammar. This process, performed at the level of individual tokens (“words”), is called lexical normalisation. This paper investigates various techniques for lexical normalisation of Twitter data and presents the findings as the techniques are applied to process raw data from Twitter.

Further reading: The complete paper is available here.

4 thoughts on “Lexical Normalisation of Twitter Data”

  1. I am extremely impressed along with your writing abilities
    and also with the layout on your blog. Is that this a paid topic or did you customize it yourself?
    Either way keep up the nice high quality writing, it’s uncommon to peer a great blog like this one these days..

Leave a Reply

Your email address will not be published. Required fields are marked *