Natural language text dominates the information available on the Web. Yet the language of online expression often differs substantially, in both style and substance, from the language found in more traditional sources such as news. Making natural language processing techniques robust to this sort of variation is thus important for applications to behave intelligently when presented with Web text.
This talk presents new research applying two sequence prediction tasks—part-of-speech tagging and named entity detection—to text from online social media platforms (Twitter and Wikipedia). For both tasks, we adapt standard forms of annotation to better suit the linguistic and topical characteristics of the data. We also propose techniques to elicit more accurate statistical taggers, including linguistic features inspired by the domain (for part-of-speech tagging of Twitter messages) as well as modifications to the learning algorithm (for named entity detection in Arabic Wikipedia).