Hyderabad: Researchers from the Language Technologies Research Centre (LTRC) of the International Institute of Information Technology, Hyderabad have done what they call a first-of-its-kind attempt at semantic role labelling of Hindi-English code mixed tweets.
This could result in various applications that require extraction of information, such as chatbots responding to multi-lingual queries (or in other words, questioning and answering systems), or document classification.
The research was presented and highly appreciated at the Linguistic Annotation Workshop during the annual conference of the Association for Computational Linguistics in Florence, Italy earlier this month, IIIT-H officials said.
The research would help in identification or classification or analysis of tweets like ‘Chalo jaldi karo, or we’ll miss the beginning of the movie’.
The sentence is a mix of English with Hindi or Hinglish as it is more popularly known. This form of code-mixing or the embedding of linguistic units such as phrases, words and morphemes of one language into an utterance of another language happens all the time in multilingual countries and is quite rampant in speech and on social media as well.
“We speak in code-mixed language and if we want to develop computational models for processing natural language, we also need to handle it. The modern world of social media has lots of code-mixed language. So if we want to do any parsing or analysis there, the first thing we have to do is handle the code-mixed data. That was our initial interest and motivation and that’s why our lab started working on it sometime back,” said Prof. Dipti Misra Sarma of the LTRC.
Prof. Misra and her student Riya Pal have been conducting research on semantic role labelling of Hinglish tweets particularly. Riya, a final-year dual degree student, said semantic role labelling was used in any case that needs accurate understanding of text before it is extracted.
“Take the example of a chatbot. It needs to understand data first before extracting information from it. Let’s say there is a sentence like, I go to school. For a machine to understand this, it extracts the action “go”, “who” is going and “where”. Thus, there are labels in that sentence.”
Riya went on to create a dataset for Hinglish data where she labelled about 1500 tweets manually. The next step in the future will be to create an automated tool using machine learning techniques that would do the same, thereby improving accuracy of the model, she said, adding that this was the first attempt so far at creating an NLP tool for Hinglish data.
From English and Hindi, next would be to check out code-mixing three languages, Prof. Misra said.