Charangan Vasantharajan

student

A third-year student at the Department of Computer Science and Engineering at the University of Moratuwa. I am a self-motivated individual with problem-solving and software engineering skills and have an interest in exploring the fields of Machine Learning, Natural Language Processing, and Data Science, and also eager to learn new technologies.

Currently, I am working on a project Tamizhi-Net OCR (a tool that extracts text from scanned PDFs/Images). Moreover, I have published a paper about Offensive Language Identification on code-mixed comments/posts in Dravidian Languages (Tamil-English, Malayalam-English, and Kannada-English) collected from social media.

Brief Description of Project

Hypers@DravidianLangTech-EACL2021: Offensive language identification in Dravidian code-mixed YouTube Comments and Posts

Code-Mixed Offensive content is used pervasively in social media posts in the last few years. Consequently, gained the significant attraction of the research community for identifying the different forms of such content (e.g., hate speech, and sentiments) and contributed to the creation of datasets. Most of the recent studies deal with high-resource languages (e.g., English) due to many publicly available datasets, and by the lack of datasets in low-resource languages, those studies are slightly involved in these languages. Therefore, this study has a focus on offensive language identification on code-mixed low-resourced Dravidian languages such as Tamil, Kannada, and Malayalam using the bidirectional approach and fine-tuning strategies. According to the leaderboard, the proposed model got a 0.96 F1-score for Malayalam, 0.73 F1-score for Tamil, and 0.70 F1-score for Kannada in the bench-mark. Moreover, in the view of multilingual models, our modal ranked as 3rd and confirmed the model as the best among all systems submitted to these shared tasks in these three languages.

Link to Paper: https://www.aclweb.org/anthology/2021.dravidianlangtech-1.26/

 

Tamizhi-Net OCR

Tamizhi-Net OCR is a tool that extracts text from scanned PDFs/Images. The system covers Tamil, Sinhala, and English languages. We use Google Tesseract OCR Engine and tessdata-best as our pretrained model for all three languages. Further, the model was trained on more than 200 fonts to adapt it to all kinds of documents. We hope that the Tamizhi-Net OCR will spur the development of an open-source text extraction system for Tamil, Sinhala and English scanned documents.

Demo: http://ec2-18-118-18-103.us-east-2.compute.amazonaws.com/