Hypers@DravidianLangTech-EACL2021: Offensive language identification in Dravidian code-mixed YouTube Comments and Posts
Code-Mixed Offensive content is used pervasively in social media posts in the last few years. Consequently, gained the significant attraction of the research community for identifying the different forms of such content (e.g., hate speech, and sentiments) and contributed to the creation of datasets. Most of the recent studies deal with high-resource languages (e.g., English) due to many publicly available datasets, and by the lack of datasets in low-resource languages, those studies are slightly involved in these languages. Therefore, this study has a focus on offensive language identification on code-mixed low-resourced Dravidian languages such as Tamil, Kannada, and Malayalam using the bidirectional approach and fine-tuning strategies. According to the leaderboard, the proposed model got a 0.96 F1-score for Malayalam, 0.73 F1-score for Tamil, and 0.70 F1-score for Kannada in the bench-mark. Moreover, in the view of multilingual models, our modal ranked as 3rd and confirmed the model as the best among all systems submitted to these shared tasks in these three languages.
Link to Paper: https://www.aclweb.org/anthology/2021.dravidianlangtech-1.26/
Tamizhi-Net OCR
Tamizhi-Net OCR is a tool that extracts text from scanned PDFs/Images. The system covers Tamil, Sinhala, and English languages. We use Google Tesseract OCR Engine and tessdata-best as our pretrained model for all three languages. Further, the model was trained on more than 200 fonts to adapt it to all kinds of documents. We hope that the Tamizhi-Net OCR will spur the development of an open-source text extraction system for Tamil, Sinhala and English scanned documents.
Demo: http://ec2-18-118-18-103.us-east-2.compute.amazonaws.com/