Documented the methods used for preparation and selection of the Wikipedia text in each language; collection of attested romanizations for sampled lexicons; and manual romanization of held-out sentences from the native script collections. Additionally provide baseline results on several tasks made possible by the dataset, including single word transliteration, full sentence transliteration, and language modeling of native script and romanized text.
Language - Tamil
Reference - https://www.aclweb.org/anthology/2020.lrec-1.294, https://github.com/google-research-datasets/dakshina
License Type - CC BY - SA 4.0
Citation -
{roark-etal-2020-processing,title = "Processing {South} {Asian} Languages Written in the {Latin} Script:
the {Dakshina} Dataset",
author = "Roark, Brian and
Wolf-Sonkin, Lawrence and
Kirov, Christo and
Mielke, Sabrina J. and
Johny, Cibu and
Demir{\c{s}}ahin, I{\c{s}}in and
Hall, Keith",
booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference (LREC)",
year = "2020",
url = "https://www.aclweb.org/anthology/2020.lrec-1.294",
pages = "2413--2423"
}