Description:
SiTa is an in-the-wild dataset designed for audio-only speaker diarization in Sinhala and Tamil languages. It addresses the lack of resources for low-resource South Asian languages by providing authentic conversational audio data sourced from publicly available YouTube videos. The dataset is divided into two subsets, Sinhala and Tamil, featuring multi-speaker dialogues with significant speech activity and overlapping speech.
Features
- Languages: Sinhala and Tamil
- Data Format: Mono-channel WAV format (16 kHz sampling rate)
- Annotations: Speaker labels and timestamps provided as RTTM files
Dataset Composition
Sinhala
- Videos: 60
- Total Duration: ~600 minutes
- Number of Speakers: 1–10
Tamil
- Videos: 14
- Total Duration: ~120 minutes
- Number of Speakers: 2–6
Categories
The dataset includes YouTube videos carefully selected to encompass various conversational contexts, such as:
- Intellectual Discussions
- Education
- Morning Shows
- Cooking
- Celebrity Interviews
- Political Debates
- Children’s Programs
- Business
Applications
- Speech Processing: Benchmarking and evaluating speaker diarization systems.
- Linguistic Research: Analyzing conversational dynamics and patterns in Sinhala and Tamil.
- ASR Development: Enhancing automatic speech recognition (ASR) systems in multilingual environments.
Conference & Workshop Information
-
Conference:
Co-located with the 31st International Conference on Computational Linguistics (COLING 2025) -
Workshop:
CHiPSAL: Challenges in Processing South Asian Languages
Dataset Access & License
-
GitHub Repository:
SiTa GitHub Repository -
License:
The dataset is licensed under an extended version of the CC BY-NC 4.0 license. You can access the full license terms here. -
Dataset Download:
Download Here