SiTa - Sinhala and Tamil Speaker Diarization Dataset in the Wild


Description:
SiTa is an in-the-wild dataset designed for audio-only speaker diarization in Sinhala and Tamil languages. It addresses the lack of resources for low-resource South Asian languages by providing authentic conversational audio data sourced from publicly available YouTube videos. The dataset is divided into two subsets, Sinhala and Tamil, featuring multi-speaker dialogues with significant speech activity and overlapping speech.


Features

  • Languages: Sinhala and Tamil
  • Data Format: Mono-channel WAV format (16 kHz sampling rate)
  • Annotations: Speaker labels and timestamps provided as RTTM files

Dataset Composition

Sinhala

  • Videos: 60
  • Total Duration: ~600 minutes
  • Number of Speakers: 1–10

Tamil

  • Videos: 14
  • Total Duration: ~120 minutes
  • Number of Speakers: 2–6

Categories

The dataset includes YouTube videos carefully selected to encompass various conversational contexts, such as:

  • Intellectual Discussions
  • Education
  • Morning Shows
  • Cooking
  • Celebrity Interviews
  • Political Debates
  • Children’s Programs
  • Business

Applications

  • Speech Processing: Benchmarking and evaluating speaker diarization systems.
  • Linguistic Research: Analyzing conversational dynamics and patterns in Sinhala and Tamil.
  • ASR Development: Enhancing automatic speech recognition (ASR) systems in multilingual environments.

Conference & Workshop Information

  • Conference:
    Co-located with the 31st International Conference on Computational Linguistics (COLING 2025)

  • Workshop:
    CHiPSAL: Challenges in Processing South Asian Languages


Dataset Access & License

  • GitHub Repository:
    SiTa GitHub Repository

  • License:
    The dataset is licensed under an extended version of the CC BY-NC 4.0 license. You can access the full license terms here.

  • Dataset Download:
    Download Here