Projects | Uthayasanker Thayasivam

MITIGATING INFORMATION LOSS IN TEXT-INDEPENDENT SPEAKER VERIFICATION USING DEEP INFOMAX


Description

This project focuses on improving text-independent speaker verification—a method of confirming a speaker’s identity regardless of what they say. Current systems often suffer from information loss during the pooling step, where audio frames are condensed into a fixed-size speaker embedding. Our work introduces a novel pooling method using Deep Information Maximization (DIM) to better preserve crucial local features, leading to more accurate and robust speaker embeddings.

Motivation and Research Objective

Motivation

Speaker verification plays a vital role in security systems, customer service, fraud detection, and forensics. Text-independent systems offer flexibility but face challenges due to speech variability. Addressing the information loss in attention-based pooling is key to improving verification accuracy and real-world reliability.

Objectives

  • Analyze the limitations of existing attention-based pooling methods.

  • Propose a new DIM-based pooling method to retain richer feature information.

  • Train and evaluate the model using benchmark datasets.

  • Release a Python library with pre-trained models for easy integration into speaker verification systems.

Results and Impact

Our research focused on improving speaker verification through better pooling methods using Deep InfoMax (DIM). Performance was evaluated using the Equal Error Rate (EER)—a lower EER indicating better accuracy.


Key Results

  • Increasing attention heads from 1 to 32 in the baseline model steadily reduced EER (from 1.877 to 1.336), showing that multi-head attention improves discriminative power.

  • DIM integration further improved results, especially at 4 and 16 attention heads, achieving EERs as low as 1.389, better than the baseline.

  • Optimal hyperparameters were α = 0.01 and β = 0.05. Lower β had a stronger impact, confirming the importance of preserving local information.

  • A slight performance dip was observed at 32 heads in the DIM model due to potential model complexity and overfitting.

  • Threshold tuning showed that DIM-based models require finer thresholds for match classification, with tuned thresholds ranging from 0.37 to 0.40.


Impact

  • Our DIM-based approach helps retain critical speaker-specific details, leading to more accurate, robust speaker embeddings.

  • These findings contribute to improved biometric authentication systems, especially in text-independent, real-world scenarios.

  • The resulting Python library with pre-trained models will make it easy to deploy enhanced speaker verification across applications like voice security, fraud prevention, and smart devices.

Team members

Nirmal Sankalana

Department of Computer Science and Engineering,
University of Moratuwa, Sri Lanka.


Nipun Thejan

Department of Computer Science and Engineering,
University of Moratuwa, Sri Lanka.