End to end speaker model for text-independent speaker identification


To provide experimental proof on datasets characterized by different numbers of speakers, this paper considers the TIMIT (462 speakers, train chunk) and Librispeech(2484 speakers) dataset. TIMIT dataset[28] has 630 speakers and is 419.81 MB in size. On the other hand, size of the Librispeech is around 6.3GB (much larger size in comparison to Librispeech) with 921 speakers and about 650000 utterances. Handling and processing the data is very expensive on this dataset. The silence observed at the beginning and end of each utterance has been removed for the research's purposes. Utterances observed with internal silences remaining more than 125 ms were split into multiple chunks. 5 sentences for each speaker were used for training, while the remaining 3 were used for the testing. The training and testing tracks have been randomly picked to exploit 12-15 seconds of training tracks for each speaker and test utterances lasting 2-6 seconds.

Associated Publication: -

Paper Title: End to End Speaker Model for Speaker Identification with Minimal Training
Published in: 2021 Moratuwa Engineering Research Conference (MERCon)
Date of Conference: 27-29 July 2021
DOI: 10.1109/MERCon52712.2021.9525740

The Research publication was also submitted to ICASSP 2020

Citation: -

S. Balakrishnan, K. Jathusan and U. Thayasivam, "End To End Model For Speaker Identification With Minimal Training Data," 2021 Moratuwa Engineering Research Conference (MERCon), 2021, pp. 456-461, doi: 10.1109/MERCon52712.2021.9525740.