Improving End-to-End Models for Children’s Speech Recognition: Insights from Augmentation and Normalization Techniques
Children’s Speech Recognition (CSR) presents challenges due to the high variability in speech patterns and limited annotated children’s speech data. Here, our focus is on improving CSR in scenarios where no children’s speech data is available for training Automatic Speech Recognition (ASR) systems. Traditionally, Vocal Tract Length Normalization (VTLN) is widely used in hybrid ASR systems to address acoustic mismatch and variability in children’s speech when training models on adult speech. Simultaneously, End-to-End (E2E) systems often employ data augmentation methods to create child-like speech from adult speech.
In this investigation, the efficacy of the augmentation methods combined with VTLN within an E2E framework for the CSR task is explored, with a specific emphasis on Dutch. VTLN is applied at different stages (training/test) of the ASR, and the analysis includes age and gender considerations. Experiments demonstrate that speed perturbations and spectral augmentation lead to significant performance enhancements. Additionally, VTLN consistently contributes to further improvements while maintaining recognition accuracy on adult speech. Notably, VTLN exhibits enhanced performance for female speakers and proves particularly effective for younger children. This research sheds light on the methods and strategies to improve the performance of End-to-End ASR Models in Children’s Speech Recognition.