ULTRAS – Unified Learning of Transformer Representations for Audio and Speech Signals

While self-supervised learning (SSL) has significantly advanced audio representation, current approaches remain largely specialised by domain. Speech-focused models like wav2vec 2.0 and HuBERT excel at tasks such as speech recognition by exploiting 1-D temporal structures, whereas general audio models like SSAST rely on 2-D spectro-temporal patch modelling for event classification. Because these 2-D spectrogram models struggle to generalise to speech-specific tasks, there is still a critical gap in developing unified SSL frameworks capable of jointly modelling both speech and naturalistic audio signals.

Figure. Block schematic of the proposed framework of joint 1-D and 2-D modelling of audio data. The gradient-coloured blocks are learnable, while the rest do not have any learnable parameters.

In the work presented at the 2025 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2025), researchers from the LEAP Lab, Indian Institute of Science, including Ameenudeen P E, Charumathi Narayanan, and Sriram Ganapathy proposed Unified Learning of Transformer Representations for Audio and Speech (ULTRAS), an approach to jointly model the time–frequency attributes of the input acoustic signal, where the masking and predictive modelling is performed over long patches of the data. The model, based on the transformer architecture, encodes spectral patches of log-mel spectrogram features. The predictive modelling of masked segments is performed on spectral and temporal targets using a combined loss function, forcing the representations to encode time and frequency traits. The team performed experiments on a variety of speech and audio tasks and illustrated that the ULTRAS framework achieves improved performance over other established baselines. Paper: https://arxiv.org/abs/2604.06702