Context-Aware Speech Generation Using BiLSTM-Based Neural Networks

Reddy, K. Jeevan; Jahangir Badashah, Syed; Govind, S. Tharun; Dinesh, M.; Bindusri, K.

Context-Aware Speech Generation Using BiLSTM-Based Neural Networks

Reddy, K. Jeevan and Jahangir Badashah, Syed and Govind, S. Tharun and Dinesh, M. and Bindusri, K. (2025) Context-Aware Speech Generation Using BiLSTM-Based Neural Networks. International Journal of Innovative Science and Research Technology, 10 (7): 25jul358. pp. 270-277. ISSN 2456-2165

[A][B][+][-]

Abstract

According to recent studies, feed-forward Deep neural networks (DNNs) perform better than text-to-speech (TTS) systems that use decision-tree clustered context-dependent hidden Markov models (HMMs) [1, 4]. The feed-forward aspect of DNN-based models makes it difficult to incorporate the long-span contextual influence into spoken utterances. Another typical strategy in HMM-based TTS for establishing a continuous speech trajectory is using the dynamic characteristics to constrain the production of speech parameters [2]. Parametric time-to-speech synthesis is used in this study by capturing the co-occurrence or correlation data between any two points in a spoken phrase using time aware memory network cells. Based on our experiments, a combination of DNN and BLSTM-RNN is the best system to use. Upper hidden layers of this system use a bidirectional RNN structure of LSTM, the low layers use a simple, one way structure followed by additional layers. On objective and subjective metrics, it surpasses both the traditional decision-based tree HMM’s and the DNN-TTS system. Dynamic limitations are superfluous since the BLSTM-RNN TTS produces very smooth speech trajectories.

Documents