Prosody Analysis of Audiobooks: NLP Models for Enhanced Text-to-Speech

1. Introduction & Overview

This research paper, "Prosody Analysis of Audiobooks," addresses a critical gap in modern Text-to-Speech (TTS) systems: the inability to replicate the expressive, dramatic vocalizations characteristic of human-narrated audiobooks. While commercial TTS has achieved high naturalness in generic speech, it falters with narrative texts rich in dialogue, emotion, and description. The core thesis is that higher-order Natural Language Processing (NLP) analysis—specifically targeting character identification, dialogue, and narrative structure—can be leveraged to predict prosodic features (pitch, volume, speech rate) and significantly enhance synthetic audiobook quality.

The work presents a novel dataset of 93 aligned book-audiobook pairs and demonstrates that models trained on this data outperform a state-of-the-art commercial TTS baseline (Google Cloud TTS) in correlating with human prosody patterns.

93

Aligned Book-Audiobook Pairs

1806

Chapters Analyzed

22/24

Books with Better Pitch Prediction

23/24

Books with Better Volume Prediction

2. Methodology & Dataset

2.1 Dataset Construction

The foundation of this research is a meticulously curated dataset comprising 93 novels and their corresponding human-read audiobooks. The dataset includes 1,806 chapters with sentence-level alignment between the text and audio, enabling precise analysis. This dataset has been made publicly available, providing a valuable resource for the speech and NLP communities. The alignment process is crucial for extracting accurate prosody labels (pitch, volume, rate) for each sentence in the text.

2.2 Prosody Attribute Extraction

From the aligned audiobooks, three core prosody attributes are extracted at the sentence level:

Pitch (F0): The fundamental frequency, indicating vocal cord vibration rate. Measured in Hertz (Hz).
Volume (Intensity/Energy): The amplitude or loudness of the speech signal. Measured in decibels (dB).
Rate (Speaking Rate): The speed of delivery, often measured in syllables per second.

These attributes serve as the target variables for the predictive models.

2.3 Model Architecture

The primary model is a Long Short-Term Memory (LSTM) network built upon MPNet (Masked and Permuted Pre-training for Language Understanding) sentence embeddings. MPNet provides rich contextual representations of the input text. The LSTM layer then models the sequential dependencies in the narrative to predict the continuous values for pitch, volume, and rate. This architecture is chosen for its ability to capture long-range contextual cues essential for narrative understanding.

3. Key Findings & Analysis

3.1 Character-Level Prosody Patterns

A significant empirical finding is that human narrators systematically modulate prosody based on character attributes and narrative context. The analysis reveals:

In 21 out of 31 books where the two lead characters differ in gender, narrators used lower pitch and higher volume to portray the male character.
Narrators consistently use lower pitch in narrative regions compared to dialogue, independent of character gender.

This quantifies an implicit performance rule used by professional narrators, providing a clear signal for models to learn.

3.2 Model Performance vs. Commercial TTS

The proposed model's predicted prosody attributes show a significantly higher correlation with human readings than the default output of Google Cloud Text-to-Speech.

Pitch: The model's predictions correlated better with human reading in 22 out of 24 books in the test set.
Volume: The model's predictions correlated better in 23 out of 24 books.

This demonstrates the model's effectiveness in capturing nuanced human prosodic patterns that generic TTS systems miss.

4. Technical Implementation

4.1 Mathematical Formulation

The prosody prediction task is framed as a regression problem. Given an input sentence $S$ represented by its MPNet embedding $\mathbf{e}_S$, the model $f_\theta$ parameterized by $\theta$ predicts a prosody vector $\mathbf{p}$: $$\mathbf{p} = [\hat{pitch}, \hat{volume}, \hat{rate}]^T = f_\theta(\mathbf{e}_S)$$ The model is trained to minimize the Mean Squared Error (MSE) loss between its predictions $\hat{\mathbf{p}}$ and the ground-truth prosody values $\mathbf{p}_{gt}$ extracted from human audio: $$\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \| \hat{\mathbf{p}}_i - \mathbf{p}_{gt,i} \|^2_2$$

4.2 LSTM Architecture Details

The core sequence model is a standard LSTM cell. At each step $t$ (corresponding to a sentence), it updates its hidden state $\mathbf{h}_t$ and cell state $\mathbf{c}_t$ based on the input $\mathbf{x}_t$ (the MPNet embedding) and the previous states: $$\mathbf{i}_t = \sigma(\mathbf{W}_{xi}\mathbf{x}_t + \mathbf{W}_{hi}\mathbf{h}_{t-1} + \mathbf{b}_i)$$ $$\mathbf{f}_t = \sigma(\mathbf{W}_{xf}\mathbf{x}_t + \mathbf{W}_{hf}\mathbf{h}_{t-1} + \mathbf{b}_f)$$ $$\mathbf{o}_t = \sigma(\mathbf{W}_{xo}\mathbf{x}_t + \mathbf{W}_{ho}\mathbf{h}_{t-1} + \mathbf{b}_o)$$ $$\tilde{\mathbf{c}}_t = \tanh(\mathbf{W}_{xc}\mathbf{x}_t + \mathbf{W}_{hc}\mathbf{h}_{t-1} + \mathbf{b}_c)$$ $$\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t$$ $$\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)$$ where $\sigma$ is the sigmoid function, $\odot$ denotes element-wise multiplication, and $\mathbf{W}$ and $\mathbf{b}$ are learnable parameters. The final hidden state $\mathbf{h}_t$ is passed through a fully connected layer to produce the 3-dimensional prosody prediction.

5. Experimental Results

5.1 Correlation Metrics & Figure 1

The primary evaluation metric is the correlation coefficient (e.g., Pearson's r) between the predicted prosody contour and the human-read prosody contour across a chapter. Figure 1 in the paper presents a dot plot comparing the human-TTS correlation for the proposed system and Google Cloud TTS across 24 test books.

Chart Description (Fig. 1a - Pitch): The x-axis represents different books. Each book has two dots: one for the proposed model's pitch correlation with human reading, and one for Google TTS's correlation. The plot visually shows the model's dot (likely in a distinct color) being higher than Google's dot for the vast majority of books, quantitatively supporting the 22/24 claim.
Chart Description (Fig. 1b - Volume): A similar dot plot for volume correlation, showing an even more dominant performance by the proposed model, corresponding to the 23/24 result.

These plots provide strong visual evidence of the model's superior ability to mimic human narrative prosody.

5.2 Human Evaluation Study

Beyond correlation metrics, a human evaluation study was conducted. The model's prosody predictions were used to generate SSML (Speech Synthesis Markup Language) tags to control a TTS engine. Listeners were presented with two versions: the default Google TTS audio and the SSML-enhanced audio using the model's predictions. Results were nuanced: a small majority (12 out of 22 subjects) preferred the SSML-enhanced readings, but the preference was not overwhelming. This highlights the complexity of subjective audio quality assessment and suggests that while the model captures objective prosodic patterns well, integrating them seamlessly into final audio output remains a challenge.

6. Analysis Framework & Case Study

Framework for Narrative Prosody Analysis:

Text Segmentation & Annotation: Divide the novel into sentences. Run NLP pipelines for:
- Named Entity Recognition (NER) to identify characters.
- Quote attribution to link dialogue to characters.
- Text classification to label sentences as "Narrative," "Dialogue," or "Description."
Contextual Feature Engineering: For each sentence, create features:
- Binary flags: `is_dialogue`, `is_narrative`.
- Character ID of the speaker (if in dialogue).
- Metadata: character gender (from external knowledge base).
- Sentence embedding (MPNet) capturing semantic content.
Prosody Label Extraction: From the time-aligned audio, extract pitch (F0), volume (RMS energy), and speaking rate (syllables/duration) for each sentence.
Model Training & Inference: Train the LSTM model (Section 4.2) on the {features → prosody labels} pairs. For new text, apply the trained model to predict prosody attributes.
SSML Generation & Synthesis: Convert predicted pitch (as a relative multiplier, e.g., `+20%`), volume (e.g., `+3dB`), and rate (e.g., `slow`) into SSML tags. Feed the tagged text to a high-quality neural TTS engine (e.g., Google, Amazon Polly) for final audio rendering.

Case Study - Applying the Framework: Consider the sentence "'I will never go back,' he said defiantly." The framework would: 1) Identify it as dialogue spoken by a male character ("he"). 2) The model, having learned that male dialogue often has lower pitch and higher volume than narrative, might predict: `pitch_shift = -10%`, `volume_boost = +2dB`. 3) These would be rendered as SSML: `I will never go back, he said defiantly.`. The resulting synthetic speech would carry the intended dramatic emphasis.

7. Future Applications & Directions

Personalized Audiobook Narration: Users could select a "narrator style" (e.g., "calm," "dramatic," "sarcastic") by fine-tuning the prosody prediction model on audiobooks read by narrators with that style.
Real-Time Interactive Storytelling: Integration into game engines or interactive fiction platforms, where prosody is dynamically adjusted based on narrative tension, character relationships, and player choices.
Accessibility & Language Learning: Enhanced TTS for visually impaired users, providing more engaging and comprehensible access to literature. It could also aid language learners by providing more expressive and context-aware pronunciation models.
Cross-Modal Creative Tools: For authors and audio producers, tools that suggest prosody markings in a manuscript or automatically generate expressive audio drafts for review.
Research Direction - Emotion & Sentiment: Extending the model to predict more granular emotional prosody (e.g., joy, sadness, anger) by incorporating sentiment analysis and emotion detection from text, similar to efforts in emotional TTS seen in research from institutions like Carnegie Mellon University's Language Technologies Institute.
Research Direction - End-to-End Systems: Moving beyond post-hoc SSML control to training an end-to-end neural TTS system (like Tacotron 2 or FastSpeech 2) where the prosody prediction is an integral, conditioned part of the acoustic model, potentially yielding more natural and cohesive output.

8. References

Pethe, C., Pham, B., Childress, F. D., Yin, Y., & Skiena, S. (2025). Prosody Analysis of Audiobooks. arXiv preprint arXiv:2310.06930v3.
Vaswani, A., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems 30 (NIPS 2017).
Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780.
Song, K., et al. (2020). MPNet: Masked and Permuted Pre-training for Language Understanding. Advances in Neural Information Processing Systems 33 (NeurIPS 2020).
Google Cloud. (n.d.). Text-to-Speech. Retrieved from https://cloud.google.com/text-to-speech
World Wide Web Consortium (W3C). (2010). Speech Synthesis Markup Language (SSML) Version 1.1. W3C Recommendation.
Zen, H., et al. (2019). LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. Interspeech 2019.

Analyst's Perspective: A Critical Deconstruction

Core Insight: This paper isn't just about making robots sound more human; it's a shrewd exploitation of a massive, underutilized dataset—human audiobook performances—to reverse-engineer the tacit rules of narrative delivery. The authors correctly identify that the billion-dollar audiobook industry is, in effect, a colossal, pre-existing annotation set for expressive speech. Their key insight is treating the narrator as a high-fidelity sensor for textual affect, a concept with parallels to how CycleGAN (Zhu et al., 2017) uses unpaired image sets to learn style translation—here, the "style" is prosodic performance.

Logical Flow: The logic is compelling: 1) Align text and audio to create a supervised dataset. 2) Use robust NLP (MPNet) to understand text. 3) Use a sequential model (LSTM) to map context to prosody. 4) Beat a commercial giant (Google) at its own game on correlation metrics. The flow from data creation to model superiority is clean and well-supported by their 22/24 and 23/24 win rates. However, the chain weakens at the final, crucial link: subjective listener preference. A 12/22 result is statistically flimsy and reveals the perennial "good metrics, mediocre experience" problem in AI audio.

Strengths & Flaws: The strength is undeniable in the dataset and the clear, quantifiable superiority over baseline TTS in capturing objective prosodic contours. The character-level analysis (male vs. female, narrative vs. dialogue) is a gem of empirical observation that provides both a validation of the model and a fascinating insight into human performance. The major flaw is the reliance on post-hoc SSML hacking. As any audio engineer will tell you, applying prosody controls after the fact to a generic TTS voice often sounds artificial and disjointed—like using a graphic equalizer on a poor recording. The human evaluation results scream this limitation. The model predicts the right notes, but the synthesis engine can't play them in tune. A more ambitious, end-to-end approach, as pioneered by models like FastSpeech 2, is the necessary but more difficult next step.

Actionable Insights: For product teams, the immediate takeaway is to license or build upon this dataset and model to add a "Storyteller" or "Expressive" mode to existing TTS offerings—a viable near-term feature. For researchers, the path is twofold: First, integrate this prosody prediction directly into the acoustic model of a neural TTS system, moving beyond SSML. Second, expand the analysis beyond the three basic attributes to encompass voice quality (breathiness, roughness) and more nuanced emotional states, perhaps leveraging resources like the MSP-Podcast corpus for emotional speech analysis. The paper successfully cracks open a rich vein of research; now the hard work of refining the ore begins.

Table of Contents