J-MAC: Japanese Multi-Speaker Audiobook Corpus for Speech Synthesis

1. Introduction

The paper introduces J-MAC (Japanese Multi-speaker Audiobook Corpus), a novel speech corpus designed to advance research in expressive, context-aware speech synthesis, specifically for audiobook applications. The authors argue that while reading-style TTS has achieved near-human quality, the next frontier involves handling complex, cross-sentence contexts, speaker-specific expressiveness, and narrative flow—all critical for compelling audiobook generation. The lack of high-quality, multi-speaker audiobook corpora has been a significant bottleneck. J-MAC addresses this by providing a method to automatically construct such a corpus from commercially available audiobooks read by professional narrators, making the resulting dataset open-source.

2. Corpus Construction

The construction pipeline is a three-stage process designed for automation and language independence.

2.1 Data Collection

Audiobooks are selected based on two primary criteria: 1) Availability of accurate reference text (preferably out-of-copyright novels to avoid ASR errors on named entities), and 2) Existence of multiple versions narrated by different professional speakers to capture diverse expressive styles. This prioritizes speaker diversity over sheer volume of data from a single speaker.

2.2 Data Cleansing & Alignment

The raw audio undergoes processing to extract clean speech segments and align them precisely with the corresponding text. This involves source separation, coarse alignment using Connectionist Temporal Classification (CTC), and fine-grained refinement using Voice Activity Detection (VAD).

3. Technical Methodology

3.1 Vocal-Instrumental Separation

To isolate clean speech from potential background music or sound effects in audiobook productions, a source separation model (like those based on Deep Clustering or Conv-TasNet) is employed. This step is crucial for obtaining high-fidelity training data for synthesis models.

3.2 CTC-based Alignment

A CTC-trained ASR model provides an initial, rough alignment between the audio waveform and the text sequence. The CTC loss function $\mathcal{L}_{CTC} = -\log P(\mathbf{y}|\mathbf{x})$, where $\mathbf{x}$ is the input sequence and $\mathbf{y}$ is the target label sequence, allows for alignment without forced segmentation.

3.3 VAD-based Refinement

The coarse CTC alignments are refined using a Voice Activity Detection system. This step removes non-speech segments (pauses, breaths) and adjusts boundaries to ensure each audio segment corresponds accurately to a text unit (e.g., a sentence), improving the precision of the text-audio pairs.

4. Experimental Results & Evaluation

The authors conducted audiobook speech synthesis evaluations using models trained on J-MAC. Key findings include:

Model Improvement Generalizes: Enhancements to the synthesis architecture improved the naturalness of output speech across different speakers in the corpus.
Entangled Factors: The perceived naturalness was strongly influenced by a complex interaction between the synthesis method, the speaker's voice characteristics, and the content of the book itself. Disentangling these factors remains a challenge.

Chart Description (Implied): A hypothetical bar chart would show Mean Opinion Scores (MOS) for naturalness across different synthesis systems (e.g., Tacotron2, FastSpeech2) and different J-MAC speakers. The chart would likely show variance across speakers for the same model and consistent improvement trends for advanced models across all speakers, visually confirming the two key insights.

5. Key Insights & Discussion

J-MAC successfully provides a scalable, automated pipeline for creating expressive speech corpora.
The multi-speaker, same-book design is a unique strength for studying speaker identity and expressiveness.
The evaluation underscores that future audiobook TTS models must account for the entangled nature of content, speaker, and style.

6. Original Analysis: Industry Perspective

Core Insight: The J-MAC paper isn't just about a new dataset; it's a strategic play to shift the TTS paradigm from isolated utterance generation to narrative intelligence. While models like WaveNet and Tacotron conquered fidelity, they largely ignored the macro-structure of speech. J-MAC, by providing parallel narratives from multiple professional speakers, is the necessary substrate for models to learn not just how to speak, but how to perform a story. This aligns with the broader industry trend seen in works like the Google AudioLM paper, which seeks to model audio in a context-aware, hierarchical manner.

Logical Flow: The authors correctly identify the data bottleneck. Their solution is pragmatic: mine existing, high-quality artistic productions (audiobooks) rather than commissioning new recordings. The technical pipeline is shrewd—leveraging mature technologies (CTC, VAD) in a novel combination for a specific, high-value goal. The evaluation then uses this new resource to surface a critical, non-obvious finding: in expressive synthesis, you can't optimize for a speaker-agnostic "best model." Performance is inextricably linked to speaker identity.

Strengths & Flaws: The major strength is the corpus design principle. The choice of professional speakers and same-text comparisons is brilliant for controllability studies. The automated pipeline is a significant contribution to reproducibility. However, the paper's flaw is its nascent evaluation. The "entangled factors" insight is crucial but merely stated. A deeper analysis, perhaps using techniques from style transfer literature (like the encoder architectures in Global Style Tokens or the disentanglement methods explored in CycleGAN-VC), is needed. How much of the variance is due to acoustic timbre vs. prosodic style vs. semantic interpretation? The paper opens the door but doesn't walk through it.

Actionable Insights: For researchers: Use J-MAC to benchmark disentanglement techniques. For product teams: This work signals that the next generation of voice AI for podcasts, ads, and books won't come from more reading-style data, but from narrative performance data. Start curating expressive, long-form datasets. The methodology itself is exportable—imagine a "J-MAC for Podcasts" or "J-MAC for Movie Trailers." The core lesson is that in the age of foundation models, the strategic value of a uniquely structured, high-quality dataset like J-MAC may outweigh that of any single model architecture published alongside it.

7. Technical Details & Mathematical Formulation

The alignment process relies on the CTC forward-backward algorithm. Given an input sequence $\mathbf{x}$ of length $T$ and a target sequence $\mathbf{l}$ of length $L$, CTC defines a distribution over alignments by introducing a blank token ($\epsilon$) and allowing repetitions. The probability of the target is the sum over all valid alignments $\pi$:

$P(\mathbf{l} | \mathbf{x}) = \sum_{\pi \in \mathcal{B}^{-1}(\mathbf{l})} P(\pi | \mathbf{x})$

where $\mathcal{B}$ is the function that collapses repeated tokens and removes blanks. The VAD refinement can be formulated as a segmentation task, finding boundaries $\{t_i\}$ that maximize the likelihood of speech within segments and non-speech between them, often using energy-based features or a trained classifier.

8. Analysis Framework: Case Study

Scenario: Evaluating the impact of speaker style on perceived "engagement" in audiobook synthesis.

Framework Application:

Data Partition: Take two professional speakers (A & B) from J-MAC who have narrated the same chapter of a novel.
Feature Extraction: For each utterance in the chapter, extract low-level descriptors (LLDs) like pitch contours, energy dynamics, and pause durations using tools like OpenSMILE or Praat. Also extract high-level style embeddings using a pre-trained model like HuBERT.
Contrastive Analysis: Compute statistical differences (e.g., using t-tests or KL divergence) between the distributions of LLDs for Speaker A and Speaker B for the same textual content. This quantifies their unique prosodic "fingerprint."
Synthesis & Evaluation: Train two TTS models: one on Speaker A's data, one on Speaker B's. Synthesize the same novel passage not seen during training. Conduct a listening test where evaluators rate each synthesis for "expressiveness" and "narrative engagement."
Correlation: Correlate the objective style differences (Step 3) with the subjective engagement scores (Step 4). This framework, enabled by J-MAC's structure, can isolate which acoustic features most contribute to perceived performance quality.

This case study demonstrates how J-MAC facilitates causal analysis, moving beyond correlation to understanding the building blocks of expressive speech.

9. Future Applications & Research Directions

Expressive Voice Cloning & Customization: J-MAC's multi-speaker data is ideal for developing few-shot or zero-shot voice adaptation systems that can mimic a speaker's narrative style, not just their timbre.
Disentangled Representation Learning: Future work can use J-MAC to train models that separate content, speaker identity, and expressive style into distinct latent spaces, enabling fine-grained control over synthesis.
Cross-Lingual Audiobook Synthesis: The methodology can be applied to other languages to build similar corpora, enabling research on preserving expressive style in translation or dubbing.
AI-Assisted Content Creation: Integration with large language models (LLMs) could lead to systems that write and perform short stories or personalized audio content in a specific narrator's style.
Accessibility Tools: Generating high-quality, expressive audiobooks on-demand for any digital text, greatly expanding access for visually impaired users.

10. References

J. Shen, et al., "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions," ICASSP, 2018.
A. Vaswani, et al., "Attention Is All You Need," NeurIPS, 2017.
Y. Ren, et al., "FastSpeech: Fast, Robust and Controllable Text to Speech," NeurIPS, 2019.
A. v. d. Oord, et al., "WaveNet: A Generative Model for Raw Audio," arXiv:1609.03499, 2016.
J.-Y. Zhu, et al., "Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks," ICCV, 2017. (CycleGAN)
Y. Wang, et al., "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis," ICML, 2018.
Google AI, "AudioLM: A Language Modeling Approach to Audio Generation," Google Research Blog, 2022.
A. Graves, et al., "Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks," ICML, 2006.