Select Language

J-MAC: Japanese Multi-Speaker Audiobook Corpus for Speech Synthesis

Analysis of J-MAC corpus construction methodology, technical contributions, evaluation results, and future directions for expressive audiobook speech synthesis.
audio-novel.com | PDF Size: 0.4 MB
Rating: 4.5/5
Your Rating
You have already rated this document
PDF Document Cover - J-MAC: Japanese Multi-Speaker Audiobook Corpus for Speech Synthesis

1. Introduction

The paper introduces J-MAC (Japanese Multi-speaker Audiobook Corpus), a novel speech corpus designed to advance research in expressive, context-aware speech synthesis, specifically for audiobook applications. The authors argue that while reading-style TTS has achieved near-human quality, the next frontier involves handling complex, cross-sentence contexts, speaker-specific expressiveness, and narrative flow—hallmarks of professional audiobook narration. The lack of high-quality, multi-speaker audiobook corpora, especially for languages like Japanese, is identified as a key bottleneck. J-MAC aims to fill this gap by providing a resource built from professionally narrated audiobooks, using an automated, language-agnostic construction pipeline.

2. Corpus Construction

The construction of J-MAC involves a three-stage pipeline: data collection, cleansing, and precise text-audio alignment.

2.1 Data Collection

Audiobooks were selected based on two primary criteria: 1) Availability of accurate reference text (prioritizing out-of-copyright novels to avoid ASR transcription errors on named entities), and 2) Existence of multiple professional speaker renditions of the same book to capture speaker-dependent expressiveness. This focus on parallel recordings (same book, different speakers) is a strategic choice to enable controlled studies on speaker style.

2.2 Data Cleansing & Alignment

The raw audiobook audio undergoes a multi-step refinement process. First, vocal-instrumental separation (e.g., using tools like Spleeter or Open-Unmix) isolates the speaker's voice from any background music or sound effects. Next, Connectionist Temporal Classification (CTC), typically from a pre-trained ASR model, provides a rough alignment between the audio segments and the corresponding text. Finally, Voice Activity Detection (VAD) is applied to refine the boundaries of speech segments, ensuring clean, precise utterances matched to text.

3. Technical Methodology

The core innovation lies in the automated pipeline, which minimizes manual effort.

3.1 Vocal-Instrumental Separation

This step is crucial for obtaining "clean" speech data. The paper implies the use of source separation models to extract the vocal track, removing non-speech elements that could degrade TTS model training.

3.2 CTC-Based Alignment

CTC alignment is used for its ability to handle sequences of different lengths without explicit segmentation. The CTC loss function, $L_{CTC} = -\log P(\mathbf{y}|\mathbf{x})$, where $\mathbf{x}$ is the acoustic input and $\mathbf{y}$ is the target label sequence, allows the model to learn an alignment between audio frames and text characters/phonemes.

3.3 VAD Refinement

Post-CTC alignment, VAD algorithms (e.g., based on energy thresholds or neural networks) are used to detect the precise start and end points of speech within the roughly aligned segments, removing leading/trailing silence or noise.

4. Evaluation & Results

The authors conducted audiobook speech synthesis evaluations using models trained on J-MAC. Key findings include:

  • Method Generalization: Improvements in the underlying synthesis method (e.g., better acoustic models) enhanced the naturalness of synthetic speech across all speakers in the corpus.
  • Entangled Factors: The naturalness of the synthesized audiobook speech was strongly influenced by a complex interaction between the synthesis method, the target speaker's voice characteristics, and the specific book/content being synthesized. Disentangling these factors remains a challenge.

Evaluation Insight

Core Result: Synthesis quality is non-trivially dependent on Speaker x Method x Content interaction.

5. Key Insights & Discussion

  • J-MAC addresses a critical data scarcity issue for expressive TTS research in Japanese.
  • The automated construction pipeline is a significant contribution, reducing the cost and time of creating such corpora and being potentially applicable to other languages.
  • The evaluation underscores that audiobook synthesis is not merely a scaling-up of single-sentence TTS; it requires modeling higher-level narrative context and speaker identity.
  • The "entanglement" finding suggests that future evaluation metrics and models need to account for multi-dimensional factors.

6. Original Analysis: Industry Perspective

Core Insight: The J-MAC paper isn't just about a new dataset; it's a strategic play to shift the TTS paradigm from isolated utterance generation to holistic narrative modeling. The authors correctly identify that the next value inflection point in speech synthesis lies in long-form, expressive content like audiobooks, podcasts, and interactive narratives—areas where current TTS still sounds robotic and context-agnostic. By open-sourcing a multi-speaker corpus, they're not just providing data; they're setting the benchmark and research agenda.

Logical Flow: Their logic is impeccable: 1) High-quality data is the fuel for deep learning. 2) Professional audiobooks are the gold standard for expressive, contextually coherent speech. 3) Manual corpus creation is prohibitively expensive. Therefore, an automated pipeline (separation → CTC alignment → VAD) is the only scalable solution. This mirrors the data-centric AI movement championed by Andrew Ng, where the quality of the data pipeline is as important as the model architecture.

Strengths & Flaws: The major strength is the pipeline's practicality and language-agnostic design. Using off-the-shelf components like source separation models (e.g., based on architectures like the U-Net used in Demucs) and CTC-based ASR makes it reproducible. However, the paper's flaw is its light touch on the "context" problem it highlights. It provides the data (J-MAC) but offers limited novel modeling solutions for leveraging cross-sentence context or disentangling speaker style from content. The evaluation results, while insightful, are descriptive rather than prescriptive. How do we actually model the "entangled" factors? Techniques from style transfer and disentangled representation learning, like those in CycleGAN or variational autoencoders, are hinted at but not deeply explored.

Actionable Insights: For industry practitioners, the takeaway is twofold. First, invest in building or acquiring similar long-form, multi-style speech corpora—it will be a key differentiator. Second, the research priority should be on context-aware architectures. This could mean transformer-based models with much longer context windows, or hierarchical models that separately encode local prosody, speaker style, and global narrative arc. The work of teams like Google Brain on SoundStream or Microsoft on VALL-E points towards neural codec-based approaches that could be extended with the contextual cues J-MAC provides. The future isn't just synthesizing a sentence; it's synthesizing a performance.

7. Technical Details & Mathematical Formulation

The alignment process relies heavily on the CTC objective. For an input sequence $\mathbf{x}$ (audio features) of length $T$ and a target label sequence $\mathbf{l}$ (text characters) of length $U$, where $T > U$, CTC introduces a blank token $\epsilon$ and defines a many-to-one mapping $\mathcal{B}$ from a path $\pi$ (of length $T$) to $\mathbf{l}$. The probability of a path is: $P(\pi|\mathbf{x}) = \prod_{t=1}^{T} y_{\pi_t}^t$, where $y_{\pi_t}^t$ is the probability of symbol $\pi_t$ at time $t$. The conditional probability of the label sequence is the sum over all paths mapped to it by $\mathcal{B}$: $P(\mathbf{l}|\mathbf{x}) = \sum_{\pi \in \mathcal{B}^{-1}(\mathbf{l})} P(\pi|\mathbf{x})$. This formulation allows the model to learn the alignment without pre-segmented data. In the J-MAC pipeline, a pre-trained CTC model (e.g., based on a DeepSpeech2-like architecture) generates these alignments to chunk the audio.

8. Experimental Results & Chart Description

While the provided PDF excerpt does not contain explicit charts, the described results imply a multi-factorial evaluation design. A hypothetical result chart that would illustrate their key finding would be a 3D surface plot or a series of grouped bar charts.

Chart Description: The y-axis represents Mean Opinion Score (MOS) for naturalness (e.g., 1-5 scale). The x-axis lists different synthesis methods (e.g., Tacotron2, FastSpeech2, a proposed model). The grouping/z-axis would represent different speakers from J-MAC (Speaker A, B, C) and/or different books (Book X, Book Y). The key visual finding would be that the bars' heights (MOS) do not follow a consistent order across groups. For instance, Method 1 might be best for Speaker A on Book X, but worst for Speaker B on Book Y, vividly demonstrating the "strong entanglement" of factors. Error bars would likely show significant overlap, indicating the challenge of drawing simple conclusions.

9. Analysis Framework: Example Case

Case Study: Evaluating a New TTS Model for Audiobooks

Objective: Determine if "Model-Z" improves upon a baseline for audiobook synthesis using J-MAC.

Framework:

  1. Data Partitioning: Split J-MAC by book and speaker. Ensure test sets contain unseen sentences from books seen in training (in-domain) and entirely unseen books (out-of-domain).
  2. Model Training: Train both Baseline (e.g., FastSpeech2) and Model-Z on the same training split. Use the J-MAC text-audio pairs.
  3. Controlled Evaluation: Generate speech for identical text sequences across all test conditions (Speaker x Book combinations).
  4. Metrics:
    • Primary: MOS for Naturalness and Expressiveness.
    • Secondary: Word Error Rate (WER) of ASR on synthetic speech (intelligibility), Speaker Similarity Score (e.g., using a speaker verification model like ECAPA-TDNN).
    • Contextual Metric: A/B test where evaluators listen to two consecutive synthesized sentences and rate coherence.
  5. Analysis: Perform ANOVA or similar statistical analysis to isolate the effect of Model, Speaker, Book, and their interactions on the MOS scores. The null hypothesis would be "Model-Z has no effect independent of Speaker and Book."
This framework directly addresses the entanglement problem highlighted in the paper.

10. Future Applications & Research Directions

  • Personalized Audiobooks: Synthesizing books in the voice of a user's favorite narrator or even a personal voice clone.
  • Dynamic Narration for Games/XR: Generating context-aware, expressive dialogue and narration in real-time for interactive media.
  • Accessibility: Drastically reducing the time and cost to produce audiobooks for the visually impaired or for books in low-resource languages.
  • Research Directions:
    1. Disentangled Representation Learning: Developing models that explicitly separate content, speaker style, emotion, and narrative tone into latent variables.
    2. Long-Context Modeling: Leveraging efficient transformer variants (e.g., Longformer, Performer) to condition synthesis on entire paragraphs or chapters.
    3. Prosody Transfer & Control: Enabling fine-grained control over pacing, emphasis, and intonation across long passages, perhaps using reference audio clips as style prompts.
    4. Cross-Lingual Expansion: Applying the J-MAC construction pipeline to build similar corpora for other languages, fostering comparative studies.

11. References

  1. J. Shen, et al., "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions," ICASSP 2018.
  2. A. Vaswani, et al., "Attention Is All You Need," NeurIPS 2017.
  3. Y. Ren, et al., "FastSpeech: Fast, Robust and Controllable Text to Speech," NeurIPS 2019.
  4. J.-Y. Zhu, et al., "Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks," ICCV 2017 (CycleGAN).
  5. A. Défossez, et al., "Demucs: Deep Extractor for Music Sources with extra unlabeled data remixed," arXiv:1909.01174.
  6. A. van den Oord, et al., "WaveNet: A Generative Model for Raw Audio," arXiv:1609.03499.
  7. J. Kong, et al., "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis," NeurIPS 2020.
  8. N. Zeghidour, et al., "SoundStream: An End-to-End Neural Audio Codec," arXiv:2107.03312.
  9. A. Graves, et al., "Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks," ICML 2006.
  10. Andrew Ng, "Data-Centric AI," DeepLearning.AI.