MultiActor-Audiobook: Zero-Shot Generation with Faces and Voices

1. Introduction & Overview

MultiActor-Audiobook presents a zero-shot framework for generating expressive audiobooks featuring multiple distinct speakers. It addresses key limitations of prior systems: the high cost of extensive voice actor datasets, domain specificity of trained models, and the labor-intensive nature of manual prosody annotation. The core innovation lies in its two automated, zero-shot processes: Multimodal Speaker Persona Generation (MSP) and LLM-based Script Instruction Generation (LSI). By synthesizing character-specific voices from generated visual personas and dynamically inferring emotional/prosodic cues from text context, the system aims to produce audiobooks with consistent, appropriate, and expressive narration without any task-specific training data.

2. Core Methodology

The system's effectiveness hinges on two novel, interconnected processes that automate the most challenging aspects of audiobook production: character voice creation and expressive reading.

2.1 Multimodal Speaker Persona Generation (MSP)

This process creates a unique, consistent voice for each character in a story from textual descriptions alone.

Entity Identification & Textual Persona Extraction: An LLM (e.g., GPT-4) parses the novel script to identify all speaking entities (characters, narrator). For each, it extracts descriptive features (personality, age, role, physical traits) from the narrative text.
Visual Persona Generation: A text-to-image model (e.g., Stable Diffusion) uses the extracted textual description to generate a face image that visually embodies the character.
Face-to-Voice Synthesis: A pre-trained Face-to-Voice system (referencing work like [14]) takes the generated face image and its caption to synthesize a short voice sample. This sample encapsulates the character's distinctive prosodic features (timbre, pitch baseline, speaking style). This voice becomes the anchor for all subsequent dialogue by that character.

This pipeline is fully zero-shot for new characters, requiring no prior recordings.

2.2 LLM-based Script Instruction Generation (LSI)

To avoid monotonic reading, this process generates dynamic, sentence-level prosody instructions.

Context-Aware Analysis: For each sentence to be synthesized, the LLM is provided with: the target sentence, surrounding context (previous/next sentences), and the persona information of the current speaker.
Instruction Generation: The LLM outputs a structured set of instructions specifying emotional state (e.g., "joyful," "somber"), tone (e.g., "sarcastic," "authoritative"), pitch variation, and speaking rate appropriate for the context and character.
Prompting for TTS: These instructions are formatted into a natural language prompt (e.g., "Say this in a [emotion] tone with [pitch] variation") that guides a pre-trained, promptable Text-to-Speech (TTS) model to generate the final audio.

This replaces manual annotation with automated, context-sensitive inference.

3. Technical Architecture & Details

3.1 System Pipeline

The end-to-end workflow can be visualized as a sequential pipeline: Input Novel Text → LLM (Speaker ID & Persona Extraction) → Text2Image (Face Gen) → Face2Voice (Voice Sample) → [Per Character]
For each sentence: [Sentence + Context + Persona] → LLM (LSI) → Prompt-TTS (with Character Voice) → Output Audio Segment
The final audiobook is the temporally concatenated output of all processed sentences.

3.2 Mathematical Formulation

The core generation process for a sentence $s_i$ spoken by character $c$ can be formalized. Let $C$ be the context window around $s_i$, and $P_c$ be the multimodal persona of character $c$ (containing text description $D_c$, generated face $F_c$, and voice sample $V_c$).

The LSI process generates an instruction vector $I_i$: $$I_i = \text{LLM}_{\theta}(s_i, C, P_c)$$ where $\text{LLM}_{\theta}$ is the large language model with parameters $\theta$.

The final audio $A_i$ for the sentence is synthesized by a promptable TTS model $\text{TTS}_{\phi}$, conditioned on the character's voice $V_c$ and the instruction $I_i$: $$A_i = \text{TTS}_{\phi}(s_i | V_c, I_i)$$ The system's zero-shot capability stems from using pre-trained, frozen models ($\text{LLM}_{\theta}$, Text2Image, Face2Voice, $\text{TTS}_{\phi}$) without fine-tuning.

4. Experimental Results & Evaluation

The paper validates MultiActor-Audiobook through comparative evaluations against commercial audiobook products and ablation studies.

4.1 Human Evaluation

Human evaluators assessed generated audiobook samples on criteria like emotional expressiveness, speaker consistency, and overall naturalness. MultiActor-Audiobook achieved competitive or superior ratings compared to commercial TTS-based audiobook services. Notably, it outperformed baseline systems that used a single voice or simple rule-based prosody, particularly in dialogues involving multiple characters with distinct personas.

4.2 MLLM Evaluation

To complement human evaluation, the authors employed Multimodal Large Language Models (MLLMs) like GPT-4V. The MLLM was presented with the audio and a description of the scene/character and asked to judge if the vocal delivery matched the context. This objective metric confirmed the system's ability to generate context-appropriate prosody as effectively as commercial systems, validating the LSI module's effectiveness.

4.3 Ablation Studies

Ablation studies demonstrated the contribution of each core module:

Without MSP (Using a generic voice): Speaker consistency and character distinctiveness dropped significantly, leading to confusing dialogues.
Without LSI (Using neutral TTS): The audio became monotonous and emotionally flat, scoring poorly on expressiveness metrics.
Full System (MSP + LSI): Achieved the highest scores across all evaluation dimensions, proving the synergistic necessity of both components.

These results robustly justify the proposed two-process architecture.

5. Analysis Framework & Case Study

Framework Application: To analyze a novel for production, the system follows a deterministic framework. Case Study - A Fantasy Novel Excerpt:

Input: "The old wizard, his beard long and grey, muttered a warning. 'Beware the shadows,' he said, his voice like grinding stones."
MSP Execution: LLM identifies "old wizard" as a speaker. Extracts persona: {age: old, role: wizard, descriptor: beard long and grey, voice quality: like grinding stones}. Text2Image generates a wizened face. Face2Voice produces a deep, gravelly voice sample.
LSI Execution for "Beware the shadows": LLM receives the sentence, context (a warning), and wizard persona. Generates instruction: {emotion: grave concern, tone: ominous and low, pitch: low and steady, pace: slow}.
Output: The promptable TTS synthesizes "Beware the shadows" using the gravelly wizard voice, delivered in a slow, ominous, low-pitched manner.

This framework showcases how textual cues are transformed into multimodal, expressive audio without manual intervention.

6. Critical Analysis & Expert Insight

Core Insight: MultiActor-Audiobook isn't just another TTS wrapper; it's a strategic pivot from data-centric to prompt-centric generative audio. Its real breakthrough is treating audiobook creation as a multimodal context-retrieval and instruction-following problem, bypassing the prohibitive cost curves of traditional voice cloning and prosody modeling. This aligns with the broader industry shift, exemplified by models like DALL-E and Stable Diffusion in vision, where compositionality from pre-trained parts replaces monolithic model training.

Logical Flow: The logic is elegantly linear but hinges on brittle assumptions. MSP assumes a Face-to-Voice model reliably maps any generated face to a fitting, consistent voice—a leap of faith given the known challenges in cross-modal representation learning (as seen in the disparities between image and audio latent spaces discussed in works like AudioCLIP). LSI assumes an LLM's textual understanding of "somber tone" perfectly translates to acoustic parameters in a downstream TTS—a semantic-acoustic gap that remains a fundamental challenge, as noted in speech processing literature.

Strengths & Flaws: Its strength is undeniable economic and operational efficiency: zero-shot, no licensing headaches for actor voices, rapid prototyping. The flaw is in the quality ceiling. The system is only as good as its weakest off-the-shelf component—the Face2Voice model and the promptable TTS. It will struggle with subtlety and long-range consistency. Can it handle a character's voice breaking with emotion, a nuance that requires sub-phonemic control? Unlikely. The reliance on visual persona for voice is also a potential bias amplifier, a well-documented issue in generative AI ethics.

Actionable Insights: For investors and product managers, this is a compelling MVP for niche markets: indie game dev, rapid content localization, personalized edutainment. However, for mainstream publishing seeking human-competitive quality, it's a complement, not a replacement. The immediate roadmap should focus on hybrid approaches: using this system to generate a rich "first draft" audiobook that a human director can then efficiently edit and polish, slashing production time by 70-80% rather than aiming for 100% automation. The research priority must be closing the semantic-acoustic gap via better joint embedding spaces, perhaps inspired by the alignment techniques used in multimodal models like Flamingo or CM3.

7. Future Applications & Directions

The paradigm introduced by MultiActor-Audiobook opens several avenues:

Interactive Media & Gaming: Dynamic, real-time generation of character dialogue in games or interactive stories based on player choices and evolving character states.
Accessibility & Education: Instant conversion of textbooks, documents, or personalized children's stories into engaging, multi-voice narrations, greatly enhancing accessibility for visually impaired users or creating immersive learning materials.
Content Localization: Rapid dubbing and voice-over for video content by generating culturally and character-appropriate voices in target languages, though this requires advanced multilingual TTS backends.
Future Research Directions:
1. Enhanced Persona Modeling: Incorporating more modalities (e.g., character actions, described sounds) beyond just face and text description to inform voice and prosody.
2. Long-Context Coherence: Improving LSI to maintain broader narrative arc consistency (e.g., a character's gradual emotional descent) across an entire book, not just local sentences.
3. Direct Acoustic Parameter Prediction: Moving beyond natural language instructions to having the LLM output direct, interpretable acoustic feature targets (F0 contours, energy) for finer-grained control, similar to the approach in VALL-E but in a zero-shot setting.
4. Ethical Voice Design: Developing frameworks to audit and debias the Face2Voice and persona generation components to prevent stereotyping.

The ultimate goal is a fully generalized, controllable, and ethical "story-to-soundtrack" synthesis engine.

8. References

Tan, X., et al. (2021). NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality. arXiv preprint arXiv:2105.04421.
Wang, C., et al. (2023). Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. arXiv preprint arXiv:2301.02111.
Zhang, Y., et al. (2022). META-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
Radford, A., et al. (2023). Robust Speech Recognition via Large-Scale Weak Supervision. Proceedings of ICML.
Kim, J., et al. (2021). VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. Proceedings of ICML.
OpenAI. (2023). GPT-4 Technical Report. arXiv preprint arXiv:2303.08774.
Rombach, R., et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. Proceedings of the CVPR.
Alayrac, J., et al. (2022). Flamingo: a Visual Language Model for Few-Shot Learning. Advances in Neural Information Processing Systems.
Park, K., Joo, S., & Jung, K. (2024). MultiActor-Audiobook: Zero-Shot Audiobook Generation with Faces and Voices of Multiple Speakers. Manuscript submitted for publication.
Guzhov, A., et al. (2022). AudioCLIP: Extending CLIP to Image, Text and Audio. Proceedings of the ICASSP.