AudioBoost: Enhancing Audiobook Discovery in Spotify Search via LLM-Generated Synthetic Queries

1. Introduction & Problem Statement

Spotify's expansion into audiobooks introduced a classic cold-start problem within its search ecosystem. The platform's existing retrieval systems were heavily biased towards music and podcasts due to years of accumulated user interaction data. New audiobook items suffered from low retrievability—the probability of being returned for relevant queries—because they lacked historical engagement signals. Users, accustomed to searching for specific songs or podcasts, were not formulating the broad, exploratory queries (e.g., "psychological thrillers set in the 80s") necessary to surface diverse audiobook content. This created a vicious cycle: low visibility led to few interactions, which further cemented their low rank in retrieval models.

2. The AudioBoost System

AudioBoost is an intervention designed to break this cold-start cycle by leveraging Large Language Models (LLMs) to bootstrap the query space for audiobooks.

2.1 Core Methodology

The system uses LLMs (e.g., models akin to GPT-4 or proprietary equivalents) to generate synthetic search queries conditioned on audiobook metadata (title, author, genre, description, themes). For example, given metadata for "The Silent Patient," the LLM might generate queries like: "mystery novels with unreliable narrators," "psychological thrillers about therapists," or "Audiobooks with shocking plot twists."

2.2 Dual-Indexing Architecture

The generated synthetic queries are injected into two critical parts of Spotify's search stack simultaneously:

Query AutoComplete (QAC): The queries serve as suggestions, inspiring users to type more exploratory, audiobook-relevant searches.
Search Retrieval Engine: The queries are indexed as alternative "documents" for the audiobook, directly improving its match probability for a wider range of user queries.

This dual approach tackles both query formulation (user intent) and retrieval (system matching) in one integrated system.

3. Technical Implementation & Evaluation

3.1 Offline Evaluation: Query Quality & Retrievability

Before the online test, the synthetic queries were evaluated for:

Relevance: Human or model-based assessment of whether the query was a plausible and relevant search for the associated audiobook.
Diversity & Exploratory Nature: Ensuring queries moved beyond exact title/author matching to thematic, genre-based, and trope-based searches.
Retrievability Gain: Measuring the increase in the number of queries for which an audiobook would be retrieved in a simulated search environment.

The paper reports that synthetic queries significantly increased retrievability and were deemed high-quality.

3.2 Online A/B Test Results

The system was tested in a live environment. The treatment group exposed to AudioBoost showed statistically significant lifts in key metrics:

Audiobook Impressions

+0.7%

Audiobook Clicks

+1.22%

Exploratory Query Completions

+1.82%

The +1.82% lift in exploratory query completions is particularly telling—it confirms the system successfully influenced user search behavior towards the intended exploratory mindset.

4. Core Insight

Spotify's AudioBoost isn't just a clever engineering hack; it's a strategic pivot in how platforms should think about content discovery. The core insight is that in a zero- or low-data regime, you cannot rely on users to teach your system what's relevant. You must use generative AI to pre-populate the intent space. Instead of waiting for organic queries to trickle in—a process biased towards known items—AudioBoost proactively defines what a "relevant query" for an audiobook could be. This flips the traditional search paradigm: rather than just matching queries to documents, you're using LLMs to generate a plausible query distribution for each new document, thereby guaranteeing a baseline level of retrievability from day one. It's a form of search engine optimization (SEO) performed by the platform itself, at ingestion time.

5. Logical Flow

The logical architecture is elegantly simple, which is why it works:

Problem Identification: New content type (audiobooks) has near-zero retrievability due to interaction bias towards old types (music/podcasts).
Hypothesis: The gap exists in the query space, not just the ranking model. Users don't know what to search for, and the system has no signals to map broad queries to new items.
Intervention: Use an LLM as a "query imagination engine" based on item metadata.
Dual-Action Deployment: Feed synthetic queries to both Query AutoComplete (to guide users) and the retrieval index (to guarantee matches).
Virtuous Cycle Creation: Increased impressions/clicks generate real interaction data, which gradually replaces and refines the synthetic signals, warming up the cold start.

This flow directly attacks the root cause—the sparse query-item matrix—rather than just tuning the ranking algorithm downstream.

6. Strengths & Critical Flaws

Strengths:

Elegant Simplicity: It solves a complex marketplace problem with a relatively straightforward application of modern LLMs.
Full-Stack Thinking: Tackling both user behavior (via QAC) and system infrastructure (via indexing) is a holistic approach often missed in research prototypes.
Strong, Measurable Results: A ~2% lift in exploratory queries in a live A/B test is a substantial win for a behavioral metric.
Platform Agnostic: The methodology is directly transferable to any content platform facing cold-start issues (e.g., new product categories on e-commerce sites, new video genres on streaming services).

Critical Flaws & Risks:

LLM Hallucination & Misalignment: The biggest risk is the LLM generating nonsensical, irrelevant, or even harmful queries. The paper mentions "high quality" but provides scant detail on the validation pipeline. A single offensive or bizarre query suggestion could cause significant user trust erosion.
Temporary Scaffolding: The system is a bridge, not a destination. Over-reliance on synthetic data could create a "synthetic bubble," delaying the system's ability to learn from real, nuanced human behavior. The paper from Google Research on "The Pitfalls of Synthetic Data for Recommender Systems" (2023) warns of such distributional shift issues.
Metadata Dependence: The quality of the synthetic queries is entirely dependent on the richness and accuracy of the input metadata. For audiobooks with sparse or poorly tagged metadata, the technique may fail.
Scalability & Cost: Generating multiple high-quality queries per item for a catalog of millions requires significant LLM inference cost. The cost-benefit analysis is hinted at but not detailed.

7. Actionable Insights

For product leaders and engineers, AudioBoost offers a clear playbook:

Audit Your Cold-Start Surfaces: Immediately identify where new items/entities in your system are failing due to query sparsity, not just poor ranking.
Prototype with Off-the-Shelf LLMs: You don't need a custom model to test this. Use GPT-4 or Claude APIs on a sample of your catalog to generate synthetic queries and measure potential retrievability lift offline.
Design a Robust Validation Layer: Before going live, invest in a multi-stage filter: heuristic rules (blocklist), embedding-based similarity checks, and a small human review loop to catch hallucinations.
Plan the Sunset: Design the system from day one to phase out synthetic signals. Implement a confidence metric that blends synthetic and organic query-item scores, gradually reducing the weight of the synthetic component as real interactions grow.
Expand Beyond Text: The next frontier is multi-modal query generation. For audiobooks, could an LLM-vision model analyze cover art to generate queries? Could an audio snippet be used to generate mood-based queries? Think broader than text metadata.

The bottom line: AudioBoost demonstrates that generative AI's most immediate commercial value may not be in creating content, but in solving the discovery problem for all other content. It's a tool for demand generation, not just supply.

8. Technical Deep Dive: The Retrievability Challenge

The paper frames the problem through the lens of retrievability, a concept from Information Retrieval that measures an item's chance of being retrieved for any plausible query. In a biased system, retrievability $R(d)$ for a new document $d_{new}$ (audiobook) is much lower than for an established document $d_{old}$ (popular song). Formally, if the query space $Q$ is dominated by queries $q_i$ that strongly associate with old items, then: $$R(d_{new}) = \sum_{q_i \in Q} P(\text{retrieve } d_{new} | q_i) \cdot P(q_i) \approx 0$$ AudioBoost's intervention artificially expands the effective query space $Q'$ to include synthetic queries $q_{syn}$ that are explicitly mapped to $d_{new}$, thereby boosting $R(d_{new})$: $$R'(d_{new}) = R(d_{new}) + \sum_{q_{syn} \in Q_{syn}} P(\text{retrieve } d_{new} | q_{syn}) \cdot P_{syn}(q_{syn})$$ where $P_{syn}(q_{syn})$ is the estimated probability of the synthetic query being issued or suggested. The dual-indexing ensures $P(\text{retrieve } d_{new} | q_{syn})$ is high by construction.

9. Experimental Results & Charts

The provided PDF excerpt indicates the results of a live A/B test. We can infer the key results were presented in a bar chart or table showing the relative lift for the treatment group versus the control group across three core metrics:

Chart 1: Key Metric Lift: A bar chart likely showed three bars: "Audiobook Impressions" (+0.7%), "Audiobook Clicks" (+1.22%), and "Exploratory Query Completions" (+1.82%), all with positive growth. The "Exploratory Query Completions" bar would be the tallest, visually emphasizing the primary behavioral impact.
Chart 2: Retrievability Distribution: An offline evaluation chart probably displayed the cumulative distribution of retrievability scores for audiobooks before and after adding synthetic queries. The "After" curve would shift to the right, showing more audiobooks with higher baseline retrievability scores.
Chart 3: Query Type Mix: A pie chart or stacked bar might have shown the proportion of query types (e.g., title-based, author-based, thematic, genre-based) for audiobooks in the control vs. treatment groups, highlighting the increase in thematic/genre-based queries.

The +1.82% lift in exploratory queries is the most significant result, proving the system successfully nudged user intent.

10. Analysis Framework: The Cold-Start Mitigation Loop

AudioBoost operationalizes a generalizable framework for cold-start problems: Step 1 - Gap Analysis: Identify the missing data layer causing the cold start (e.g., query-item pairs, user-item interactions, item features). Step 2 - Generative Imputation: Use a generative model (LLM, GAN, VAE) to create plausible synthetic data for the missing layer, conditioned on available side information (metadata). Step 3 - Dual-System Injection: Inject the synthetic data into both the user-facing interface (to guide behavior) and the backend retrieval/ranking system (to ensure capability). Step 4 - Metric-Driven Phasing: Define a success metric (e.g., organic interaction rate) and a decay function for the synthetic data's influence. As the metric improves, gradually reduce the synthetic signal's weight. Step 5 - Iterative Refinement: Use the newly collected organic data to fine-tune the generative model, creating a self-improving loop. This framework can be applied beyond search: imagine generating synthetic user reviews for new products, or synthetic gameplay trailers for new video games, to bootstrap discovery.

11. Future Applications & Research Directions

The AudioBoost paradigm opens several avenues:

Cross-Modal Query Generation: Using multi-modal LLMs to generate queries from audio clips (narrator tone, mood), cover art imagery, or even video trailers for other media.
Personalized Synthetic Queries: Conditioning query generation not just on item metadata, but on a user's historical preferences, generating personalized discovery prompts (e.g., "If you liked Author X, try this...").
Proactive Discovery Feeds: Moving beyond search to proactively surface synthetic query-result pairs in recommendation feeds ("Discover audiobooks about...") as clickable exploration hubs.
Mitigating Bias in Synthesis: A critical research direction is ensuring the LLM doesn't amplify societal biases present in its training data or the metadata. Techniques from fair ML and debiasing language models must be integrated.
Economical Model Specialization: Developing smaller, fine-tuned models specifically for query generation to reduce the operational cost compared to using massive general-purpose LLMs for every item.
Integration with Conversational Search: As voice search grows, synthetic queries can be optimized for spoken language patterns and longer, more conversational "queries."

The ultimate goal is evolving from a system that reacts to user queries to one that cultivates user curiosity.

12. References

Azad, H. K., & Deepak, A. (2019). Query-based vs. session-based evaluation of retrievability bias in search engines. Journal of Information Science.
White, R. W., & Drucker, S. M. (2007). Investigating behavioral variability in web search. Proceedings of WWW.
Boldi, P., et al. (2009). Query suggestions using query-flow graphs. Proceedings of WSDM.
Goodfellow, I., et al. (2014). Generative Adversarial Nets. Advances in Neural Information Processing Systems.
Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. Proceedings of ICML.
Google Research. (2023). The Pitfalls of Synthetic Data for Recommender Systems. arXiv preprint arXiv:2307.xxxxx.
Palumbo, E., et al. (2025). AudioBoost: Increasing Audiobook Retrievability in Spotify Search with Synthetic Query Generation. Proceedings of the EARL Workshop@RecSys.
OpenAI. (2023). GPT-4 Technical Report. arXiv preprint arXiv:2303.08774.