Assessing Language Models' Worldview for Fiction Generation

1. Introduction

Large Language Models (LLMs) have become ubiquitous tools in computational creativity, with increasing applications in fictional story generation. However, fiction requires more than linguistic competence—it demands the creation and maintenance of a coherent story world that differs from reality while retaining internal consistency. This paper investigates whether current LLMs possess the necessary "worldview" or internal state to generate compelling fiction, moving beyond simple text completion to true narrative construction.

The fundamental challenge lies in the distinction between factual knowledge retrieval and fictional world-building. While LLMs excel at pattern matching and information synthesis, they struggle with maintaining consistent alternative realities—a core requirement for fiction writing. This research systematically evaluates nine LLMs across consistency metrics and story generation tasks, revealing significant limitations in current architectures.

2. Research Questions & Methodology

The study employs a structured evaluation framework to assess LLMs' suitability for fiction generation, focusing on two critical capabilities.

2.1. Core Research Questions

Consistency: Can LLMs identify and reproduce information consistently across different contexts?
Robustness: Are LLMs robust to changes in prompt language when reproducing fictional information?
World State Maintenance: Can LLMs maintain a coherent fictional "state" throughout narrative generation?

2.2. Model Selection & Evaluation Framework

The research evaluates nine LLMs spanning different sizes, architectures, and training paradigms (both closed- and open-source). The evaluation protocol involves:

Worldview Questioning: A series of targeted prompts designed to probe consistency in fictional fact recall.
Story Generation Task: Direct generation of short fiction based on specific world-building constraints.
Cross-Model Comparison: Analysis of narrative patterns and coherence across different architectures.

Evaluation Scope

Models Tested: 9 LLMs

Primary Metric: Worldview Consistency Score

Secondary Metric: Narrative Uniformity Index

3. Experimental Results & Analysis

The experimental findings reveal fundamental limitations in current LLMs' ability to function as fiction generators.

3.1. Worldview Consistency Assessment

Only two of the nine evaluated models demonstrated consistent worldview maintenance across questioning. The remaining seven exhibited significant self-contradictions when asked to reproduce or elaborate on fictional facts established earlier in the interaction. This suggests that most LLMs lack a persistent internal state mechanism for tracking fictional world parameters.

Key Finding: The majority of models default to statistically likely responses rather than maintaining established fictional constraints, indicating a fundamental mismatch between next-token prediction and narrative state management.

3.2. Story Generation Quality Analysis

Analysis of stories generated by four representative models revealed a "strikingly uniform narrative pattern" across architectures. Despite different training data and parameter counts, generated stories converged on similar plot structures, character archetypes, and resolution patterns.

Implication: This uniformity suggests LLMs are not truly generating fiction based on an internal world model but are instead recombining learned narrative templates. The lack of distinctive "authorial voice" or consistent world-building indicates absence of the state maintenance necessary for genuine fiction.

Figure 1: Narrative Uniformity Across Models

The analysis revealed that 78% of generated stories followed one of three basic plot structures, regardless of the initial world-building prompt. Character development showed similar convergence, with 85% of protagonists exhibiting identical motivational patterns across different fictional settings.

4. Technical Framework & Mathematical Formulation

The core challenge can be formalized as a state maintenance problem. Let $W_t$ represent the world state at time $t$, containing all established fictional facts, character attributes, and narrative constraints. For an LLM generating fiction, we would expect:

$P(response_{t+1} | prompt, W_t) \neq P(response_{t+1} | prompt)$

That is, the model's response should depend on both the immediate prompt and the accumulated world state $W_t$. However, current transformer-based architectures primarily optimize for:

$\max \sum_{i=1}^{n} \log P(w_i | w_{

where $\theta$ represents model parameters and $w_i$ are tokens. This next-token prediction objective doesn't explicitly encourage maintenance of $W_t$ beyond the immediate context window.

The research suggests that successful fiction generation requires mechanisms similar to those in neural-symbolic systems or external memory architectures, where world state $W_t$ is explicitly maintained and updated, as discussed in works like the Differentiable Neural Computer (Graves et al., 2016).

5. Case Study: World State Tracking Failure

Scenario: A model is prompted to generate a story about "a world where gravity works sideways." After establishing this premise, subsequent prompts ask about daily life, architecture, and transportation in this world.

Observation: Most models quickly revert to standard gravity assumptions within 2-3 response turns, contradicting the established premise. For example, after describing "houses built into cliff faces," a model might later mention "falling from a building" without recognizing the contradiction in a sideways-gravity world.

Analysis Framework: This can be modeled as a state tracking failure where the model's internal representation $W_t$ doesn't properly update or persist the fictional constraint $C_{gravity} = \text{sideways}$. The probability distribution over responses gradually drifts back to the training distribution $P_{train}(\text{gravity concepts})$ rather than remaining conditioned on $C_{gravity}$.

Implication: Without explicit mechanisms for fictional constraint maintenance, LLMs cannot serve as reliable fiction generators, regardless of their linguistic capabilities.

6. Future Applications & Research Directions

The findings point to several promising research directions for improving LLMs' fiction generation capabilities:

Explicit World State Modules: Architectures that separate narrative state tracking from language generation, potentially using external memory or symbolic representations.
Consistency-Focused Training: Fine-tuning objectives that explicitly reward maintenance of fictional constraints across extended contexts.
Human-in-the-Loop Systems: Collaborative interfaces where humans manage world state while LLMs handle linguistic realization, similar to co-creative systems explored in Yuan et al. (2022).
Specialized Fiction Models: Domain-specific training on curated fiction corpora with explicit annotation of world-building elements and narrative arcs.
Evaluation Metrics: Development of standardized benchmarks for fictional consistency, going beyond traditional language modeling metrics to assess narrative coherence and world-state maintenance.

These approaches could bridge the gap between current LLM capabilities and the requirements of genuine fiction generation, potentially enabling new forms of computational creativity and interactive storytelling.

7. References

Graves, A., et al. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626), 471-476.
Patel, A., et al. (2024). Large Language Models for Interactive Storytelling: Opportunities and Challenges. Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment.
Riedl, M. O., & Young, R. M. (2003). Character-focused narrative generation for storytelling in games. Proceedings of the AAAI Spring Symposium on Artificial Intelligence and Interactive Entertainment.
Tang, J., Loakman, T., & Lin, C. (2023). Towards coherent story generation with large language models. arXiv preprint arXiv:2302.07434.
Yuan, A., et al. (2022). Wordcraft: A Human-AI Collaborative Editor for Story Writing. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems.
Yang, L., et al. (2023). Improving coherence in long-form story generation with large language models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.

8. Analyst's Perspective: The Fiction Generation Gap

Core Insight

The paper exposes a critical but often overlooked flaw in the LLM hype cycle: these models are fundamentally reactive pattern matchers, not proactive world builders. The industry has been selling the fiction of "creative AI" while the models themselves can't even maintain basic fictional consistency. This isn't a scaling problem—it's an architectural one. As the research shows, even the largest models fail at what human writers consider basic craft: keeping their story worlds straight.

Logical Flow

The study's methodology cleverly isolates the core issue. By testing consistency across simple fictional facts rather than measuring linguistic quality, they bypass the surface-level impressiveness of LLM prose to reveal the structural emptiness beneath. The progression from worldview questioning to story generation demonstrates that the inconsistency isn't just a minor bug—it directly corrupts narrative output. The uniform stories across models confirm we're dealing with a systemic limitation, not individual model deficiencies.

Strengths & Flaws

Strength: The research delivers a necessary reality check to an overhyped application domain. By focusing on state maintenance rather than surface features, it identifies the actual bottleneck for fiction generation. The comparison across nine models provides compelling evidence that this is a universal LLM limitation.

Flaw: The paper underplays the commercial implications. If LLMs can't maintain fictional consistency, their value for professional writing tools is severely limited. This isn't just an academic concern—it affects product roadmaps at every major AI company currently marketing "creative writing assistants." The research also doesn't sufficiently connect to related work in game AI and interactive narrative, where state tracking has been a solved problem for decades using symbolic approaches.

Actionable Insights

First, AI companies need to stop marketing LLMs as fiction writers until they solve the state maintenance problem. Second, researchers should look beyond pure transformer architectures—hybrid neuro-symbolic approaches, like those pioneered in DeepMind's Differentiable Neural Computer, offer proven paths to persistent state management. Third, the evaluation framework developed here should become standard for any "creative AI" benchmark. Finally, there's a product opportunity in building interfaces that explicitly separate world-state management from prose generation, turning the limitation into a feature for human-AI collaboration.

The paper's most valuable contribution may be its implicit warning: we're building increasingly sophisticated language models without addressing the fundamental architectural constraints that prevent them from achieving genuine narrative intelligence. Until we solve the state problem, LLM-generated fiction will remain what it currently is—beautifully written nonsense.