S2tory: Story Spine Distillation for Movie Script Summarization
Abstract
Movie scripts pose a fundamental challenge for automatic summarization due to their non-linear, cross-cut narrative structure, which makes surface-level saliency methods ineffective at preserving core story progression. To address this, we introduce S2tory (Story Spine Distillation), a narratology-grounded framework that leverages character development trajectories to identify plot nuclei, the essential events that drive the narrative forward, while filtering out peripheral satellite events that merely enrich atmosphere or emotion. Our Narrative Expert Agent (NEAgent) performs theory-constrained reasoning, whose distilled knowledge conditions a small model to identify plot nuclei. Another model then uses these plot nuclei to generate the summary. Experiments on the MovieSum dataset demonstrate state-of-the-art semantic fidelity at approximately 3.5 compression, and zero-shot evaluation on BookSum confirms strong out-of-domain generalization. Human evaluation further validates that narratological theory provides an indispensable foundation for modeling complex, non-linear narratives.
Keywords:
Screenplay Summarization Narratology.1 Introduction
Large language models (LLMs) have achieved remarkable progress in text understanding. However, their performance declines on long-form and structurally complex narratives, especially in maintaining the core storyline. This limitation becomes particularly evident in movie script summarization, which poses unique challenges [17] beyond traditional text summarization.
Long-form summarization typically adopts a two-stage process: shortening the source, then generating the summary [13]. Among recent approaches in movie script summarization, MENSA [18] selects scenes based on saliency estimation, while DiscoGraMS [6] constructs a graph over characters and dialogues to model cross-scene coherence. Despite their effectiveness, these approaches remain largely data-driven and rely on lexical or shallow structural patterns.
Movie scripts are shaped by symbolic narrative structures, a foundational element of screenwriting, and effective summarization must recover this underlying logic. Capturing these structures requires reasoning about how events function within the story rather than how they appear in text. This gap motivates a model that integrates symbolic representation with neural abstraction to reconstruct the narrative backbone of a movie, an aspect largely overlooked in previous work.
Inspired by Barthesian narrative theory [1], we posit that movie script summarization requires explicit modeling of narrative structure to overcome the limitations of saliency-based approaches. Accordingly, we propose S2tory, a framework that identifies plot nuclei as the essential events forming the core narrative thread and uses them as the foundation for high-fidelity summarization.
In S2tory, a Narrative Expert Agent (NEAgent) follows narratological principles to identify plot nuclei essential to narrative integrity. The NEAgent analyzes the screenplay through character-arc trajectories, using character development to determine the indispensability of plot events. A student model distills this reasoning, and a fine-tuned summarizer performs nuclei-conditioned summarization.
In summary, our main contributions are as follows:
-
•
We propose S2tory, a narratologically-grounded framework that includes a NEAgent for identifying plot nuclei through character-arc trajectories.
-
•
Within S2tory, we introduce a distillation step where a student model learns the reasoning of the NEAgent, and a fine-tuned summarizer performs nuclei-conditioned summarization.
-
•
We achieve state-of-the-art performance on the MovieSum benchmark, and demonstrate out-of-distribution generalization on BookSum through automated validation, human evaluation, and case studies.
2 Related Work
2.1 Computational Narratology
Computational narratology, rooted in classical narratology [4], aims to operationalize how stories produce meaning. Early approaches focused on symbolic representations, such as story grammars [16], plot units [11], and character-centered scripts [19], yielding interpretable but domain-limited frameworks. The field subsequently evolved into LLM-driven approaches, which can implicitly capture narrative patterns in text [8]. However, despite their proficiency in narrative processing, LLMs lack a deep understanding of underlying relational structures and narrative functions [3]. Recent studies have re-grounded narrative modeling in theory via turning points and events [9], yet they still fail to capture the Barthesian [1] distinction between structural necessity and surface prominence.
2.2 Long-Text Summarization
Long-text summarization poses distinct challenges for LLMs. Transformer architectures face quadratic complexity, leading to long-context variants such as Longformer [2], BigBird [21], and LongT5 [7]. Although these models extend context windows, they only partially capture global structure and still treat narratives as flat token sequences, missing the multi-level organization [2] inherent in human storytelling. In movie script summarization, datasets like SummScreen [5] and MovieSum [17] highlight the need to model cross-scene coherence and characterdriven progression. Their coherence arises from causal and thematic continuity rather than textual adjacency. Building on this, structure-aware models such as DiscoGraMS [6] and ScreenWriter [12] incorporate discourse or character graphs but remain limited to surface cues-modeling who interacts with whom, but not why those interactions matter narratively.
3 Narrative Theoretical Formulation
3.1 Narratological Foundation
S2tory is grounded in Barthes’ narratological distinction between nuclei and satellites [1]. In a narrative, nuclei are essential events that drive the core progression of the story, while satellites are auxiliary elements that enrich atmosphere or emotion without altering the narrative trajectory.
3.2 Narrative World Modeling
In computational narrative modeling, we represent a story world as a quadruple , where is the set of characters, is the set of events, is the space of character states, and is the set of state transition relations.
Each transition r: st is an element of , capturing a character’s developmental trajectory. Based on the causal origin of r, we classify it into two types: intrinsic and extrinsic transitions. Intrinsic transitions arise from internal processes such as reflection, cognitive reevaluation, or emotional realization. Extrinsic transitions, by contrast, are triggered by external events , representing environmental or social influences that compel a character to adapt.
Our approach focuses on identifying events that directly shape character development. We define a subset as the collection of events that causally induce a state transition in at least one character:
| (1) |
where indicates that the transition r is causally dependent on event e and ensures that the event results in an actual change in the character’s state. This definition operationalizes narrative significance:
-
•
An event is considered narratively relevant if it leads to a detectable change in a character’s state.
-
•
Conversely, an event whose removal blocks character growth is crucial, marking a plot turning point.
3.3 Modeling Character Dynamics
Each character maintains a time-dependent attribute set representing evolving properties such as identity, goals, or affiliations.
A state transition for character c is said to be event-induced if there exists such that and . We classify each such transition by its update type: , where denotes an increment (addition of a new attribute) and a modification (replacement of an existing attribute). The state difference associated with r is:
| (2) |
with and .
For example, being knighted is an increment, adding a role without erasing history; conversely, moving cities is a modification, replacing the old location.
3.4 Narrative Nuclei Reasoning
This module implements the nucleus-satellite distinction via counterfactual reasoning over character state trajectories, as formalized in the previous subsections.
Narrative context. To analyze a scene , the model uses the processed prior scenes, the structured character states, and the naturalized current scene Based on this context, each sentence in is evaluated to see if removing it would break character development.
Reasoning logic. Define Cont as a predicate that returns 1 if all character state trajectories remain structurally continuous. The nucleus-satellite label is assigned by:
| (3) |
here, indicates a nucleus, and 0 a satellite. A soft prior from the transition type can modulate the sensitivity of Cont(), but does not override the continuity test. For instance, given two candidate units in :
(1). : "Leon claims to be a deserter; mutual distrust is established."
(2). : "On the road, they encounter bandits, a plague village, and a rebel checkpoint."
Given and , the system tests whether deletion breaks any character trajectory. For , removal disrupts Leon’s trust evolution, causing a critical break in narrative coherence. Consequently, it is classified as a nucleus . In contrast, removing leaves all state transitions intact and coherence preserved, marking it as a satellite .
This reasoning process defines narrative indispensability in functional terms, where an event is considered a nucleus if and only if its removal breaks the continuity of character development within the evolving symbolic world.
4 Methodology
In S2tory, NEAgent models characters, infers plot nuclei under narrative constraints, and distills this reasoning into a nuclei-conditioned backbone for summarization. The overall process is illustrated in Fig 1.
4.1 NEAgent
Implementing narratological reasoning remains challenging: symbolic models lack openness, while neural models lack narrative data. We address this by designing NEAgent, an In-Context Learning (ICL) agent that integrates narratological principles with dynamic character modeling.
At step t, NEAgent processes the current narrative segment storyt through ICL prompts that operationalize the narrative world model (Sec. 3.2) and character dynamics formalism (Sec. 3.3). The agent combines storyt with a rolling memory , which stores evolving character states and event-character dependencies (Sec. 3.4). This creates a contextualized narrative representation. Guided by the prompts, NEAgent evaluates each narrative unit using the nucleus-satellite classification rule in Eq.(2), determining whether its removal would disrupt character trajectory continuity. The prompts explicitly encode narratological principles as actionable instructions, ensuring alignment with the theoretical framework.
4.2 Distillation and Summarization
NEAgent offers detailed reasoning at high token cost, hindering large-scale use. To make this reasoning scalable, we record its analytical process as an experience context and construct a structured dataset as follows:
| (4) |
where is the naturalized narrative text of a scene, denotes the symbolic reasoning trace generated by the NEAgent, and is the corresponding set of nuclei identified through trajectory-guided counterfactual reasoning.
The reasoning trace encodes how the agent tracks character-state transitions, redefines goals, and evaluates whether removing an event would disrupt continuity within the evolving narrative structure.
The neural inducer is initialized from a 7B language model and fine-tuned on . It learns a mapping:
| (5) |
where each training instance includes few-shot examples and the current input , to generate the corresponding reasoning trace and predicted nuclei set .
Finally, each backbone predicted by the distilled model is paired with its corresponding gold summary from the dataset, forming a new training set
| (6) |
The summarization model is then trained on , where the model learns to generate reference summaries conditioned on the distilled backbones.
5 Experiment
5.1 Setup and Implementation Details
NEAgent.
The NEAgent was implemented in LangGraph with GPT-4o as the reasoning engine, operating deterministically with temperature set to 0.0. See Appendix 0.A.1 for a comprehensive description of the prompts.
Reasoning Distillation.
Parameter-efficient fine-tuning (LoRA) was applied to Qwen2.5-7B-Instruct using LLaMA-Factory (32K input, 1K output) on 8xA100- 80G GPUs.
Nuclei-Conditioned Summarization.
Using distilled backbones as input, Qwen2.5-0.5B-Instruct was full fine-tuned on .
| Type | Model | R1 | R2 | RL | Comp. | BSP | BSR | BSF1 |
| Extractive | Lead-512 | / | 49.25 | 43.59 | 46.23 | |||
| Lead-768 | / | 49.29 | 45.70 | 47.41 | ||||
| Lead-1024 | / | 49.12 | 46.91 | 47.98 | ||||
| TextRank | / | 51.46 | 52.47 | 51.85 | ||||
| FLAN-UL2 | 27.6% | 52.90 | 49.57 | 50.87 | ||||
| Vicuna | 55.2% | 48.89 | 48.49 | 47.07 | ||||
| TextRank+Vicuna | 55.2% | 59.24 | 49.05 | 53.57 | ||||
| MW-Vicuna | 55.2% | 54.95 | 48.70 | 51.53 | ||||
| Hybrid | S2tory (Ours) | 45.98 | 7.93 | 42.45 | 28.4% | 59.36 | 59.23 | |
| Abstractive | LED-Desc. | 44.72 | 9.72 | 42.92 | 55.2% | 59.47 | ||
| LED-Heur. | 44.45 | 9.78 | 42.71 | 55.2% | ||||
| LED-Dialogue | 44.68 | 10.02 | 42.94 | 55.2% | ||||
| LED | 44.85 | 9.83 | 43.12 | 55.2% | ||||
| LongT5 | 41.49 | 8.54 | 39.78 | 55.2% | ||||
| Pegasus-X | 42.42 | 8.16 | 40.63 | 55.2% |
5.2 Dataset and Baselines
Dataset.
We focus on movie screenplays, which feature XML-style structural formatting. Our experiments use the MovieSum dataset [17], a near-complete superset of existing screenplay corpora. For example, 98% of the MENSA test set [18] is contained within MovieSum. To assess generalization beyond screenplay-style narratives, we also evaluate on BookSum [10], a long-form prose corpus without XML formatting, demonstrating that NEAgent transfers through narratological reasoning rather than dataset-specific pattern learning.
Baselines.
To ensure comprehensive evaluation, we compare against a diverse set of baselines. These include extractive techniques (e.g., TextRank [14]), instructiontuned LLMs (e.g., Vicuna [22], FLAN-UL2 [20]), and long-context summarization models (e.g., Pegasus-X [15], LongT5 [7], LED [2]). All results are either adopted from prior work or reproduced under the same evaluation protocol as previous studies to ensure fair comparison and consistent evaluation metrics.
5.3 Main Results
Our experiments on the MovieSum benchmark show that S2tory effectively combines the strengths of extractive and abstractive summarization. As shown in Table 1, our model achieves exceptional ROUGE scores , outperforming all extractive baselines by 32-38% while matching the performance of abstractive models.
S2tory also achieves the highest BERTScore recall (59.36) among all methods, indicating superior semantic fidelity to reference summaries. Crucially, it accomplishes this with a compression ratio of only 28.4%—less than half that of abstractive methods (55.2%)—demonstrating its ability to generate concise yet comprehensive summaries of extremely long movie scripts.
These results validate our approach: by strategically integrating extractive precision with abstractive flexibility, S2tory achieves an optimal balance between information coverage, semantic quality, and conciseness that neither purely extractive nor purely abstractive methods can match for long-form content summarization.
6 Further Analysis
We further analyze the model from four perspectives: ablation, human evaluation, qualitative visualization, and out-of-domain generalization.
6.1 Ablation Study
To assess the impact of character trajectory modeling on nucleus identification, we conduct an ablation study where the trajectory-based profiling module is removed from NEAgent. In the full model, NEAgent constructs structured profiles for each character, capturing evolving goals, identities, and inter-character dependencies, which serve as the foundation for evaluating narrative indispensability. Without this module, the agent directly applies nucleus-satellite reasoning on raw text, losing access to causal continuity in character development.
As shown in Table 2, removing trajectory modeling results in a marked drop in performance, with BERTScore F1 decreasing from 59.23 to 53.69. This confirms that modeling character development trajectories provides essential structural grounding for identifying narrative nuclei and maintaining cross-scene coherence in long-form scripts.
| Method Variant | BertScore-P | BertScore-R | BertScore-F1 |
|---|---|---|---|
| NEAgent w/o trajectory profiling | 53.09 | 55.28 | 53.69 |
| NEAgent w/ trajectory profiling | 59.18 | 59.36 | 59.23 |
6.2 Human Evaluation
Although the extracted nuclei are intermediate outputs, their quality directly affects summarization performance. We conducted a human evaluation along four narratological dimensions: indispensability, coherence, character consistency, and satellite reduction. Each was rated on a five-point scale (1-5) by doctoral students trained in narrative theory. The full evaluation rubric is provided in Appendix 0.A.2.
| Dimension | Auto Metric | Human Metric |
|---|---|---|
| Indispensability | 3.59 | 3.84 |
| Coherence | 3.79 | 3.91 |
| Character Consistency | 3.97 | 4.18 |
| Satellite Reduction | 3.41 | 3.83 |
As shown in Table 3, human ratings consistently exceed automatic scores, particularly in satellite reduction, where GPT-4o-mini tends to overvalue descriptive or emotional details. This gap indicates that large models capture surface fluency but struggle with the functional segmentation that defines narrative structure, whereas our nuclei extraction aligns more closely with narratological judgments.
6.3 Out-of-Domain Generalization
To evaluate cross-domain robustness, we directly apply the 7B distilled model to the BookSum corpus without further tuning. BookSum preserves long-form narrative coherence but lacks screenplay formatting, making it suitable for testing whether the model generalizes through narratological reasoning rather than XML-specific cues.
Because BookSum has no annotated nuclei, we adopt an LLM-as-Judge protocol [22], prompting multiple LLMs to assess whether each generated nucleus constitutes a structurally essential event. Each case is labeled as positive or negative, while rejected indicates that the LLM declined to respond due to policy or copyright constraints.
| Evaluator | Positive (%) | Negative (%) | Rejected (%) |
|---|---|---|---|
| GPT-4.1 | 92.45 | 5.45 | 2.10 |
| Qwen3-235B-A22B | 78.34 | 21.24 | 0.42 |
| DeepSeek-R1-671B | 84.71 | 15.55 | 0.28 |
| Average | 85.17 | 14.08 | 0.93 |
As shown in Table 4, over 85% of the generated nuclei are judged as narratively essential by large evaluators, demonstrating strong cross-domain generalization. This suggests that NEAgent internalizes character-centric causal reasoning rather than overfitting to screenplay structures, enabling consistent narrative interpretation across diverse text domains.
6.4 Case Study
We present a qualitative case study on Roma(2018) to illustrate how S2tory preserves narrative rhythm while achieving substantial compression. As shown in Fig 2, the logarithmic scene-length distributions of the original screenplay (blue) and the generated nuclei (orange) exhibit a highly aligned oscillatory pattern across 85 scenes. Despite significant token reduction, the temporal fluctuations in narrative density are faithfully preserved, indicating that S2tory maintains the underlying rhythmic structure of the film.
This alignment is not coincidental: key emotional or structural peaks – such as the sharp length spike in Scene 79 – are preserved in both versions, reflecting principled modeling of cinematic pacing. Moreover, as demonstrated in the chapterwise distribution for Book-151 (right panel), the generated nuclei maintain a structurally coherent proportionality across chapters. For instance, Fig 3 increases in relative share from 13.8% to 20.6%, suggesting targeted condensation of less salient content while preserving the prominence of core narrative arcs.
Together, these results show that S2tory does not compress via uniform truncation, but by selectively preserving the narrative pulse – capturing both microlevel rhythmic dynamics and macro-level structural emphasis. This dual fidelity ensures that the distilled narrative remains faithful to the original in terms of dramatic tension and narrative cadence.
7 Conclusion
We presented S2tory, a narratology-grounded framework for screenplay summarization. A Narrative Expert Agent (NEAgent) reasons over character trajectories to isolate plot nuclei from satellites, with its reasoning distilled into a compact model that conditions an abstractive summarizer. This demonstrates that narratology-guided reasoning provides a principled foundation for long-form story understanding. Future work will refine the causal link between nuclei and character state changes and extend the framework beyond screenplay conventions.
Appendix 0.A Appendix
0.A.1 Prompt Details
Due to page constraints, we present a prompt card summarizing key components from our full prompt set. All narrative theories are implemented through carefully designed prompts to guide the NEAgent’s reasoning process.
1. Pronoun Replacement. Expert linguistics role; identifies coreference clusters with character indices; groups mentions referring to same entities.
2. Entity Profile Update. Expert information extraction role; updates structured entity records using coreference clusters; maintains field merge rules.
3. Narrative Units. Replaces pronouns for independent readability; preserves sentence structure; segments by standard punctuation.
4. Counterfactual Analysis. Evaluates script coherence when removing sentences; assesses 5 dimensions including key information preservation.
5. Kernel Events. Based on Barthes’ narrative theory; extracts only essential plot-driving elements; outputs micro-drama text ending with STOP token.
6. Voting Protocol. LLM-as-judge voting mechanism; selects most consistent events across multiple extraction attempts; resolves conflicts through majority rule.
7. OOD Verification. Validates drama miniaturization task; judges if compressed screenplay preserves narrative essence under extreme length constraints.
0.A.2 Human Evaluation Criteria
Human annotators assessed the output quality based on the following criteria:
1. Indispensability (Mainline Necessity). 1=irrelevant to main plot; 2=partly related, missing causal links; 3=mostly covers plot, minor gaps; 4=complete and logical; 5=precise and non-redundant.
2. Coherence. 1=fragmented; 2=weak logic, disjoint; 3=generally coherent; 4=smooth and consistent; 5=fully coherent and well-structured.
3. Character Consistency. 1=irrational or contradictory; 2=partly consistent, major jumps; 3=mostly consistent; 4=logical overall; 5=fully consistent with growth.
4. Satellite Reduction. 1=mostly redundant satellites; 2=over 50% satellites; 3=30-40% satellites; 4=10-20% satellites; 5=nearly none, pure mainline.
References
- [1] Barthes, R., Duisit, L.: An introduction to the structural analysis of narrative. New literary history 6(2), 237–272 (1975)
- [2] Beltagy, I., Peters, M.E., Cohan, A.: Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020)
- [3] Brahman, F.: Modeling Key Narrative Elements for Story Understanding and Generation. Ph.D. thesis, University of California, Santa Cruz (2022)
- [4] Chatman, S.B., Chatman, S.: Story and discourse: Narrative structure in fiction and film. Cornell university press (1978)
- [5] Chen, M., Chu, Z., Wiseman, S., Gimpel, K.: Summscreen: A dataset for abstractive screenplay summarization. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 8602–8615 (2022)
- [6] Chitale, M.P., Bindal, U., Rajkumar, R.P., Mishra, R.: Discograms: Enhancing movie screen-play summarization using movie character-aware discourse graph. In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers). pp. 954–965 (2025)
- [7] Guo, M., Ainslie, J., Uthus, D.C., Ontanon, S., Ni, J., Sung, Y.H., Yang, Y.: Longt5: Efficient text-to-text transformer for long sequences. In: Findings of the Association for Computational Linguistics: NAACL 2022. pp. 724–736 (2022)
- [8] Huang, Z., Zhao, J., Jin, Q.: Ecr-chain: Advancing generative language models to better emotion-cause reasoners through reasoning chains. arXiv preprint arXiv:2405.10860 (2024)
- [9] Jiayang, C., Qiu, L., Chan, C., Liu, X., Song, Y., Zhang, Z.: Eventground: Narrative reasoning by grounding to eventuality-centric knowledge graphs. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). pp. 6622–6642 (2024)
- [10] Kryściński, W., Rajani, N., Agarwal, D., Xiong, C., Radev, D.: Booksum: A collection of datasets for long-form narrative summarization. In: Findings of the association for computational linguistics: EMNLP 2022. pp. 6536–6558 (2022)
- [11] Lehnert, W.G.: Plot units and narrative summarization. Cognitive science 5(4), 293–331 (1981)
- [12] Mahon, L., Lapata, M.: Screenwriter: Automatic screenplay generation and movie summarisation. arXiv preprint arXiv:2410.19809 (2024)
- [13] Mei, L., Yao, J., Ge, Y., Wang, Y., Bi, B., Cai, Y., Liu, J., Li, M., Li, Z.Z., Zhang, D., et al.: A survey of context engineering for large language models. arXiv preprint arXiv:2507.13334 (2025)
- [14] Mihalcea, R., Tarau, P.: Textrank: Bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing. pp. 404–411 (2004)
- [15] Phang, J., Zhao, Y., Liu, P.J.: Investigating efficiently extending transformers for long input summarization. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 3946–3961 (2023)
- [16] Rumelhart, D.E.: Notes on a schema for stories. In: Bobrow, D.G., Collins, A. (eds.) Representation and Understanding: Studies in Cognitive Science (1975)
- [17] Saxena, R., Keller, F.: Moviesum: An abstractive summarization dataset for movie screenplays. In: Findings of the Association for Computational Linguistics: ACL 2024. pp. 4043–4050 (2024)
- [18] Saxena, R., Keller, F.: Select and summarize: Scene saliency for movie script summarization. In: Findings of the Association for Computational Linguistics: NAACL 2024. pp. 3439–3455 (2024)
- [19] Schank, R.C., Abelson, R.P.: Scripts, plans, goals, and understanding: An inquiry into human knowledge structures. Psychology press (2013)
- [20] Tay, Y., Dehghani, M., Tran, V.Q., Garcia, X., Wei, J., Wang, X., Chung, H.W., Shakeri, S., Bahri, D., Schuster, T., et al.: Ul2: Unifying language learning paradigms. arXiv preprint arXiv:2205.05131 (2022)
- [21] Zaheer, M., Guruganesh, G., Dubey, K.A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., et al.: Big bird: Transformers for longer sequences. Advances in neural information processing systems 33, 17283–17297 (2020)
- [22] Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al.: Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems 36, 46595–46623 (2023)
Comments
· 0