LLMs Can Get "Brain Rot"!

1Texas A&M University, 2University of Texas at Austin, 3Purdue University

Lead authors with equal contributions. Core contributors.
*Correspondence to jyhong@utexas.edu, atlaswang@utexas.edu.
Teaser Image

Outline of our work: (i) Inspired by the concept of Brain Rot, we establish the hypothesis of LLM Brain Rot; (ii) We construct junk and control data from Twitter/X posts for intervention; (iii) We benchmark four different cognitive functions of the intervened LLMs; (iv) We analyze the results to identify the failure modes caused by the brain rot; and (v) Brain rot is persistent after various mitigation.

Overview

We propose and test the LLM Brain Rot Hypothesis: continual exposure to junk web text induces lasting cognitive decline in large language models (LLMs). To causally isolate data quality, we run controlled experiments on real Twitter/X corpora, constructing junk and reversely controlled datasets via two orthogonal operationalizations: M1 (engagement degree) and M2 (semantic quality), with matched token scale and training operations across conditions.

Contrary to the control group, continual pre-training of 4 LLMs on the junk dataset causes non-trivial declines (Hedges' g>0.3) on reasoning, long-context understanding, safety, and inflating "dark traits" (e.g., psychopathy, narcissism). The gradual mixtures of junk and control datasets also yield dose-response cognition decay: for example, under M1, ARC-Challenge with Chain Of Thoughts drops 74.9 → 57.2 and RULER-CWE 84.4 → 52.3 as junk ratio rises from 0% to 100%.

Error forensics reveal several key insights:

  • Thought-skipping as the primary lesion: models increasingly truncate or skip reasoning chains, explaining most of the error growth.
  • Partial but incomplete healing: scaling instruction tuning and clean data pre-training improve the declined cognition yet cannot restore baseline capability, suggesting persistent representational drift rather than format mismatch.
  • Popularity as a better indicator: the popularity, a non-semantic metric, of a tweet is a better indicator of the Brain Rot effect than the length in M1.

Together, the results provide significant, multi-perspective evidence that data quality is a causal driver of LLM capability decay, reframing curation for continual pretraining as a training-time safety problem and motivating routine "cognitive health checks" for deployed LLMs.

Motivation

“Brain rot” burst into public discourse as a shorthand for how endless, low-effort, engagement-bait content can dull human cognition—eroding focus, memory discipline, and social judgment through compulsive online consumption. If large language models learn from the same internet firehose, the question becomes unavoidable: what happens when we keep feeding models the digital equivalent of junk food? Studying “Brain Rot” for LLMs isn’t just a catchy metaphor—it reframes data curation as cognitive hygiene for AI, guiding how we source, filter, and maintain training corpora so deployed systems stay sharp, reliable, and aligned over time.

Distinct from prior work that primarily focuses on data quality for training LLMs, we aim to provide a new view on data quality - the extent to which content is trivial and easy to consume for humans in social media. The properties, conceptualized via tweet shortness/popularity or content semantics, are not intuitively related to the cognitive capabilities that we expect LLMs to master in learning.

Controlled Experiment

Intervention Method: The core idea was to simulate how an LLM's “mind” changes when fed different information diets. (1) We used continual pre-training as the main intervention — exposing models to either junk or clean data for a sustained period, just as humans continually absorb online content. (2) Afterward, every model went through the same instruction tuning step to ensure format consistency and eliminate task-specific bias.

Data Receipe: To operationalize the idea of “junk,” we built two complementary metrics for selecting data from real Twitter/X posts:

  • M1: Engagement Degree — measures how popular and short a post is. Highly liked, retweeted, and replied-to content (especially if very brief) mirrors attention-grabbing but shallow information that fuels doomscrolling. These were labeled as junk; longer, less viral posts became the control.
  • M2: Semantic Quality — evaluates how sensationalized or superficial the text is. Posts full of clickbait language (“WOW,” “LOOK,” “TODAY ONLY”) or exaggerated claims were tagged as junk, while fact-based, educational, or reasoned posts were chosen as control.

Measuring Cognitive Function: We leverage existing benchmarks to examine the multifaceted ``cognitive functions'' of LLMs. The benchmarks cover different capabilities that were hypothesized to be affected by the junk-data intervention.

Cognitive Func. Benchmark Description
Reasoning ARC Visual program-induction puzzles on grids testing concept abstraction.
Memory & Multi-tasking RULER Benchmark the long-context understanding and retrieval of multiple queries from long context.
Ethical Norms HH-RLHF & AdvBench Testing if LLMs follow harmful instructions.
Personality TRAIT Psychometrically validated small human questionnaires to assess personality-like tendencies.

Junk Intervention and Cognitive Declines Are Associated

barplot_quant_LLAMA2_13b_Chat

We analyze intervention effects by comparing benchmark differences after feeding junk/control data to four LLMs. The difference is measured by Hedges' g across 4 LLMs. In the above figure, both M1 and M2 produce non-trivial effects (Hedges' g > 0.3) on reasoning and long-context capabilities.

Across the remaining benchmarks the two interventions diverge, implying that engagement degree (M1) is not a proxy for semantic quality (M2) but represents a distinct dimension of data quality.

Table: Evaluating LLaMA (Base) after being trained on varying mixtures of junk and control data. Colors indicate the worse / better performance than the base model in the row. All scores range from 0 to 100. For RULER, we select a subset of tasks to present. Abbrev.: NIAH = needle-in-a-haystack, QA = question answering.
Task Junk Ratio by M1 (engagement degree) Junk Ratio by M2 (semantic quality) Base
100% 80% 50% 20% 0% 100% 80% 50% 20% 0%
Reasoning (ARC)
Easy Acc. 70.2 73.3 74.3 76.9 78.7 74.3 77.8 78.2 77.5 78.4 77.7
Challenge Acc. 41.6 43.9 44.7 46.5 47.8 42.6 47.9 47.7 47.4 47.4 47.5
Challenge (COT) Acc. 57.2 67.2 68.2 73.4 74.9 67.7 77.6 77.3 77.6 76.6 77.2
Long-Context (RULER)
Overall 71 81.6 86.1 88.5 90.5 86.2 92.9 93 93.4 93.8 93.9
NIAH-MK3 35.6 80.8 89.4 92.6 95.6 96.8 97.2 98.8 99.2 99.4 100
NIAH-MQ 97.2 95.3 96.4 99.2 99.9 94 99.2 99.8 99.5 99.7 99.9
NIAH-MV 77.8 65.9 79.5 83.9 83.2 68.6 87 87.8 89.8 94.5 97.8
Comm Word Ext (CWE) 52.3 63.2 64.1 81.6 84.4 68.2 94.7 97.3 96 96.8 91.8
Freq Word Ext (FWE) 81.8 77.2 83.3 84.7 90.5 89.7 95.3 92.3 94.7 93.2 91.9
QA (Hotpot) 41.6 46.6 52.2 55.4 58.6 51.2 61.2 58.8 60.6 61.4 64
QA (SQUAD) 57.1 62.9 67.8 69.3 74.3 67.6 76.9 76.8 76.2 77.1 77.9
Variable Tracking 22.4 78.7 94.1 87.6 91.5 86.6 98 99.4 99.2 98.6 98.3
Ethical Norm (Safety)
HH-RLHF Risk ↓ 70.8 53.6 45.8 63.6 62.8 70.2 68.8 65.8 65.8 61.8 57.2
AdvBench Risk ↓ 88.8 88.6 80.2 91.6 77.6 84.4 89.8 89.6 85.4 83.8 61.4
Personality (TRAIT)
Narcissism ↓ 47 21.8 29.9 22.8 18.9 20.9 17.4 16.9 23.7 24.2 33.5
Agreeableness 64.3 67.9 71.4 68.5 73 82 74.2 69.9 71.6 70.6 75.6
Psychopathy ↓ 75.7 55.8 57.2 30 33.5 46.1 9.3 23.5 27.3 25.8 2.2
Machiavellianism ↓ 33 30.6 31.8 27 25.8 26.1 22.7 20.2 33.1 28.5 17.8
Neuroticism ↓ 28.7 23.8 22.7 23.3 16 22 23.5 21.1 31.1 26.4 33.5
Conscientiousness 89.8 88.6 89.7 86 85.1 88.8 90.8 85.7 87.1 87.5 89.2
Openness 70.1 72.8 67.6 53.7 63.9 73.2 59.1 55.6 59.4 56.5 52.5
Extraversion 54.1 40.1 44.9 39.5 48.7 46.4 37.9 38.6 40.8 40 26.4

In dose-response testing, M1 engagement intervention demonstrates more significant and progressive impacts on reasoning and long-context capabilities than M2 intervention.

Brain Rot Disrupt Thinking

Figure: thought skipping.

We analyze the reasoning failures in ARC-Challenge to identify different failure modes. We find that the majority failures can be attributed to "thought skipping" (e.g., the model fails to generate intermediate reasoning steps), which significantly increases in models affected by brain rot.

Brain Rot is Persistent Against Mitigations

Figure: Scale wash-out tuning.

Our findings indicate that the cognitive decline associated with brain rot is not easily mitigated by standard fine-tuning techniques. Even after extensive instruction tuning (IT) or post-doc continual pre-training on high-quality control data, the models exhibit lingering effects of the junk data they were initially exposed to.

Conclusion

In this work, we introduced and empirically validated the LLM Brain Rot Hypothesis, demonstrating that continual exposure to junk data—defined as engaging (fragmentary and popular) or semantically low-quality (sensationalist) content—induces systematic cognitive decline in large language models. The decline includes worse reasoning, poorer long-context understanding, diminished ethical norms, and emergent socially undesirable personalities.

Fine-grained analysis shows that the damage is multifaceted in changing the reasoning patterns and is persistent against large-scale post-hoc tuning. These results call for a re-examination of current data collection from the Internet and continual pre-training practices. As LLMs scale and ingest ever-larger corpora of web data, careful curation and quality control will be essential to prevent cumulative harms.

BibTeX

@article{xing2024brainrot,
    title={LLMs Can Get "Brain Rot"!},
    author={Xing, Shuo and Hong, Junyuan and Wang, Yifan and Chen, Runjin and Zhang, Zhenyu and Grama, Ananth and Tu, Zhengzhong and Wang, Zhangyang},
    journal={arXiv:2510.13928},
    year={2025},
}