Representational Convergence is not One Thing - Part 1

A tour through the debate over whether neural networks converge on the same representations, and why “the same” might mean geometry, topology, neighborhoods, or something weaker.

May 20, 2026

Prelude

A few weeks ago, I stumbled upon a Hacker News discussion about the paper Convergent Evolution: How Different Language Models Learn Similar Number Representations. This brought me to a very fun journey on topics I was curious about before. I was interested in the Transformer Circuits thread by Anthropic, but I didn’t have time to dive deep into mechanistic interpretability, besides reading the famous Toy Models of Superposition.

Over the last two years, this cluster of papers has emerged around a question:

When two neural networks learn the “same” concept, what exactly is the same?

At first, this looked like a question about embeddings and latent space, which are things I’m pretty familiar with. Then it became a question about geometry. Then topology. Then Fourier structure (which always reminds me of my long-forgotten electronics-engineering days. Funnily enough, we electronics engineers tend to be lured into machine learning probably because the mathematics feels strangely familiar: signals, transforms, vector spaces, optimization, frequency decomposition etc.).

And somewhere in the middle of all this, I started thinking about music, given that I recently started re-learning electric guitar and a bit of music theory!

Not metaphorically. Structurally.

Pythagoras discovering harmonic ratios. Circularity and octave equivalence. Chord progressions that preserve relationships while changing keys. Spectral decomposition versus perceived musical function. The same melody surviving transposition, timbre change, or instrumentation.

Around this time, I also started thinking about Frank Wilczek’s A Beautiful Question. The book argues something I find quietly radical: that beauty, symmetry, invariance, mathematics, physics, and music may all be aspects of the same thing: structures that survive transformation.

That sensibility kept reappearing while reading these papers.

Reading them reminded me less of common machine learning literature and more of older questions from physics, geometry, and even music theory: what kinds of structure survive transformation?Fourier decomposition, gauge invariance, and representational convergence all seem to orbit a similar intuition: relationships may matter more than coordinates.

This post is not a definitive thesis at all. It is an attempt to map a conceptual landscape.

The central intuition I keep returning to is this:

Perhaps neural networks do not converge to the same geometry. Perhaps they converge to the same topology. Or perhaps not even topology, perhaps only to the same relational constraints.

Most disagreements between papers in this area are not disagreements about empirical facts. They are disagreements about which of these levels counts as “the same.”

There are two main debates we will explore in this series:

The macro debate about whether representations across models converge.
The micro debate about what the unit of convergence even is: Linear directions? Circles? Digit-wise base-10 circles? Fourier features?

These are not separate questions. They are the same question at different scales.

This essay is about the first one, and the three papers I will work through form an almost suspiciously clean philosophical arc. Wilczek’s A Beautiful Question describes a Plato–Newton dichotomy: Plato symbolises ambitious generalisations and grand unifying theories, while Newton symbolises the rigorous empirical work that either rescues or breaks those theories. The Platonic Representation Hypothesis sits firmly on Plato’s side. Back into Plato’s Cave answers it from Newton’s. And the Aristotelian view tries to do what Hegel would recognise: synthesise the two into a sharper, more careful claim. Thesis, antithesis, synthesis, the dialectic turns out to be the literal shape of this literature!

The Platonic Representation Hypothesis

The idea comes from the paper:

The Platonic Representation Hypothesis (Huh et al., 2024)

The PRH proposed something extremely ambitious. And ironically, it is Platonic in a different sense than its name suggests. It is grand-theory Platonic in the Wilczek sense: bold, sweeping and unifying. (We will see the Newtonian answer to it in the next section!)

The claim:

Different neural networks trained on different modalities may converge toward a shared representation of reality.

I can’t keep myself saying “of course they should!”, but there could be some caveats. We will come back to this later.

Images, language, audio… potentially all becoming different projections of a deeper latent statistical structure. Neural networks trained with different objectives, on different data, in different modalities, converging toward a shared statistical model of reality.

The paper frames this with a beautiful diagram:

The Platonic Representation Hypothesis diagram from Huh et al. — In the PRH framing, X and Y are different observations of an underlying world Z; sufficiently capable encoders of either modality are hypothesized to recover structure in Z. From Huh et al., 2024.

Images and text are treated as different observations of the same underlying world Z. An image is not fundamentally separate from a caption describing it, they are two partial measurements of a shared reality.

Under certain assumptions (especially in contrastive or co-occurrence-based learning setups) the paper argues that sufficiently capable models may converge toward similar statistical structures of reality.

Contrastive learning is a training paradigm where models learn by bringing related things closer together in representation space while pushing unrelated things apart. Modern multimodal systems like CLIP are the clearest example: an image encoder and a text encoder are jointly trained so that matching image-caption pairs become nearby vectors, while unrelated pairs separate.

But the broader intuition may extend far beyond explicitly contrastive systems.

Even autoregressive language models like the GPT family are constantly learning statistical co-occurrence structure through next-token prediction. The model repeatedly learns which words, concepts, symbols, and contexts belong together, and which do not.

Next-token prediction is not only about predicting the next word. It is a mechanism for compressing enormous amounts of relational structure about the world by using this prediction operation:

which concepts co-occur
which abstractions cluster together
which patterns repeat across contexts
and which structures remain stable across many transformations

In that sense, many modern learning objectives implicitly organize representations around relational statistics in data. This helps explain a surprising empirical trend that motivated PRH in the first place.

Three observations that paved way to PRH

Three lines of evidence accumulated before PRH was written, and together they created the sociological pressure the paper responded to.

1. Linear maps can often “stitch” features between independently trained models.

Earlier work on representation stitching and model alignment (Understanding Image Representations by Measuring Their Equivariance and Equivalence (Lenc & Vedaldi, 2015), Revisiting Model Stitching to Compare Neural Representations (Bansal et al., 2021), and later work by Moschella et al.) found something surprising:

Independently trained neural networks often learn internal representations that can be approximately aligned through relatively simple linear transformations.

In practice, this means activations from one network can sometimes be translated into activations from another, intermediate representations can be “swapped” between models with surprisingly small degradation, and features learned independently often become partially compatible.

The implication is subtle but important. In principle, two models trained separately could discover completely unrelated internal organizations. Yet again and again, researchers found evidence that different networks may be inventing different coordinate systems for structurally similar latent organizations.

This is the same move as musical transposition. A melody survives changing key, changing instrument, changing octave, even changing timbre. The absolute frequencies change, but the relationships survive.

2. “Rosetta neurons” started appearing across independently trained systems.

Rosetta Neurons: Mining the Common Units in a Model Zoo (Dravid et al., 2023) identified neurons or sparse activation directions that corresponded to remarkably similar semantic concepts across independently trained vision systems (curves, wheels, text regions, animal faces or object boundaries) across different architectures and training procedures.

Like the Rosetta Stone enabling translation between languages, these neurons seemed to provide translation points between otherwise different representational systems.

3. Multimodal systems like CLIP and LLaVA made the phenomenon harder to ignore.

Learning Transferable Visual Models From Natural Language Supervision (Radford et al., 2021), introducing CLIP, and Visual Instruction Tuning (Liu et al., 2023), introducing LLaVA, demonstrated that models trained on radically different modalities could become surprisingly interoperable, often through relatively lightweight bridging transformations.

This is strange since images and text are fundamentally different objects: continuous spatial measurements versus symbolic sequential streams. The statistical structure of pixels is nothing like the structure of text. Yet a photograph of a guitar and the sentence “someone playing electric guitar on stage” could become nearby points in latent space.

None of this proved PRH by itself. But collectively, these findings made it harder to maintain the older intuition that every neural network learns an essentially arbitrary internal world. Researchers increasingly started asking a question that, until recently, would have sounded more like physics than machine learning: are there attractors in representation space?

A short detour: attractors

I just cannot mention attractors and go on explaining methodology. In complexity science (yet another thing I want to dive deep into later), an attractor is a state or family of states that a system naturally evolves toward over time.

A pendulum settles downward.
Planetary systems stabilize into orbital configurations.
Water spirals toward a whirlpool.
A marble dropped into a bowl finds the lowest region.
Iron filings align around magnetic field lines.
Snowflakes form similar symmetries despite microscopic randomness.
Sand dunes self-organize into stable wave patterns under wind.
Even chaotic systems like weather often orbit recognizable large-scale structures.
Certain chord progressions feel resolved, some melodies naturally want to return home.

(notice that some are pretty trivial to explain, some are not)

Very different initial conditions can still converge toward similar stable structures.

What if learning is another one of these systems?

PRH is asking that question, basically. Are some representational structures simply easier, more stable, or more “natural” for learning systems to rediscover? Does intelligence itself have universality classes? (the way physicists describe completely different microscopic systems that nevertheless converge to the same macroscopic behaviour)

PRH tries to make this concrete using the language of kernels.

Diagram of the Platonic Representation Hypothesis

From coordinates to relationships

The paper makes one key methodological move that I think deserves to be stated as clearly as possible:

The representation is not the vector. The representation is the relational structure induced by the vectors.

Instead of comparing neurons directly or comparing coordinates, the paper compares the similarity structure that each model imposes on the data.

For two inputs x_i and x_j, a co-occurrence kernel K(x_i, x_j) measures how similar the model considers them to be inside its representation space. Rather than asking “do these two models produce the same vector?”the paper asks something closer to:

Do these two models consider the same things similar to each other?

Two systems may use completely different internal coordinate systems while still preserving very similar similarity structures: which datapoints cluster together, which concepts become neighbors, which patterns are considered close or far apart.

This is the central conceptual move, and it is what links PRH to everything that follows. The metric you choose (kernel, neighborhood, distance, spectrum etc.) determines which version of “the same” you are measuring.

Why the kernel should be the same across modalities

The paper then attempts something more ambitious than just measuring agreement. It tries to argue why agreement should arise in the first place.

The argument runs through Pointwise Mutual Information (PMI).

PMI measures how strongly two things co-occur relative to chance:

PMI(x_a, x_b) = log(P(x_a, x_b) / (P(x_a) P(x_b)))

If two events are statistically independent, PMI is zero. If they reliably co-occur (like words appearing in similar contexts, images recurring with similar captions or concepts that tend to appear together in the world) PMI is positive. If they actively repel each other, PMI is negative.

A standard result in self-supervised learning theory says that contrastive learners trained with InfoNCE-style objectives don’t just produce some kernel. Under sufficient capacity, they converge to a kernel that approximates PMI:

〈f(x_a), f(x_b)〉 ≈ PMI(x_a, x_b) + const.

In other words, the geometry that contrastive learners settle into is the geometry of co-occurrence statistics. The vectors are coordinates; the meaningful object is the PMI structure underneath.

Now comes the move that motivates PRH.

Suppose images X and captions Y are both observations of an underlying world Z. Suppose also (and this is the load-bearing assumption) that the observation functions are sufficiently lossless. Then the co-occurrence statistics of images, the co-occurrence statistics of captions, and the co-occurrence statistics of underlying world states are all the same statistics. Two events that co-occur in the world co-occur in our images of them, and co-occur in our descriptions of them.

If that’s true, then PMI itself is invariant across modalities:

PMI(x_a, x_b) = PMI(y_a, y_b) = PMI(z_a, z_b)

And therefore two contrastive learners (one trained on images, one trained on language) should each be approximating the same PMI kernel, just expressed in different coordinate systems.

This is the conceptual payoff. PRH is not arguing that two models converge by accident. It is arguing that contrastive learners converge to a co-occurrence kernel and that this kernel is itself a property of the “world”, not of the modality. The coordinate systems differ, the relational structure should not.

There is a beautiful parallel here to signal processing. A recording of someone speaking and a written transcript of what they said are wildly different objects: one is a continuous waveform and the other is a discrete string of symbols. But they share an underlying linguistic content: the same phoneme sequence, the same prosodic structure, the same words. The waveform and the transcript are two coordinate systems. The linguistic content is the invariant. PRH is making the analogous claim for representation learning: pixels and words are different waveforms of the same underlying signal, and any sufficiently capable encoder should recover that signal.

The paper measures this empirically too. Across many vision models, language models, and multimodal systems, it uses neighborhood-based similarity metrics to compare representations, and reports that more capable models tend to be more aligned with each other. So, vision and language models grow more aligned as they grow more capable.

What “convergence” might really mean

This is the point where the question started getting much bigger for me.

Because if you take PRH seriously, the discussion stops sounding like classical machine learning and starts sounding like:

music theory (a melody survives transposition)
signal processing (a Fourier spectrum survives certain transformations even when the waveform changes shape. That’s why you can practically read this article!)
physics (physical law survives coordinate transformations)

Many of the later papers can almost be interpreted as arguments about which kinds of relational structure survive transformation:

local versus global,
topology versus geometry,
spectral similarity versus functional similarity,
and whether concepts are best understood as vectors, manifolds, graphs, or something else entirely.

The papers that follow will spend their time interrogating which of these layers actually holds up under scrutiny.

PRH gave us the most ambitious version of the answer. The next paper takes the Newtonian approach: it asks whether the empirical evidence survives contact with scale.

Back into Plato’s Cave

Although PRH was an ambitious study of an astonishing idea, the empirical evidence for it turned out to be fragile and context-dependent. So, the Newtonian answer arrived in:

Back into Plato’s Cave: Examining Cross-modal Representational Convergence at Scale (Koepke et al., 2026) (please check it, it’s a genuinely nice read)

This paper performs one of the most important operations in science:

It asks if a beautiful idea survives contact with scale or in other words with the real world it claims to explain.

The PRH experiments used relatively small retrieval galleries (around 1024 samples), one-to-one image-caption pairings, and coarse nearest-neighbor metrics. At scale, with many-to-many correspondences, the picture changes substantially. Cross-modal nearest-neighbor alignment weakens dramatically, even though each modality’s internal representations get richer.

Models trained on different modalities may learn equally rich representations of the world, just not the same one.

In the original PRH setup, models were compared using small retrieval sets and nearest-neighbor overlap metrics.

At small scales, two systems may appear highly aligned simply because they agree on broad semantic categories: dogs near dogs, guitars near guitars, landscapes near landscapes and so on.

But as the retrieval space becomes much larger and denser, the comparison becomes far stricter. The question is no longer “do both systems retrieve semantically related things?” but something closer to:

Do they retrieve the exact same neighbors in the exact same local structure?

And this is where the apparent convergence began collapsing. At million-scale galleries, mutual-nearest-neighbor scores between vision and language models dropped by orders of magnitude. At LAION-15M scale with strict k=1 matching, scores fell from ~0.135 on 1024 samples to ~0.001.

The 1024-sample mirage and why it matters

Here is the empirical fact that makes this paper so important (and that I want to surface as clearly as possible, because I think it is the most under-appreciated finding in this whole literature):

As the gallery scales, each modality individually becomes a better neighbor-retriever. On ImageNet, DINOv2 retrieves a same-class image neighbor 46% of the time. OpenLlama retrieves a same-class caption neighbor 58% of the time. Both models are getting better at organising their own perceptual world.
But the rate at which they agree on the exact same instance stays flat at roughly 11%, regardless of how dense the gallery becomes.

Each model is getting smarter about its own modality. Both are constructing more refined organizations of their data. And yet their fine-grained agreement does not improve, it remains roughly flat.

This is not a measurement failure by the way. The within-modality alignment (between DINOv2-base and DINOv2-giant, or between two LLMs of different scale) remains stable across gallery sizes. So the metric isn’t broken. What’s happening is more subtle and more interesting:

Two models can construct rich, internally coherent organizations of the world without those organizations agreeing on which exact instance is closest to which.

PRH’s mutual-nearest-neighbor metric, evaluated on small galleries, conflates two different things: “do both models build rich representations?” and “do both models agree on the same fine-grained neighbors?” At 1024 samples, these look like the same thing. There are so few candidates that any two semantically reasonable models pick the same one by default, because there’s nothing else nearby. The agreement looks impressive. It is, in part, the absence of alternatives!

Scale it up, and the candidates multiply. Each model now has many semantically valid neighbors to choose from. And the moment they have choices, they make different choices.

This is the 1024-sample mirage. The reported convergence was real in the regime it was measured. It just wasn’t measuring what readers (and follow-up work) interpreted it to be measuring.

But the paper’s contribution is not only empirical. It reaches for a much older idea to explain what it has found.

Many-to-many and the Umwelt

The paper also stress-tests the bijectivity assumption in PRH’s theoretical argument. In the real world, a single image has many valid captions, and a single caption matches many visually distinct images. When the dataset reflects this (CycleReward dataset, which provides 11–12 generated captions per image) mutual-NN alignment drops further. The metric cannot distinguish “genuine misalignment” from “many-to-many correspondence.”

The paper invokes the notion of Umwelt: an idea from the biologist and philosopher Jakob von Uexküll (1934), who argued that every organism inhabits its own perceptual world. Not merely its own perspective on the same reality, but effectively its own meaningful reality model altogether.

A bat perceives echolocation reflections.
A bee perceives ultraviolet flower patterns invisible to humans.
Dogs inhabit rich olfactory worlds humans barely notice.

Different organisms carve reality into different meaningful structures. The world itself may be shared, but the organisation of relevance is not.

Wittgenstein made the same move in a different vocabulary. In Philosophical Investigations he wrote: “If a lion could speak, we could not understand him.”¹ The point is not that the lion lacks a language. It is that the lion’s form of life (its goals, its perceptual world and its web of meaningful relations) is so different from ours that even shared words would fail to carry shared meaning. Translation requires more than a dictionary, it requires overlapping Umwelten.

The Back into Plato’s Cave authors actually footnote a beautiful pop-culture treatment of this idea: the Star Trek: The Next Generationepisode “Darmok.”² The crew encounters an alien species whose language can be translated word-by-word, but whose meaning is built entirely out of mythological references. Every word is comprehensible but the meaning is opaque. Their Umwelt is structured around a corpus of shared stories that the crew does not have access to. The whole episode is forty-five minutes of dramatising the gap between vocabulary and meaning. It is, as far as I know, the only Hugo-nominated piece of fiction that doubles as an illustration of the Umwelt point.

Perhaps image models and language models also inhabit different representational Umwelten. Each modality, on its own, becomes a better organiser of its own perceptual world. But the modalities do not agree on which specific neighbor is best. Both are constructing rich, internally coherent organizations of the visual or linguistic world, but those organizations follow different internal geometries.

The same underlying signal can support many distinct representational organizations. Different organizations can be equally rich while assigning different metric structure within each semantic cluster.

This is what was hiding inside the small-gallery experiments. It was invisible until somebody scaled them up.

Revisiting the PRH: The Aristotelian View

Then came the paper that, in some sense, synthesizes the tension between the Platonic and Newtonian views (and the dialectical structure here is almost too clean to ignore!). Plato proposes the grand unity. Newton breaks it open with measurement. And then comes the synthesis, in the Hegelian sense: not a return to the original claim, but a sharper, more careful version that survives the critique and incorporates it.

Revisiting the Platonic Representation Hypothesis: An Aristotelian View (Gröger et al., 2026)

This paper makes a distinction:

Local neighborhood structure may converge even when global geometry does not.

The key insight is methodological before it is philosophical. The authors argue that many representation-similarity metrics used in earlier work are heavily confounded by model scale and dimensionality.

Here’s the concrete version of the problem: in high dimensions, even two independently sampled random representations produce non-trivial similarity scores under metrics like CKA. The metric has a built-in inflation that scales with model width. Larger models naturally produce larger feature spaces, more opportunities for accidental alignment, and inflated similarity scores. There is more room to “agree by chance.”

Diagram showing agreement by absence of alternatives

So when earlier work reported that larger models showed higher CKA similarity, part of that signal was real convergence but part of it was an artifact of the metric having a non-zero floor that grows with dimension.

The paper introduces a calibration procedure based on permutation testing. The logic is the same logic statisticians use everywhere: rather than asking whether a similarity score is large in absolute terms, ask if it is larger than it would be under chance.

You take your two representations, shuffle the correspondence between samples (so that you preserve marginal statistics but destroy any genuine pairing), and recompute the similarity metric. Do this thousands of times to build an empirical null distribution. Then compare your real measurement against that null.

The calibrated score asks: how much does the observed similarity exceed the floor produced by dimensionality and finite samples alone?

After this calibration, something striking happens.

Global spectral similarity metrics (CKA, CCA, etc.) collapse. The convergence trend reported in earlier work largely disappears.
Local neighborhood similarity metrics (mutual-nearest-neighbor, cycle-NN) survive. The trend persists after calibration.

In other words, two models may still disagree about exact distances, precise geometry, or overall global organization, while continuing to agree on which datapoints belong near one another, which regions cluster together, and which local relationships matter.

There is one more subtle finding that I think is the most philosophically interesting in the paper: local distances don’t survive calibration either. Only local neighborhood relationships (the topological “who is near whom”) survive.

Models seem to agree on neighborhood membership without agreeing on the precise metric structure within those neighborhoods.

This is what the paper calls the Aristotelian view: representations converge not to a shared Platonic geometry, but to shared relational structure.

Diagram showing shared neighborhoods with different geometry

Topology versus geometry

Honestly, this was the point where the discussion started sounding less like geometry and more like topology. Geometry studies properties that depend on exact distances: angles, lengths, areas. Topology studies properties that survive continuous transformation: stretching, bending, warping, deformation. A coffee mug and a donut are topologically equivalent because both have exactly one hole, even though their geometry differs dramatically.

The Aristotelian view sounds almost like a representational version of this idea: models may preserve relational neighborhoods without preserving exact metric structure. The relevant mathematical fact is that distance-preserving transformations (isometries) form a smallergroup than neighborhood-preserving transformations (homeomorphisms in the continuous limit). What survives across models is the coarser invariant.

In my old electronics-engineering vocabulary, this is similar to saying that two filters can share the same poles and zeros (the relational structure of where things attract and where things vanish) without sharing the same gain. The qualitative response is identical. The absolute calibration differs. If you only care about which frequencies are emphasized relative to which others, the two filters behave the same. If you care about the absolute output amplitude, they don’t.

Convergence at the topological level is convergence of poles and zeros. Convergence at the geometric level would be convergence of gain too. The empirical claim of the Aristotelian view is that we get the first but not the second.

A short detour: symmetry, invariance, and what physics learned

There is a thread running through modern physics that I think clarifies what these papers are actually circling.

In 1918, Emmy Noether proved a theorem that quietly reshaped how physicists think about reality. The statement, stripped to its essence: every continuous symmetry of a physical system corresponds to a conserved quantity. Time-translation symmetry (the laws are the same yesterday and today) gives you conservation of energy. Spatial-translation symmetry (the laws are the same here and there) gives you conservation of momentum. Rotational symmetry gives you angular momentum.

The deep philosophical content of Noether’s theorem is not really about conservation laws. It is about the relationship between symmetry and substance. What is real is what is invariant under symmetry. The pieces of a physical description that change under a coordinate transformation are bookkeeping. The pieces that don’t change are physics.

This is the lesson Wilczek keeps returning to in A Beautiful Question: beauty in physics is not decoration. It is a diagnostic. When a theory is invariant under a large enough group of transformations, that invariance is doing real work. It is telling you which features of the formalism correspond to anything in the world, and which are artifacts of the perspective you happened to choose.

Now consider what the macro-convergence debate is actually doing in this light.

PRH says: different neural networks should be invariant under modality choice, pixels or words, the underlying statistical reality is the same. Back into Plato’s Cave says: maybe not, the invariance breaks once you measure carefully. The Aristotelian view says: the invariance is real, but it lives at the level of neighborhood relations rather than distances or coordinates.

Each of these is, in disguise, a claim about which symmetry group neural representations respect.

Coordinate invariance: the ability to permute or relabel neurons without changing anything that matters is the most basic symmetry, and nobody really disputes it. The model doesn’t care which neuron is “neuron #7.”
Linear-transformation invariance: the symmetry implicit in PRH’s kernel framing is stronger. It says that any invertible linear remapping of the representation space preserves the meaningful content. This is roughly what model stitching evidence suggested.
Distance-preserving (isometric) invariance: the claim that two models share metric structure up to rigid motion. This is what global CKA and similar metrics try to detect, and what the Aristotelian paper argues doesn’t survive calibration.
Neighborhood-preserving (topological) invariance: the claim that what survives is “who is near whom” is even weaker, and it is what the Aristotelian view argues does survive.

Diagram of symmetry groups and representational invariance

Read in this way, the entire macro debate is a search for the right symmetry group. The Platonic hypothesis bet on a large symmetry group (full kernel equivalence). The Newtonian critique showed the metric-level symmetries don’t hold. The Aristotelian synthesis says the symmetry is real but smaller than we thought, it’s a topological symmetry, not a geometric one.

This is exactly the trajectory physics took. Newton’s mechanics looked like it had Galilean symmetry; Einstein discovered the symmetry was actually Lorentzian, and that the “extra” structure Newton thought was real (absolute simultaneity, absolute time) was a coordinate artifact that didn’t survive the move to a larger symmetry group. The history of physics is, in large part, the history of discovering that the realsymmetry group of nature is larger and stranger than the one we initially assumed, and that what we mistook for fundamental structure was often just our particular coordinate choice.

I find it strangely encouraging that representation learning seems to be going through a smaller version of the same process. We assumed neural networks would converge to shared coordinates. We discovered they share something coarser. The story of finding out exactly how coarse is the same shape of story physics has been telling for a hundred years.

If Noether were watching this debate, I think she would recognise the question immediately: not do they agree?, but under which group of transformations do they agree? That question is the one I want to carry forward.

A short detour: notation, and which structure is visible

There is a sibling intuition I want to pull in here, because it has been quietly shaping how I think about this whole debate.

Linus Lee has an essay called Notation as a Tool of Thought (like a much older Iverson essay of the same title) which makes the following point: the notation you choose to represent a problem is not neutral. It determines which operations are easy and which are invisible. Multiplication is hard in Roman numerals and trivial in Arabic numerals, even though the underlying mathematical structure is identical. The numbers are the same. The notations make different operations thinkable.

This is the same idea as the Aristotelian view, run in reverse.

The Aristotelian view says: different models converge on the same relational structure despite using different coordinate systems. Linus’s point is: different coordinate systems make different relational structures (Umwelt?) visible to us. One direction is about model-to-model agreement. The other direction is about world-to-mind expressibility. But the deep claim is the same: what you can see, and what you can compare, depends on the coordinate system you happen to inhabit.

Which raises a slightly uncomfortable question that I want to flag without trying to answer: if our measurement of “convergence” is itself filtered through a chosen notation (mutual-NN, CKA, calibrated or not), are we discovering convergence in the models, or in our metric? The Aristotelian paper answers part of this by calibrating against the null by asking how much of what we see is an artifact of the lens. But the deeper question remains. We may be partway through a long process of inventing the right notation for what neural-network representations actually are. And the notation we land on will, in a quiet way, determine what we are allowed to see.

A note on the shape of this whole story

There is a pattern to this kind of story that is worth naming. Hegel described intellectual progress as a dialectic: a thesis is proposed, an antithesis pushes back against it, and a synthesis emerges that preserves what was right about both. The pattern is not really about philosophy, it shows up wherever a sharp idea meets sharp measurement. Newtonian mechanics gave way to special relativity, which preserved Newton as a low-velocity limit. Euclidean geometry gave way to Riemannian, which preserved Euclid as a zero-curvature special case. Classical genetics gave way to molecular biology, which preserved Mendel’s laws as statistical regularities over a more complex substrate. Each time, the synthesis was not the original claim rescued, and not the critique vindicated, it was a more careful claim that absorbed both. Convergence of representations now seems to be going through the same loop. PRH proposed a Platonic unity. Back into Plato’s Cave showed where that unity breaks. The Aristotelian view is the synthesis: a weaker, more careful claim about which level the unity actually lives at. If the pattern holds, the Aristotelian view will itself give way to something more careful still, and that is not a sign the field is confused. It is a sign that the field is doing what fields do.

So what did the macro debate actually settle?

In my humble opinion, none of these papers fully proves its strongest philosophical interpretation.

PRH does not prove the existence of a single underlying Platonic geometry of reality

Back into Plato’s Cave does not prove that such shared structure is absent.

The Aristotelian view does not prove that topology is the “true” invariant of representation.

What these papers actually show, taken together, is more subtle:

Different notions of similarity survive under different measurement regimes.

The deeper question gradually becomes: at what level of abstraction does representational convergence remain stable? Coordinates? Distances? Neighborhoods? Spectra? Functions? Causal structure?

Different papers probe different layers of this hierarchy. Many apparent disagreements come from measuring different kinds of structure rather than observing contradictory phenomena.

This is also where the question stops being a closed question. Because if the unit of convergence is something coarser than geometry (if it’s some kind of relational or topological invariant) then we are forced to ask: what exactly is that invariant made of?

That is no longer a question about whether models agree. It’s a question about what kinds of internal structure are even capable of being the thing that agrees.

What to expect in Part 2

If the macro debate is if models converge, the micro debate is what they converge to.

Lines? Directions? Circles? Fourier modes? Something stranger?

This is where the story gets more concrete and weirder.

The starting point is the Linear Representation Hypothesis (Park, Choe & Veitch, 2024): the idea that high-level concepts in language models are encoded as straight directions in activation space. Add a “male-to-female” vector to “king,” and you get something close to “queen.” (yes like very classical example from the Word2Vec paper!) It’s a beautiful, almost too-clean picture and it requires choosing the right inner product, which turns out to be a non-Euclidean one. Concepts that can vary independently become orthogonal, but only under a specific whitening of the space. This will feel familiar to anyone who has thought about gauge transformations in physics: the geometry only becomes meaningful once you have chosen the right metric.

Then comes the pushback. Not All Language Model Features Are One-Dimensionally Linear (Engels et al., 2024) shows that some concepts cannot be lines. Days of the week. Months of the year. Hours on a clock. These are encoded as circles, irreducibly two-dimensional structures that no rotation, no change of basis, can flatten into a single direction. The model is not just using directions. It is using manifolds.

The story sharpens further with Language Models Encode Numbers Using Digit Representations in Base 10 (Levy & Geva, 2025), which shows that LLMs represent numbers not as values but as products of digit-wise circular representations (a torus of circles, one per digit, each modulo 10). The number 375 is not stored as a point on a number line. It is stored as a particular position on the units circle, crossed with a particular position on the tens circle, crossed with a particular position on the hundreds circle. And these circles are causally separable: you can intervene on one without disturbing the others.

And then Convergent Evolution (Fu et al., 2026) (the paper that actually started me on this whole journey) pulls back to ask a question that ties everything together. If multiple architectures, trained with multiple objectives, all converge on these same circular and Fourier-structured representations, what is doing the work? The paper proves a striking dissociation: spectral convergence (the same Fourier spikes appearing in different models’ embeddings) is universal, but geometric convergence (the actual linear separability of modular classes) is selective. Two models can look identical in their spectra and behave completely differently in what they can compute. The Fourier signature, on its own, doesn’t tell you whether the structure is functional or merely present.

That last result is, in some sense, the micro-debate’s version of the 1024-sample mirage. Two things can look the same at one level and diverge sharply at another. Knowing which level matters is the whole question.

Diagram previewing the representational convergence part two taxonomy

Part 2 will work through these four papers in the same way (math, intuition, music, signals) and try to build a single coherent picture of what the unit of convergence actually is. By the end, we will have a taxonomy: spectral, geometric, relational, functional, mechanistic. Each of them a different answer to the same question, valid in its own regime, misleading outside it.

I think meaning in these systems is probably not a coordinate, and probably not even a manifold. I think meaning could be a relational thing, something that lives in the pattern of what stays the same when everything else is allowed to change. That is the position I have been circling toward through all of this, and Part 2 is where I will try to make it stick.

See you in Part 2!

Thanks for reading. If any of this resonated, or if you think I have something wrong, I would like to hear from you. This is my attempt to map a landscape I am still actively exploring, and the map will get sharper with pushback.

Quoted by Koepke et al. in Back into Plato’s Cave (2026), which uses it to ground the Umwelt argument in a linguistic register. The original is in Wittgenstein’s Philosophical Investigations(1953), Part II, §xi. Back to text
Footnoted in Koepke et al. (2026). Season 5, episode 2 of Star Trek: The Next Generation (1991). The episode was nominated for a Hugo Award for Best Dramatic Presentation in 1992. Back to text

The Representation Layer

Discussion about this post

Ready for more?