How GPT Learns Layer by Layer

Layer-wise Interpretability in OthelloGPT

2025-01-13

LLMs can perform impressively on reasoning and planning benchmarks, but it’s unclear whether they actually build coherent internal representations of the structures they’re trained on, or whether they rely on surface-level patterns. Understanding how these internal world models form is important for building agents that behave consistently across tasks, so we wanted a controlled setting to study it.

We use OthelloGPT, a small GPT trained only to predict the next move in Othello games, as a testbed. The rules and board state are fully known, which makes it possible to check what the model has actually learned at each layer and compare different interpretability methods on the same ground truth.

Despite being trained solely on next-token prediction with random valid moves, OthelloGPT shows meaningful layer-wise progression in how it understands board state and gameplay. Early layers capture static attributes like board edges, while deeper layers reflect dynamic tile changes. To interpret these representations, we compare Sparse Autoencoders (SAEs) with linear probes, and find that SAEs offer more robust, disentangled insights into compositional features, whereas linear probes mainly detect features useful for classification.

Note: This was our first attempt at mechanistic interpretability. We are not experts in this field and just wanted to share some interesting findings we had.

Sparse Autoencoders (SAEs) vs Linear Probes

We compare Sparse Autoencoders (SAEs) and linear probes to analyze the learned representations. Our experiments show that SAEs uncover more distinctive and disentangled features, particularly for compositional attributes, whereas linear probes primarily identify features that act as strong correlators to classification accuracy.

What are Sparse Autoencoders?

Sparse Autoencoders (SAEs) are neural networks trained to transform model activations into a sparse set of features that reconstruct the original input. Unlike standard autoencoders, which reduce dimensionality, SAEs expand the hidden layer and encourage sparsity through an L1 penalty, so that only a small number of features activate at once.

Each activation corresponds to a learned feature (i.e. tile color), and sparsity helps ensure that these features are both distinct and interpretable. Instead of all features lighting up at once, the model is forced to choose the most relevant ones.

We apply SAEs to the residual stream of OthelloGPT to isolate compositional attributes of board state. Because most features remain inactive for any given input, the ones that do activate often correspond to isolated concepts, which makes it possible to associate semantic meaning with specific neurons.

What are Linear Probes?

Linear probes are simple classifiers trained on frozen model activations to assess what information is linearly accessible at different layers. Given a hidden representation from OthelloGPT, we train a probe to predict whether a specific tile is empty, belongs to the player, or belongs to the opponent.

While linear probes are useful for evaluating representational capacity, they provide limited insight into how those features are structured. They tend to detect whatever is most predictive for a task, even if it’s entangled with other features, making them better suited for classification rather than interpretation.

Comparison

Linear probes are effective at measuring what features a model encodes, but not how those features are organized. They highlight what’s linearly decodable, which often leads to blending correlated attributes together. In contrast, SAEs recover sparse, disentangled components of the representation, making it easier to assign semantic meaning to individual features.

Layer by Layer Analysis

We examine each layer of OthelloGPT, tracking two key gameplay concepts: tile color and stability. Tile color refers to the state of each board position, i.e. whether it’s empty, occupied by the player, or occupied by the opponent. Stability is defined as whether a tile can be flipped for the remainder of the game; stable tiles, such as corners or anchored edges, remain fixed and represent longer-term control over the board.

Linear probe accuracy increases with depth, suggesting that deeper layers encode progressively stronger predictors for classification. However, they fail to reveal distinct or compositional features per layer. SAEs address this by disentangling activations into sparse, interpretable bases, providing deeper insight into the features learned at each layer.

These results reveal distinct differences in how SAEs and linear probes learn features from the board. SAE visualizations highlight clear and structured patterns, such as strong activations at corner and edge tiles in layer 1, indicating that the model captures the board’s shape early on. As we progress to layers 2 and 4, SAEs show more dynamic changes, with activations concentrated in the central tiles and along the edges. This suggests the SAEs are not only learning positional importance, but could also be capturing the evolving dynamics of central tiles, which tend to flip frequently as the game progresses.

These results are aggregated across 10 random seeds, demonstrating the robustness and consistency of SAEs in identifying meaningful features. In contrast, the linear probe visualizations show more dispersed activations across the board. While individual tiles are well-classified, the activations lack the clear structural patterns seen in SAEs.

When analyzing tile stability, we observe that the strongest feature activations appear in the middle of the board, specifically in layers 2 through 4. Earlier layers do not yet capture this long-range concept, while later layers likely reallocate capacity to downstream objectives like move prediction.

Final Notes

We analyzed the progression of learned features in OthelloGPT, revealing a hierarchical structure in its internal representations. Specifically, we observe that different layers in OthelloGPT focus on distinct aspects of gameplay: some capture structural attributes like board shape and edges, while others appear to encode dynamic features such as tile flips and shifts in board state.

By comparing the capabilities of sparse autoencoders (SAEs) and linear probes, we established that SAEs excel at uncovering more distinctive and disentangled features, particularly for compositional and interpretable attributes. In contrast, linear probes tend to highlight features that serve as strong correlators for classification tasks. Through these methods, we decoded features related to tile color and stability, offering a framework for understanding how GPT-based models and transformers construct and organize their internal representations.