Anthropic have been working on the interpretability of neural networks for a long time. Their past SAE (Sparse Autoencoder) method has already been adopted by OpenAI and Google, and now they offer a new way to "parse" AI into thoughts - Circuit Tracing.

๐ŸŸข How does it work?

๐Ÿ’ They take an off-the-shelf language model and select a task.

๐Ÿ˜˜ Replace some components of the model with simple linear models (Cross-Layer Transcoder).

๐Ÿ˜˜ Train these replaced parts to mimic the original model, minimizing the difference in output.

๐Ÿ’ Now you can see how information "flows" through all the layers of the model.

๐Ÿ˜˜ Based on this data, an attribution graph is built - it shows which attributes influence each other and form the final answer.

๐ŸŸข What interesting things were discovered in Claude's brain?

๐ŸŸ  The LLM "thinks ahead." For example, when she writes a poem, she plans the rhyme scheme in advance, even before she starts a new line.

๐ŸŸ  Math is not just about memorization. Turns out the model is actually calculating, not just retrieving memorized answers.

๐ŸŸ  Hallucinations have a cause. A specific "answer is known" trigger is found. If it is triggered in error - the model starts making things up.

๐ŸŸ  Fun fact: if you tell the model the answer to a problem right away, it will think backwards - come up with a plausible path to that answer.

  1. #claude #AI