Beyond Explainability: Opening the Black Box from Within
Bardia Khosravi
When an AI model tells you there’s a 34% risk of long-term complication after hip arthroplasty or suggests asking about visual disturbances in a pregnant patient with hypertension, you naturally want to know why. For years, we’ve relied on saliency maps, heatmaps that highlight which regions of an image influenced the AI’s decision, to provide that answer. But here’s the uncomfortable truth: these maps can be deeply misleading (see Revisiting the Trustworthiness of Saliency Methods in Radiology AI and Sanity Checks for Saliency Maps). Research has shown that saliency maps can suggest equally convincing explanations for models with completely randomized weights as for models with outputs derived from clinically appropriate image features. These maps tell us where the model is looking, but not what it’s actually “seeing” or how it’s reasoning. When the hip joint is highlighted, is the model evaluating joint space width? Head shape? Acetabular coverage? The saliency map can’t tell us.

The lack of truly informative explanations has become even more pronounced with the rise of generative AI. Large language models and vision-language models don’t just output a single classification—they generate streams of tokens, each influencing the next. Traditional explainability methods weren’t designed for this complexity.
From Explainability to Interpretability
The distinction between explainability and interpretability matters. Explainability addresses the question “what did the model do?” through post-hoc analysis of inputs and outputs. Interpretability asks “how does the model work?” by reverse-engineering the model’s internal mechanisms. You can think of it like the difference between observing that a car turns left when you rotate the steering wheel versus understanding the mechanical linkages that make the turn happen.
Mechanistic interpretability takes a bottom-up approach: rather than approximating the model’s behavior from outside, we examine the actual computations occurring within. This reveals not just correlations but causal mechanisms (see Mechanistic Interpretability for AI Safety -- A Review).
The Polysematicity Challenge
Why is interpretability so hard? Neural networks face a fundamental constraint: they must encode thousands of concepts using limited parameters. The solution they arrive at during training is called superposition, which is the representation of more features than there are neurons by encoding concepts in overlapping combinations (see Toy Models of Superposition).
The result is polysemanticity, single neurons responding to multiple unrelated concepts. A neuron in a chest X-ray model might activate for both pneumothorax and pleural effusion or encode both anatomical features and patient demographics. This makes individual neurons nearly impossible to interpret directly.

Sparse Autoencoders: Disentangling the Mess
Sparse autoencoders (SAEs) offer an elegant solution. An autoencoder is a type of neural network that learns to compress data into a different representation and then reconstruct it; the ‘sparse’ version forces the network to identify only the most essential features by ensuring that most of the compressed reconstruction is zeros. When applied to AI models, SAEs learn to decompose polysemantic activations into higher-dimensional representations where individual features become interpretable. Imagine taking a muddled audio recording of multiple conversations and separating it into distinct voices.

When applied to medical imaging models, SAEs can identify features corresponding to specific pathologies, anatomical structures, and hidden confounders.
Attribution Graphs: Mapping the Model’s Reasoning
Although SAEs reveal what features exist, labs at the frontier of interpretability research have developed techniques to show how relevant features interact to ultimately produce the model’s output. Recent work on “the biology of a large language model” has resulted in the creation of attribution graphs, visualizations that trace how internal representations interact to produce outputs (see On the Biology of a Large Language Model).
In one compelling example from their research (Figure 30 in the original article), they present a clinical scenario: a 32-year-old pregnant woman at 30 weeks gestation presents with severe right upper quadrant pain, mild headache, nausea, blood pressure of 162/98 mmHg, and mildly elevated liver enzymes. The model is asked: if we can only inquire about one additional symptom, what should it be? The model suggests inquiring about visual disturbances.
The attribution graph reveals the reasoning chain behind this answer. Individual words from the clinical prompt connect to concepts the model has recognized: “32-year-old female at 30 weeks gestation” activates a “pregnancy” feature; “right upper quadrant” activates an anatomical location feature; “162/98 mmHg” activates a “high blood pressure” feature; “elevated liver enzymes” activates a “liver conditions” feature. These symptom-level features then converge on a central node representing “preeclampsia”, the model has identified this as the most likely diagnosis. From the preeclampsia node, connections branch to features representing additional diagnostic criteria: “visual deficits,” “proteinuria,” “edema,” “epigastric pain,” and “hemorrhage.” The model ultimately outputs “visual disturbances” because this feature has the strongest connection to confirming the suspected diagnosis.
The power of mechanistic interpretability becomes clear when researchers intervene on these internal representations (Figure 31 in the original article). When they artificially suppress the model’s “preeclampsia” feature, the entire constellation of associated features deactivates. With preeclampsia removed from consideration, the model shifts toward an alternative diagnosis: biliary system disorders, which also explains right upper quadrant pain and elevated liver enzymes in pregnancy. Now, instead of suggesting visual disturbances, the model recommends asking about decreased appetite.
This is causal understanding, not mere correlation. I highly encourage readers interested in exploring these visualizations to consult the original Anthropic publication, which includes interactive versions of these attribution graphs.
Why This Matters for Radiology AI
Medical imaging AI presents a unique opportunity for interpretability research. Unlike general-purpose models trained on internet text, radiology models operate in domains with (relatively) well-defined ground truth and expert-validated concepts. When we discover that a model encodes “pneumothorax,” we can verify this against radiologist annotations. Vision-language models in medicine may actually help us understand shortcuts and biases more clearly than models trained on natural images.
As foundation models increasingly enter clinical practice, mechanistic interpretability offers something saliency maps never could: the ability to audit not just what models predict, but how they reason. This transparency may prove essential for building the trust required for clinical adoption.
Bardia Khosravi, MD, MPH, MHPE is a radiology resident at Yale University and an adjunct assistant professor of radiology at Mayo Clinic. His research focuses on the effectiveness of synthetic data in medical AI performance and fairness, evaluation of medical foundation models, and deployment of AI on edge devices. With over a decade of experience in full-stack development and six years in machine learning, he works at the intersection of radiology and artificial intelligence to develop innovative solutions for healthcare. Follow him on X: @BrdKhsrv | www.BrdKhsrv.com


