Pooled vs. compartment-specific analysis: when averaging hides the biology

You ask one question of a spatial dataset and get two answers. Pool the regions together and the cytotoxic T-cell signal appears to increase with the biomarker. Split the tumor and stromal compartments apart and the association reverses — decreasing in each. Same ROIs, same measurements, opposite conclusions.

This isn’t a rare edge case. It’s a structural consequence of how spatial data are collected, and it has a name in statistics: Simpson’s paradox (you’ll also see it called aggregation bias or compositional confounding). On a recent spatial omics project we watched a version of it emerge, where the pooled estimate contradicted what every individual compartment was showing. The pooled answer is the easiest to compute and the easiest to plot — but when compartments differ in composition and sampling density, it can reflect shifts in compartment mixture rather than the biological relationship you’re after.

How averaging inverts a result

The mechanism is worth seeing concretely, because once you’ve seen it you can’t unsee it in your own data.

Spatial platforms like GeoMx DSP don’t sample tissue uniformly. You place regions of interest (ROIs) on a tumor compartment, on adjacent stroma, maybe on an immune-enriched margin. Those compartments differ in two ways at once: in how much of the readout you’re measuring (say, T-cell abundance) and in how many ROIs landed there. When a variable that drives your outcome is also unevenly distributed across the groups you’re averaging over, the pooled average can point the opposite way from every subgroup.

Here’s a deliberately simplified, illustrative example — not real project data, just the arithmetic:

Compartment	Biomarker-high ROIs	Biomarker-low ROIs	Mean CD8 (high)	Mean CD8 (low)
Tumor	10	40	0.10	0.15
Stroma	40	10	0.50	0.60

Within both compartments, biomarker-high ROIs have lower CD8 than biomarker-low ROIs (0.10 vs. 0.15 in tumor, 0.50 vs. 0.60 in stroma). But pool everything and the weighting flips it. The biomarker-high group is dominated by stroma (40 of its 50 ROIs, all at the high stromal baseline); the biomarker-low group is dominated by tumor (40 of 50, at the low tumor baseline). The pooled means come out to 0.42 for biomarker-high and 0.24 for biomarker-low — so the pooled comparison reports CD8 going up with the biomarker, the exact opposite of the within-compartment truth. Nothing about the pooled number looks suspicious on its own.

The point isn’t the specific digits. It’s that the pooled estimate is a weighted blend, and those weights are set by compartment composition and ROI sampling density — not by the biological effect you’re trying to measure. Stroma is more abundant and more T-cell-rich than tumor core in many tissues; the moment your groups differ in their tumor-to-stroma ratio, pooling lets composition masquerade as effect.

Why spatial data is especially prone to this

Bulk RNA-seq averages too, but at least it averages consistently — one number per sample. Spatial data invites the trap because it gives you compartments and then tempts you to collapse them. A few things make it worse:

ROI counts are unbalanced by design. You place more regions where the interesting biology is. That’s good experimental practice and a direct source of unequal weights.
Compartments have wildly different baselines. As we’ve written about immune cell infiltration, tumor core, stroma, and immune-enriched zones carry fundamentally different cellular compositions. Averaging across them blends populations that were never comparable.
Deconvolution adds its own structure. Cell-type proportions are themselves estimates that behave differently per compartment. Pool first, deconvolve second — or deconvolve per compartment and pool the proportions — and you can get different answers from the same pipeline. (More on the upstream decisions in our spatial deconvolution primer.)

And normalizing each ROI to its nuclei count or area — which you should do — doesn’t rescue you here. That step makes individual ROI measurements comparable to each other; it says nothing about how those ROIs are weighted together when you pool them. Per-ROI normalization and cross-ROI aggregation are different axes, and the paradox lives entirely on the second one.

None of this is exotic. It’s the default shape of a GeoMx experiment. Which is why “just run the comparison” is a more dangerous instruction than it sounds.

The failure mode in practice

The reason this matters beyond statistical hygiene: pooled numbers travel. They make it into slide decks, into abstracts, into manuscript drafts — and they look authoritative because they summarize the whole study. On the project that prompted this post, a pooled association made it into a draft figure before anyone noticed that the compartment-specific analysis told the opposite story. Nobody was careless. The pooled number was simply the first one computed, and it was plausible.

That’s the insidious part. A compartment-specific result that contradicts your pooled one doesn’t announce itself with an error bar or a failed QC check. Both analyses run cleanly. Both produce a tidy p-value. The only way to know which is real is to look at both and ask why they disagree.

When they disagree and the thing that differs between your groups is compartment composition, the within-compartment answer is the one to trust — pooling is the step that let composition leak in. The exceptions exist (sometimes the pooled, tissue-level effect is genuinely the quantity you care about), but they’re exceptions you argue for explicitly, not the default you back into.

How to catch it before it reaches a figure

You don’t need heavy machinery. You need a habit of distrusting the pooled number until it’s earned trust.

1. Analyze each compartment separately first, then ask whether pooling adds anything. Compartment-specific analysis isn’t a refinement you do if there’s time — it’s the primary analysis. Make pooling the thing you justify, not the thing you assume. This is the same discipline that makes immune profiling trustworthy: a relationship that holds in tumor but reverses in stroma is a finding, not a nuisance.

2. Always report the subgroup table alongside the headline. If a comparison is worth making, it’s worth showing per compartment. The illustrative table above takes thirty seconds to read and would have caught the inversion immediately. A pooled bar chart with no compartment breakdown is where paradoxes hide.

3. Watch the weights — the right ones. Tabulate ROI counts per compartment for each group before you interpret any pooled result. The red flag isn’t unequal counts as such — it’s when the groups you’re comparing have different compartment mixes (biomarker-high ROIs landing mostly in stroma while biomarker-low land mostly in tumor). When that mix differs across groups — especially if the compartments have very different baseline levels of the feature — a pooled average can be driven more by compartment composition than by the biological relationship of interest.

4. When you must summarize across compartments, weight deliberately. A simple pooled mean weights by ROI count, which is an artifact of sampling. If you genuinely want a tissue-level estimate, decide the weights on biological grounds (e.g., known compartment fractions) rather than letting the ROI placement decide for you. Mixed-effects models that include compartment as a term are the more rigorous version of the same idea.

5. Make the context impossible to lose downstream. A fold-change or correlation shown without its compartment is an invitation to misread it. The number and its context — which compartment, pooled or specific, which grouping — have to travel together, all the way into the report a collaborator reads months later.

The honest trade-off

Compartment-specific analysis costs you power. Splitting 100 ROIs into three compartments of roughly 33 each means smaller groups, wider confidence intervals, and some comparisons that simply won’t reach significance where the pooled version would have. That’s real, and it’s worth being honest about: pooling is tempting partly because under-powered studies need the sample size.

But a significant pooled result driven by compartment composition rather than within-compartment biology isn’t the power it looks like. It may simply be answering a different, less useful question — and that apparent gain comes at the cost of biological interpretability. The right response to thin per-compartment data is to say so (and to design the next study with enough ROIs per compartment), not to pool your way to a number you can’t defend when a reviewer asks which compartment it came from.

There’s also a middle path worth naming: you don’t have to choose between pooling everything and splitting blindly. In larger studies, hierarchical or mixed-effects models that carry compartment as a covariate or random effect can recover much of the lost power while still accounting for compartment structure. But whatever the modeling approach, the principle holds — compartment composition is part of the signal-generating process, not something to average away. This same tension shows up in predicting checkpoint inhibitor response, where the most informative signals are compartment-specific and cross-compartment — exactly the structure pooling destroys.

The discipline is simple to state and easy to skip under deadline: when compartments differ in composition, distrust the average — start from the compartment-specific result, and never let a pooled number into a figure without checking what its subgroups say. The averaging step is where the biology goes to hide.

Cytogence runs spatial transcriptomics analysis — GeoMx and beyond — with compartment-aware design from the first ROI to the final report, and delivers it through Atlas so every number stays traceable to the compartment and run it came from. If you’re planning or troubleshooting a spatial study, start a conversation about your project.