Provenance in Bioinformatics: Why Every Number Should Trace to Its Source
When a reviewer asks where a value came from, 'somewhere in the analysis' won't do. How to build provenance in so every number traces to its source.
A co-author is finalizing a manuscript and points at a single sentence in the abstract — an association between a checkpoint marker and a clinical subgroup. Reasonable question: which analysis produced this number, and can I see the data behind it? The number is correct — that was never in question. What you can’t do, an hour into opening folders, is point to the exact run that produced it: the inputs, the parameters, the code. The value matches one run, but not the run anyone thought it came from. The result is sound; it’s the trail back to it that has gone missing.
This is one of the most avoidable failure modes in computational biology, and it has almost nothing to do with whether your statistics were correct. It’s a provenance problem — the inability to trace a reported result back to the exact inputs, parameters, samples, and code that produced it. We’ve argued before that a good deliverable leads with findings, not files; traceability is the property underneath that one, and it deserves its own treatment. We’ve watched its absence cost real time during manuscript revisions, and the fix is far less glamorous than the analysis itself.
What “provenance” actually means here
Provenance is the documented chain from a number in your report back to everything that determined it. For a single fold-change, a correlation, or a p-value, that chain includes:
- Which input data — the exact file(s), down to a version or checksum, not just “the GeoMx export.”
- Which samples — the inclusion/exclusion set actually used, after whatever filtering happened.
- Which parameters — normalization method, thresholds, the grouping or contrast, the random seed.
- Which code — the script and its version (a commit hash, ideally), plus the environment it ran in.
When all four are recoverable, “where did this come from?” is a lookup. When any one is missing, it’s an investigation — and the investigation tends to land precisely when you have the least time for it, during revisions or a grant report.
The reason this is hard isn’t conceptual. It’s that an analysis is a moving target. You subset the samples, re-run after a fix, try a corrected grouping, swap one normalization for another. Each step is sensible the day you do it. Three weeks later the relationship between “the figure in the draft” and “the run that made it” has quietly dissolved.
The failure mode that should scare you
The mundane version of poor provenance is annoyance — too many folders, vague names like analysis_v2_filtered_final. Irritating, but survivable.
The dangerous version is silent inconsistency. We’ve described before how the same biological question can give opposite answers when compartments are pooled versus analyzed separately. Now imagine two analysis variants like that living in adjacent folders, and a result migrating into a manuscript from one of them while a co-author later goes looking in the other. Both runs exist. Both are defensible in their own framing. The number is “right” — and still unsupportable, because the report doesn’t record which variant it came from.
That’s the part worth internalizing: provenance failures don’t announce themselves. The analysis looks done, the figure looks clean, and the gap only surfaces when someone asks a question you can no longer answer cheaply. A reviewer asking for the source of a value is not an edge case. It’s the normal life cycle of a paper.
Provenance is not the same as reproducibility
These get conflated, and the distinction matters. Reproducibility is the ability to re-run the pipeline and get the same output. Provenance is the ability to look at an existing output and know what produced it. You can have one without the other.
A perfectly containerized, version-pinned Nextflow pipeline is reproducible — but if the figure in your slide deck is a screenshot from a run you’ve since overwritten, you still can’t prove which inputs it reflects. Conversely, a messy interactive analysis can have decent provenance if you’ve recorded what you did at each step, even though re-running it from scratch would be painful.
You want both, but for the specific problem of defending a number in a manuscript, provenance is the one that saves you, and it’s the one more often missing.
How to build it in (without drowning in process)
The instinct, once burned, is to log everything. Resist it — provenance that’s expensive to maintain gets abandoned under deadline, which is worse than a lighter scheme you’ll actually keep. A few practices carry most of the value:
1. Make the run self-describing. Every analysis run should emit, alongside its outputs, a small manifest: input file names and checksums, the sample list used, the key parameters, the code commit, and a timestamp. If a human has to remember to write this, it won’t happen reliably — so have the script generate it. This single habit converts most “where did this come from?” questions into reading one file.
2. Stop overwriting; start versioning. The pooled-vs-specific trap is a trap mostly because variants get scattered or clobbered. Give each run an immutable output directory keyed to its manifest, so “run A vs. run B” is a diff, not an act of memory. Tools help here — git for code, a workflow manager (Nextflow, Snakemake) for the DAG, data-versioning tools for large inputs — but the principle is independent of any of them.
3. Carry context with the number, not in your head. A fold-change shown without its compartment, its contrast, or its sample set is an invitation to misread. The same discipline we argue for in choosing data-driven cutoffs applies to presentation: a value should travel with the context that makes it interpretable, so it can’t be quietly re-used in a framing where it no longer holds.
4. Generate the report from the run, not by hand. The widest gap opens when results are copied — into a slide, a doc, an abstract — and severed from their origin. The closer your report is to being produced by the pipeline (parameterized notebooks, templated reports that pull from the run’s outputs), the smaller the chance a number drifts away from its source.
Notice none of this is extra analysis. It’s packaging the analysis you already did so it survives contact with a future question.
Where the honest trade-offs are
Provenance has a cost, and pretending otherwise is how good intentions die. Full checksumming and immutable run directories add storage and a little friction to every iteration. For a fast exploratory phase — where you’re sketching, not reporting — that overhead can genuinely slow you down, and it’s reasonable to run loose. We’re not arguing that every throwaway plot needs a manifest.
The line we draw is this: the moment a number is headed for something durable — a manuscript, a regulatory document, a decision a client will act on — it needs to be traceable, and retrofitting that later is the expensive path. The cheap path is to have the workflow produce provenance by default, so the “report-grade” runs are already covered and you’re not reconstructing history under deadline. Get the default right and the trade-off mostly disappears; the overhead lands on the exploratory work, where it’s cheap to skip, not on the reporting work, where it’s costly to omit. This is the same bottleneck core facilities know well — the analysis scales faster than our ability to account for it — and provenance is how you keep the accounting from swamping the science.
The point underneath the tooling
You can implement all of this with nothing fancier than disciplined directory conventions and a script that writes a manifest. The reason it so often doesn’t happen is incentives: the analyst is measured on getting the analysis right, and by the time it’s right, provenance feels like overhead someone else will appreciate later. So it gets compressed, then skipped — until a reviewer’s question turns the skip into a billable afternoon of archaeology.
The durable fix is to stop treating provenance as a final-step chore and make it a property of the workflow itself. That’s a large part of why we built Cytogence Atlas: rather than handing clients a drive of folders, we deliver curated findings through a secure portal where the provenance behind a number — which inputs, which parameters, which samples — is a click away rather than a folder hunt. That delivery layer is live now in an invite-only preview for our clients, with the broader self-service analysis suite rolling out behind it. The aim is narrow and concrete: results you can cite, traceable to their source, without anyone emailing a zip file.
But the principle stands with or without the tool. If you’ve ever stared at a value in your own manuscript and couldn’t say which run produced it, that wasn’t a statistics problem. It was a provenance problem — and it’s one you can design out before the reviewer ever asks.
Cytogence is the bioinformatics division of KeyQ, Inc. We build analysis our clients can trust and trace. See how Atlas delivers traceable results.