Result export is not evidence preservation

The gap nobody names

Embedded test teams have spent years building robust export pipelines. NI TestStand writes ATML XML. OpenTAP streams results into PostgreSQL. A Python script converts CAN traces to CSV. A DOORS bridge links test outcomes to requirements. The export works. The file is written. The database is populated. The status light is green.

But six months later, when an auditor asks which firmware version was on the DUT during test run 2,847, the answer is not in that export file. It is in a different tool, or a Slack message, or the memory of an engineer who has since left.

This is the gap nobody names: export pipelines preserve test results, but they rarely preserve test evidence. A result is a data point. Evidence is the chain of context that lets someone – an auditor, a colleague, a future version of yourself – trust that data point without reconstructing the scene from scratch.

What export preserves vs. what evidence requires

To understand the gap, draw two columns. Column A is what the export format captures. Column B is what an engineer needs six months later to defend this result.

The export pipeline was designed for data interchange – not for preserving context.

Figure 1: Most export pipelines capture data points (left). Few capture the full chain of context an engineer needs to trust a result months later (right). The gap is structural – export formats were designed for data interchange, not evidence preservation.

Public reference points

This is not a critique of any one export mechanism. It is easier to inspect the gap against the public docs: NI describes TestStand XML and ATML reports, OpenTAP documents result listeners for PostgreSQL, SQLite, CSV, logs, and custom listeners, and pytest documents JUnit XML output for CI systems. Those outputs are useful. The evidence question is what setup, identity, version, and review context survives across the tools before the export is written.

Three export failure modes

When a test result cannot be trusted six months later, the failure follows one of three patterns.

Failure Mode 1 – The Missing Version Chain. A pass/fail result exists. The firmware revision, sequence file version, and instrument driver version that produced it do not. An engineer returning to this result cannot determine whether the test ran against the correct firmware build, against a known-good sequence revision, or with a calibrated instrument. The chain of versions that gives a result its meaning has been severed.

So what? If a field failure investigation requires reproducing the exact test conditions from an 18-month-old run, the version gap makes the result unusable. The test passed at the time. Today, its provenance is lost.

Failure Mode 2 – The Disconnected Identity Chain. Results exist. The DUT serial number, instrument asset tags, and calibration certificates are stored in separate systems – a configuration management database updated quarterly, a calibration lab spreadsheet, an asset tracker that the test cell does not query. The export says “measurement = 4.72V.” It does not say which instrument took that measurement, whether that instrument was in calibration, or which DUT was on the fixture.

So what? If a batch of DUTs is later found to have a manufacturing defect, tracing which tests ran against which units becomes a forensic exercise spanning multiple databases and a manual timestamp alignment. In regulated industries, this is the difference between a targeted recall and a blanket one.

Failure Mode 3 – The Silent Configuration Drift. The test passes today. It passes tomorrow. The configuration – a bus termination setting, a calibration offset applied in a startup script, a fixture wiring change documented in a lab notebook – drifts silently. The export does not capture configuration context, so the drift is invisible in the evidence trail. The result values look consistent. The conditions that produced them have changed.

So what? When a regression appears, the team cannot determine whether the product changed or the test environment changed. Debugging starts from zero.

Mode	What Goes Missing	Consequence
Missing Version Chain	Firmware rev, sequence version, driver rev	Cannot reproduce test conditions
Disconnected Identity	DUT S/N, instrument IDs, cal dates	Cannot trace defect to unit
Silent Configuration Drift	Bus termination, cal offsets, wiring changes	Regression debugging from zero

Table: The three failure modes. Each one is silent in the export file – the result looks fine. The context that made it trustworthy has evaporated.

Why tools weren’t designed to fix this

No vendor is at fault here. Each tool was designed to own its layer really well. LabVIEW and TestStand own test execution and sequencing. Vector tools own bus traffic capture. OpenTAP owns plugin-based test automation. DOORS owns requirements traceability.

But the evidence chain crosses all of these layers. It starts at the DUT, passes through instrumentation, sequencing, data processing, and lands in the export file. Each tool captures its own slice. No tool captures the handoffs between slices.

Figure 2: The six-layer test stack. Tools own vertical slices – nobody owns the horizontal handoffs. When evidence reaches the export, the context from layers 0-2 is often already lost because the handoffs were manual. The auditor sees a boolean.

This is the six-layer test stack: L0 Device, L1 Instrumentation, L2 Stand Configuration, L3 Orchestration, L4 Data Processing, L5 Export/Evidence. Tools own vertical slices. The handoffs – the arrows between layers – are collectively unowned. The export format sits at the boundary between L4 and L5. It inherits whatever context was carried across the previous four handoffs. If those handoffs were manual – and in most labs, they are – the export inherits gaps.

Map before you fix

Before buying a platform, upgrading an export format, or writing more glue scripts, map what you already have. The diagnostic takes twenty minutes and requires no new tools.

Pick one workflow – the one where evidence assembly causes the most pain before audits. List every tool that touches the data, in order. Write down what each tool produces and what it passes to the next. Then, for each handoff, ask: is this context machine-readable, or does it depend on a person doing the right thing?

Step	Question	If No
1	Is firmware version captured per run?	Version Gap – see Failure Mode 1
2	Is DUT serial number linked to results?	Identity Gap – see Failure Mode 2
3	Is calibration status queried automatically?	Configuration Drift – see Failure Mode 3
4	Are tool versions logged per run?	Version Gap
5	Is the handoff chain traceable without a person?	Structural Gap – consider diagnostic

Table: Five questions to ask about one workflow’s export pipeline. Person-dependent handoffs at any of these points mean context is silently lost.

Machine-readable means a structured log, an API response, a database record, a version-tagged artifact – something a script could consume without human interpretation. Person-dependent means a typed-in serial number, a calibration sticker read visually, a configuration setting documented in a separate file that may or may not be current.

The exercise often reveals that the export format itself is not the weakest link. The weakest link is usually two handoffs upstream, where configuration context was never captured in the first place.

If the workflow is tied to aerospace software verification, the DO-178C test evidence diagnostic keeps the same question narrow: which evidence, version, and review context survives one workflow without turning the page into certification advice.

When gaps are structural

Some gaps you find will be within-layer – a script that could capture firmware version automatically, an instrument that can be queried for calibration status. Fix those yourself. They cost attention, not capital.

Other gaps are cross-layer. When the handoff between stand configuration and orchestration has never been formalized – when the sequence file assumes a pin mapping that lives only in a lab notebook – no single-tool improvement closes that gap. The gap is structural. It exists because the tools were designed to own layers, not handoffs.

This is where an outside diagnostic becomes useful. The diagnostic covers one workflow, produces a mapped evidence chain with the gaps identified, and scopes what would be required to close them. It does not involve replacing any existing tool. It does not claim compliance automation or certification. It provides a map – a clear picture of where the chain breaks and what it would take to fix it.

Map one workflow. Find the gaps. No pitch.

If your team faces an audit, a certification deadline, or a recurring forensic scramble to reassemble test provenance, start with a diagnostic. One to two weeks. One workflow. A map of what is captured, what is lost, and what it would cost to close the gap. No tool replacement. No compliance guarantees. Just a lens you can use whether or not we ever talk again.

Marcin June 11, 2026

Understand the gap between exporting embedded test results and preserving engineering evidence.

Your export pipeline is lying to you. Not about the result – about the evidence.

The gap nobody names

What export preserves vs. what evidence requires

Public reference points

Three export failure modes

Why tools weren’t designed to fix this

Map before you fix

When gaps are structural

Map this in one workflow.

Your export pipeline is lying to you. Not about the result – about the evidence.

The gap nobody names

What export preserves vs. what evidence requires

Public reference points

Three export failure modes

Why tools weren’t designed to fix this

Map before you fix

When gaps are structural

Continue the evidence trail

The difference between test logs and trusted evidence

From Pass/Fail to Defensible Evidence: What the EU Cyber Resilience Act Means for Your Test Workflows

Where LabVIEW and TestStand handoffs lose workflow context

Map this in one workflow.