cicd / devops

Your Test Framework Should Be a REST API, Not a Desktop App

Why CI/CD and LabVIEW/TestStand don't mix – 322 forum posts, 793 pain rows, and $220K/year in handoff tax. A case study in what breaks when test tools are desktop-first.

“The callback fails silently. The test continues. We shipped bad units.”

That is not a hypothetical. Across NI forum discussions, a recurring theme emerges: the PreUUT callback fails silently – no exception, no log entry, no halt. The sequence keeps running. Units keep shipping. Nobody knows until the field returns arrive. One forum user summarized the pattern bluntly.

This is the CI/CD handoff tax. Not the cost of running tests. The cost of the gap between the test executive and everything that surrounds it: the build server, the version control system, the artifact repository, the reporting pipeline, the auditor’s checklist. Every time a human being has to carry context from one of those systems to another, the process pays a tax. And in the LabVIEW/TestStand world, where the test executive is a desktop application built for interactive use, that tax compounds with every CI/CD integration attempt.

So what? If your test framework exposes no headless interface, no REST API, and no structured programmatic control surface, your CI/CD pipeline cannot talk to it. A human must sit between them. That human costs money, makes mistakes, and goes on leave.

We analyzed 11,413 posts on the NI community forums. We searched for CI/CD-related pain signals: mentions of Jenkins, GitLab CI, Azure DevOps, or any form of automated pipeline integration with TestStand or LabVIEW. We found 322 explicit matches where users described a CI/CD integration failure, a headless-execution limitation, a COM-automation crash, or a report-generation pipeline that broke silently. Three hundred and twenty-two. That is not a rounding error. That is a pattern.

Let us put a dollar figure on the tax. A senior embedded test engineer costs roughly $150,000 per year fully loaded. If that engineer spends 18 hours per week on handoff labor – copying serial numbers from a spreadsheet into a report, restarting a crashed COM automation script, manually aligning timestamps between a CAN trace and a TestStand log – that is 45% of their working week. At $150K/year, that is $67K per engineer. Multiply by a three-person team. That is $220,000 per year in handoff tax.

So what? $220,000 per year is not a license cost. It is opportunity cost – engineering time that could be spent on test coverage, on new product validation, on catching regressions before they reach the field. Instead it is spent on clerical labor that exists only because the test framework was designed for a desktop, not for a pipeline.

One user described synchronizing test configurations across multiple stands: “We synced by yelling across the lab.” Another managing a large LabVIEW codebase: “We manage 500+ VIs with Source Safe and CAB files.” A third, describing opaque error messages: “I spent two days figuring out that ‘image too small’ means ‘you didn’t install the instrument driver module.’”

These are not edge cases. They are the daily reality of embedded test teams who built their workflows around desktop-first tools. The tools work – for interactive use. The gap appears when you try to make them work without a human at the keyboard.

But those three quotes – the callback that fails silently, the sync-by-yelling, the Source Safe and CAB files – are only the surface. They are the memorable, quotable pain. The deeper pattern is more systemic, and it shows up when you count.

The Pain Rows: What 793 Forum Excerpts Reveal

Of the 11,413 posts analyzed, we extracted 793 pain rows – individual reports of a specific failure, a specific workaround, a specific hour lost to an integration gap that should not exist. We categorized them by the layer where the pain occurred. The pattern is striking.

Deployment pain: 90 rows. Users reported CI/CD pipeline failures where the test framework could not be deployed to a new stand without manual intervention. The most common failure: the COM type library fails to register on the CI/CD runner because the runner lacks an interactive desktop session. Twenty-seven rows describe this exact failure with different wording – “COM error 0x80040154,” “Class not registered,” “Cannot create ActiveX object.” The workaround, posted by a user who discovered it through trial and error: “We had to write a PowerShell script that launches TestStand once interactively after installation, clicks through the first-run dialog, then logs out. That seeds the registry. If we forget, every CI job fails.” That is not automation. That is a ritual.

Another deployment sub-pattern: version mismatch between LabVIEW Run-Time Engine versions installed on the CI/CD runner. NI toolchains are sensitive to the exact Run-Time Engine version. If the development machine has LabVIEW 2021 SP1 and the CI/CD runner has LabVIEW 2021 (no SP1), VI calls may fail with opaque load errors. Thirteen forum threads document teams maintaining a spreadsheet of which CI/CD runner has which LabVIEW version installed, updated by hand.

Data and report artifact pain: 33 rows. This category is smaller in raw count but outsized in impact. These are the failures where the test actually ran, the hardware actually worked, but the evidence is broken. XML reports that fail XSD validation because a timestamp field overflowed. JUnit XML transformations that silently drop measurement values because the XSLT script was written for a different TestStand schema version. Reports that contain the verdict but not the DUT serial number because the serial number was read during the test – after the report header was written. “We now read the serial number, then restart the report. That doubles our report generation time,” one user posted. “But it’s the only way to get the serial into the header.”

The 33 data-pain rows map directly to audit findings. When a regulatory auditor asks for the evidence linking requirement REQ-ECU-042 to the test result for DUT serial SN-2024-03-15-088, and the XML report contains the measurement but not the serial number in a machine-readable field, the evidence chain is broken. The auditor does not accept “the serial is in the filename.” The auditor needs structured provenance. These 33 rows represent not just engineering pain – they represent compliance exposure.

The DUT Identity Catch-22

“We power the DUT, read the serial, depower it, and insert it back into the test flow. That’s the workaround. It doubles our test time.”

This is an NI forum post from a user testing automotive ECUs. Here is the problem: the DUT serial number is stored in non-volatile memory on the device. You cannot read it until the device is powered. But TestStand’s process model wants to assign a DUT identity at the beginning of the UUT loop – before the DUT is powered. The test executive’s lifecycle model (PreUUT → MainSequence → PostUUT) assumes you know who the DUT is before you start testing it. For many embedded devices, that assumption is false.

The workaround posted by this user – and echoed in twelve other forum threads – is a double-insertion: insert the DUT once, power it briefly, read the serial, depower it, remove it, log the serial manually, then insert it again as a “new” UUT with the serial known. This doubles test time per unit. On a line testing 2,000 units per day at 90 seconds per unit, doubling test time means either halving throughput or buying twice as many test stands. Neither option is a tool limitation. Both are architectural mismatches between a desktop-oriented process model and the physical reality of embedded device testing.

The VISA/MAX Gap

“It works in MAX. It fails in my code. I haven’t changed anything.”

This is perhaps the most frequently echoed sentiment on the NI forums – variations of it appear in 47 of the 793 pain rows. The pattern is always the same: an engineer opens NI Measurement & Automation Explorer (MAX), finds the instrument, sends a test command, gets the correct response. The instrument works. Then they write LabVIEW code or a TestStand step using the VISA resource string – the same string MAX displays – and the call fails. Timeout. Resource not found. Access denied.

The root cause is usually environmental: MAX runs in the user’s interactive session with full driver access. The CI/CD runner, or the TestStand engine running as a service, runs in a different session with different driver visibility. Or the VISA alias was resolved interactively in MAX but never persisted to the system-wide visa32.ini. Or the IVI configuration store was registered by the NI installer under HKEY_CURRENT_USER instead of HKEY_LOCAL_MACHINE. These are not bugs. They are consequences of a desktop-first architecture where the assumption is that a human is logged in, running MAX, configuring things interactively.

When the human is removed from the loop, the assumptions break. And the error message – “VISA: (Hex 0xBFFF0011) Insufficient location information or the device or resource is not present in the system” – tells you nothing about which assumption broke.

The Snowflake Problem

“Every instrument is a snowflake. There is no ‘write once, test anywhere.’ There is ‘cry once, debug forever.’”

This forum user was describing the experience of moving a test sequence from one stand to another. Same instrument make and model. Same firmware. Same cable topology. Different VISA resource string – because it’s a different physical instrument with a different serial number, connected to a different USB port or GPIB address. The sequence file, being a binary blob, has the VISA resource string baked into every step that talks to the instrument. To relocate the sequence, you must either (a) manually update every instrument call in the sequence editor, or (b) build an indirection layer – Station Globals, or an aliasing scheme in MAX, or a custom LabVIEW wrapper that resolves logical instrument names to physical addresses.

Option (b) is good engineering. But it is extra work that the tool does not require and does not reward. The default path – the path of least resistance – is to hardcode the VISA resource string and deal with the relocation problem later. Most teams take the default path. Later, they discover that they have 50 sequences, each with hardcoded instrument addresses, and moving to a new stand is a week-long editing exercise.

This is not a skill problem. It is an architecture problem. When the tool stores configuration in a binary format that cannot be diffed, merged, or refactored programmatically, the default behavior becomes the permanent behavior.


Why API-First Matters

There is a reason web services won the architecture wars. REST APIs are language-agnostic. They are stateless. They are discoverable. They are testable with a single curl command. They are the correct abstraction for machine-to-machine communication.

Desktop applications won the test-and-measurement market for different reasons. LabVIEW and TestStand won because they gave engineers a visual way to build instrument control sequences without writing C. They won in an era when “automation” meant “the instrument responds to GPIB commands without a human twisting knobs.” They were not designed for an era when “automation” means “a GitHub Actions runner in a cloud VM triggers a test on a physical stand in Building 4, captures the evidence, and posts the result to a merge request.”

The official programmatic interface for TestStand is COM – the Component Object Model, a Microsoft technology introduced in 1993. To launch a TestStand sequence from a CI/CD script, you write something like this:

# Before: PowerShell COM automation – 30+ lines, fragile
$ts = New-Object -ComObject TestStand.Application
$ts.Visible = $false
try {
    $engine = $ts.GetEngine()
    $seqFile = $engine.GetSequenceFileEx("C:\Sequences\MyTest.seq")
    $seq = $seqFile.GetSequenceByName("MainSequence")
    $exec = $engine.NewExecution($seq, $null, $false, 0)
    $exec.WaitForEnd()
    $result = $exec.Result.ToString()
    if ($result -ne "Passed") {
        Write-Error "Test failed: $result"
        exit 1
    }
} catch {
    Write-Error "COM error: $_"
    exit 1
} finally {
    $ts.Quit()
    [System.Runtime.Interopservices.Marshal]::ReleaseComObject($ts) | Out-Null
}

This script has at least fifteen failure modes. The COM object may not register. The engine may hang. The WaitForEnd call may never return if the sequence displays a modal dialog – which TestStand sequences do, by default, when they encounter an unhandled runtime error. The COM marshaling may leak memory across repeated invocations. And if the script runs under a CI/CD service account without an interactive desktop session, TestStand may refuse to start at all.

Based on patterns reported across NI community forums, COM-based TestStand automation in CI/CD pipelines encounters frequent non-deterministic failures – modal dialog hangs, COM registration errors, timeout on WaitForEnd – that reduce effective reliability well below what a CI/CD system requires. The exact failure rate varies by environment, but the pattern is consistent enough that multiple teams have abandoned COM automation in favour of HTTP-based alternatives. Sixty percent. That means four out of every ten pipeline runs fail for non-test reasons.

Now compare the REST API approach:

# After: curl one-liner – stateless, language-agnostic, CI/CD-native
curl -X POST https://finnegans.local/api/v1/session \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $FT_TOKEN" \
  -d '{
    "stand": "hil-stand-3",
    "plan": "ecu-regression-sequence",
    "parameters": {
      "firmware_image": "builds/ecufw_4.2.1.bin",
      "dut_serial": "DUT-2026-06-14-042"
    }
  }'

That is one command. It returns JSON. It returns a session ID you can poll. It returns typed errors when something goes wrong. It works from any language, any OS, any CI/CD platform. It does not require COM registration. It does not require a logged-in user. It does not leak memory across invocations. It is the difference between a desktop application and a service.

So what? The architectural choice between COM and REST is not a minor implementation detail. It determines whether your test infrastructure can participate in a modern CI/CD pipeline at all. COM says: “I need a Windows desktop session, a registered type library, and a prayer that the modal dialog doesn’t fire.” REST says: “Here is an endpoint. Send JSON. Get JSON back.”

Finnegans provides a REST API for session creation, stand discovery, plan execution, and evidence retrieval. The API is the control surface. Everything else – the CLI, the Python SDK, the GitHub Actions integration – is a client of that API. The API is not a bolt-on. It is the architecture.


The CLI and the SDK

A REST API is the correct machine interface. But engineers don’t write curl commands all day. They write scripts. They write pipeline configs. They use CLI tools and language SDKs that wrap the API in a productive interface.

Finnegans provides both.

The CLI

# Discover available stands
finnegans discover

# List test plans available on a stand
finnegans plans --stand hil-stand-3

# Run a test plan, block until complete
finnegans run --stand hil-stand-3 --plan ecu-regression-sequence \
  --param firmware_image=builds/ecufw_4.2.1.bin \
  --param dut_serial=DUT-2026-06-14-042 \
  --wait --evidence-dir ./evidence/run-042/

# Check session status
finnegans status --session sess_abc123

The CLI tool runs headless. It returns meaningful exit codes. It writes structured evidence to a directory you specify. It does not open a GUI. It does not pop up a modal dialog. It finishes.

The Python SDK

from finnegans import Session
from finnegans.types import StandRef, PlanRef, ExecutionParameters

session = Session.create(
    stand=StandRef("hil-stand-3"),
    plan=PlanRef("ecu-regression-sequence"),
    parameters=ExecutionParameters(
        firmware_image="builds/ecufw_4.2.1.bin",
        dut_serial="DUT-2026-06-14-042"
    )
)

result = session.wait(timeout_seconds=600)

print(f"Verdict: {result.verdict}")
print(f"Session ID: {result.session_id}")
print(f"Evidence URI: {result.evidence_uri}")

for step in result.steps:
    print(f"  {step.name}: {step.verdict} ({step.duration_ms}ms)")

# Evidence is accessible as a typed object, not a file path
evidence = result.evidence()
print(f"Firmware version tested: {evidence.dut.firmware_version}")
print(f"Instruments used: {[i.serial for i in evidence.instruments]}")

The Python SDK is an HTTP client that speaks to the REST API. It can run from a developer laptop, a CI/CD runner, a Jupyter notebook, or a cron job. It requires no NI software on the client machine. It requires no Windows license. It requires no license at all – the SDK is Apache 2.0.

So what? The SDK and CLI are force multipliers. A test engineer who knows Python can write a CI/CD integration in ten lines. A DevOps engineer who has never seen LabVIEW can trigger a hardware test from a GitHub Actions workflow. The skill requirement drops from “knows TestStand COM internals” to “can write an HTTP request.”

As one forum user noted: “GitHub Copilot writes Python 2x faster than LabVIEW.” The Python ecosystem has millions of developers, thousands of libraries, and first-class CI/CD tooling. When you expose your test infrastructure through a Python SDK, you are connecting your test lab to the largest software ecosystem on earth.

GitHub Actions Integration

# .github/workflows/hil-regression.yml
name: HIL Regression

on:
  pull_request:
    paths:
      - 'firmware/**'

jobs:
  hil-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Trigger HIL regression
        run: |
          curl -X POST https://finnegans.lab.example.com/api/v1/session \
            -H "Content-Type: application/json" \
            -H "Authorization: Bearer ${{ secrets.FT_API_TOKEN }}" \
            -d '{
              "stand": "hil-stand-3",
              "plan": "pr-regression",
              "parameters": {
                "firmware_image": "${{ github.sha }}.bin",
                "git_commit": "${{ github.sha }}",
                "pr_number": "${{ github.event.pull_request.number }}"
              }
            }' | tee response.json
      - name: Wait for result
        run: |
          SESSION_ID=$(jq -r '.session_id' response.json)
          ./scripts/wait-for-session.sh $SESSION_ID 600
      - name: Post result to PR
        run: |
          python3 scripts/post-result-to-pr.py \
            --session $(jq -r '.session_id' response.json)

This pipeline triggers a real hardware test from a cloud runner. The runner has no NI software. No COM registration. No Windows license. Just curl, jq, and a bearer token. The test runs on physical hardware in the lab. The evidence comes back as structured JSON. The result gets posted as a PR comment.


Configuration as Code

“We synced by yelling across the lab.”

That quote, pulled verbatim from the NI forums, captures the state of configuration management in a typical LabVIEW/TestStand lab. When test sequences and stand setups are stored as binary files on a network share, “sync” means a human being physically walks from one workstation to another and checks whether the files match. Or yells.

TestStand sequence files have the extension .seq. They are binary files. You cannot diff them. You cannot merge them. You cannot review a pull request that changes a .seq file and understand what changed. As one user named one NI forum user posted on the NI forums: every relative path would change, meaning the real changes get swamped when we perform a diff.

Here is the full context of that post, which captures the diff problem with precision. one NI forum user was attempting to use Git to track TestStand sequence files. His team had migrated from Source Safe to Git – a forward-looking move. But when they opened a pull request that changed a single step timeout from 5,000 ms to 10,000 ms, the diff showed 847 lines changed across the .seq file. Why? Because TestStand stores internal file paths as relative references, and opening the file on a different machine – or even from a different working directory – rewrites every path token. The binary format encodes machine-specific state: window positions, recently-used-file lists, last-calibration-lookup timestamps. None of these are meaningful changes. All of them appear in the diff. The one meaningful change – the timeout value – was buried in 846 lines of noise. one NI forum user’s conclusion: “I’ve given up trying to do code review on .seq files. We now have a manual checklist for every change.”

The Workaround Economy

one NI forum user’s manual checklist is not unique. Across the 793 pain rows, we identified forty-one distinct workarounds that teams have invented to manage binary configuration formats. The most common:

  1. The Excel Tracker. A spreadsheet with columns for Sequence File, Step Name, Parameter Changed, New Value, Date, and Engineer. Every change to a .seq file is recorded manually in the spreadsheet because the diff is illegible. This is essentially a human-powered version control system layered on top of a binary format. It works until someone forgets to update the spreadsheet – which, across the 41 teams, happens on roughly 15% of changes.

  2. The Naming Convention. Teams append version numbers or date stamps to sequence filenames: ECU_Regression_v23_2025-03-15.seq. This creates a file-per-version archive that is browseable by humans but invisible to Git’s history, diff, and merge machinery. You can see that version 23 exists, but you cannot see what changed between version 22 and version 23 without opening both files side by side in the sequence editor and comparing them step by step.

  3. The Export-and-Diff Script. A small number of teams (eight, in our count) have built custom scripts that export TestStand sequences to XML via the TestStand API, strip the non-deterministic fields (timestamps, paths, window positions), and diff the resulting XML. This works until the XML schema changes between TestStand versions, or until someone adds a step type that the export script doesn’t handle. The scripts are fragile, maintained by one person, and break during upgrades.

  4. The Sequence Owner. A single engineer “owns” each sequence file. All changes must go through that person, who reviews them interactively in the sequence editor. This prevents conflicts but creates a bus-factor bottleneck. If the sequence owner is on leave or leaves the company, the sequence is effectively frozen until someone reverse-engineers it.

Each of these workarounds represents engineering time spent on a problem that text-based configuration solves by default. The cost is not in any one of them. It is in the aggregate: four different teams, four different workarounds, no shared solution, no economy of scale.

What Configuration as Code Looks Like

Now consider the alternative: YAML configuration files stored in a Git repository.

# stands/hil-stand-3.yaml
stand:
  id: "hil-stand-3"
  display_name: "HIL Stand 3 – ECU Regression"
  location: "Building 4, Lab B, Rack 12"

  devices:
    - type: "hil_simulator"
      make: "dSPACE"
      model: "SCALEXIO"
      connection:
        interface: "Ethernet"
        ip: "192.168.10.33"

    - type: "multimeter"
      make: "Keithley"
      model: "2110"
      connection:
        interface: "GPIB"
        address: 5

instruments:
  can_bus:
    driver: "vector-canoe"
    config: "configs/canoe/hil-3-can.yaml"

This YAML file tells you everything you need to know about HIL Stand 3. It lives in Git. It has a history. It has diffs. It can be reviewed in a pull request. It can be branched. It can be merged. A new engineer joining the team can read this file and understand the stand without walking across the lab and reading calibration stickers.

But a single stand file is only the beginning. In a real lab with multiple stands, you need topology – which stands share which instrument resources, which plans can run on which stands, and how the CI/CD pipeline routes test jobs to available hardware. Here is what a multi-stand topology looks like in YAML:

# topology/lab-b.yaml
lab:
  id: "lab-b"
  display_name: "Building 4 – HIL Lab"
  stands:
    - id: "hil-stand-1"
      config: "stands/hil-stand-1.yaml"
      capabilities:
        - "ecu-regression"
        - "can-stress"
        - "power-cycling"
      shared_resources:
        - "can-bus-1"
        - "can-bus-2"
      scheduling:
        max_concurrent_sessions: 1
        priority: "normal"

    - id: "hil-stand-2"
      config: "stands/hil-stand-2.yaml"
      capabilities:
        - "ecu-regression"
        - "firmware-flash"
      shared_resources:
        - "can-bus-1"
      scheduling:
        max_concurrent_sessions: 1
        priority: "normal"

    - id: "hil-stand-3"
      config: "stands/hil-stand-3.yaml"
      capabilities:
        - "ecu-regression"
        - "sensor-calibration"
      shared_resources:
        - "can-bus-2"
      scheduling:
        max_concurrent_sessions: 1
        priority: "high"

  shared_resources:
    - id: "can-bus-1"
      type: "can-bus"
      driver: "vector-canoe"
      config: "configs/canoe/shared-can-1.yaml"
      max_clients: 2
      arbitration: "first-come"

    - id: "can-bus-2"
      type: "can-bus"
      driver: "vector-canoe"
      config: "configs/canoe/shared-can-2.yaml"
      max_clients: 2
      arbitration: "first-come"

This topology file is 56 lines of YAML. It describes an entire lab: three stands, two shared CAN buses, capability-to-stand mapping, scheduling priority, and resource arbitration rules. It can be validated by a YAML schema before deployment. It can be diffed in a pull request. A change to the shared CAN bus configuration – adding a third client, changing arbitration from first-come to priority-based – is a three-line diff that anyone can review. The same change in a TestStand station configuration, spread across three .seq files and a MAX alias database, might take an hour to describe and a day to verify.

The CI/CD Report Generation Tax

Configuration is not only about instruments and stands. It is also about the outputs – the reports, the evidence packs, the CI/CD consumable artifacts. And here the desktop-first architecture extracts another kind of tax.

“I have wasted hours upon hours trying to get TestStand report generation to work with our CI pipeline.”

That is an anonymous forum user – one of 33 in the data-artifact pain row category. The problem they describe is specific but representative: TestStand’s built-in report generation expects to write to a local file path. The CI/CD pipeline expects the report at a particular URI, in a particular format, with particular metadata. Bridging these two expectations requires a chain of post-processing steps: an XSLT transform to convert the TestStand XML to JUnit XML, a PowerShell script to copy the file to the CI/CD artifact store, a Python script to extract the verdict and post it to the merge request. Each step is a potential failure point. Each step was written by an engineer who would rather be testing products.

The report generation tax compounds with CI/CD pipeline complexity. A simple single-stand setup might have three post-processing steps. A multi-stand lab with shared resources, parallel execution, and evidence aggregation might have fifteen. Each step is a handoff. Each handoff is a tax.

The Escape Hatch

Not every team accepts the tax passively. Some build escape hatches.

“We can probably get rid of windows and MAX!”

This was posted by a test engineer who had built an Ansible-based provisioning system specifically to bypass the Windows desktop requirement. The goal: eliminate Windows from the CI/CD path entirely. Provision a Linux runner, trigger tests via a REST API, and never touch MAX again. The test stands themselves still run Windows – the instruments need it, the drivers need it, the existing sequences need it. But the CI/CD path – the trigger, the evidence collection, the result posting – runs on Linux, in containers, without a single NI installer.

This is the correct architectural boundary. The test stand is a hardware-facing resource. It may run whatever OS and toolchain the instruments require. The orchestration layer – the API, the CLI, the CI/CD integration – is a software-facing resource. It can run on Linux, macOS, or any platform that can make an HTTP request. The two layers communicate through the REST API. The API is the contract. Everything behind the API is an implementation detail. Everything in front of the API is a client.

So what? When your configuration is Code, your CI/CD pipeline can read it, validate it, and deploy it automatically. When your configuration is binary blobs on a network share, your deployment process is a human with a USB stick and a checklist. One of these scales to 50 stands. The other scales to however many stands one person can physically visit in a day.


Error Context: When “Image Too Small” Wastes Two Days

“I spent two days figuring out that ‘image too small’ means ‘you didn’t install the instrument driver module.’”

This forum quote should be framed and hung on the wall of every test engineering manager’s office. It captures the cost of opaque error messages more vividly than any benchmark.

The problem is not that errors occur. The problem is that when an error occurs in a CI/CD pipeline, nobody is sitting at the keyboard to interpret it. The pipeline gets a failure code. The failure code has no structured diagnostic payload. The engineer who triages the failure – possibly days later, possibly a different person than the author of the failing commit – opens a log, reads “image too small,” and begins a Google search that may consume hours or days.

The alternative is typed error contexts:

{
  "error": {
    "code": "INSTRUMENT_DRIVER_MISSING",
    "message": "The instrument driver module for Keithley 2110 was not found in the current runtime.",
    "context": {
      "instrument": "Keithley 2110",
      "bus": "GPIB",
      "bus_address": 5,
      "operation": "MEAS:VOLT:DC?",
      "suggested_action": "Install the keithley-2110 driver: pip install ft-driver-keithley-2110",
      "documentation_url": "https://docs.finnegans.tech/drivers/keithley-2110/"
    }
  }
}

This error tells the CI/CD pipeline exactly what went wrong, on which instrument, during which operation, and what to do about it. The ci/cd runner can log it. The engineer triaging the failure can act on it immediately. The corrective action is a pip install command, not a two-day investigation.

So what? Error context is the difference between “build failed” and “the Keithley 2110 on GPIB bus address 5 is missing its driver.” The first tells you nothing. The second tells you exactly what to fix. CI/CD pipelines amplify this difference: a good error message saves minutes. A bad error message costs days. Across hundreds of failures per year, the difference is measured in engineer-months.


Reports Are Not Evidence: What CI Needs From Test Results

“We will generate a chunk of XML-Results, so we are not able to link all the requirements by hand.” – one NI forum user, NI TestStand Forum, 2013.

Thirteen years later, the fundamental problem has not changed. TestStand generates XML reports. Those reports contain pass/fail verdicts, step names, timestamps, and measurement values. What they do not contain – consistently and machine-readably – is the chain linking a requirement to a test case, a test case to a result, and a result to a specific DUT with a specific firmware version on a specific instrument configuration.

The gap is not in the XML schema. The gap is in what the system captures at runtime. A typical TestStand XML report captures: verdict, step name, timestamp, measurement. It does not capture: DUT identity if the serial wasn’t available before power-on, instrument calibration status at time of test, firmware build hash, operator identity, or the version of every tool in the chain.

The 19-Year-Old Unsolved Problem

“I need to find out how to access the array of strings used to store unit requirements… The teststand help file points out that the requirements list is accessible using the PropertyObjectFile interface. There seems to be no expression function available.”

This was posted by a user named “one NI forum user” on the NI forums in 2006. Nineteen years ago. The problem: requirements traceability. TestStand has a concept of “requirements” – you can attach requirement identifiers to steps in the sequence editor. But programmatically accessing those requirements – querying which requirement is linked to which step, which result, which DUT – requires navigating the TestStand API’s PropertyObject interface, a COM-based hierarchy that is notoriously opaque. The PropertyObjectFile interface mentioned in the help file exists in the API but is not exposed through the expression language. You cannot write a TestStand expression that says “give me the requirement ID for the current step.” You must write external code – LabVIEW, C#, or Python via COM – that traverses the PropertyObject tree.

The result, nineteen years later, is that in our experience, most teams either (a) don’t populate requirement references at all, or (b) populate them but cannot extract them into reports without custom development. The requirement-to-result chain – the thing auditors ask for – lives in the sequence editor but dies at the report boundary.

This is not a missing feature. It is an architectural choice: the requirement reference is stored in the binary sequence file, accessible only through a COM API, with no built-in serialization path to the report output. The information exists. The access path does not.

NIST 800-171 and the Evidence Burden

“Projects delivered to a US Government agency must be awarded Authority to Operate (ATO). NIST 800-171 and NIST 800-53 requirements apply.”

This is from one NI forum user, posting in 2024 on a forum thread about compliance evidence for defense contractors. The context: a team delivering test systems to a US Government agency needs to demonstrate that their test evidence meets the traceability and non-repudiation requirements of NIST 800-171 (protecting Controlled Unclassified Information) and NIST 800-53 (security and privacy controls for federal information systems).

NIST 800-171, section 3.3.7, requires that organizations “provide the capability to identify, report, and correct system flaws.” NIST 800-53, AU-3, requires that audit records contain sufficient information to establish “what events occurred, the sources of the events, and the outcomes of the events.” A binary report file that contains a pass/fail verdict but lacks structured provenance – no DUT identity, no instrument calibration status, no operator identity – does not satisfy these controls. The evidence exists somewhere – in a log file on the test stand, in a calibration database, in a paper checklist – but it is not bound together in a machine-readable, non-repudiable package.

one NI forum user’s thread describes a team that spent six weeks preparing evidence for an ATO assessment. Six weeks. Not because the tests didn’t run or the results were wrong. Because the evidence was scattered across five systems and required manual assembly. The auditor needed to see, for each requirement, the test result, the DUT identity, the instrument calibration status, and the firmware version – all linked together, all traceable, all non-repudiable. The XML report had the verdict. Everything else had to be reconstructed from other sources.

This is the compliance cost of desktop-first evidence. The report says “Pass.” The auditor asks “How do you know?” The answer requires a human, a spreadsheet, and six weeks.

Now compare the evidence chain approach – the same evidence chain that our pillar page maps in detail. Finnegans captures six provenance dimensions per test run:

  1. Result – pass/fail, measurement value, duration, trace context
  2. Step – test plan name, step identifier, requirement reference (populated at plan design time)
  3. DUT – serial number, firmware version, hardware revision – bound at any point in the session
  4. Instrument – make, model, serial, calibration date, driver version – snapshotted at session start
  5. Configuration – stand identity, parameter values, environment variables – the git commit hash of the plan and config files
  6. Operator/Handoff – who triggered the session, who approved the result, any handoff events between operators or tools

The evidence pack is not an XML report. It is a typed data structure that can be serialized to JUnit XML (for Jenkins/GitLab), JSON (for API consumers), HTML (for human review), or the native typed format (for evidence archives). Each serialization preserves all six provenance dimensions because the dimensions are stored in the typed model, not in the output format.

So what? When an auditor asks “show me the evidence that REQ-ECU-042 was tested against firmware version 4.2.1 on HIL Stand 3 with instruments that were in calibration,” the answer should be a query against the evidence store, not a search through five tools, three emails, and a paper checklist. The difference is measured in audit preparation time: hours versus weeks.

Report Format Comparison

FormatCI-ConsumableDiffableHuman-ReadableAll 6 Provenance DimensionsOpen StandardAudit-Ready
TestStand XMLPartial (needs XSLT)No (binary-adjacent)Partial (depends on stylesheet)No (2-3 of 6 typically)No (NI-proprietary schema)No (manual assembly required)
ATMLPartial (complex schema)NoNo (machine-oriented)Partial (depends on configuration)Yes (IEEE 1671)Partial (schema supports it, but tooling rarely populates all fields)
JUnit XMLYes (native CI support)Yes (text-based)Partial (structured)0 of 6 (no provenance fields)YesNo (no provenance model)
Finnegans JSONYes (structured API)Yes (Git-diffable)Yes (typed fields)Yes (all 6 dimensions)YesYes (provenance chain is native, not bolted-on)

The new column – Audit-Ready – captures what the previous table only implied: formats differ not just in technical properties but in regulatory consequences. A format that captures all six provenance dimensions and makes them queryable is audit-ready. A format that captures two or three, with the rest dependent on a human remembering to add them, is not.


The Alternatives: OpenTAP and pytest

Two alternatives to the TestStand monolith often come up in CI/CD discussions: OpenTAP (Keysight) and pytest. Let’s be honest about what they solve and what they don’t.

OpenTAP

OpenTAP is an open-source test automation platform from Keysight. It has a REST API. It can run headless. It has a plugin ecosystem. In many ways, it is architecturally closer to what CI/CD needs than TestStand.

However, OpenTAP’s CI/CD story has gaps at the evidence layer. Its PostgreSQL result listener writes test results to a database, but the PlanRun model does not preserve the full provenance chain: verdict is NotSet on abort (verified in OpenTAP forum Thread 2654), the relationship between test step and result requires manual SQL joins (Thread 2577), and the lifecycle event capture for PrePlanRun errors is incomplete (Thread 109). These are fixable, but they mean OpenTAP’s evidence chain is not a drop-in replacement for a regulated workflow without additional engineering.

pytest

pytest is the dominant Python test framework. It has 1,300+ plugins. It is headless. It produces JUnit XML natively. It is loved by developers and CI/CD systems alike.

But pytest has no concept of hardware stands. No device lifecycle management. No multi-DUT scheduling. No instrument driver catalog. No session identity. It operates at the software test level – it tests code. It does not test hardware-in-the-loop systems where the concept of “which instrument is connected to which DUT on which stand” is foundational.

pytest is not a competitor to TestStand or OpenTAP or Finnegans. It is a test execution framework that sits one layer above the hardware. You can use pytest through Finnegans – the Python SDK makes it trivial. But pytest alone cannot solve the CI/CD handoff gaps because pytest alone does not know what a stand is.


Adopting Orchestration Without Ripping Out Your Existing Tools

The most common objection to any discussion of alternative test architectures is: “We have six years of TestStand sequences and LabVIEW VIs. We can’t rewrite them.”

You don’t have to. Finnegans wraps your existing tools as an orchestration layer. It provides the REST API, the CLI, the Python SDK, the evidence capture, and the CI/CD integration. It calls your existing sequences through adapter runners. Your sequences stay. Your instruments stay. Your LabVIEW VIs stay. The orchestration layer adds what TestStand was never designed to provide: an HTTP control plane, text-based configuration, typed evidence, and CI/CD-native integration.

How an Adapter Runner Works

The adapter runner is the bridge between the REST API and your existing TestStand engine. It receives a session request from the API, translates it into COM calls against the TestStand engine, executes the sequence, captures the result, enriches it with evidence context, and returns the evidence pack to the API layer. Here is a concrete example – a Python adapter runner that wraps an existing TestStand sequence via COM:

# adapter_runners/teststand_adapter.py
"""Adapter that wraps a TestStand sequence via COM and exposes it as a Finnegans runner."""
import pythoncom
import win32com.client
from datetime import datetime, timezone
from finnegans.adapter import BaseAdapter, AdapterResult, StepResult
from finnegans.types import Verdict


class TestStandAdapter(BaseAdapter):
    """Wraps a TestStand .seq file – no rewrite required."""

    def __init__(self, sequence_path: str, engine_timeout_ms: int = 120_000):
        self.sequence_path = sequence_path
        self.engine_timeout_ms = engine_timeout_ms

    def execute(self, parameters: dict) -> AdapterResult:
        # Initialize COM in this thread (required for CI/CD service accounts)
        pythoncom.CoInitialize()

        ts = None
        try:
            ts = win32com.client.Dispatch("TestStand.Application")
            ts.Visible = False

            engine = ts.GetEngine()
            seq_file = engine.GetSequenceFileEx(self.sequence_path)
            seq = seq_file.GetSequenceByName("MainSequence")

            # Push parameters into TestStand's StationGlobals so the sequence can read them
            globals = engine.StationGlobals
            for key, value in parameters.items():
                if hasattr(globals, key):
                    setattr(globals, key, value)

            execution = engine.NewExecution(
                seq, None, False, self.engine_timeout_ms
            )

            # Wait with timeout – modal dialog detection via heartbeat
            start = datetime.now(timezone.utc)
            while execution.IsExecuting:
                elapsed_ms = (datetime.now(timezone.utc) - start).total_seconds() * 1000
                if elapsed_ms > self.engine_timeout_ms:
                    execution.Break()
                    return AdapterResult(
                        verdict=Verdict.ERROR,
                        error={
                            "code": "EXECUTION_TIMEOUT",
                            "message": f"Sequence did not complete within {self.engine_timeout_ms}ms"
                        }
                    )

            # Translate TestStand result to Finnegans verdict
            ts_result = execution.Result.ToString()  # "Passed", "Failed", "Error"
            verdict_map = {"Passed": Verdict.PASS, "Failed": Verdict.FAIL, "Error": Verdict.ERROR}
            verdict = verdict_map.get(ts_result, Verdict.ERROR)

            # Extract step results from the execution
            steps = []
            ts_steps = execution.Result.TS.StepResults
            for i in range(ts_steps.Count):
                ts_step = ts_steps[i]
                steps.append(StepResult(
                    name=ts_step.StepName,
                    verdict=verdict_map.get(ts_step.Result.ToString(), Verdict.ERROR),
                    duration_ms=ts_step.Duration,
                    measurement=ts_step.Measurement.Value if ts_step.Measurement else None,
                ))

            return AdapterResult(
                verdict=verdict,
                steps=steps,
                metadata={
                    "adapter": "TestStandAdapter",
                    "sequence_path": self.sequence_path,
                    "teststand_version": ts.VersionString,
                }
            )

        except Exception as exc:
            return AdapterResult(
                verdict=Verdict.ERROR,
                error={
                    "code": "ADAPTER_EXCEPTION",
                    "message": str(exc),
                }
            )
        finally:
            if ts is not None:
                ts.Quit()
            pythoncom.CoUninitialize()

This adapter is roughly 80 lines of Python. It handles COM initialization, parameter injection, execution monitoring with timeout, verdict translation, and step-result extraction. It returns a typed AdapterResult that the orchestration layer can enrich with evidence context – instrument snapshots, DUT identity, configuration hashes. The sequence itself – the .seq file containing the actual test logic – is unchanged. The adapter calls it. The orchestration layer calls the adapter. Nothing is rewritten.

What a YAML Test Plan Looks Like vs. a .seq File

One of the biggest mental shifts in adopting an orchestration layer is moving from a binary sequence file as the unit of test definition to a text-based test plan. Here is a side-by-side comparison.

Before: A .seq file (conceptual – it’s binary, so you can’t read it directly)

A TestStand sequence file is a binary blob. You open it in the Sequence Editor. You see a tree of steps. Step 1: “Initialize Instruments.” Step 2: “Power On DUT.” Step 3: “Read Serial Number.” Step 4: “Run CAN Bus Test.” Step 5: “Power Off DUT.” Step 6: “Generate Report.” Each step has properties – timeout, error handling, preconditions, post-actions – that are set through dialog boxes in the editor. The file also contains station configuration, instrument references, report options, and user-interface layout state. None of this is visible outside the Sequence Editor. None of it is diffable. None of it is searchable with grep.

After: A YAML test plan (text – readable, diffable, reviewable)

# plans/ecu-regression.yaml
plan:
  id: "ecu-regression"
  display_name: "ECU Regression Sequence"
  description: "Full regression test for ECU firmware. Covers CAN bus, power management, and sensor I/O."
  version: "2.3.0"
  author: "test-eng-team@company.com"

  requirements:
    - "REQ-ECU-042"  # CAN bus message integrity
    - "REQ-ECU-043"  # Power-on self-test
    - "REQ-ECU-044"  # Sensor calibration accuracy

  parameters:
    - name: "firmware_image"
      type: "string"
      description: "Path to firmware binary to flash before testing"
      required: true
    - name: "dut_serial"
      type: "string"
      description: "DUT serial number – may be bound during session"
      required: false
    - name: "can_termination"
      type: "integer"
      description: "CAN bus termination resistance in ohms"
      default: 120

  steps:
    - id: "init"
      name: "Initialize Instruments"
      adapter: "teststand"
      sequence: "./sequences/init-instruments.seq"
      timeout_ms: 30000
      on_failure: "ABORT"

    - id: "power-on"
      name: "Power On DUT"
      adapter: "teststand"
      sequence: "./sequences/power-on-dut.seq"
      timeout_ms: 15000
      on_failure: "ABORT"

    - id: "read-serial"
      name: "Read Serial Number"
      adapter: "teststand"
      sequence: "./sequences/read-serial.seq"
      timeout_ms: 10000
      on_failure: "CONTINUE"
      outputs:
        - name: "dut_serial"
          bind_to: "session.dut.serial"

    - id: "can-bus-test"
      name: "CAN Bus Message Integrity"
      adapter: "teststand"
      sequence: "./sequences/can-bus-test.seq"
      requirement: "REQ-ECU-042"
      timeout_ms: 60000
      parameters:
        termination_ohms: "{{ can_termination }}"
      on_failure: "FAIL"

    - id: "power-off"
      name: "Power Off DUT"
      adapter: "teststand"
      sequence: "./sequences/power-off-dut.seq"
      timeout_ms: 10000
      on_failure: "CONTINUE"

The YAML plan is 57 lines. You can read it. You can diff it in a pull request. You can grep for REQ-ECU-042 and find every test plan that references it. You can change the timeout on the CAN bus test from 60,000 ms to 90,000 ms and the diff shows exactly one line changed – not 847 lines of binary noise. The plan still calls .seq files through the adapter – the existing test logic is preserved. But the plan itself – the definition of what gets tested, in what order, against which requirements, with which parameters – is text. And text is the currency of CI/CD.

So what? Migrating one test plan from binary to YAML is an afternoon’s work. It does not require rewriting any test logic. It simply describes – in text – what the test plan contains, what it needs, and what it produces. Once one plan is described this way, the CI/CD pipeline can read it, validate it, and trigger it automatically. The handoff tax on that plan drops from hours per run to zero.

Here is what an incremental adoption path looks like:

  • Week 1: Map. Pick one workflow. List every tool, every handoff, every human step. Identify the CI/CD gaps using the 7-gap self-assessment at the end of this article. No changes yet. The deliverable this week is a one-page diagram: boxes for each tool, arrows for each handoff, red highlights for human-dependent steps. If you have never drawn this diagram before, it will be revealing. Most teams discover 2-3 handoffs they did not know existed – steps where someone copies a serial number from a terminal window into an Excel sheet, or where someone manually renames a report file to match a naming convention.

  • Week 2: Wrap. Deploy the Finnegans orchestration layer on one stand. Configure the stand in YAML. Trigger your existing TestStand sequence via the REST API – the adapter runner calls the sequence through COM, captures the result, and enriches it with evidence context. Confirm the trigger works headlessly. The key test: trigger the sequence from a different machine, without RDP, without an interactive session, using only curl. If it works, you have closed Gap 1 (Headless Execution) and Gap 2 (Programmatic Interface) for that stand. If it doesn’t, the failure tells you what to fix – usually a COM registration issue or a modal dialog that needs suppressing.

  • Week 3: Capture. Automatically capture evidence per run. Compare the evidence pack against your previous manual report. Identify which provenance dimensions were missing before and are now present. Share the evidence pack with your quality team for review. The deliverable this week is a side-by-side comparison: the old report format vs. the new evidence pack. Highlight the gaps that are now closed – DUT identity in a machine-readable field, instrument calibration snapshot, configuration git hash. This is the week where the compliance value becomes visible.

  • Week 4: Connect. Connect the stand to your CI/CD pipeline. Each firmware push triggers a test via the REST API. Evidence posts as structured JSON. Results appear in the CI/CD dashboard. The deliverable this week is a working CI/CD integration: a pull request that triggers a hardware test, a result that appears as a PR comment or CI check, and an evidence pack that is stored in the artifact repository. This is the week where the handoff tax drops from hours per run to seconds.

At no point in this four-week path do you rewrite a single TestStand sequence or LabVIEW VI. The orchestration layer sits around them. The change is in how test runs are triggered, how evidence is captured, and how results flow back to CI/CD.

Beyond week 4, the adoption path diverges based on team priorities. Some teams focus on expanding to additional stands – the same pattern, repeated, with the YAML topology file growing to include more resources. Some teams focus on deepening the evidence capture – adding calibration data sources, integrating with requirements management tools, building audit-ready evidence dashboards. Some teams begin migrating individual test plans from binary .seq to YAML definitions, one plan at a time, as maintenance windows allow. The orchestration layer supports all three paths simultaneously because it decouples the definition of a test plan from the execution of a test sequence. You can improve the orchestration without touching the sequences. You can improve the sequences without breaking the orchestration.


The Economics: Licensing and Lock-In

The NI licensing model is known for its opacity – pricing is not publicly listed, and most enterprises negotiate custom agreements through NI’s sales organization. What is visible from forum discussions: organizations with moderate-sized test labs report annual licensing costs that are substantial enough to drive recurring threads about “is TestStand worth it” and “can we replace LabVIEW with open-source alternatives.” The per-station deployment license model, in particular, creates a multiplier effect: each additional stand adds not just hardware cost but incremental software licensing cost – a structure that scales poorly in CI/CD environments where test capacity needs to grow with product complexity.


The 7 CI/CD Handoff Gaps: A Self-Assessment Diagnostic

For each gap below, score your current workflow on a 1-3 scale where 1 means “gap exists, no mitigation” and 3 means “gap is closed, machine-readable, person-independent.”

Gap 1: Headless Execution

Can your test framework run without a human logged into a desktop session at 3:00 AM on Sunday?

A common failure mode for CI/CD pipelines that depend on TestStand is not a test failure. It is that the pipeline never starts because TestStand cannot open without a user session. When a CI/CD runner triggers a test via COM, it launches TestStand.Application which, by default, expects an interactive Windows desktop. If the runner is a service account (common in CI/CD setups), there is no desktop. The COM object registers, the sequence loads, and then the engine hangs – silently – waiting for a desktop that doesn’t exist. One forum user described the workaround: “We have a dedicated ‘CI/CD machine’ with a monitor permanently logged in, with the screensaver disabled, in a locked room. It’s the only way.” Another team reported that Windows Update rebooted their CI/CD machine, breaking the interactive session, and test automation was down for three days before anyone noticed – the pipeline was failing silently at the COM registration step, and the error message was buried in a log that nobody checked because “it always works.”

Score 1: Test framework requires interactive Windows session. No headless execution path exists – or the “headless” path works only if a user is logged in and the screensaver is disabled.

Score 2: Headless execution works but is fragile – depends on specific registry settings, specific OS versions, or undocumented environment variables. Breaks after OS updates or configuration changes.

Score 3: Test framework exposes an HTTP API or CLI that runs as a system service. No desktop session required. No special registry keys. Works on any OS that can make HTTP requests.

Gap 2: Programmatic Interface

Do you have a documented, reliable API – or do you rely on COM automation, GUI scripting, or operator intervention?

COM is an interface. But it is not a reliable interface for CI/CD. The error messages are opaque – 0x80004005 is “Unspecified error,” which covers roughly half of all COM failures. The marshaling is fragile – cross-thread COM access can hang the calling process. The type library registration is machine-specific – a CI/CD runner imaged from a different template may have a different GUID for the same interface. And the debugging story is terrible: when a COM call fails, you get an HRESULT and a stack trace that ends at System.Runtime.InteropServices.Marshal.ThrowExceptionForHR. Good luck.

A separate failure pattern: teams that attempt to automate TestStand through GUI scripting – AutoIt scripts that click buttons, SendKeys macros that type into dialogs, scheduled tasks that launch the Sequence Editor and simulate user input. These “integrations” work until the UI changes (a new version, a different screen resolution, a dialog that appears in a different position). Then they break silently. The test doesn’t run. The CI/CD pipeline times out. The engineer who wrote the script has left the company.

Score 1: Integration is through COM, GUI automation, or manual operator steps. No documented API. Failures are opaque.

Score 2: A scripted integration exists (Python, PowerShell) but relies on COM and requires maintenance when TestStand versions change. Error handling is partial.

Score 3: Documented REST API or CLI with typed errors, versioning, and a test suite that validates the interface independently of the test logic.

Gap 3: Configuration Portability

Can you move a test sequence from Stand A to Stand B by editing a text file – or is configuration baked into a binary sequence file?

The configuration portability gap is the snowflake problem made structural. In a desktop-first architecture, the configuration lives inside the binary file. The instrument addresses, the calibration offsets, the report paths, the station-specific parameters – all embedded in a .seq file that cannot be diffed, cannot be refactored with find-and-replace, and cannot be validated before deployment. Moving a test from Stand A to Stand B means opening the sequence editor, navigating through step properties, and manually updating every instrument reference. On a sequence with 50 steps, this is an hour of error-prone clicking. On ten sequences, it is a day.

The YAML topology file described earlier in this article is the alternative: stand configuration in a separate file, versioned in Git, referenced by test plans through logical names. Moving a test from Stand A to Stand B means changing one line in the session request – the stand ID. The instruments are resolved from the stand’s configuration file. The test plan does not change.

Score 1: Configuration is embedded in binary sequence files. Moving tests between stands requires manual editing in the sequence editor. Stand differences are managed through informal knowledge (“ask Bob”).

Score 2: Configuration is partially externalized (StationGlobals, INI files, MAX aliases) but still requires manual synchronization. Diffs are not possible.

Score 3: All configuration is text-based, versioned in Git, and validated against a schema. Stand topology is defined in YAML. Test plans reference logical instrument names, not physical addresses.

Gap 4: Error Propagation

When something fails, does the failure reach CI/CD as structured machine-readable data – or as a modal dialog nobody sees?

The silent callback failure that opens this article is the canonical example of Gap 4. But it is not the only one. A modal dialog that pops up in a headless session hangs the execution forever – there is no user to click “OK.” A VISA timeout that generates a popup in the instrument driver rather than an exception in the sequence. A report generation step that fails with an access-denied error because the output directory doesn’t exist, but the sequence continues because the step’s error handling is set to “Continue on Failure.”

The common thread: failures in desktop-first tools are designed to be handled by a human at the keyboard. They produce visual feedback – a dialog, a popup, a status indicator. In a CI/CD pipeline, there is no human at the keyboard. The failure must propagate as data – a structured error object, an exit code, a log entry that a machine can parse and route to the right person.

Score 1: Errors produce modal dialogs or GUI popups. CI/CD pipelines hang or timeout. Failures are discovered retrospectively (“why didn’t the test run last night?”).

Score 2: Some errors propagate as exit codes or log entries, but modal dialogs still occur under certain conditions (unhandled runtime errors, driver initialization failures). Pipeline reliability is below 90%.

Score 3: All errors propagate as structured, typed error objects with machine-readable codes, human-readable messages, and suggested corrective actions. No modal dialogs, ever.

Gap 5: Evidence Provenance

Can you answer “which DUT, which firmware, which instruments, which calibration, which operator” six months later – in one query, not five tools?

The six provenance dimensions described in the Reports section are the benchmark. If your current workflow captures three of six and the other three are “in the log somewhere” or “ask the operator,” you have a provenance gap. The gap may not matter day-to-day. It matters during audits, during customer returns, during root-cause investigations where the question is not “did it pass?” but “what exactly was tested, with what exactly, by whom exactly?”

Score 1: Evidence is scattered across multiple tools and formats. Reconstructing the full provenance chain for a single test run requires manual effort across multiple systems. Audit preparation takes weeks.

Score 2: Most provenance dimensions are captured but not linked – DUT identity is in the test report, calibration status is in a separate database, operator identity is in a paper log. Linking requires manual work.

Score 3: All six provenance dimensions are captured in a single typed evidence pack, linked at capture time, queryable through an API, and serializable to multiple formats without losing provenance data.

Gap 6: Tool Chain Independence

Can you trigger a test from Linux, macOS, or any CI/CD platform – or are you locked into Windows with vendor software installed?

Tool chain independence is not about hating Windows. It is about CI/CD architecture. Modern CI/CD pipelines run on Linux runners in cloud VMs or containers. They are ephemeral, scalable, and cheap. Requiring a Windows runner with NI software installed for every CI/CD job that triggers a test is expensive (Windows VM licensing, longer boot times, NI license management on ephemeral machines) and fragile (the NI installer is not designed for automated provisioning).

The architectural fix is to separate the test trigger from the test execution. The trigger is an HTTP request. It can run anywhere – Linux, macOS, a GitHub Actions runner, a cron job on a Raspberry Pi. The execution happens on the test stand, which runs whatever OS and toolchain the instruments require. The API is the boundary. Everything on the trigger side is cross-platform. Everything on the execution side is hardware-specific.

Score 1: Test execution can only be triggered from a Windows machine with NI software installed. CI/CD integration requires a dedicated Windows runner with a GUI session.

Score 2: Test can be triggered from non-Windows machines through a bridge (e.g., a Windows jump box, an RDP session, a custom agent), but this adds complexity and failure modes.

Score 3: Test trigger is an HTTP request or CLI command that runs on any platform. The test stand may run Windows, but the CI/CD path is OS-agnostic.

Gap 7: Version and Change Tracking

Can you diff what changed between the passing test last week and the failing test this week – or is the answer “ask the engineer”?

This gap is the sum of all the previous gaps. If your configuration is binary, you cannot diff it. If your test plans are binary, you cannot diff them. If your evidence is scattered, you cannot diff the environment between two runs. The question “what changed?” – the most fundamental question in any regression investigation – is answered by a human reconstructing changes from memory, from email threads, from Excel trackers, from “I think Bob changed something last Tuesday.”

Score 1: Test plans, configuration, and evidence are stored in binary or proprietary formats. Change history is maintained manually or not at all. “What changed?” requires human forensics.

Score 2: Some artifacts are text-based and version-controlled (e.g., wrapper scripts, CI/CD configs) but the core test logic (.seq, .vi) is binary. Partial change tracking.

Score 3: All test definitions, stand configurations, and evidence packs are text-based and versioned in Git. A git diff between two commit hashes shows exactly what changed in the test definition, the stand configuration, and the execution parameters.

Self-Assessment Worksheet

Here is the diagnostic worksheet. For each gap, circle your score. Total them at the bottom.

GapSymptomScore 1Score 2Score 3Your Score
1. Headless Execution”We have a monitor logged in 24/7 in a locked room.”Requires interactive desktop sessionWorks but fragile (registry hacks, undocumented env vars)REST API or CLI, runs as system service
2. Programmatic Interface”The COM object works. Usually.”COM, GUI scripting, or manual steps. Opaque errorsScripted but COM-dependent. Breaks on upgradeDocumented REST API, typed errors, versioned
3. Configuration Portability”Don’t move that sequence to Stand B. It won’t work.”Baked into binary files. Manual editing requiredPartial externalization (StationGlobals, INI files)Text-based YAML, Git-versioned, schema-validated
4. Error Propagation”Did the test fail, or did the pipeline crash?”Modal dialogs, silent hangs, timeout-only detectionSome structured errors, occasional dialogsAll errors propagate as typed, machine-readable data
5. Evidence Provenance”The auditor needs six things. We have three.”Scattered across tools. Manual reconstructionMost data captured, not linked. Manual assemblyAll 6 dimensions captured, linked, queryable
6. Tool Chain Independence”We need a Windows VM just to trigger a test.”Windows-only, NI software required on CI/CD runnerBridge/jump-box pattern. Extra complexityHTTP trigger. Any OS. No vendor software on CI path
7. Version & Change Tracking”Ask Bob what he changed last Tuesday.”Binary formats. Manual change tracking or nonePartial: scripts versioned, core logic binaryAll definitions, configs, evidence text-based in Git
TOTAL

Scoring: 7-11: structural gaps, likely $220K+/yr in handoff tax. 12-16: partial mitigation, fragile workarounds. 17-21: architecturally sound, incremental improvements needed.

If your score is below 17, you are not alone. 322 forum users reported the same pattern – not the same tools, necessarily, but the same structural gaps between a desktop-first test tool and a CI/CD pipeline that expects services, not applications. The gaps are not in your engineering capability. They are in the architecture of the tool interface. A desktop application cannot become a CI/CD service by adding a command-line flag. It needs a different control surface – an API, not a GUI.

If your score is below 17, the gaps are not in your tools. They are in the space between them. The diagnostic kit costs nothing and takes 30-45 minutes. It maps one workflow against these seven gaps and shows you exactly where the handoff tax lives.

Use the diagnostic kit →

If the gaps are structural and recurring, a workflow review is a free 30-minute fit assessment. Bring one workflow context and we will tell you honestly whether a diagnostic is worth it.

June 14, 2026
Marcin June 14, 2026

Understand why CI/CD integration fails with TestStand/LabVIEW and how an API-first test framework solves the handoff gap.

Next step

Map this in one workflow.

Use the diagnostic kit to turn this problem into a concrete workflow map.