Architecture Of Konflux

62. Distributed Tracing

Date started: 2026-02-04

# Status

Implementable

# Context

Tekton’s OpenTelemetry instrumentation supports propagating a parent span context onto PipelineRuns. When present, Tekton parents its execution spans under the provided context and propagates it to child TaskRuns. Although designed for Tekton-internal parent-child PipelineRun linkage, the mechanism accepts a span context from any source.

By propagating trace context onto PipelineRuns they create, Konflux controllers cause Tekton’s execution spans to appear as children in an external distributed trace. This enables end-to-end trace continuity across the delivery lifecycle — from SCM event receipt through build, snapshotting, integration, and release — supporting consistent trace-based measurement of delivery latency (MTTB) across controllers and clusters.

# Goals

  1. Propagate a remote span context across Konflux controllers and clusters so that all PipelineRuns in a delivery form a single distributed trace.
  2. Remain inert when trace context is absent — no behavioral change to existing PipelineRuns.
  3. Support reliable timing analysis using resource timestamps (wait vs. execute breakdown).

# Decision

# Alignment principles

Trace context propagation follows two complementary standards:

PaC extracts trace context from inbound SCM events and propagates it onto the build PipelineRun ensuring a single propagation mechanism throughout the resource-level delivery lifecycle.

# Trace context propagation

PaC extracts trace context from inbound SCM webhook headers and propagates it onto the build PipelineRun it creates, establishing a new root when no incoming context is present. The initiating event’s trace context takes precedence and controllers propagate the event-origin context, not any pre-existing context that may be present in resource metadata at creation time.

After a successful build, integration-service persists the trace context onto the the Snapshot, which serves as the logical handoff boundary across temporal and cluster transitions. Trace context is propagated forward across this boundary to preserve trace continuity. For any integration PipelineRuns derived from that Snapshot, integration-service injects the Snapshot’s trace context onto each created PipelineRun. When release is initiated, integration-service copies the Snapshot’s trace context onto the Release CR; release-service carries it onto release PipelineRuns.

A delivery may produce multiple integration and release PipelineRuns. The propagation rule is consistent: any PipelineRun derived from the Snapshot carries the Snapshot’s trace context.

# Heterogeneous snapshots and missing context

Some Snapshots may be heterogeneous (components built from different initiating events) or may lack a usable trace context (missing, invalid, or never seeded). In these cases, integration-service creates a new root span and injects its context onto the Snapshot for continuity.

# Timing visibility

Timing spans derived from resource lifecycle timestamps are parented under the propagated span context, decomposed into two phases:

These timing spans are emitted for build, integration, and release PipelineRuns, making end-to-end delivery latency and per-stage breakdown directly visible from trace data.

Pre-execution timing captures delays as a single measurement. Finer-grained breakdown (e.g., queue vs. provisioning) is available through native scheduling metrics where applicable.

# Span attributes

All attributes required for per-namespace delivery latency analysis are locally available at each timing span emission point. The attribute set covers:

No cross-service attribute propagation (e.g., OTel Baggage) is required for the current attribute set.

As the workload identity model evolves, attribute semantics will be updated to reflect the current model.

# Infrastructure Requirements

No new infrastructure is required beyond existing OTLP trace collection.

# Required Changes (by controller/component)

# PaC (pipelines-as-code)

PaC propagates trace context from inbound SCM events onto build PipelineRuns, creating a new root when no incoming context is present, and emits timing spans with required attributes for build PipelineRuns

# Integration-service

Integration-service propagates the trace context across the Snapshot, PipelineRun, and Release CR chain, creates a new root when valid context is missing, and emits timing spans with required attributes for integration PipelineRuns.

# Release-service

Release-service propagates trace context from the Release CR onto release PipelineRuns and emits timing spans with required attributes. When a Release CR lacks trace context, release-service creates a new root span. This completes the end-to-end timing visibility from event receipt through release completion.

# Pros and Cons of Alternatives Considered

# Separate traces per stage

Not adopted because the primary goal is end-to-end delivery latency measurement, which requires a single trace spanning all stages.

Not adopted because span links do not produce a navigable trace tree, making delivery latency analysis indirect and tool-dependent.

# Custom CRD fields for trace context

Not adopted because metadata-based propagation achieves the same result with no schema changes and reuses existing runtime mechanisms.

# Parallel trace context carriers on the same resource

Not adopted because a single trace context carrier per resource avoids ambiguity and redundant mechanisms.

# OTel Baggage for attribute propagation

Not adopted for current requirements. Can be reconsidered if future attributes that are not locally available need cross-service propagation.

# Linking of non-triggering build PipelineRuns in Snapshot

Not adopted because span link limits make the sampling arbitrary and unreliable for metrics or navigation.

# Consequences

Reusing the runtime’s existing trace parent adoption mechanism for external trace propagation yields end-to-end trace continuity across controllers and clusters with minimal integration cost. It introduces controller responsibility to propagate trace context correctly, and provides a defined path for missing-context Snapshots by allowing integration-service to establish a new root. Any system that creates PipelineRuns can participate in distributed tracing by propagating trace context onto created PipelineRuns using the same pattern.

All attributes required for per-namespace delivery latency analysis are locally available at each timing span emission point. No cross-service attribute propagation is needed for the current attribute set.

Any future controller that creates PipelineRuns should follow the same propagation pattern: inject trace context onto created PipelineRuns, emit timing spans with the required attribute categories, and create a new root span when valid trace context is unavailable.