Skip to content

Observability

Telemetry, Metrics & Execution Transparency

Forge Pool exposes structured observability across the entire execution lifecycle.

Observability is not optional.

It exists to:

  • validate deterministic execution
  • detect anomalies
  • support replay and audit
  • monitor economic flow
  • enforce trust boundaries

Observability is divided into:

  • Control Plane Telemetry
  • Execution Plane Telemetry

1. Control Plane Observability

Control plane telemetry is exposed via HQ.

It includes:

  • project usage
  • job registry
  • credit accounting
  • identity events
  • policy configuration
  • token activity

Control plane answers:

  • Who executed?
  • Under what policy?
  • At what cost?
  • Under which identity?

2. Execution Plane Observability

Execution plane telemetry originates from:

  • Hub
  • Agents
  • Aggregation layer

It includes:

  • shard planning metadata
  • agent execution metrics
  • deterministic reduction logs
  • verification signals
  • replay metadata

Execution plane answers:

  • How was execution structured?
  • Which agents participated?
  • Was integrity preserved?
  • Can this run be replayed?

3. Job-Level Transparency

Each Job exposes:

  • job_id
  • Kernel workload (op.name, version, profile)
  • shard count
  • participating agents
  • execution duration
  • verification mode
  • replay seed
  • credit usage

In HQ → Jobs:

You can inspect:

  • execution timeline
  • shard distribution
  • reduction summary
  • replay metadata
  • billing record

Jobs are immutable once completed.

Immutability is foundational to audit integrity.


4. Shard-Level Telemetry

Each shard reports:

  • execution duration
  • hardware class (CPU / GPU)
  • partial result size
  • result hash
  • verification participation

Shard telemetry enables:

  • anomaly detection
  • performance profiling
  • reliability scoring
  • corruption detection

Shard metadata is bound to job context.


5. Agent-Level Metrics

Providers can monitor:

HQ → Providers → Nodes

Available signals:

  • online status
  • heartbeat freshness
  • shard throughput
  • verification participation ratio
  • latency distribution
  • hardware classification
  • credits earned

Hub tracks:

  • historical reliability
  • correctness ratio
  • tail latency behavior
  • scheduling weight

Reliable nodes are prioritized.


6. Scheduler & Tail Signals

Internally, Hub tracks:

  • queue depth
  • shard dispatch latency
  • tail latency outliers
  • rebalance events
  • agent health drift

These signals influence:

  • shard routing
  • verification intensity
  • workload distribution

This prevents systemic skew.


7. Replay Telemetry

Replay observability includes:

  • root seed
  • shard seed derivation
  • workload version binding
  • aggregation checksum
  • output hash

Replay metadata ensures:

  • forensic reconstruction
  • regulatory defensibility
  • reproducibility verification

Replay telemetry is part of the execution artifact.


8. Studio Run Observability

Each Studio Run includes:

  • flow version
  • graph hash
  • job IDs triggered
  • execution timestamps
  • artifact references
  • final output snapshot

Run history is version-bound.

Flow reproducibility depends on deterministic adapters.


9. Credit & Economic Observability

Credits are recorded per:

  • job
  • shard
  • workload type
  • verification overhead
  • resource class

HQ exposes:

  • credit balance
  • historical burn rate
  • provider earnings
  • per-adapter usage breakdown

Economic observability aligns incentives with execution correctness.


10. Failure Visibility

When execution fails:

  • error code recorded
  • failure reason stored
  • partial shards marked
  • verification divergence logged
  • billing outcome recorded

Clients should log:

  • job_id
  • full request payload
  • response payload
  • retry decision

Failure telemetry supports root cause analysis.


11. Health & Reliability Model

Node health scoring considers:

  • shard completion ratio
  • verification consistency
  • latency stability
  • uptime consistency
  • resource reporting accuracy

Reliability influences:

  • scheduling weight
  • shard volume
  • earning potential

Health scoring reduces economic attack surfaces.


12. Time Filtering & Diagnostics

HQ supports filtering by:

  • time range
  • project
  • workload type
  • node
  • verification mode

This enables:

  • capacity planning
  • cost forecasting
  • anomaly investigation
  • deterministic replay analysis

13. Audit & Export Model

Enterprise environments may require:

  • job metadata export
  • replay artifact export
  • ledger export
  • verification logs
  • execution trace archives

Forge Pool supports audit-ready data structures.


Observability Philosophy

Distributed compute without transparency is unsafe.

Forge Pool exposes:

  • structural telemetry
  • deterministic replay metadata
  • shard integrity signals
  • economic traceability

Execution truth must be observable.

Observability is the foundation of distributed determinism.