tx

PRD-008: Observability & OpenTelemetry

OpenTelemetry integration for traces, metrics, and structured logging of task lifecycle events

Status: Draft Priority: P1 (Should Have) Owner: TBD Last Updated: 2025-01-28


Problem Statement

Task management systems operate as black boxes without observability. When agents create, update, or complete tasks, there's no telemetry to answer:

  1. How many tasks are created per session? No metrics.
  2. What's the average time from creation to completion? No tracing.
  3. Are there error spikes in dependency operations? No alerting.
  4. Which MCP tools are called most frequently? No usage data.

We need OpenTelemetry (OTEL) integration that emits traces, metrics, and logs for all task lifecycle events, enabling monitoring, debugging, and usage analytics.


Target Users

User TypePrimary UseValue
Platform EngineersMonitor agent task throughputCapacity planning
AI Agent DevelopersDebug failed task operationsFaster debugging
Team LeadsUnderstand task completion patternsProcess improvement
SRE/DevOpsAlert on error ratesReliability

Goals

  1. Trace all task lifecycle events: create, update, delete, done, block, unblock
  2. Emit metrics: task counts, ready queue depth, completion latency, error rates
  3. Structured logging: Every operation logs context-rich structured data
  4. Zero-cost when disabled: OTEL is optional — no overhead when not configured
  5. Standard protocol: Use OpenTelemetry SDK for vendor-neutral export

Non-Goals

  • Custom dashboards or visualization (use Grafana, Jaeger, etc.)
  • Real-time streaming of events (batch export is fine)
  • Distributed tracing across multiple tx instances (single process)

Requirements

Tracing

IDRequirementEvent
OT-001Span for task creationtask.create
OT-002Span for task update (includes status changes)task.update
OT-003Span for task deletiontask.delete
OT-004Span for task completion (done)task.done
OT-005Span for dependency operationsdependency.add, dependency.remove
OT-006Span for ready detection queryready.query
OT-007Span for MCP tool executionmcp.tool.<name>
OT-008Span for LLM operations (dedupe, compact)llm.dedupe, llm.compact
OT-009Span for CLI command executioncli.command.<name>

Metrics

IDMetricTypeDescription
OM-001task.created.totalCounterTotal tasks created
OM-002task.completed.totalCounterTotal tasks completed
OM-003task.deleted.totalCounterTotal tasks deleted
OM-004task.ready.countGaugeCurrent ready queue depth
OM-005task.completion.duration_msHistogramTime from creation to done
OM-006dependency.cycle_detected.totalCounterCircular dependency attempts
OM-007mcp.tool.calls.totalCounterMCP tool invocations (by tool name)
OM-008mcp.tool.errors.totalCounterMCP tool errors (by tool name)
OM-009llm.tokens.used.totalCounterLLM tokens consumed
OM-010db.query.duration_msHistogramDatabase query latency

Structured Logging

IDRequirement
OL-001All operations emit structured log entries with trace context
OL-002Log level: INFO for lifecycle events, WARN for degraded, ERROR for failures
OL-003Include task ID, operation type, duration, and outcome in every log
OL-004Correlate logs with OTEL trace IDs

Configuration

IDRequirement
OC-001OTEL is disabled by default (zero overhead)
OC-002Enable via OTEL_EXPORTER_OTLP_ENDPOINT environment variable
OC-003Support console exporter for development (OTEL_EXPORTER=console)
OC-004Service name configurable via OTEL_SERVICE_NAME (default: tx)
OC-005No code changes needed to enable/disable — purely configuration

Constraints

IDConstraintRationale
OC-006OTEL packages are optional peer dependenciesDon't bloat core install
OC-007No performance impact when disabledCheck once at startup
OC-008Works with any OTLP-compatible backendVendor neutral

Span Attributes

Every span includes these base attributes:

AttributeExampleDescription
task.idtx-a1b2c3Task identifier
task.statusreadyTask status after operation
task.score800Task score
task.parent_idtx-parent1Parent task ID (if any)
operation.typecreateOperation name
interface.typecli / mcp / apiCalling interface

API Examples

Tracing a Task Lifecycle

Trace: task.create (tx-a1b2c3)
  ├── db.query (INSERT INTO tasks)
  └── otel.emit (task.created.total++)

Trace: dependency.add (tx-a1b2c3 blocked by tx-d4e5f6)
  ├── dependency.cycle_check
  └── db.query (INSERT INTO task_dependencies)

Trace: task.done (tx-a1b2c3)
  ├── db.query (UPDATE tasks SET status='done')
  ├── ready.cascade_check
  │   ├── ready.check (tx-g7h8i9) → now ready
  │   └── ready.check (tx-j0k1l2) → still blocked
  └── otel.emit (task.completed.total++, task.completion.duration_ms)

Environment Configuration

# Development: log to console
export OTEL_EXPORTER=console

# Production: send to OTLP endpoint
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
export OTEL_SERVICE_NAME=tx

# Disable (default)
# Just don't set any OTEL_ variables

Integration Points

ComponentOTEL Integration
TaskServiceSpans for create, get, update, delete
ReadyServiceSpans for getReady; gauge for queue depth
DependencyServiceSpans for add/remove; counter for cycle detection
CompactionServiceSpans for compact; counter for tokens used
CLI CommandsRoot span per command execution
MCP ToolsRoot span per tool call
SQLite queriesChild spans for database operations

On this page