PRD-008: Observability & OpenTelemetry OpenTelemetry integration for traces, metrics, and structured logging of task lifecycle events
Status : Draft
Priority : P1 (Should Have)
Owner : TBD
Last Updated : 2025-01-28
Task management systems operate as black boxes without observability. When agents create, update, or complete tasks, there's no telemetry to answer:
How many tasks are created per session? No metrics.
What's the average time from creation to completion? No tracing.
Are there error spikes in dependency operations? No alerting.
Which MCP tools are called most frequently? No usage data.
We need OpenTelemetry (OTEL) integration that emits traces, metrics, and logs for all task lifecycle events, enabling monitoring, debugging, and usage analytics.
User Type Primary Use Value Platform Engineers Monitor agent task throughput Capacity planning AI Agent Developers Debug failed task operations Faster debugging Team Leads Understand task completion patterns Process improvement SRE/DevOps Alert on error rates Reliability
Trace all task lifecycle events : create, update, delete, done, block, unblock
Emit metrics : task counts, ready queue depth, completion latency, error rates
Structured logging : Every operation logs context-rich structured data
Zero-cost when disabled : OTEL is optional — no overhead when not configured
Standard protocol : Use OpenTelemetry SDK for vendor-neutral export
Custom dashboards or visualization (use Grafana, Jaeger, etc.)
Real-time streaming of events (batch export is fine)
Distributed tracing across multiple tx instances (single process)
ID Requirement Event OT-001 Span for task creation task.createOT-002 Span for task update (includes status changes) task.updateOT-003 Span for task deletion task.deleteOT-004 Span for task completion (done) task.doneOT-005 Span for dependency operations dependency.add, dependency.removeOT-006 Span for ready detection query ready.queryOT-007 Span for MCP tool execution mcp.tool.<name>OT-008 Span for LLM operations (dedupe, compact) llm.dedupe, llm.compactOT-009 Span for CLI command execution cli.command.<name>
ID Metric Type Description OM-001 task.created.totalCounter Total tasks created OM-002 task.completed.totalCounter Total tasks completed OM-003 task.deleted.totalCounter Total tasks deleted OM-004 task.ready.countGauge Current ready queue depth OM-005 task.completion.duration_msHistogram Time from creation to done OM-006 dependency.cycle_detected.totalCounter Circular dependency attempts OM-007 mcp.tool.calls.totalCounter MCP tool invocations (by tool name) OM-008 mcp.tool.errors.totalCounter MCP tool errors (by tool name) OM-009 llm.tokens.used.totalCounter LLM tokens consumed OM-010 db.query.duration_msHistogram Database query latency
ID Requirement OL-001 All operations emit structured log entries with trace context OL-002 Log level: INFO for lifecycle events, WARN for degraded, ERROR for failures OL-003 Include task ID, operation type, duration, and outcome in every log OL-004 Correlate logs with OTEL trace IDs
ID Requirement OC-001 OTEL is disabled by default (zero overhead) OC-002 Enable via OTEL_EXPORTER_OTLP_ENDPOINT environment variable OC-003 Support console exporter for development (OTEL_EXPORTER=console) OC-004 Service name configurable via OTEL_SERVICE_NAME (default: tx) OC-005 No code changes needed to enable/disable — purely configuration
ID Constraint Rationale OC-006 OTEL packages are optional peer dependencies Don't bloat core install OC-007 No performance impact when disabled Check once at startup OC-008 Works with any OTLP-compatible backend Vendor neutral
Every span includes these base attributes:
Attribute Example Description task.idtx-a1b2c3Task identifier task.statusreadyTask status after operation task.score800Task score task.parent_idtx-parent1Parent task ID (if any) operation.typecreateOperation name interface.typecli / mcp / apiCalling interface
Trace: task.create (tx-a1b2c3)
├── db.query (INSERT INTO tasks)
└── otel.emit (task.created.total++)
Trace: dependency.add (tx-a1b2c3 blocked by tx-d4e5f6)
├── dependency.cycle_check
└── db.query (INSERT INTO task_dependencies)
Trace: task.done (tx-a1b2c3)
├── db.query (UPDATE tasks SET status='done')
├── ready.cascade_check
│ ├── ready.check (tx-g7h8i9) → now ready
│ └── ready.check (tx-j0k1l2) → still blocked
└── otel.emit (task.completed.total++, task.completion.duration_ms)
# Development: log to console
export OTEL_EXPORTER = console
# Production: send to OTLP endpoint
export OTEL_EXPORTER_OTLP_ENDPOINT = http://localhost:4318
export OTEL_SERVICE_NAME = tx
# Disable (default)
# Just don't set any OTEL_ variables
Component OTEL Integration TaskService Spans for create, get, update, delete ReadyService Spans for getReady; gauge for queue depth DependencyService Spans for add/remove; counter for cycle detection CompactionService Spans for compact; counter for tokens used CLI Commands Root span per command execution MCP Tools Root span per tool call SQLite queries Child spans for database operations