Observability

Pathrule2 Rules • 2 Memories • 1 Skill

A ready-to-use bundle of rules, memories, and a review checklist for instrumenting services with OpenTelemetry. It encodes 2026 best practices: OTLP-first telemetry, structured JSON logs carrying trace and span ids, low-cardinality metrics, and SLO burn-rate alerts so your agent stops generating print-statement debugging and noisy page-on-everything alerting.

Suggested path map

Pathrule places each piece on the matching path, so your assistant only sees it where it belongs. This is the scoping you get on import; you can adjust it in your workspace.

/ workspace root
observability-review
src/
Log structured JSON carrying trace and span ids
Keep metric attributes low-cardinality
observability/
OpenTelemetry SDK setup: OTLP export, signals, and propagation
deploy/
SLOs and multi-window burn-rate alerts

Rules

2
Log structured JSON carrying trace and span ids/srchighstrictEvery log line is structured and correlated to its active span.
1Emit structured JSON through one logger and let OpenTelemetry inject the active trace context, so any log line jumps straight to its trace.
2 
3- Never use `console.log` or `print` for application logging; route everything through a single structured logger (`pino`, `winston`, `structlog`) wired to the OTel logs bridge.
4- Include `trace_id` and `span_id` on every record from the active span context, plus `service.name` and a severity that maps to the OTel `SeverityNumber`.
5- Attach business identifiers (`user.id`, `order.id`, `request.id`) as discrete fields, never by string-concatenating them into the message.
6- Do not log secrets, tokens, or full PII; redact at the logger, not at the call site.
Keep metric attributes low-cardinality/srchighadvisoryNever attach unbounded ids to metric attributes.
1Metric attributes form the cardinality of a time series, so only attach bounded, enumerable values.
2 
3- Allowed attributes are bounded sets: `http.route` (the templated path, not the raw URL), `http.response.status_code`, `service.name`, region, and environment.
4- Never attach user ids, session ids, request ids, raw URLs, or error messages as metric attributes; carry those on spans and logs instead.
5- Follow stable OpenTelemetry semantic conventions for attribute names (`http.request.method`, `http.route`) so cross-service dashboards and SLO queries work without per-team mapping.
6- Drop or aggregate unwanted attributes at the source with SDK Views, or in the Collector, before they are ever exported.

Memories

2
OpenTelemetry SDK setup: OTLP export, signals, and propagation/src/observabilityInitialize one SDK that exports traces, metrics, and logs over OTLP.
1OpenTelemetry is the default instrumentation layer in 2026, with traces, metrics, and logs all stable and shipped over the OTLP wire protocol. Continuous profiling is the fourth signal, in release-candidate status.
2 
3- Initialize the SDK once at process start, before any other import, so auto-instrumentation can patch libraries; in Node use `@opentelemetry/sdk-node` with `getNodeAutoInstrumentations()`.
4- Export all three signals over OTLP (gRPC `4317` or HTTP `4318`) to a local OpenTelemetry Collector, not directly to a vendor; the Collector handles batching, retries, and re-routing.
5- Set `service.name`, `service.version`, and `deployment.environment` as Resource attributes so every signal is attributable.
6- Use the W3C `traceparent` propagator (the default) so trace context flows across HTTP, gRPC, and message queues; do not hand-roll correlation headers.
7- LLM and agent calls have a `gen_ai` semantic-convention group (still experimental) covering `gen_ai.request.model` and token-usage attributes; use it for AI pipelines rather than inventing attribute names.
SLOs and multi-window burn-rate alerts/deployAlert on error-budget burn rate, page only on user impact.
1Define SLOs on user-facing symptoms (availability, latency) and alert on how fast you burn the error budget, not on raw resource thresholds. This is the Google SRE multi-window, multi-burn-rate approach.
2 
3- Pick SLIs that reflect user experience: success ratio of requests and a latency percentile (for example p95 under 300ms); set a realistic objective like 99.9% over 30 days.
4- Page when burn rate is greater than 14.4 over a 1-hour window (about 2% of a 30-day budget consumed in an hour) and the short window confirms it is still burning now.
5- Open a ticket (no page) when burn rate is greater than 6 over a 6-hour window, and surface slow burns (greater than 1 over 3 days) in weekly review.
6- Each alert pairs a long detection window with a short confirmation window so resolved incidents stop paging; instrument RED metrics (Rate, Errors, Duration) for request services and USE (Utilization, Saturation, Errors) for resources to power these SLIs.
7- Every page must be actionable and link to a runbook; if an alert cannot be acted on, it is a dashboard, not a page.

Skills

1
observability-review/rootPre-merge checklist for new or changed instrumentation, logging, and alerts.
1---
2name: observability-review
3description: Review checklist for service observability covering structured correlated logs, OpenTelemetry trace and metric instrumentation, OTLP export, low-cardinality metrics, and SLO burn-rate alerts. Run before merging any telemetry, logging, or alerting change.
4---
5 
6# Observability review
7 
8- [ ] All application logs go through one structured logger emitting JSON; no `console.log`/`print` for app logging.
9- [ ] Every log record carries `trace_id`, `span_id`, `service.name`, and a mapped OTel severity from the active span context.
10- [ ] Secrets, tokens, and PII are redacted at the logger; business identifiers are discrete fields, not embedded in the message string.
11- [ ] The OpenTelemetry SDK is initialized once before other imports, with `service.name`, `service.version`, and `deployment.environment` set on the Resource.
12- [ ] Traces, metrics, and logs export over OTLP to a Collector, not directly to a vendor backend.
13- [ ] W3C `traceparent` propagation is used across HTTP, gRPC, and queues; no hand-rolled correlation headers.
14- [ ] Metric attributes are low-cardinality and follow semantic conventions (`http.route`, `http.request.method`); no user ids, request ids, or raw URLs as attributes.
15- [ ] High-cardinality attributes are dropped or aggregated via SDK Views or the Collector before export.
16- [ ] New alerts are tied to an SLO and fire on multi-window burn rate (page at >14.4 / 1h, ticket at >6 / 6h), not raw CPU or error counts.
17- [ ] Every paging alert is actionable and links to a runbook; non-actionable signals are dashboards, not pages.

Why this pattern

AI agents reach for ad-hoc console logs and unbounded custom metrics, producing telemetry that cannot be correlated across signals and alerts that page on noise instead of user impact.

Built for Backend and platform teams running production services who own their on-call and SLOs.

Keeps your assistant from:

  • Plain-text or console logs with no trace_id, so a log line can never be tied back to the request that produced it
  • High-cardinality metric attributes (user ids, raw URLs, request ids) that explode storage cost and degrade queries
  • Threshold alerts on raw CPU or error counts that page constantly without reflecting actual user impact
License
Apache-2.0
Version
1.0.0
Updated
2026-06-09
View source