# Pathrule Pattern: Observability (1.0.0)
# ::pathrule:package:observability

### [RULE] Log structured JSON carrying trace and span ids  (path: /src)
<!-- scope: folder | priority: high | strict -->

Emit structured JSON through one logger and let OpenTelemetry inject the active trace context, so any log line jumps straight to its trace.

- Never use `console.log` or `print` for application logging; route everything through a single structured logger (`pino`, `winston`, `structlog`) wired to the OTel logs bridge.
- Include `trace_id` and `span_id` on every record from the active span context, plus `service.name` and a severity that maps to the OTel `SeverityNumber`.
- Attach business identifiers (`user.id`, `order.id`, `request.id`) as discrete fields, never by string-concatenating them into the message.
- Do not log secrets, tokens, or full PII; redact at the logger, not at the call site.

---

### [RULE] Keep metric attributes low-cardinality  (path: /src)
<!-- scope: folder | priority: high | advisory -->

Metric attributes form the cardinality of a time series, so only attach bounded, enumerable values.

- Allowed attributes are bounded sets: `http.route` (the templated path, not the raw URL), `http.response.status_code`, `service.name`, region, and environment.
- Never attach user ids, session ids, request ids, raw URLs, or error messages as metric attributes; carry those on spans and logs instead.
- Follow stable OpenTelemetry semantic conventions for attribute names (`http.request.method`, `http.route`) so cross-service dashboards and SLO queries work without per-team mapping.
- Drop or aggregate unwanted attributes at the source with SDK Views, or in the Collector, before they are ever exported.

---

### [MEMORY] OpenTelemetry SDK setup: OTLP export, signals, and propagation  (path: /src/observability)

OpenTelemetry is the default instrumentation layer in 2026, with traces, metrics, and logs all stable and shipped over the OTLP wire protocol. Continuous profiling is the fourth signal, in release-candidate status.

- Initialize the SDK once at process start, before any other import, so auto-instrumentation can patch libraries; in Node use `@opentelemetry/sdk-node` with `getNodeAutoInstrumentations()`.
- Export all three signals over OTLP (gRPC `4317` or HTTP `4318`) to a local OpenTelemetry Collector, not directly to a vendor; the Collector handles batching, retries, and re-routing.
- Set `service.name`, `service.version`, and `deployment.environment` as Resource attributes so every signal is attributable.
- Use the W3C `traceparent` propagator (the default) so trace context flows across HTTP, gRPC, and message queues; do not hand-roll correlation headers.
- LLM and agent calls have a `gen_ai` semantic-convention group (still experimental) covering `gen_ai.request.model` and token-usage attributes; use it for AI pipelines rather than inventing attribute names.

---

### [MEMORY] SLOs and multi-window burn-rate alerts  (path: /deploy)

Define SLOs on user-facing symptoms (availability, latency) and alert on how fast you burn the error budget, not on raw resource thresholds. This is the Google SRE multi-window, multi-burn-rate approach.

- Pick SLIs that reflect user experience: success ratio of requests and a latency percentile (for example p95 under 300ms); set a realistic objective like 99.9% over 30 days.
- Page when burn rate is greater than 14.4 over a 1-hour window (about 2% of a 30-day budget consumed in an hour) and the short window confirms it is still burning now.
- Open a ticket (no page) when burn rate is greater than 6 over a 6-hour window, and surface slow burns (greater than 1 over 3 days) in weekly review.
- Each alert pairs a long detection window with a short confirmation window so resolved incidents stop paging; instrument RED metrics (Rate, Errors, Duration) for request services and USE (Utilization, Saturation, Errors) for resources to power these SLIs.
- Every page must be actionable and link to a runbook; if an alert cannot be acted on, it is a dashboard, not a page.

---

### [SKILL] observability-review  (path: /)

---
name: observability-review
description: Review checklist for service observability covering structured correlated logs, OpenTelemetry trace and metric instrumentation, OTLP export, low-cardinality metrics, and SLO burn-rate alerts. Run before merging any telemetry, logging, or alerting change.
---

# Observability review

- [ ] All application logs go through one structured logger emitting JSON; no `console.log`/`print` for app logging.
- [ ] Every log record carries `trace_id`, `span_id`, `service.name`, and a mapped OTel severity from the active span context.
- [ ] Secrets, tokens, and PII are redacted at the logger; business identifiers are discrete fields, not embedded in the message string.
- [ ] The OpenTelemetry SDK is initialized once before other imports, with `service.name`, `service.version`, and `deployment.environment` set on the Resource.
- [ ] Traces, metrics, and logs export over OTLP to a Collector, not directly to a vendor backend.
- [ ] W3C `traceparent` propagation is used across HTTP, gRPC, and queues; no hand-rolled correlation headers.
- [ ] Metric attributes are low-cardinality and follow semantic conventions (`http.route`, `http.request.method`); no user ids, request ids, or raw URLs as attributes.
- [ ] High-cardinality attributes are dropped or aggregated via SDK Views or the Collector before export.
- [ ] New alerts are tied to an SLO and fire on multi-window burn rate (page at >14.4 / 1h, ticket at >6 / 6h), not raw CPU or error counts.
- [ ] Every paging alert is actionable and links to a runbook; non-actionable signals are dashboards, not pages.
