Observability¶

Self-hosted metrics stack removed (2026-06 GCP migration)

The Prometheus + Grafana + Alertmanager + exporters stack has been decommissioned and removed from compose.yml. Fleet metrics/alerting will be redone with Cloud Monitoring + Cloud Logging on GCP. Logfire (per-request tracing) is unaffected. The Prometheus/Grafana sections below are legacy reference until the Cloud Monitoring setup lands.

Dataland runs two complementary observability stacks that do not overlap:

Stack	Answers	Source of truth
Logfire (OpenTelemetry)	"What happened inside this request? Which span was slow? What did the LLM do?"	Per-service traces, AI spans, and structured events
Prometheus + Grafana + Alertmanager (DAT-82)	"How is the fleet doing over time? Are we burning the SLO? Is the DLQ growing?"	Numeric time series, dashboards, paging alerts

Logfire is for trace-level forensics (a single visitor's chat turn, a single push). Prometheus is for capacity, SLOs and burn alerts. When an alert fires in Alertmanager, the alert annotation usually tells you to jump into Logfire scoped to a service_name to find the dominant exception.

flowchart LR
  subgraph svc["Python services"]
    A[dataland-agent]
    AU[dataland-auth]
    R[dataland-rag]
    M[dataland-museum]
    NW[notification-worker]
    NA[notification-api]
    W[information-webui]
  end

  subgraph metrics["Numeric pull"]
    PE[postgres-exporter]
    RE[redis-exporter]
    CA[cAdvisor]
    NE[node-exporter]
    Q[Qdrant /metrics]
  end

  svc -- "/metrics (text exposition)" --> P[(Prometheus TSDB)]
  metrics --> P
  P -- rules --> AM[Alertmanager]
  P --> G[Grafana]
  AM -- "Slack / PagerDuty (operator-wired)" --> oncall((On-call))

  svc -- "OTLP spans + events" --> LF[(Logfire)]

Logfire (traces, AI spans, structured events)¶

Every Python service is instrumented through a small per-repo observability.py wrapper around the logfire SDK. The wrapper exposes a uniform internal API (configure_observability(), instrument_fastapi(), instrument_common_clients(), event(), span(), and on the agent set_attributes()), so the call sites look identical across repos even though each service tunes its own instrumentation.

Instrumented services and their service_name:

`service_name`	Repo / process	Notes
`dataland-agent`	dataland-agent (FastAPI, uvicorn `--workers ${UVICORN_WORKERS:-2}`)	pydantic-ai + google-genai + SQLAlchemy + httpx instrumented
`dataland-auth`	`auth_server.py` (hosted in the dataland-agent repo)	RS256 JWKS issuer
`dataland-rag`	dataland-rag-v2	google-genai instrumented; AI content captured by default in prod (DAT-177), redacted
`dataland-museum`	dataland-museum (museum-api)	RDC bridge spans + `auth.login.*` events
`dataland-notification`	dataland-notification (both the worker and api processes report under this name)	redis tracing off by default
`dataland-atlas`	dataland-atlas	the Catalog Studio CMS
`dataland-simulator`	dataland-museum simulator (only when the `compose.sim.yml` overlay is on)	dev/load only

Enabling telemetry¶

Logfire is token-gated and fail-open. Set a write token in .env:

LOGFIRE_TOKEN=your-logfire-write-token       # (1)!
LOGFIRE_ENVIRONMENT=production
LOGFIRE_SEND_TO_LOGFIRE=if-token-present      # (2)!
LOGFIRE_SYSTEM_METRICS=true                   # (3)!
LOGFIRE_CAPTURE_AI_CONTENT=false              # (4)!

The write token is the master switch. With it empty, every service still boots normally and simply never exports spans — Logfire is token-gated and fail-open.
Accepts true / false / if-token-present. The default if-token-present ships spans only when LOGFIRE_TOKEN is set; true errors if there is no token, false never ships.
Per-process CPU / mem / GC gauges. Enabled on agent, rag, museum and notification.
Attaches prompts + completions to spans (also sets OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT). Keep false on agent / museum / webui in prod. RAG defaults this on in prod (DAT-177); the auto-instrumented google-genai capture path bypasses RAG's redact() scrubber, so never put secrets in prompts.

LOGFIRE_SEND_TO_LOGFIRE is parsed by every service's _send_to_logfire():

true / 1 / yes / on → always ship (errors if no token).
false / 0 / no / off → never ship.
anything else (the default if-token-present) → ship only when LOGFIRE_TOKEN is set.

Services start with no token

With an empty LOGFIRE_TOKEN, every service still boots normally; it just doesn't export spans. configure_observability() is idempotent and guarded by a module-level _CONFIGURED flag, so re-importing is safe. event() / span() never raise — span() falls back to a nullcontext() on any error, so observability can never break the request path.

AI content capture leaks prompts

LOGFIRE_CAPTURE_AI_CONTENT controls whether prompts and completions are attached to spans (it also sets OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT).

agent / museum / webui: default false. Keep it off in production.
dataland-rag: defaults to on in production (DAT-177) so a mis-captioned image has its Gemini prompt + response in the trace. RAG runs every explicit attribute through structured_log.redact() first, scrubbing Bearer … / Basic … tokens and secret-shaped keys. The auto-instrumented google-genai capture path bypasses that scrubber, so do not put secrets in prompts. Flip it off explicitly with LOGFIRE_CAPTURE_AI_CONTENT=false in any environment where prompt content is sensitive.

What gets auto-instrumented¶

instrument_common_clients() wires the relevant integrations per service:

Integration	agent	rag	museum	notification
FastAPI (request spans)	✅	✅	✅	✅
httpx (outbound HTTP)	✅	✅	✅	✅
pydantic-ai (LLM run spans)	✅	—	—	—
google-genai (Gemini calls)	✅	✅	—	—
SQLAlchemy (query spans)	✅	—	—	—
system metrics (`LOGFIRE_SYSTEM_METRICS`)	✅	✅	✅	✅
Redis	opt-in	—	—	opt-in

Redis tracing is off by default on both the agent and the notification worker because the Redis traffic (session/ticket-state lookups on the agent; XREADGROUP / XACK / INCRBY loops on the notification consumer) is high-volume plumbing with no business signal. The interesting spans are the FastAPI request, the SQLAlchemy queries, the pydantic-ai run, and the hand-written notification.process_telemetry span. Set LOGFIRE_INSTRUMENT_REDIS=true for short-window debugging only.

Enrich the HTTP span instead of nesting

The agent's set_attributes(...) (used in routers/chat.py, routers/conversations.py, middleware.py) attaches domain fields like ticket.id, conversation.id, chat.mode onto the active auto-instrumented HTTP span. Live-view queries can then filter on these without drilling into child spans. It no-ops silently when no span is recording.

Useful filters¶

service_name = 'dataland-agent'
service_name = 'dataland-notification'
tags contains 'dataland'

Every hand-emitted event() / span() is tagged dataland (plus a per-service tag like notification or rag), so tags contains 'dataland' isolates the curated events from the firehose of auto-instrumented spans.

Key event and span names¶

These are the actual names emitted by the code (obs.event(...) / obs.span(...)), grouped by flow. Distributed tracing is on (distributed_tracing=True), so a chat turn that hops agent → rag → museum stitches into one trace.

Flow	Events / spans (real names)
Agent lifecycle	`agent.startup`, `agent.shutdown`
Agent service resolver	`service.ticket.resolve` (span), `service.ticket.resolved`, `service.ticket.resolve_not_found`
Museum lifecycle	`museum.startup`, `museum.shutdown`, `museum.rdc.subscriber`, `museum.telemetry.publisher`
Museum ticket→user	`museum.ticket_user.resolve` (span), `museum.ticket_user.resolved`
Museum auth	`auth.login.success`, `auth.login.failed`, `auth.login.rate_limited`
RAG search	`rag.search` (span), `rag.search.completed`, `rag.images.search_text` / `rag.images.search_text_completed`, `rag.images.search_image` / `rag.images.search_image_completed`
RAG ingest	`rag.ingest.file` / `rag.ingest.file_completed` / `rag.ingest.file_failed`, `rag.ingest.image` / `rag.ingest.image_completed`, `rag.ingest.image_rate_limited`, `rag.ingest.image_caption_quota_exhausted`, `rag.ingest.sync` / `rag.ingest.sync_completed`
Notification telemetry	`notification.process_telemetry` (span), `notification.telemetry.processed`, `notification.telemetry.skipped_missing_ticket`, `notification.telemetry.skipped_disabled`
Notification rules	`notification.rule.triggered`, `notification.rule.rate_limited`, `notification.rule.skipped_missing_required`, `notification.rules.reloaded` / `notification.rules.reload_failed`
Notification push	`notification.push.send` (span), `notification.push.sent`, `notification.push.failed`, `notification.push.missing_recipient`, `notification.push.skipped_credentials`
Notification welcome (DAT-296)	`notification.welcome.sent`, `notification.welcome.unresolved`
Notification ↔ agent	`notification.agent.resolve_ticket` (span), `notification.agent.ticket_resolved`, `notification.agent.ticket_not_found`, `notification.agent.open_chat`, `notification.agent.chat_opened`, `notification.agent.chat_wall_clock_timeout`
Ops alerts (DAT-213)	`notification.ops_alert.send` (span), `notification.ops_alert.sent`, `notification.ops_alert.failed`, `notification.ops_alert.multi_subfailure`, `notification.ops_alert.skipped_no_url`, `notification.complaint.notifier_disabled`

Prometheus + Grafana + Alertmanager (DAT-82)¶

The monitoring stack lives entirely in this repo: containers in compose.yml, scrape/alert/route config under dataland-infrastructure/monitoring/.

Container	Image (default ver)	Purpose
`dataland-prometheus`	`prom/prometheus:v2.55.0`	Scrape coordinator + TSDB (`30d` / `10GB` retention)
`dataland-grafana`	`grafana/grafana:11.3.0`	Dashboard frontend (DAT-266)
`dataland-alertmanager`	`prom/alertmanager:v0.27.0`	Alert routing + fan-out
`dataland-postgres-exporter`	`prometheuscommunity/postgres-exporter:v0.15.0`	`pg_stat_*`, `pg_up`, connection counts
`dataland-redis-exporter`	`oliver006/redis_exporter:v1.62.0`	`INFO`, memory, stream depths (`requirepass`-aware)
`dataland-cadvisor`	`gcr.io/cadvisor/cadvisor:v0.49.1`	Per-container CPU / memory / CFS throttling
`dataland-node-exporter`	`prom/node-exporter:v1.8.2`	Host CPU, IO wait, disk, memory

All image tags are overridable via *_VERSION env vars (see .env.example).

host-metrics profile (Linux only)

cadvisor and node-exporter are gated behind the compose profiles: ["host-metrics"]. They mount host /proc, /sys, /, and /var/run in ways a macOS dev box can't satisfy, so local dev skips them. deploy.sh layers the host-metrics profile on the prod VDS. When the profile is off, the dataland-host alert group stays silent (its targets never come up) — that is expected, not a misconfiguration.

Application `/metrics` endpoints¶

Every FastAPI service exposes Prometheus text exposition at GET /metrics via an install_metrics(app, service=...) helper in its metrics.py. The middleware is a pure-ASGI wrapper (not Starlette's BaseHTTPMiddleware), chosen deliberately: BaseHTTPMiddleware buffers the whole response body before forwarding, which would break the agent's SSE chat endpoints (/v1/chat/general, /v1/chat/museum). The ASGI middleware only wraps send to capture the status code, so streaming stays unbuffered while the metric is still recorded in a finally block.

Common series exported by agent / rag / museum:

Metric	Type	Labels	Meaning
`http_requests_total`	counter	`service`, `method`, `path`, `status`	Every HTTP request, by status code (drives 5xx rate + SLO burn)
`http_request_duration_seconds`	histogram	`service`, `method`, `path`	Request latency (drives `HighHTTPLatency`)

Service-specific app metrics:

Service	Extra metric(s)	Source
`dataland-agent`	`agent_guardrail_triggered_total{direction,category}` (DAT-263)	`app/agent/streaming.py` increments on input/output guardrail blocks; cardinality bounded by the `GuardrailCategory` enum
`dataland-museum`	`museum_bridge_published`, `museum_bridge_skipped_no_ticket`, `museum_bridge_publish_failed` (counters), `museum_bridge_seconds_since_last_published` (gauge)	A custom `BridgeCollector` snapshots `RDCSubscriber.metrics` at scrape time (same data as `/api/bridge/metrics` JSON, DAT-131)
`dataland-notification-api`	DLQ / stream / latency surface — see below	`app/metrics.py` rendered from Redis counters

Agent multiprocess metrics

The agent runs uvicorn with --workers ${UVICORN_WORKERS:-2}, so each worker has its own in-process counters. compose.yml sets PROMETHEUS_MULTIPROC_DIR=/tmp/prom-multiproc (backed by tmpfs, so it's empty on every boot) which switches prometheus_client into multiprocess mode: workers write per-pid files and /metrics aggregates them with MultiProcessCollector at scrape time. rag, museum, and the notification consumer run a single worker, so the default in-process registry is correct there; the same env var opts them in if the worker count is bumped later.

The notification metrics surface (DAT-59 / DAT-107 / DAT-110 / DAT-257)¶

The notification service is special: its counters live in Redis under notification:counters:*, not in process memory. This is deliberate — the worker process increments them, the api process renders them at GET /metrics, and no in-process state is shared. Counters are best-effort (every increment is wrapped in a broad exception suppressor) so a Redis hiccup never crashes the rules engine.

Counter-derived series (rendered as Prometheus counters):

Metric	Redis key	Meaning
`notification_processed_total`	`…:processed`	Telemetry entries the consumer ack'd
`notification_dlq_total`	`…:dlq`	Entries routed to `museum:telemetry:dlq`
`notification_push_sent_total`	`…:push:sent`	OneSignal calls returning 2xx
`notification_push_failed_total`	`…:push:failed`	4xx after retries OR missing recipient
`notification_chat_opened_total`	`…:chat:opened`	Agent chat conversations opened
`notification_chat_failed_total`	`…:chat:failed`	Chat dispatch errors
`notification_rule_triggered_total{rule}`	`…:rule:triggered:<name>`	Per-rule fire count
`notification_rule_rate_limited_total{rule}`	`…:rule:rate_limited:<name>`	Per-rule cooldown suppressions

No SCAN per scrape (DAT-107)

Per-rule counters used to be discovered with a per-scrape SCAN rule:*, which drove constant SCAN traffic against the same Redis the consumer reads from. The module now maintains index SETs (…:index:rule:triggered, …:index:rule:rate_limited) populated in the same pipeline as the INCRBY, then renders via SMEMBERS + MGET. The first render after deploy backfills the index from a one-shot SCAN; orphaned index entries (counter deleted) are pruned on render.

Gauges and the latency histogram (these are the alerting backbone):

Metric	Type	Meaning	Alertable?
`notification_dlq_depth`	gauge	`XLEN` of the DLQ stream (DAT-110)	Yes — canonical "rules are silently failing" signal
`notification_telemetry_consumer_lag{group}`	gauge	Per-group unread entries from `XINFO GROUPS` (DAT-82)	Yes — correct backpressure signal
`notification_telemetry_stream_depth`	gauge	`XLEN` of the input telemetry stream	No — pegs near the `MAXLEN` cap in steady state; diagnostic only
`notification_e2e_latency_seconds`	histogram	Producer timestamp → successful OneSignal push-ack (DAT-257)	Yes — p95 SLO

Each gauge is skipped (its line omitted) when its underlying Redis call errors, rather than emitting a bogus 0 that would mask a Redis outage. The e2e latency histogram is itself stored as Redis keys (notification:hist:e2e_latency_seconds:{sum,count,bucket:<le>}) with bucket boundaries 0.05 … 60s, so the worker can observe and the api can render without sharing memory.

Why stream_depth is not an alert

museum:telemetry is MAXLEN-capped, so XLEN sits near the cap even when the consumer is fully caught up. Alerting on notification_telemetry_stream_depth would false-fire constantly. Always alert on notification_telemetry_consumer_lag (the real unread count) instead — this is the lesson encoded in NotificationConsumerLagHigh.

Scrape targets (`monitoring/prometheus.yml`)¶

scrape_interval: 15s, scrape_timeout: 10s, external labels cluster=dataland, environment=production. Container DNS names match the compose container_name fields.

Native exporters + already-shipping routes (zero-code):

prometheus (self, localhost:9090)
dataland-postgres-exporter:9187
dataland-redis-exporter:9121
dataland-cadvisor:8080
dataland-node-exporter:9100
dataland-qdrant:6333/metrics (native Prometheus exposition on the REST port)
dataland-notification-api:8080/metrics

FastAPI application targets (scrape cleanly once each service has /metrics wired):

dataland-agent:4141/metrics
dataland-auth:9000/metrics
dataland-rag:4143/metrics
dataland-museum:5001/metrics
dataland-atlas:4152/metrics

A scrape failure can be the correct signal

The Phase-1 application rows scrape successfully only after a service exposes /metrics. Until then Prometheus logs scrape failures — which is the intended "not-yet-instrumented" indicator, not a stack bug. up{job="..."} == 0 for >2m is what ServiceDown keys off.

Reaching the UIs¶

By default Prometheus, Alertmanager and Grafana bind loopback-only (127.0.0.1). Two ways to view:

# 1. SSH tunnel from your workstation:
ssh -L 9090:127.0.0.1:9090 -L 9093:127.0.0.1:9093 -L 3000:127.0.0.1:3000 \
    ege@100.124.170.43
# then open http://localhost:9090 (Prometheus), :9093 (Alertmanager), :3000 (Grafana)  # (1)!

These three local ports map to Prometheus (9090), Alertmanager (9093) and Grafana (3000) on the VDS. The tunnel is the safest path because all three bind loopback-only by default and Prometheus / Alertmanager have no authentication.

# 2. Bind to the tailnet IP (no tunnel needed for tailnet peers), in .env:
PROMETHEUS_PUBLIC_BIND=100.124.170.43        # (1)!
GRAFANA_PUBLIC_BIND=100.124.170.43
ALERTMANAGER_PUBLIC_BIND=100.124.170.43

DAT-291. Use the tailnet IP (100.124.170.43) so only tailnet peers can reach the UIs without a tunnel. Never set these to 0.0.0.0 — that exposes the host's public IP, and Prometheus / Alertmanager have no auth.

Never bind to 0.0.0.0

Prometheus and Alertmanager have no authentication. Grafana's admin password is the only barrier. Binding any of the three to 0.0.0.0 exposes the host's Spectrum public IP. Use the tailnet IP (100.124.170.43) for direct peer access, or the SSH tunnel.

Grafana¶

GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=***                    # (1)!

Required. The compose :? guard refuses to start Grafana without it (DAT-266), and the admin password is the only barrier in front of Grafana. Generate with openssl rand -hex 24.

The compose :? guard refuses to start Grafana without GRAFANA_ADMIN_PASSWORD set (DAT-266). The Prometheus datasource is provisioned (monitoring/grafana/provisioning/datasources/prometheus.yml, uid prometheus, talks to http://dataland-prometheus:9090 over the docker network). Five dashboards ship pre-provisioned in the Dataland folder from monitoring/grafana/dashboards/:

Dashboard	File
Service overview	`01-service-overview.json`
Notification	`02-notification.json`
Museum bridge	`03-museum-bridge.json`
Datastore	`04-datastore.json`
Host & containers	`05-host-and-containers.json`

Dashboards are git-managed source of truth (allowUiUpdates: false, updateIntervalSeconds: 30): edit the JSON, git pull, recreate the container. UI edits do not persist past a restart. Grafana's own unified alerting is disabled (manageAlerts: false) — alerts live in Prometheus + Alertmanager, never duplicated.

Alert rules (`monitoring/rules/dataland.alerts.yml`)¶

Severity convention: critical = paging incident (service down or data at risk), warning = investigate next business day.

Alert	Expr (essence)	For	Severity
`ServiceDown`	`up == 0`	2m	critical
`HighHTTPErrorRate`	5xx rate / total > 5%	5m	warning
`HighHTTPLatency`	p99 `http_request_duration_seconds` > 1s	5m	warning
`NotificationDLQDepthHigh`	`notification_dlq_depth > 100`	5m	warning
`NotificationDLQGrowing`	`rate(notification_dlq_total[10m]) > 0.1`	10m	warning
`NotificationConsumerLagHigh`	`notification_telemetry_consumer_lag > 100`	5m	warning
`NotificationLatencyP95High`	p95 `notification_e2e_latency_seconds` > 5s	10m	warning
`NotificationPushFailureRate`	`push_failed / (sent + failed) > 5%`	10m	warning
`MuseumBridgeStuck`	`museum_bridge_seconds_since_last_published > 60`	5m	warning
`MuseumBridgePublishFailing`	`rate(museum_bridge_publish_failed_total[5m]) > 0`	5m	warning
`PostgresDown`	`pg_up == 0`	1m	critical
`PostgresConnectionsSaturated`	connections > 85% of `max_connections`	5m	warning
`RedisDown`	`redis_up == 0`	1m	critical
`RedisMemoryHigh`	`used / maxmemory > 85%`	5m	warning
`QdrantDown`	`up{job="qdrant"} == 0`	2m	critical
`HostMemoryHigh`	host mem > 90%	5m	warning
`HostDiskHigh`	< 10% disk free	10m	warning
`ContainerCPUThrottling`	`rate(container_cpu_cfs_throttled_seconds_total{name=~"dataland-.*"}[5m]) > 0.1`	10m	warning
`ContainerMemoryNearLimit`	working set > 90% of `mem_limit`	5m	warning
`AgentErrorBudgetFastBurn` / `Rag…` / `Museum…`	1h 5xx rate > `14.4 × (1 − 0.99)`	5m	critical (`slo=availability`)

The SLO-burn group uses the Google SRE Workbook fast-burn formula: 14.4× burn exhausts the monthly error budget in ~2 days.

Alert annotations point you at Logfire

Most warning annotations explicitly say to pull Logfire traces scoped to service_name = {{ $labels.service }} and look for the dominant exception class. That is the intended Prometheus→Logfire handoff: numeric alert in, root-cause trace out.

Key signals to watch¶

DLQ depth (notification_dlq_depth) — the canonical "rules are silently failing" gauge. If it grows, events are landing in museum:telemetry:dlq. Inspect via dataland-notification-api:8080/dlq and replay once the root cause is fixed.
Push failure rate — NotificationPushFailureRate watches OneSignal 2xx vs failure. Spikes usually mean ONESIGNAL_API_KEY invalidity, rate limits, or a recipient-mapping miss (notification.push.missing_recipient events corroborate).
Consumer lag — notification_telemetry_consumer_lag is the only trustworthy backpressure signal on the capped museum:telemetry stream. Rising lag → scale CONSUMER_COUNT or investigate a worker bottleneck.
Bridge freshness — museum_bridge_seconds_since_last_published > 60 means the RDC redis bridge has gone silent (no BioSensors event reached museum:telemetry). This is the data-plane health check the green docker ps dot cannot give you.
Fallback-JWKS WARN (DAT-286) — this is a log signal, not a metric. The agent logs JWT accepted by FALLBACK JWKS provider … local JWKS is missing this signing key; chat auth depends on this external endpoint (app/auth.py) whenever a token is validated only by a non-primary JWKS provider. The whole point of DAT-286 was to mirror the CMS signing key into the local data/extra_jwks.json so the local JWKS is the primary validator; this WARN means the mirror is stale or missing and chat auth has silently reverted to a single point of failure on the external CMS. Treat it as alertable — grep the agent's structured log / Logfire for FALLBACK JWKS.

Recent changes

DAT-82: full Prometheus + Grafana + Alertmanager + exporter stack, app /metrics on every FastAPI service, notification_telemetry_consumer_lag from XINFO GROUPS, SLO-burn group.
DAT-257: notification_e2e_latency_seconds histogram (producer ts → OneSignal push-ack) + NotificationLatencyP95High.
DAT-263: agent_guardrail_triggered_total{direction,category}.
DAT-266: provisioned Grafana datasource + 5 dashboards, GRAFANA_ADMIN_PASSWORD boot guard.
DAT-286: fallback-JWKS WARN is now an alertable signal (local JWKS mirror of the CMS key removes the chat-auth SPOF).
DAT-291: tailnet *_PUBLIC_BIND publishing for the monitoring UIs.

Structured JSONL logs¶

Alongside the human-readable console/text logs, each service writes a redacted, structured JSONL stream (structured_log.py, public API setup_structured_logging(service=...), log_event(...), redact(...)):

One JSON object per line.
Daily rotation at UTC midnight, gzip on rotation, 30-day retention.
<LOG_DIR>/<service>-YYYY-MM-DD.jsonl (≥ INFO) and <service>-error-YYYY-MM-DD.jsonl (≥ WARNING).
LOG_DIR defaults to <repo>/logs, overridden by LOG_FILE_DIR so containers mount it at /app/logs.
Mandatory redaction: keys matching secret|token|password|hash|salt|cookie|authorization|api[-_]?key|bearer|jwt|access[-_]?key|client[-_]?secret are replaced with ***REDACTED*** recursively, and inline Bearer … tokens inside string values are scrubbed before write.

This is the durable, on-disk forensic record when Logfire isn't shipping (no token) or when you're SSH'd into the box debugging an incident.

Health checks vs. depth checks¶

Every service has a docker healthcheck — docker compose ps shows status, and Prometheus/Alertmanager also healthcheck via /-/healthy and /api/health.

A green dot is a liveness probe, not a depth probe

museum-api's /health returns 200 even when its RDC subscriber is firing connection-timeout errors every few seconds. The notification api's /health only PINGs Redis. For real data-plane health, trust XLEN museum:telemetry, the museum_bridge_seconds_since_last_published gauge, notification_telemetry_consumer_lag, and log tails — not the green dot in docker ps.

Operator one-liners¶

# Reload Prometheus config without restart:
curl -X POST http://localhost:9090/-/reload    # (1)!

# Reload Alertmanager config without restart:
curl -X POST http://localhost:9093/-/reload

# Validate Prometheus config:
docker run --rm --entrypoint /bin/promtool \
  -v "$PWD/dataland-infrastructure/monitoring:/etc/prometheus:ro" \
  prom/prometheus:v2.55.0 check config /etc/prometheus/prometheus.yml   # (2)!

# Validate alert rules:
docker run --rm --entrypoint /bin/promtool \
  -v "$PWD/dataland-infrastructure/monitoring/rules:/rules:ro" \
  prom/prometheus:v2.55.0 check rules /rules/dataland.alerts.yml

# Validate Alertmanager routing:
docker run --rm --entrypoint /bin/amtool \
  -v "$PWD/dataland-infrastructure/monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro" \
  prom/alertmanager:v0.27.0 check-config /etc/alertmanager/alertmanager.yml

# Spot-check a service's exposition directly:
curl -s http://localhost:4141/metrics | grep -E 'http_requests_total|agent_guardrail'   # (3)!
curl -s http://dataland-notification-api:8080/metrics | grep notification_dlq_depth

Hot-reloads config from disk without dropping the TSDB or in-flight scrapes. Same trick for Alertmanager on :9093. Use after editing scrape/route config instead of restarting the container.
Run check config / check rules / check-config before reloading — promtool / amtool pin the same image versions (prom/prometheus:v2.55.0, prom/alertmanager:v0.27.0) so the validator matches the running binary. A bad config silently keeps the old one loaded on reload.
The agent runs multiple uvicorn workers, so /metrics here is the aggregated MultiProcessCollector output. Port 4141 is the agent; 8080 is the notification-api, whose counters come from Redis (notification:counters:*), not process memory.

Wiring Alertmanager receivers

monitoring/alertmanager.yml ships a route tree (critical pages every 1h with group_wait: 0s; default warning tier repeats every 12h) but the critical and default receivers are intentionally empty until an operator adds real channels — a Slack slack_configs block / webhook for warnings and pagerduty_configs for critical. Inhibit rules already suppress per-endpoint flapping when a host or downstream is hard-down, and suppress SLO-burn alerts on services whose dependency is ServiceDown.