Skip to content

Observability

Self-hosted metrics stack removed (2026-06 GCP migration)

The Prometheus + Grafana + Alertmanager + exporters stack has been decommissioned and removed from compose.yml. Fleet metrics/alerting will be redone with Cloud Monitoring + Cloud Logging on GCP. Logfire (per-request tracing) is unaffected. The Prometheus/Grafana sections below are legacy reference until the Cloud Monitoring setup lands.

Dataland runs two complementary observability stacks that do not overlap:

Stack Answers Source of truth
Logfire (OpenTelemetry) "What happened inside this request? Which span was slow? What did the LLM do?" Per-service traces, AI spans, and structured events
Prometheus + Grafana + Alertmanager (DAT-82) "How is the fleet doing over time? Are we burning the SLO? Is the DLQ growing?" Numeric time series, dashboards, paging alerts

Logfire is for trace-level forensics (a single visitor's chat turn, a single push). Prometheus is for capacity, SLOs and burn alerts. When an alert fires in Alertmanager, the alert annotation usually tells you to jump into Logfire scoped to a service_name to find the dominant exception.

flowchart LR
  subgraph svc["Python services"]
    A[dataland-agent]
    AU[dataland-auth]
    R[dataland-rag]
    M[dataland-museum]
    NW[notification-worker]
    NA[notification-api]
    W[information-webui]
  end

  subgraph metrics["Numeric pull"]
    PE[postgres-exporter]
    RE[redis-exporter]
    CA[cAdvisor]
    NE[node-exporter]
    Q[Qdrant /metrics]
  end

  svc -- "/metrics (text exposition)" --> P[(Prometheus TSDB)]
  metrics --> P
  P -- rules --> AM[Alertmanager]
  P --> G[Grafana]
  AM -- "Slack / PagerDuty (operator-wired)" --> oncall((On-call))

  svc -- "OTLP spans + events" --> LF[(Logfire)]

Logfire (traces, AI spans, structured events)

Every Python service is instrumented through a small per-repo observability.py wrapper around the logfire SDK. The wrapper exposes a uniform internal API (configure_observability(), instrument_fastapi(), instrument_common_clients(), event(), span(), and on the agent set_attributes()), so the call sites look identical across repos even though each service tunes its own instrumentation.

Instrumented services and their service_name:

service_name Repo / process Notes
dataland-agent dataland-agent (FastAPI, uvicorn --workers ${UVICORN_WORKERS:-2}) pydantic-ai + google-genai + SQLAlchemy + httpx instrumented
dataland-auth auth_server.py (hosted in the dataland-agent repo) RS256 JWKS issuer
dataland-rag dataland-rag-v2 google-genai instrumented; AI content captured by default in prod (DAT-177), redacted
dataland-museum dataland-museum (museum-api) RDC bridge spans + auth.login.* events
dataland-notification dataland-notification (both the worker and api processes report under this name) redis tracing off by default
dataland-atlas dataland-atlas the Catalog Studio CMS
dataland-simulator dataland-museum simulator (only when the compose.sim.yml overlay is on) dev/load only

Enabling telemetry

Logfire is token-gated and fail-open. Set a write token in .env:

LOGFIRE_TOKEN=your-logfire-write-token       # (1)!
LOGFIRE_ENVIRONMENT=production
LOGFIRE_SEND_TO_LOGFIRE=if-token-present      # (2)!
LOGFIRE_SYSTEM_METRICS=true                   # (3)!
LOGFIRE_CAPTURE_AI_CONTENT=false              # (4)!
  1. The write token is the master switch. With it empty, every service still boots normally and simply never exports spans — Logfire is token-gated and fail-open.
  2. Accepts true / false / if-token-present. The default if-token-present ships spans only when LOGFIRE_TOKEN is set; true errors if there is no token, false never ships.
  3. Per-process CPU / mem / GC gauges. Enabled on agent, rag, museum and notification.
  4. Attaches prompts + completions to spans (also sets OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT). Keep false on agent / museum / webui in prod. RAG defaults this on in prod (DAT-177); the auto-instrumented google-genai capture path bypasses RAG's redact() scrubber, so never put secrets in prompts.

LOGFIRE_SEND_TO_LOGFIRE is parsed by every service's _send_to_logfire():

  • true / 1 / yes / on → always ship (errors if no token).
  • false / 0 / no / off → never ship.
  • anything else (the default if-token-present) → ship only when LOGFIRE_TOKEN is set.

Services start with no token

With an empty LOGFIRE_TOKEN, every service still boots normally; it just doesn't export spans. configure_observability() is idempotent and guarded by a module-level _CONFIGURED flag, so re-importing is safe. event() / span() never raise — span() falls back to a nullcontext() on any error, so observability can never break the request path.

AI content capture leaks prompts

LOGFIRE_CAPTURE_AI_CONTENT controls whether prompts and completions are attached to spans (it also sets OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT).

  • agent / museum / webui: default false. Keep it off in production.
  • dataland-rag: defaults to on in production (DAT-177) so a mis-captioned image has its Gemini prompt + response in the trace. RAG runs every explicit attribute through structured_log.redact() first, scrubbing Bearer … / Basic … tokens and secret-shaped keys. The auto-instrumented google-genai capture path bypasses that scrubber, so do not put secrets in prompts. Flip it off explicitly with LOGFIRE_CAPTURE_AI_CONTENT=false in any environment where prompt content is sensitive.

What gets auto-instrumented

instrument_common_clients() wires the relevant integrations per service:

Integration agent rag museum notification
FastAPI (request spans)
httpx (outbound HTTP)
pydantic-ai (LLM run spans)
google-genai (Gemini calls)
SQLAlchemy (query spans)
system metrics (LOGFIRE_SYSTEM_METRICS)
Redis opt-in opt-in

Redis tracing is off by default on both the agent and the notification worker because the Redis traffic (session/ticket-state lookups on the agent; XREADGROUP / XACK / INCRBY loops on the notification consumer) is high-volume plumbing with no business signal. The interesting spans are the FastAPI request, the SQLAlchemy queries, the pydantic-ai run, and the hand-written notification.process_telemetry span. Set LOGFIRE_INSTRUMENT_REDIS=true for short-window debugging only.

Enrich the HTTP span instead of nesting

The agent's set_attributes(...) (used in routers/chat.py, routers/conversations.py, middleware.py) attaches domain fields like ticket.id, conversation.id, chat.mode onto the active auto-instrumented HTTP span. Live-view queries can then filter on these without drilling into child spans. It no-ops silently when no span is recording.

Useful filters

service_name = 'dataland-agent'
service_name = 'dataland-notification'
tags contains 'dataland'

Every hand-emitted event() / span() is tagged dataland (plus a per-service tag like notification or rag), so tags contains 'dataland' isolates the curated events from the firehose of auto-instrumented spans.

Key event and span names

These are the actual names emitted by the code (obs.event(...) / obs.span(...)), grouped by flow. Distributed tracing is on (distributed_tracing=True), so a chat turn that hops agent → rag → museum stitches into one trace.

Flow Events / spans (real names)
Agent lifecycle agent.startup, agent.shutdown
Agent service resolver service.ticket.resolve (span), service.ticket.resolved, service.ticket.resolve_not_found
Museum lifecycle museum.startup, museum.shutdown, museum.rdc.subscriber, museum.telemetry.publisher
Museum ticket→user museum.ticket_user.resolve (span), museum.ticket_user.resolved
Museum auth auth.login.success, auth.login.failed, auth.login.rate_limited
RAG search rag.search (span), rag.search.completed, rag.images.search_text / rag.images.search_text_completed, rag.images.search_image / rag.images.search_image_completed
RAG ingest rag.ingest.file / rag.ingest.file_completed / rag.ingest.file_failed, rag.ingest.image / rag.ingest.image_completed, rag.ingest.image_rate_limited, rag.ingest.image_caption_quota_exhausted, rag.ingest.sync / rag.ingest.sync_completed
Notification telemetry notification.process_telemetry (span), notification.telemetry.processed, notification.telemetry.skipped_missing_ticket, notification.telemetry.skipped_disabled
Notification rules notification.rule.triggered, notification.rule.rate_limited, notification.rule.skipped_missing_required, notification.rules.reloaded / notification.rules.reload_failed
Notification push notification.push.send (span), notification.push.sent, notification.push.failed, notification.push.missing_recipient, notification.push.skipped_credentials
Notification welcome (DAT-296) notification.welcome.sent, notification.welcome.unresolved
Notification ↔ agent notification.agent.resolve_ticket (span), notification.agent.ticket_resolved, notification.agent.ticket_not_found, notification.agent.open_chat, notification.agent.chat_opened, notification.agent.chat_wall_clock_timeout
Ops alerts (DAT-213) notification.ops_alert.send (span), notification.ops_alert.sent, notification.ops_alert.failed, notification.ops_alert.multi_subfailure, notification.ops_alert.skipped_no_url, notification.complaint.notifier_disabled

Prometheus + Grafana + Alertmanager (DAT-82)

The monitoring stack lives entirely in this repo: containers in compose.yml, scrape/alert/route config under dataland-infrastructure/monitoring/.

Container Image (default ver) Purpose
dataland-prometheus prom/prometheus:v2.55.0 Scrape coordinator + TSDB (30d / 10GB retention)
dataland-grafana grafana/grafana:11.3.0 Dashboard frontend (DAT-266)
dataland-alertmanager prom/alertmanager:v0.27.0 Alert routing + fan-out
dataland-postgres-exporter prometheuscommunity/postgres-exporter:v0.15.0 pg_stat_*, pg_up, connection counts
dataland-redis-exporter oliver006/redis_exporter:v1.62.0 INFO, memory, stream depths (requirepass-aware)
dataland-cadvisor gcr.io/cadvisor/cadvisor:v0.49.1 Per-container CPU / memory / CFS throttling
dataland-node-exporter prom/node-exporter:v1.8.2 Host CPU, IO wait, disk, memory

All image tags are overridable via *_VERSION env vars (see .env.example).

host-metrics profile (Linux only)

cadvisor and node-exporter are gated behind the compose profiles: ["host-metrics"]. They mount host /proc, /sys, /, and /var/run in ways a macOS dev box can't satisfy, so local dev skips them. deploy.sh layers the host-metrics profile on the prod VDS. When the profile is off, the dataland-host alert group stays silent (its targets never come up) — that is expected, not a misconfiguration.

Application /metrics endpoints

Every FastAPI service exposes Prometheus text exposition at GET /metrics via an install_metrics(app, service=...) helper in its metrics.py. The middleware is a pure-ASGI wrapper (not Starlette's BaseHTTPMiddleware), chosen deliberately: BaseHTTPMiddleware buffers the whole response body before forwarding, which would break the agent's SSE chat endpoints (/v1/chat/general, /v1/chat/museum). The ASGI middleware only wraps send to capture the status code, so streaming stays unbuffered while the metric is still recorded in a finally block.

Common series exported by agent / rag / museum:

Metric Type Labels Meaning
http_requests_total counter service, method, path, status Every HTTP request, by status code (drives 5xx rate + SLO burn)
http_request_duration_seconds histogram service, method, path Request latency (drives HighHTTPLatency)

Service-specific app metrics:

Service Extra metric(s) Source
dataland-agent agent_guardrail_triggered_total{direction,category} (DAT-263) app/agent/streaming.py increments on input/output guardrail blocks; cardinality bounded by the GuardrailCategory enum
dataland-museum museum_bridge_published, museum_bridge_skipped_no_ticket, museum_bridge_publish_failed (counters), museum_bridge_seconds_since_last_published (gauge) A custom BridgeCollector snapshots RDCSubscriber.metrics at scrape time (same data as /api/bridge/metrics JSON, DAT-131)
dataland-notification-api DLQ / stream / latency surface — see below app/metrics.py rendered from Redis counters

Agent multiprocess metrics

The agent runs uvicorn with --workers ${UVICORN_WORKERS:-2}, so each worker has its own in-process counters. compose.yml sets PROMETHEUS_MULTIPROC_DIR=/tmp/prom-multiproc (backed by tmpfs, so it's empty on every boot) which switches prometheus_client into multiprocess mode: workers write per-pid files and /metrics aggregates them with MultiProcessCollector at scrape time. rag, museum, and the notification consumer run a single worker, so the default in-process registry is correct there; the same env var opts them in if the worker count is bumped later.

The notification metrics surface (DAT-59 / DAT-107 / DAT-110 / DAT-257)

The notification service is special: its counters live in Redis under notification:counters:*, not in process memory. This is deliberate — the worker process increments them, the api process renders them at GET /metrics, and no in-process state is shared. Counters are best-effort (every increment is wrapped in a broad exception suppressor) so a Redis hiccup never crashes the rules engine.

Counter-derived series (rendered as Prometheus counters):

Metric Redis key Meaning
notification_processed_total …:processed Telemetry entries the consumer ack'd
notification_dlq_total …:dlq Entries routed to museum:telemetry:dlq
notification_push_sent_total …:push:sent OneSignal calls returning 2xx
notification_push_failed_total …:push:failed 4xx after retries OR missing recipient
notification_chat_opened_total …:chat:opened Agent chat conversations opened
notification_chat_failed_total …:chat:failed Chat dispatch errors
notification_rule_triggered_total{rule} …:rule:triggered:<name> Per-rule fire count
notification_rule_rate_limited_total{rule} …:rule:rate_limited:<name> Per-rule cooldown suppressions

No SCAN per scrape (DAT-107)

Per-rule counters used to be discovered with a per-scrape SCAN rule:*, which drove constant SCAN traffic against the same Redis the consumer reads from. The module now maintains index SETs (…:index:rule:triggered, …:index:rule:rate_limited) populated in the same pipeline as the INCRBY, then renders via SMEMBERS + MGET. The first render after deploy backfills the index from a one-shot SCAN; orphaned index entries (counter deleted) are pruned on render.

Gauges and the latency histogram (these are the alerting backbone):

Metric Type Meaning Alertable?
notification_dlq_depth gauge XLEN of the DLQ stream (DAT-110) Yes — canonical "rules are silently failing" signal
notification_telemetry_consumer_lag{group} gauge Per-group unread entries from XINFO GROUPS (DAT-82) Yes — correct backpressure signal
notification_telemetry_stream_depth gauge XLEN of the input telemetry stream No — pegs near the MAXLEN cap in steady state; diagnostic only
notification_e2e_latency_seconds histogram Producer timestamp → successful OneSignal push-ack (DAT-257) Yes — p95 SLO

Each gauge is skipped (its line omitted) when its underlying Redis call errors, rather than emitting a bogus 0 that would mask a Redis outage. The e2e latency histogram is itself stored as Redis keys (notification:hist:e2e_latency_seconds:{sum,count,bucket:<le>}) with bucket boundaries 0.05 … 60s, so the worker can observe and the api can render without sharing memory.

Why stream_depth is not an alert

museum:telemetry is MAXLEN-capped, so XLEN sits near the cap even when the consumer is fully caught up. Alerting on notification_telemetry_stream_depth would false-fire constantly. Always alert on notification_telemetry_consumer_lag (the real unread count) instead — this is the lesson encoded in NotificationConsumerLagHigh.

Scrape targets (monitoring/prometheus.yml)

scrape_interval: 15s, scrape_timeout: 10s, external labels cluster=dataland, environment=production. Container DNS names match the compose container_name fields.

Native exporters + already-shipping routes (zero-code):

  • prometheus (self, localhost:9090)
  • dataland-postgres-exporter:9187
  • dataland-redis-exporter:9121
  • dataland-cadvisor:8080
  • dataland-node-exporter:9100
  • dataland-qdrant:6333/metrics (native Prometheus exposition on the REST port)
  • dataland-notification-api:8080/metrics

FastAPI application targets (scrape cleanly once each service has /metrics wired):

  • dataland-agent:4141/metrics
  • dataland-auth:9000/metrics
  • dataland-rag:4143/metrics
  • dataland-museum:5001/metrics
  • dataland-atlas:4152/metrics

A scrape failure can be the correct signal

The Phase-1 application rows scrape successfully only after a service exposes /metrics. Until then Prometheus logs scrape failures — which is the intended "not-yet-instrumented" indicator, not a stack bug. up{job="..."} == 0 for >2m is what ServiceDown keys off.

Reaching the UIs

By default Prometheus, Alertmanager and Grafana bind loopback-only (127.0.0.1). Two ways to view:

# 1. SSH tunnel from your workstation:
ssh -L 9090:127.0.0.1:9090 -L 9093:127.0.0.1:9093 -L 3000:127.0.0.1:3000 \
    ege@100.124.170.43
# then open http://localhost:9090 (Prometheus), :9093 (Alertmanager), :3000 (Grafana)  # (1)!
  1. These three local ports map to Prometheus (9090), Alertmanager (9093) and Grafana (3000) on the VDS. The tunnel is the safest path because all three bind loopback-only by default and Prometheus / Alertmanager have no authentication.
# 2. Bind to the tailnet IP (no tunnel needed for tailnet peers), in .env:
PROMETHEUS_PUBLIC_BIND=100.124.170.43        # (1)!
GRAFANA_PUBLIC_BIND=100.124.170.43
ALERTMANAGER_PUBLIC_BIND=100.124.170.43
  1. DAT-291. Use the tailnet IP (100.124.170.43) so only tailnet peers can reach the UIs without a tunnel. Never set these to 0.0.0.0 — that exposes the host's public IP, and Prometheus / Alertmanager have no auth.

Never bind to 0.0.0.0

Prometheus and Alertmanager have no authentication. Grafana's admin password is the only barrier. Binding any of the three to 0.0.0.0 exposes the host's Spectrum public IP. Use the tailnet IP (100.124.170.43) for direct peer access, or the SSH tunnel.

Grafana

GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=***                    # (1)!
  1. Required. The compose :? guard refuses to start Grafana without it (DAT-266), and the admin password is the only barrier in front of Grafana. Generate with openssl rand -hex 24.

The compose :? guard refuses to start Grafana without GRAFANA_ADMIN_PASSWORD set (DAT-266). The Prometheus datasource is provisioned (monitoring/grafana/provisioning/datasources/prometheus.yml, uid prometheus, talks to http://dataland-prometheus:9090 over the docker network). Five dashboards ship pre-provisioned in the Dataland folder from monitoring/grafana/dashboards/:

Dashboard File
Service overview 01-service-overview.json
Notification 02-notification.json
Museum bridge 03-museum-bridge.json
Datastore 04-datastore.json
Host & containers 05-host-and-containers.json

Dashboards are git-managed source of truth (allowUiUpdates: false, updateIntervalSeconds: 30): edit the JSON, git pull, recreate the container. UI edits do not persist past a restart. Grafana's own unified alerting is disabled (manageAlerts: false) — alerts live in Prometheus + Alertmanager, never duplicated.


Alert rules (monitoring/rules/dataland.alerts.yml)

Severity convention: critical = paging incident (service down or data at risk), warning = investigate next business day.

Alert Expr (essence) For Severity
ServiceDown up == 0 2m critical
HighHTTPErrorRate 5xx rate / total > 5% 5m warning
HighHTTPLatency p99 http_request_duration_seconds > 1s 5m warning
NotificationDLQDepthHigh notification_dlq_depth > 100 5m warning
NotificationDLQGrowing rate(notification_dlq_total[10m]) > 0.1 10m warning
NotificationConsumerLagHigh notification_telemetry_consumer_lag > 100 5m warning
NotificationLatencyP95High p95 notification_e2e_latency_seconds > 5s 10m warning
NotificationPushFailureRate push_failed / (sent + failed) > 5% 10m warning
MuseumBridgeStuck museum_bridge_seconds_since_last_published > 60 5m warning
MuseumBridgePublishFailing rate(museum_bridge_publish_failed_total[5m]) > 0 5m warning
PostgresDown pg_up == 0 1m critical
PostgresConnectionsSaturated connections > 85% of max_connections 5m warning
RedisDown redis_up == 0 1m critical
RedisMemoryHigh used / maxmemory > 85% 5m warning
QdrantDown up{job="qdrant"} == 0 2m critical
HostMemoryHigh host mem > 90% 5m warning
HostDiskHigh < 10% disk free 10m warning
ContainerCPUThrottling rate(container_cpu_cfs_throttled_seconds_total{name=~"dataland-.*"}[5m]) > 0.1 10m warning
ContainerMemoryNearLimit working set > 90% of mem_limit 5m warning
AgentErrorBudgetFastBurn / Rag… / Museum… 1h 5xx rate > 14.4 × (1 − 0.99) 5m critical (slo=availability)

The SLO-burn group uses the Google SRE Workbook fast-burn formula: 14.4× burn exhausts the monthly error budget in ~2 days.

Alert annotations point you at Logfire

Most warning annotations explicitly say to pull Logfire traces scoped to service_name = {{ $labels.service }} and look for the dominant exception class. That is the intended Prometheus→Logfire handoff: numeric alert in, root-cause trace out.

Key signals to watch

  • DLQ depth (notification_dlq_depth) — the canonical "rules are silently failing" gauge. If it grows, events are landing in museum:telemetry:dlq. Inspect via dataland-notification-api:8080/dlq and replay once the root cause is fixed.
  • Push failure rateNotificationPushFailureRate watches OneSignal 2xx vs failure. Spikes usually mean ONESIGNAL_API_KEY invalidity, rate limits, or a recipient-mapping miss (notification.push.missing_recipient events corroborate).
  • Consumer lagnotification_telemetry_consumer_lag is the only trustworthy backpressure signal on the capped museum:telemetry stream. Rising lag → scale CONSUMER_COUNT or investigate a worker bottleneck.
  • Bridge freshnessmuseum_bridge_seconds_since_last_published > 60 means the RDC redis bridge has gone silent (no BioSensors event reached museum:telemetry). This is the data-plane health check the green docker ps dot cannot give you.
  • Fallback-JWKS WARN (DAT-286) — this is a log signal, not a metric. The agent logs JWT accepted by FALLBACK JWKS provider … local JWKS is missing this signing key; chat auth depends on this external endpoint (app/auth.py) whenever a token is validated only by a non-primary JWKS provider. The whole point of DAT-286 was to mirror the CMS signing key into the local data/extra_jwks.json so the local JWKS is the primary validator; this WARN means the mirror is stale or missing and chat auth has silently reverted to a single point of failure on the external CMS. Treat it as alertable — grep the agent's structured log / Logfire for FALLBACK JWKS.

Recent changes

  • DAT-82: full Prometheus + Grafana + Alertmanager + exporter stack, app /metrics on every FastAPI service, notification_telemetry_consumer_lag from XINFO GROUPS, SLO-burn group.
  • DAT-257: notification_e2e_latency_seconds histogram (producer ts → OneSignal push-ack) + NotificationLatencyP95High.
  • DAT-263: agent_guardrail_triggered_total{direction,category}.
  • DAT-266: provisioned Grafana datasource + 5 dashboards, GRAFANA_ADMIN_PASSWORD boot guard.
  • DAT-286: fallback-JWKS WARN is now an alertable signal (local JWKS mirror of the CMS key removes the chat-auth SPOF).
  • DAT-291: tailnet *_PUBLIC_BIND publishing for the monitoring UIs.

Structured JSONL logs

Alongside the human-readable console/text logs, each service writes a redacted, structured JSONL stream (structured_log.py, public API setup_structured_logging(service=...), log_event(...), redact(...)):

  • One JSON object per line.
  • Daily rotation at UTC midnight, gzip on rotation, 30-day retention.
  • <LOG_DIR>/<service>-YYYY-MM-DD.jsonl (≥ INFO) and <service>-error-YYYY-MM-DD.jsonl (≥ WARNING).
  • LOG_DIR defaults to <repo>/logs, overridden by LOG_FILE_DIR so containers mount it at /app/logs.
  • Mandatory redaction: keys matching secret|token|password|hash|salt|cookie|authorization|api[-_]?key|bearer|jwt|access[-_]?key|client[-_]?secret are replaced with ***REDACTED*** recursively, and inline Bearer … tokens inside string values are scrubbed before write.

This is the durable, on-disk forensic record when Logfire isn't shipping (no token) or when you're SSH'd into the box debugging an incident.


Health checks vs. depth checks

Every service has a docker healthcheck — docker compose ps shows status, and Prometheus/Alertmanager also healthcheck via /-/healthy and /api/health.

A green dot is a liveness probe, not a depth probe

museum-api's /health returns 200 even when its RDC subscriber is firing connection-timeout errors every few seconds. The notification api's /health only PINGs Redis. For real data-plane health, trust XLEN museum:telemetry, the museum_bridge_seconds_since_last_published gauge, notification_telemetry_consumer_lag, and log tails — not the green dot in docker ps.


Operator one-liners

# Reload Prometheus config without restart:
curl -X POST http://localhost:9090/-/reload    # (1)!

# Reload Alertmanager config without restart:
curl -X POST http://localhost:9093/-/reload

# Validate Prometheus config:
docker run --rm --entrypoint /bin/promtool \
  -v "$PWD/dataland-infrastructure/monitoring:/etc/prometheus:ro" \
  prom/prometheus:v2.55.0 check config /etc/prometheus/prometheus.yml   # (2)!

# Validate alert rules:
docker run --rm --entrypoint /bin/promtool \
  -v "$PWD/dataland-infrastructure/monitoring/rules:/rules:ro" \
  prom/prometheus:v2.55.0 check rules /rules/dataland.alerts.yml

# Validate Alertmanager routing:
docker run --rm --entrypoint /bin/amtool \
  -v "$PWD/dataland-infrastructure/monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro" \
  prom/alertmanager:v0.27.0 check-config /etc/alertmanager/alertmanager.yml

# Spot-check a service's exposition directly:
curl -s http://localhost:4141/metrics | grep -E 'http_requests_total|agent_guardrail'   # (3)!
curl -s http://dataland-notification-api:8080/metrics | grep notification_dlq_depth
  1. Hot-reloads config from disk without dropping the TSDB or in-flight scrapes. Same trick for Alertmanager on :9093. Use after editing scrape/route config instead of restarting the container.
  2. Run check config / check rules / check-config before reloading — promtool / amtool pin the same image versions (prom/prometheus:v2.55.0, prom/alertmanager:v0.27.0) so the validator matches the running binary. A bad config silently keeps the old one loaded on reload.
  3. The agent runs multiple uvicorn workers, so /metrics here is the aggregated MultiProcessCollector output. Port 4141 is the agent; 8080 is the notification-api, whose counters come from Redis (notification:counters:*), not process memory.

Wiring Alertmanager receivers

monitoring/alertmanager.yml ships a route tree (critical pages every 1h with group_wait: 0s; default warning tier repeats every 12h) but the critical and default receivers are intentionally empty until an operator adds real channels — a Slack slack_configs block / webhook for warnings and pagerduty_configs for critical. Inhibit rules already suppress per-endpoint flapping when a host or downstream is hard-down, and suppress SLO-burn alerts on services whose dependency is ServiceDown.


See also

  • Notification service — DLQ, replay, OneSignal, the OpsNotifier multi-provider fan-out, complaint pipeline (DAT-213).
  • Agent — guardrail metric, JWKS auth (DAT-286), SSE chat endpoints.
  • Museum bridge — RDC subscriber metrics and the museum:telemetry stream.
  • RAG — search/ingest events and AI-content capture.
  • On-call runbook: dataland-infrastructure/reports/runbook.md — rollback, incident triage, backup/restore. Read this before opening anything else when paged.