Observability¶
Self-hosted metrics stack removed (2026-06 GCP migration)
The Prometheus + Grafana + Alertmanager + exporters stack has been
decommissioned and removed from compose.yml. Fleet metrics/alerting will
be redone with Cloud Monitoring + Cloud Logging on GCP. Logfire
(per-request tracing) is unaffected. The Prometheus/Grafana sections below
are legacy reference until the Cloud Monitoring setup lands.
Dataland runs two complementary observability stacks that do not overlap:
| Stack | Answers | Source of truth |
|---|---|---|
| Logfire (OpenTelemetry) | "What happened inside this request? Which span was slow? What did the LLM do?" | Per-service traces, AI spans, and structured events |
| Prometheus + Grafana + Alertmanager (DAT-82) | "How is the fleet doing over time? Are we burning the SLO? Is the DLQ growing?" | Numeric time series, dashboards, paging alerts |
Logfire is for trace-level forensics (a single visitor's chat turn, a single push). Prometheus is for capacity, SLOs and burn alerts. When an alert fires in Alertmanager, the alert annotation usually tells you to jump into Logfire scoped to a service_name to find the dominant exception.
flowchart LR
subgraph svc["Python services"]
A[dataland-agent]
AU[dataland-auth]
R[dataland-rag]
M[dataland-museum]
NW[notification-worker]
NA[notification-api]
W[information-webui]
end
subgraph metrics["Numeric pull"]
PE[postgres-exporter]
RE[redis-exporter]
CA[cAdvisor]
NE[node-exporter]
Q[Qdrant /metrics]
end
svc -- "/metrics (text exposition)" --> P[(Prometheus TSDB)]
metrics --> P
P -- rules --> AM[Alertmanager]
P --> G[Grafana]
AM -- "Slack / PagerDuty (operator-wired)" --> oncall((On-call))
svc -- "OTLP spans + events" --> LF[(Logfire)]
Logfire (traces, AI spans, structured events)¶
Every Python service is instrumented through a small per-repo observability.py wrapper around the logfire SDK. The wrapper exposes a uniform internal API (configure_observability(), instrument_fastapi(), instrument_common_clients(), event(), span(), and on the agent set_attributes()), so the call sites look identical across repos even though each service tunes its own instrumentation.
Instrumented services and their service_name:
service_name |
Repo / process | Notes |
|---|---|---|
dataland-agent |
dataland-agent (FastAPI, uvicorn --workers ${UVICORN_WORKERS:-2}) |
pydantic-ai + google-genai + SQLAlchemy + httpx instrumented |
dataland-auth |
auth_server.py (hosted in the dataland-agent repo) |
RS256 JWKS issuer |
dataland-rag |
dataland-rag-v2 | google-genai instrumented; AI content captured by default in prod (DAT-177), redacted |
dataland-museum |
dataland-museum (museum-api) | RDC bridge spans + auth.login.* events |
dataland-notification |
dataland-notification (both the worker and api processes report under this name) | redis tracing off by default |
dataland-atlas |
dataland-atlas | the Catalog Studio CMS |
dataland-simulator |
dataland-museum simulator (only when the compose.sim.yml overlay is on) |
dev/load only |
Enabling telemetry¶
Logfire is token-gated and fail-open. Set a write token in .env:
LOGFIRE_TOKEN=your-logfire-write-token # (1)!
LOGFIRE_ENVIRONMENT=production
LOGFIRE_SEND_TO_LOGFIRE=if-token-present # (2)!
LOGFIRE_SYSTEM_METRICS=true # (3)!
LOGFIRE_CAPTURE_AI_CONTENT=false # (4)!
- The write token is the master switch. With it empty, every service still boots normally and simply never exports spans — Logfire is token-gated and fail-open.
- Accepts
true/false/if-token-present. The defaultif-token-presentships spans only whenLOGFIRE_TOKENis set;trueerrors if there is no token,falsenever ships. - Per-process CPU / mem / GC gauges. Enabled on agent, rag, museum and notification.
- Attaches prompts + completions to spans (also sets
OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT). Keepfalseon agent / museum / webui in prod. RAG defaults this on in prod (DAT-177); the auto-instrumented google-genai capture path bypasses RAG'sredact()scrubber, so never put secrets in prompts.
LOGFIRE_SEND_TO_LOGFIRE is parsed by every service's _send_to_logfire():
true/1/yes/on→ always ship (errors if no token).false/0/no/off→ never ship.- anything else (the default
if-token-present) → ship only whenLOGFIRE_TOKENis set.
Services start with no token
With an empty LOGFIRE_TOKEN, every service still boots normally; it just doesn't export spans. configure_observability() is idempotent and guarded by a module-level _CONFIGURED flag, so re-importing is safe. event() / span() never raise — span() falls back to a nullcontext() on any error, so observability can never break the request path.
AI content capture leaks prompts
LOGFIRE_CAPTURE_AI_CONTENT controls whether prompts and completions are attached to spans (it also sets OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT).
- agent / museum / webui: default
false. Keep it off in production. - dataland-rag: defaults to on in production (DAT-177) so a mis-captioned image has its Gemini prompt + response in the trace. RAG runs every explicit attribute through
structured_log.redact()first, scrubbingBearer …/Basic …tokens and secret-shaped keys. The auto-instrumented google-genai capture path bypasses that scrubber, so do not put secrets in prompts. Flip it off explicitly withLOGFIRE_CAPTURE_AI_CONTENT=falsein any environment where prompt content is sensitive.
What gets auto-instrumented¶
instrument_common_clients() wires the relevant integrations per service:
| Integration | agent | rag | museum | notification |
|---|---|---|---|---|
| FastAPI (request spans) | ✅ | ✅ | ✅ | ✅ |
| httpx (outbound HTTP) | ✅ | ✅ | ✅ | ✅ |
| pydantic-ai (LLM run spans) | ✅ | — | — | — |
| google-genai (Gemini calls) | ✅ | ✅ | — | — |
| SQLAlchemy (query spans) | ✅ | — | — | — |
system metrics (LOGFIRE_SYSTEM_METRICS) |
✅ | ✅ | ✅ | ✅ |
| Redis | opt-in | — | — | opt-in |
Redis tracing is off by default on both the agent and the notification worker because the Redis traffic (session/ticket-state lookups on the agent; XREADGROUP / XACK / INCRBY loops on the notification consumer) is high-volume plumbing with no business signal. The interesting spans are the FastAPI request, the SQLAlchemy queries, the pydantic-ai run, and the hand-written notification.process_telemetry span. Set LOGFIRE_INSTRUMENT_REDIS=true for short-window debugging only.
Enrich the HTTP span instead of nesting
The agent's set_attributes(...) (used in routers/chat.py, routers/conversations.py, middleware.py) attaches domain fields like ticket.id, conversation.id, chat.mode onto the active auto-instrumented HTTP span. Live-view queries can then filter on these without drilling into child spans. It no-ops silently when no span is recording.
Useful filters¶
Every hand-emitted event() / span() is tagged dataland (plus a per-service tag like notification or rag), so tags contains 'dataland' isolates the curated events from the firehose of auto-instrumented spans.
Key event and span names¶
These are the actual names emitted by the code (obs.event(...) / obs.span(...)), grouped by flow. Distributed tracing is on (distributed_tracing=True), so a chat turn that hops agent → rag → museum stitches into one trace.
| Flow | Events / spans (real names) |
|---|---|
| Agent lifecycle | agent.startup, agent.shutdown |
| Agent service resolver | service.ticket.resolve (span), service.ticket.resolved, service.ticket.resolve_not_found |
| Museum lifecycle | museum.startup, museum.shutdown, museum.rdc.subscriber, museum.telemetry.publisher |
| Museum ticket→user | museum.ticket_user.resolve (span), museum.ticket_user.resolved |
| Museum auth | auth.login.success, auth.login.failed, auth.login.rate_limited |
| RAG search | rag.search (span), rag.search.completed, rag.images.search_text / rag.images.search_text_completed, rag.images.search_image / rag.images.search_image_completed |
| RAG ingest | rag.ingest.file / rag.ingest.file_completed / rag.ingest.file_failed, rag.ingest.image / rag.ingest.image_completed, rag.ingest.image_rate_limited, rag.ingest.image_caption_quota_exhausted, rag.ingest.sync / rag.ingest.sync_completed |
| Notification telemetry | notification.process_telemetry (span), notification.telemetry.processed, notification.telemetry.skipped_missing_ticket, notification.telemetry.skipped_disabled |
| Notification rules | notification.rule.triggered, notification.rule.rate_limited, notification.rule.skipped_missing_required, notification.rules.reloaded / notification.rules.reload_failed |
| Notification push | notification.push.send (span), notification.push.sent, notification.push.failed, notification.push.missing_recipient, notification.push.skipped_credentials |
| Notification welcome (DAT-296) | notification.welcome.sent, notification.welcome.unresolved |
| Notification ↔ agent | notification.agent.resolve_ticket (span), notification.agent.ticket_resolved, notification.agent.ticket_not_found, notification.agent.open_chat, notification.agent.chat_opened, notification.agent.chat_wall_clock_timeout |
| Ops alerts (DAT-213) | notification.ops_alert.send (span), notification.ops_alert.sent, notification.ops_alert.failed, notification.ops_alert.multi_subfailure, notification.ops_alert.skipped_no_url, notification.complaint.notifier_disabled |
Prometheus + Grafana + Alertmanager (DAT-82)¶
The monitoring stack lives entirely in this repo: containers in compose.yml, scrape/alert/route config under dataland-infrastructure/monitoring/.
| Container | Image (default ver) | Purpose |
|---|---|---|
dataland-prometheus |
prom/prometheus:v2.55.0 |
Scrape coordinator + TSDB (30d / 10GB retention) |
dataland-grafana |
grafana/grafana:11.3.0 |
Dashboard frontend (DAT-266) |
dataland-alertmanager |
prom/alertmanager:v0.27.0 |
Alert routing + fan-out |
dataland-postgres-exporter |
prometheuscommunity/postgres-exporter:v0.15.0 |
pg_stat_*, pg_up, connection counts |
dataland-redis-exporter |
oliver006/redis_exporter:v1.62.0 |
INFO, memory, stream depths (requirepass-aware) |
dataland-cadvisor |
gcr.io/cadvisor/cadvisor:v0.49.1 |
Per-container CPU / memory / CFS throttling |
dataland-node-exporter |
prom/node-exporter:v1.8.2 |
Host CPU, IO wait, disk, memory |
All image tags are overridable via *_VERSION env vars (see .env.example).
host-metrics profile (Linux only)
cadvisor and node-exporter are gated behind the compose profiles: ["host-metrics"]. They mount host /proc, /sys, /, and /var/run in ways a macOS dev box can't satisfy, so local dev skips them. deploy.sh layers the host-metrics profile on the prod VDS. When the profile is off, the dataland-host alert group stays silent (its targets never come up) — that is expected, not a misconfiguration.
Application /metrics endpoints¶
Every FastAPI service exposes Prometheus text exposition at GET /metrics via an install_metrics(app, service=...) helper in its metrics.py. The middleware is a pure-ASGI wrapper (not Starlette's BaseHTTPMiddleware), chosen deliberately: BaseHTTPMiddleware buffers the whole response body before forwarding, which would break the agent's SSE chat endpoints (/v1/chat/general, /v1/chat/museum). The ASGI middleware only wraps send to capture the status code, so streaming stays unbuffered while the metric is still recorded in a finally block.
Common series exported by agent / rag / museum:
| Metric | Type | Labels | Meaning |
|---|---|---|---|
http_requests_total |
counter | service, method, path, status |
Every HTTP request, by status code (drives 5xx rate + SLO burn) |
http_request_duration_seconds |
histogram | service, method, path |
Request latency (drives HighHTTPLatency) |
Service-specific app metrics:
| Service | Extra metric(s) | Source |
|---|---|---|
dataland-agent |
agent_guardrail_triggered_total{direction,category} (DAT-263) |
app/agent/streaming.py increments on input/output guardrail blocks; cardinality bounded by the GuardrailCategory enum |
dataland-museum |
museum_bridge_published, museum_bridge_skipped_no_ticket, museum_bridge_publish_failed (counters), museum_bridge_seconds_since_last_published (gauge) |
A custom BridgeCollector snapshots RDCSubscriber.metrics at scrape time (same data as /api/bridge/metrics JSON, DAT-131) |
dataland-notification-api |
DLQ / stream / latency surface — see below | app/metrics.py rendered from Redis counters |
Agent multiprocess metrics
The agent runs uvicorn with --workers ${UVICORN_WORKERS:-2}, so each worker has its own in-process counters. compose.yml sets PROMETHEUS_MULTIPROC_DIR=/tmp/prom-multiproc (backed by tmpfs, so it's empty on every boot) which switches prometheus_client into multiprocess mode: workers write per-pid files and /metrics aggregates them with MultiProcessCollector at scrape time. rag, museum, and the notification consumer run a single worker, so the default in-process registry is correct there; the same env var opts them in if the worker count is bumped later.
The notification metrics surface (DAT-59 / DAT-107 / DAT-110 / DAT-257)¶
The notification service is special: its counters live in Redis under notification:counters:*, not in process memory. This is deliberate — the worker process increments them, the api process renders them at GET /metrics, and no in-process state is shared. Counters are best-effort (every increment is wrapped in a broad exception suppressor) so a Redis hiccup never crashes the rules engine.
Counter-derived series (rendered as Prometheus counters):
| Metric | Redis key | Meaning |
|---|---|---|
notification_processed_total |
…:processed |
Telemetry entries the consumer ack'd |
notification_dlq_total |
…:dlq |
Entries routed to museum:telemetry:dlq |
notification_push_sent_total |
…:push:sent |
OneSignal calls returning 2xx |
notification_push_failed_total |
…:push:failed |
4xx after retries OR missing recipient |
notification_chat_opened_total |
…:chat:opened |
Agent chat conversations opened |
notification_chat_failed_total |
…:chat:failed |
Chat dispatch errors |
notification_rule_triggered_total{rule} |
…:rule:triggered:<name> |
Per-rule fire count |
notification_rule_rate_limited_total{rule} |
…:rule:rate_limited:<name> |
Per-rule cooldown suppressions |
No SCAN per scrape (DAT-107)
Per-rule counters used to be discovered with a per-scrape SCAN rule:*, which drove constant SCAN traffic against the same Redis the consumer reads from. The module now maintains index SETs (…:index:rule:triggered, …:index:rule:rate_limited) populated in the same pipeline as the INCRBY, then renders via SMEMBERS + MGET. The first render after deploy backfills the index from a one-shot SCAN; orphaned index entries (counter deleted) are pruned on render.
Gauges and the latency histogram (these are the alerting backbone):
| Metric | Type | Meaning | Alertable? |
|---|---|---|---|
notification_dlq_depth |
gauge | XLEN of the DLQ stream (DAT-110) |
Yes — canonical "rules are silently failing" signal |
notification_telemetry_consumer_lag{group} |
gauge | Per-group unread entries from XINFO GROUPS (DAT-82) |
Yes — correct backpressure signal |
notification_telemetry_stream_depth |
gauge | XLEN of the input telemetry stream |
No — pegs near the MAXLEN cap in steady state; diagnostic only |
notification_e2e_latency_seconds |
histogram | Producer timestamp → successful OneSignal push-ack (DAT-257) | Yes — p95 SLO |
Each gauge is skipped (its line omitted) when its underlying Redis call errors, rather than emitting a bogus 0 that would mask a Redis outage. The e2e latency histogram is itself stored as Redis keys (notification:hist:e2e_latency_seconds:{sum,count,bucket:<le>}) with bucket boundaries 0.05 … 60s, so the worker can observe and the api can render without sharing memory.
Why stream_depth is not an alert
museum:telemetry is MAXLEN-capped, so XLEN sits near the cap even when the consumer is fully caught up. Alerting on notification_telemetry_stream_depth would false-fire constantly. Always alert on notification_telemetry_consumer_lag (the real unread count) instead — this is the lesson encoded in NotificationConsumerLagHigh.
Scrape targets (monitoring/prometheus.yml)¶
scrape_interval: 15s, scrape_timeout: 10s, external labels cluster=dataland, environment=production. Container DNS names match the compose container_name fields.
Native exporters + already-shipping routes (zero-code):
prometheus(self,localhost:9090)dataland-postgres-exporter:9187dataland-redis-exporter:9121dataland-cadvisor:8080dataland-node-exporter:9100dataland-qdrant:6333/metrics(native Prometheus exposition on the REST port)dataland-notification-api:8080/metrics
FastAPI application targets (scrape cleanly once each service has /metrics wired):
dataland-agent:4141/metricsdataland-auth:9000/metricsdataland-rag:4143/metricsdataland-museum:5001/metricsdataland-atlas:4152/metrics
A scrape failure can be the correct signal
The Phase-1 application rows scrape successfully only after a service exposes /metrics. Until then Prometheus logs scrape failures — which is the intended "not-yet-instrumented" indicator, not a stack bug. up{job="..."} == 0 for >2m is what ServiceDown keys off.
Reaching the UIs¶
By default Prometheus, Alertmanager and Grafana bind loopback-only (127.0.0.1). Two ways to view:
# 1. SSH tunnel from your workstation:
ssh -L 9090:127.0.0.1:9090 -L 9093:127.0.0.1:9093 -L 3000:127.0.0.1:3000 \
ege@100.124.170.43
# then open http://localhost:9090 (Prometheus), :9093 (Alertmanager), :3000 (Grafana) # (1)!
- These three local ports map to Prometheus (9090), Alertmanager (9093) and Grafana (3000) on the VDS. The tunnel is the safest path because all three bind loopback-only by default and Prometheus / Alertmanager have no authentication.
# 2. Bind to the tailnet IP (no tunnel needed for tailnet peers), in .env:
PROMETHEUS_PUBLIC_BIND=100.124.170.43 # (1)!
GRAFANA_PUBLIC_BIND=100.124.170.43
ALERTMANAGER_PUBLIC_BIND=100.124.170.43
- DAT-291. Use the tailnet IP (
100.124.170.43) so only tailnet peers can reach the UIs without a tunnel. Never set these to0.0.0.0— that exposes the host's public IP, and Prometheus / Alertmanager have no auth.
Never bind to 0.0.0.0
Prometheus and Alertmanager have no authentication. Grafana's admin password is the only barrier. Binding any of the three to 0.0.0.0 exposes the host's Spectrum public IP. Use the tailnet IP (100.124.170.43) for direct peer access, or the SSH tunnel.
Grafana¶
- Required. The compose
:?guard refuses to start Grafana without it (DAT-266), and the admin password is the only barrier in front of Grafana. Generate withopenssl rand -hex 24.
The compose :? guard refuses to start Grafana without GRAFANA_ADMIN_PASSWORD set (DAT-266). The Prometheus datasource is provisioned (monitoring/grafana/provisioning/datasources/prometheus.yml, uid prometheus, talks to http://dataland-prometheus:9090 over the docker network). Five dashboards ship pre-provisioned in the Dataland folder from monitoring/grafana/dashboards/:
| Dashboard | File |
|---|---|
| Service overview | 01-service-overview.json |
| Notification | 02-notification.json |
| Museum bridge | 03-museum-bridge.json |
| Datastore | 04-datastore.json |
| Host & containers | 05-host-and-containers.json |
Dashboards are git-managed source of truth (allowUiUpdates: false, updateIntervalSeconds: 30): edit the JSON, git pull, recreate the container. UI edits do not persist past a restart. Grafana's own unified alerting is disabled (manageAlerts: false) — alerts live in Prometheus + Alertmanager, never duplicated.
Alert rules (monitoring/rules/dataland.alerts.yml)¶
Severity convention: critical = paging incident (service down or data at risk), warning = investigate next business day.
| Alert | Expr (essence) | For | Severity |
|---|---|---|---|
ServiceDown |
up == 0 |
2m | critical |
HighHTTPErrorRate |
5xx rate / total > 5% | 5m | warning |
HighHTTPLatency |
p99 http_request_duration_seconds > 1s |
5m | warning |
NotificationDLQDepthHigh |
notification_dlq_depth > 100 |
5m | warning |
NotificationDLQGrowing |
rate(notification_dlq_total[10m]) > 0.1 |
10m | warning |
NotificationConsumerLagHigh |
notification_telemetry_consumer_lag > 100 |
5m | warning |
NotificationLatencyP95High |
p95 notification_e2e_latency_seconds > 5s |
10m | warning |
NotificationPushFailureRate |
push_failed / (sent + failed) > 5% |
10m | warning |
MuseumBridgeStuck |
museum_bridge_seconds_since_last_published > 60 |
5m | warning |
MuseumBridgePublishFailing |
rate(museum_bridge_publish_failed_total[5m]) > 0 |
5m | warning |
PostgresDown |
pg_up == 0 |
1m | critical |
PostgresConnectionsSaturated |
connections > 85% of max_connections |
5m | warning |
RedisDown |
redis_up == 0 |
1m | critical |
RedisMemoryHigh |
used / maxmemory > 85% |
5m | warning |
QdrantDown |
up{job="qdrant"} == 0 |
2m | critical |
HostMemoryHigh |
host mem > 90% | 5m | warning |
HostDiskHigh |
< 10% disk free | 10m | warning |
ContainerCPUThrottling |
rate(container_cpu_cfs_throttled_seconds_total{name=~"dataland-.*"}[5m]) > 0.1 |
10m | warning |
ContainerMemoryNearLimit |
working set > 90% of mem_limit |
5m | warning |
AgentErrorBudgetFastBurn / Rag… / Museum… |
1h 5xx rate > 14.4 × (1 − 0.99) |
5m | critical (slo=availability) |
The SLO-burn group uses the Google SRE Workbook fast-burn formula: 14.4× burn exhausts the monthly error budget in ~2 days.
Alert annotations point you at Logfire
Most warning annotations explicitly say to pull Logfire traces scoped to service_name = {{ $labels.service }} and look for the dominant exception class. That is the intended Prometheus→Logfire handoff: numeric alert in, root-cause trace out.
Key signals to watch¶
- DLQ depth (
notification_dlq_depth) — the canonical "rules are silently failing" gauge. If it grows, events are landing inmuseum:telemetry:dlq. Inspect viadataland-notification-api:8080/dlqand replay once the root cause is fixed. - Push failure rate —
NotificationPushFailureRatewatches OneSignal 2xx vs failure. Spikes usually meanONESIGNAL_API_KEYinvalidity, rate limits, or a recipient-mapping miss (notification.push.missing_recipientevents corroborate). - Consumer lag —
notification_telemetry_consumer_lagis the only trustworthy backpressure signal on the cappedmuseum:telemetrystream. Rising lag → scaleCONSUMER_COUNTor investigate a worker bottleneck. - Bridge freshness —
museum_bridge_seconds_since_last_published > 60means the RDC redis bridge has gone silent (noBioSensorsevent reachedmuseum:telemetry). This is the data-plane health check the greendocker psdot cannot give you. - Fallback-JWKS WARN (DAT-286) — this is a log signal, not a metric. The agent logs
JWT accepted by FALLBACK JWKS provider … local JWKS is missing this signing key; chat auth depends on this external endpoint(app/auth.py) whenever a token is validated only by a non-primary JWKS provider. The whole point of DAT-286 was to mirror the CMS signing key into the localdata/extra_jwks.jsonso the local JWKS is the primary validator; this WARN means the mirror is stale or missing and chat auth has silently reverted to a single point of failure on the external CMS. Treat it as alertable — grep the agent's structured log / Logfire forFALLBACK JWKS.
Recent changes
- DAT-82: full Prometheus + Grafana + Alertmanager + exporter stack, app
/metricson every FastAPI service,notification_telemetry_consumer_lagfromXINFO GROUPS, SLO-burn group. - DAT-257:
notification_e2e_latency_secondshistogram (producer ts → OneSignal push-ack) +NotificationLatencyP95High. - DAT-263:
agent_guardrail_triggered_total{direction,category}. - DAT-266: provisioned Grafana datasource + 5 dashboards,
GRAFANA_ADMIN_PASSWORDboot guard. - DAT-286: fallback-JWKS WARN is now an alertable signal (local JWKS mirror of the CMS key removes the chat-auth SPOF).
- DAT-291: tailnet
*_PUBLIC_BINDpublishing for the monitoring UIs.
Structured JSONL logs¶
Alongside the human-readable console/text logs, each service writes a redacted, structured JSONL stream (structured_log.py, public API setup_structured_logging(service=...), log_event(...), redact(...)):
- One JSON object per line.
- Daily rotation at UTC midnight, gzip on rotation, 30-day retention.
<LOG_DIR>/<service>-YYYY-MM-DD.jsonl(≥ INFO) and<service>-error-YYYY-MM-DD.jsonl(≥ WARNING).LOG_DIRdefaults to<repo>/logs, overridden byLOG_FILE_DIRso containers mount it at/app/logs.- Mandatory redaction: keys matching
secret|token|password|hash|salt|cookie|authorization|api[-_]?key|bearer|jwt|access[-_]?key|client[-_]?secretare replaced with***REDACTED***recursively, and inlineBearer …tokens inside string values are scrubbed before write.
This is the durable, on-disk forensic record when Logfire isn't shipping (no token) or when you're SSH'd into the box debugging an incident.
Health checks vs. depth checks¶
Every service has a docker healthcheck — docker compose ps shows status, and Prometheus/Alertmanager also healthcheck via /-/healthy and /api/health.
A green dot is a liveness probe, not a depth probe
museum-api's /health returns 200 even when its RDC subscriber is firing connection-timeout errors every few seconds. The notification api's /health only PINGs Redis. For real data-plane health, trust XLEN museum:telemetry, the museum_bridge_seconds_since_last_published gauge, notification_telemetry_consumer_lag, and log tails — not the green dot in docker ps.
Operator one-liners¶
# Reload Prometheus config without restart:
curl -X POST http://localhost:9090/-/reload # (1)!
# Reload Alertmanager config without restart:
curl -X POST http://localhost:9093/-/reload
# Validate Prometheus config:
docker run --rm --entrypoint /bin/promtool \
-v "$PWD/dataland-infrastructure/monitoring:/etc/prometheus:ro" \
prom/prometheus:v2.55.0 check config /etc/prometheus/prometheus.yml # (2)!
# Validate alert rules:
docker run --rm --entrypoint /bin/promtool \
-v "$PWD/dataland-infrastructure/monitoring/rules:/rules:ro" \
prom/prometheus:v2.55.0 check rules /rules/dataland.alerts.yml
# Validate Alertmanager routing:
docker run --rm --entrypoint /bin/amtool \
-v "$PWD/dataland-infrastructure/monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro" \
prom/alertmanager:v0.27.0 check-config /etc/alertmanager/alertmanager.yml
# Spot-check a service's exposition directly:
curl -s http://localhost:4141/metrics | grep -E 'http_requests_total|agent_guardrail' # (3)!
curl -s http://dataland-notification-api:8080/metrics | grep notification_dlq_depth
- Hot-reloads config from disk without dropping the TSDB or in-flight scrapes. Same trick for Alertmanager on
:9093. Use after editing scrape/route config instead of restarting the container. - Run
check config/check rules/check-configbefore reloading — promtool / amtool pin the same image versions (prom/prometheus:v2.55.0,prom/alertmanager:v0.27.0) so the validator matches the running binary. A bad config silently keeps the old one loaded on reload. - The agent runs multiple uvicorn workers, so
/metricshere is the aggregatedMultiProcessCollectoroutput. Port4141is the agent;8080is the notification-api, whose counters come from Redis (notification:counters:*), not process memory.
Wiring Alertmanager receivers
monitoring/alertmanager.yml ships a route tree (critical pages every 1h with group_wait: 0s; default warning tier repeats every 12h) but the critical and default receivers are intentionally empty until an operator adds real channels — a Slack slack_configs block / webhook for warnings and pagerduty_configs for critical. Inhibit rules already suppress per-endpoint flapping when a host or downstream is hard-down, and suppress SLO-burn alerts on services whose dependency is ServiceDown.
See also¶
- Notification service — DLQ, replay, OneSignal, the OpsNotifier multi-provider fan-out, complaint pipeline (DAT-213).
- Agent — guardrail metric, JWKS auth (DAT-286), SSE chat endpoints.
- Museum bridge — RDC subscriber metrics and the
museum:telemetrystream. - RAG — search/ingest events and AI-content capture.
- On-call runbook:
dataland-infrastructure/reports/runbook.md— rollback, incident triage, backup/restore. Read this before opening anything else when paged.