Services¶
This is the catalog of every container in the Dataland stack: what each one does, which repo builds it, the endpoints and ports it exposes, and which datastores it depends on. The whole stack runs as one Docker Compose project (name: dataland, file compose.yml) on the Spark DGX VDS, on a single bridge network — dataland-network — fronted by Cloudflare Tunnels and a Tailscale tailnet.
For the wiring between services (who calls whom over the docker network, public ingress paths, the three-ring layout) see Architecture. This page is the reference table; each row links to a detail page.
Recent changes (2026-06-03 → 2026-06-04)
- DAT-269: every LLM call standardised on
gemini-3.5-flash(agent chat + RAG Gemini captioning). RAG vectors usegemini-embedding. Code,.env.example, and the live.envare aligned. - DAT-291:
deploy.shnow fails fast on placeholder prod secrets, the*_PUBLIC_BINDtailnet-publishing convention landed across stateful services, and thedocsservice (docs.dataland.chat) was added tocompose.yml. - DAT-296 removed the agent
/registerand/currentticket endpoints — the firstPOST /v1/chat/museumcall registers the ticket andconversation_id == ticket_id.
How the stack fits together¶
graph LR
subgraph clients["Clients"]
MOB["Mobile app"]
CUR["Curators"]
OPS["Operators"]
end
subgraph edge["Edge"]
CF["Cloudflare Tunnel<br/>(host systemd cloudflared)"]
TS["Tailscale tailnet<br/>(spark 100.124.170.43)"]
end
MOB --> CF
CUR --> CF
OPS --> TS
CF -->|dataland.chat| AG["agent :4141"]
CF -->|"dataland.chat (dashboard)"| MU["museum-api :4144→5001"]
CF -->|data.dataland.chat| WU["information-webui :4152"]
CF -->|docs.dataland.chat| DC["docs :4148→80"]
AG -->|"HTTP /search, /images/*"| RG["rag :4143"]
AG -->|"HTTP /api/*"| MU
AG -->|"HTTP /v1/ops/*"| NA["notification-api :8080"]
AG -->|JWKS RS256| AU["auth :9000"]
WU -->|"HTTP /ingest/*"| RG
MU -->|"read (mixed msgpack)"| RDC[("RDC redis<br/>external")]
MU -->|"publish museum:telemetry"| RD[("redis")]
NW["notification-worker"] -->|consume telemetry| RD
NW -->|resolve ticket→user| AG
NW -->|OneSignal push| OS["OneSignal"]
NW -->|ops alerts| DSC["Discord / Slack"]
AG --> PG[("postgres")]
AU --> PG
RG --> QD[("qdrant")]
RG --> GCS["GCS buckets"]
The recurring pattern: mobile + visitor traffic enters over Cloudflare, operator/curator traffic enters over the tailnet (or an SSH tunnel for loopback-only ports), service-to-service calls stay on the docker network using container DNS names (e.g. http://dataland-rag:4143) and never touch host-published ports.
App services¶
Business-logic containers. All are FastAPI/uvicorn unless noted; all build from a sibling repo under /home/cobanov/DATALAND/ on the host.
| Service | Container | Repo | Role | Host port → internal | Datastore deps |
|---|---|---|---|---|---|
| Agent | dataland-agent |
dataland-agent | AI guide: museum + general SSE chat, conversation history, internal service API | 4141 → 4141 |
Postgres, Redis, RAG, museum-api, notification-api, auth (JWKS) |
| Museum API + dashboard | dataland-museum |
dataland-museum | Bridge to external RDC redis; publishes museum:telemetry; mirrors Visitors:ActiveTicketIDs; serves dashboard + chapters.json |
4144 → 5001 |
Redis (write), RDC redis (read) |
| Museum simulator | dataland-simulator |
dataland-museum | Synthetic telemetry producer for dev/testing (publishes to museum:telemetry) |
— (no HTTP) | Redis (write) |
| RAG | dataland-rag |
dataland-rag-v2 | Retrieval: hybrid dense+BM25 search + rerank, file/image ingest, Gemini captioning | 4143 → 4143 (loopback + tailnet) |
Qdrant, GCS |
| Information WebUI | dataland-atlas |
dataland-atlas | "Catalog Studio" CMS: Refik artwork catalog + Museum sections/scenes, RAG live-sync | 4152 → 4152 |
SQLite (/app/data), GCS, RAG |
| Notification worker | dataland-notification-worker |
dataland-notification | Telemetry rules engine → OneSignal pushes + ops alerts; DLQ + replay | — (no HTTP) | Redis, agent (ticket resolve) |
| Notification API | dataland-notification-api |
dataland-notification | Ops surface: rules reload, ticket state, /v1/ops/* ingress, /metrics |
8080 → 8080 (loopback + tailnet) |
Redis, agent |
| Auth (JWKS) | dataland-auth |
dataland-agent (auth_server.py) |
Mobile signup/login + RS256 JWT issuer; serves /.well-known/jwks.json |
9000 → 9000 |
Postgres (auth DB) |
| Docs | dataland-docs |
dataland-infrastructure (docs/) |
This site (MkDocs Material via nginx + basic auth) | 4148 → 80 (loopback + tailnet) |
— |
Data plane¶
Stateful stores. None are published on 0.0.0.0; each is bound to 127.0.0.1 (host tooling + SSH tunnel) and to its *_PUBLIC_BIND tailnet IP (default 100.124.170.43) for direct peer access — never the public Spectrum IP (DAT-73).
| Service | Container | Image | Host port → internal | Auth | Holds |
|---|---|---|---|---|---|
| Redis | dataland-redis |
redis:7-alpine |
4145 → 6379 |
requirepass (REDIS_PASSWORD) |
museum:telemetry stream, ticket state, dedup keys, rate-limit state |
| Postgres | dataland-postgres |
postgres:16-alpine |
5432 → 5432 |
user/password | Agent conversations DB + auth users DB |
| Qdrant (HTTP) | dataland-qdrant |
qdrant/qdrant |
4146 → 6333 |
none (tailnet is the boundary) | Vector collections (see below) |
| Qdrant (gRPC) | dataland-qdrant |
qdrant/qdrant |
4147 → 6334 |
none | Same collections, gRPC client path |
RDC redis is external
The Refik Anadol data center (RDC) redis — the wearable/sensor source of truth — is not part of this Compose stack. dataland-museum reads it (mixed msgpack + plain-UTF-8 encoding) and bridges what it needs onto the local dataland-redis. Configure it via RDC_REDIS_URL. Never replay museum-simulation-playback Avro into dataland-redis; use a dedicated container instead.
Qdrant collections (RAG)¶
| Collection | Approx points | Source | Vectors |
|---|---|---|---|
knowledge |
~4969 | Documents + museum sections/scenes/overview (/ingest/file) |
gemini-embedding (dense) + BM25 (sparse) |
images |
~1485 | Artwork + chapter images, Gemini-captioned (/ingest/image) |
gemini-embedding |
scenes |
— | Scene JSON (/ingest/sync) |
gemini-embedding |
Knowledge re-ingest
The 20 museum sections + scenes + overview were re-ingested today, taking knowledge from ~4839 → ~4969 points. The RAG /search client read timeout in the agent was raised 10s → 25s after museum-knowledge queries surfaced a 10s-timeout → 3× retry → agent_timeout chain.
Observability stack¶
Metrics, dashboards, and alerting. All UIs default to 127.0.0.1 binding (override *_PUBLIC_BIND for tailnet access); the exporters expose no host ports. The host-metrics exporters are gated behind the host-metrics Compose profile so macOS dev boxes don't try to mount /proc + /sys. See Observability.
| Service | Container | Image | Host port | Role |
|---|---|---|---|---|
| Prometheus | dataland-prometheus |
prom/prometheus |
9090 (loopback) |
Scrape coordinator + TSDB (30d / 10GB retention) |
| Grafana | dataland-grafana |
grafana/grafana |
3000 (loopback) |
Dashboards (admin password required, DAT-266) |
| Alertmanager | dataland-alertmanager |
prom/alertmanager |
9093 (loopback) |
Routes alerts from ./monitoring/rules |
| Postgres exporter | dataland-postgres-exporter |
prometheuscommunity/postgres-exporter |
— | Postgres internals |
| Redis exporter | dataland-redis-exporter |
oliver006/redis_exporter |
— | Redis internals (authenticates via REDIS_PASSWORD) |
| cAdvisor | dataland-cadvisor |
gcr.io/cadvisor/cadvisor |
— | Per-container metrics (host-metrics profile, Linux only) |
| Node exporter | dataland-node-exporter |
prom/node-exporter |
— | Host metrics (host-metrics profile, Linux only) |
Endpoint quick-reference¶
The endpoints you reach for most often. Internal/service-to-service paths are marked; everything else is client-facing.
Agent (dataland-agent, port 4141)¶
uvicorn app.main:app. All /v1 endpoints require a mobile access JWT (Authorization: Bearer <token>, RS256, validated against the auth JWKS). /v1/service/* is service-to-service, gated by AGENT_SERVICE_TOKEN.
| Method | Path | Purpose |
|---|---|---|
| GET | /health, /health/full |
Liveness; composite dependency check (RAG + museum) |
| GET | /v1/auth/me |
Resolve the JWT to the Dataland user record |
| POST | /v1/auth/logout |
Stateless logout acknowledgement |
| POST | /v1/chat/museum |
Museum-mode SSE chat. First call registers the ticket; empty first message = instant welcome (DAT-296) |
| POST | /v1/chat/museum/multimodal |
Museum chat with an optional image (multipart) |
| POST | /v1/chat/general |
General assistant SSE chat |
| POST | /v1/chat/general/multimodal |
General chat with an optional image (multipart) |
| GET | /v1/conversations |
List the user's conversations |
| GET | /v1/conversations/{id}/messages |
Mobile message timeline (?raw=true for provider-native) |
| DELETE | /v1/conversations/{id} |
Delete a conversation |
| POST | /v1/conversations/{id}/messages/{mid}/feedback |
Like / dislike an assistant message |
| GET | /v1/service/tickets/{ticket_id}/user |
Internal: resolve ticket → user + conversation (used by notification) |
| POST | /v1/service/chat/museum |
Internal: notification-triggered museum chat |
| GET | /metrics |
Prometheus (multiprocess; aggregated across uvicorn workers) |
Ticket endpoints removed (DAT-296)
/v1/tickets/register and /v1/tickets/current no longer exist. The first POST /v1/chat/museum (or /multimodal) with a ticket_id registers the ticket implicitly, and the conversation id is permanently equal to the ticket id. /v1/ops/complaint and /v1/ops/welcome are not agent endpoints — the agent posts to them on the notification service.
The agent's chat tools: get_visitor_vitals, get_room_info, get_scene_flow (DAT-261), search_knowledge, search_artwork_images. It speaks real room/section names — Data Pavilion (GA), Latent Gallery (GB), Infinity Room (GC), The Sanctuary (GD), Discovery Portal (ON) — never bare codes (DAT-281).
Museum API (dataland-museum, host 4144 → container 5001)¶
uvicorn run:app. Dashboard is gated by a shared password (MUSEUM_PASSWORD) + signed session cookie.
| Method | Path | Purpose |
|---|---|---|
| GET | /health |
Liveness (pings Redis) |
| GET | / |
Dashboard (auth) |
| POST | /api/auth/login, /api/auth/logout, GET /api/auth/session |
Dashboard auth |
| GET | /api/active-tickets |
Mirror of Visitors:ActiveTicketIDs |
| GET | /api/tickets/{ticket_id}/vitals |
Live vitals for a ticket |
| GET | /api/tickets/{ticket_id}/session-status |
Session state |
| GET | /api/visitors/{visitor_id}/vitals |
Vitals by visitor |
| GET | /api/rooms/{room_code}/chapters, /api/chapters |
Chapter catalog |
| GET | /api/bridge/metrics |
Telemetry-bridge stats |
| GET | /api/aqi, /api/qsys, /api/audio, /api/onboarding/kiosks |
RDC-sourced environment + Q-SYS/audio (plain-UTF-8 decode, DAT-293) |
| GET | /metrics |
Prometheus |
RAG (dataland-rag, port 4143)¶
FastAPI dataland-rag v3.0.0. Write/search endpoints require X-API-Key (RAG_API_KEY). Reranker: jina-reranker-v2-base-multilingual (ONNX, the heavy tenant — 12 GB / 12 cores).
| Method | Path | Purpose |
|---|---|---|
| GET | /health, /health/full |
Liveness; full dependency check |
| POST | /search |
Hybrid dense+BM25 search + rerank (knowledge) |
| POST | /ingest/file |
Ingest a document into knowledge |
| POST | /ingest/image |
Ingest an image into images (Gemini captioning) |
| POST | /ingest/sync |
Ingest scene JSON into scenes |
| POST | /ingest/by-project-slug/{slug} |
Project-scoped ingest |
| POST | /images/search/text, /images/search/image |
Image search by text / by image |
| GET | /images/{filepath:path}, /images/extracted/{filepath:path} |
Serve images |
| POST | /admin/sparse-backfill |
Ops-driven BM25 backfill |
| GET | /metrics |
Prometheus |
GCS buckets: dataland-public/artworks (+ cobanov-public/chapters for chapter images), dataland-private (documents, museum/scenes).
Information WebUI (dataland-atlas, port 4152)¶
The Catalog Studio CMS. Auth-gated (INFORMATION_WEBUI_PASSWORD + session cookie); all /api/* routes sit behind require_auth. Persists to SQLite (/app/data) + GCS, and live-syncs to RAG (text → /ingest/file, images → /ingest/image; rag slugs museum-section-<slug> / museum-scene-<slug>, replace-by-slug with UUIDv5 ids).
| Method | Path (prefix /api) |
Purpose |
|---|---|---|
| GET/POST/PUT/DELETE | /projects, /projects/{id} |
Refik artwork catalog CRUD |
| GET | /projects/search, /catalog, /projects/{id}/catalog |
Search + catalog export |
| POST/PUT/DELETE | /projects/{id}/images... |
Artwork image upload + management |
| GET/PUT | /museum, /museum/catalog |
Museum overview |
| GET/POST/PUT/DELETE | /museum/sections, /museum/sections/{identifier} |
Sections CRUD (+ images, catalog) |
| GET/POST/PUT/DELETE | /museum/scenes, /museum/scenes/{identifier} |
Scenes CRUD (+ images, catalog) |
Notification API (dataland-notification-api, port 8080)¶
Ops surface. Write/inspect endpoints require the ops bearer token; /health + /metrics are open (and readable by tailnet peers).
| Method | Path | Purpose |
|---|---|---|
| GET | /health |
Liveness (pings Redis) |
| GET | /metrics |
Prometheus (incl. DLQ + telemetry stream XLEN gauges) |
| GET | /rules, POST /rules/reload, POST /rules/validate |
Inspect / hot-swap / dry-run the rules TOML |
| POST | /events |
Process a raw telemetry event (debug) |
| GET | /state/{ticket_id} |
Per-ticket notification state |
| POST | /v1/ops/complaint |
Receive a silent visitor-complaint event from the agent (DAT-213) |
| POST | /v1/ops/welcome |
Send the museum welcome push for a ticket (DAT-296) |
Telemetry rules: visit_started (welcome), heart_rate, skin_conductance, artwork_engagement, room_transition, session_flow, experience_tip, visit_ended. Content pushes are gated until Gallery A (welcome exempt); room-transition pushes bypass the 240s cooldown (DAT-287/289/290). Ops alerts fan out to Discord/Slack via a swappable OpsNotifier (OPS_NOTIFIER_PROVIDER, comma-separated multi-provider).
Auth (dataland-auth, port 9000)¶
auth_server.py (same image as the agent). The RS256 signing authority for mobile JWTs.
| Method | Path | Purpose |
|---|---|---|
| GET | /.well-known/jwks.json |
Public JWKS the agent validates tokens against |
| POST | /api/auth/signup, /api/auth/login |
Mobile account creation + login |
| GET | /api/auth/me |
Verify token, return user |
| GET | / |
Test/login HTML |
JWKS mirror (DAT-286)
The local dataland-auth mirrors the CMS signing key (kid dataland-rs256-1) via data/extra_jwks.json in the auth-data volume, removing the chat-auth single point of failure. The agent logs a WARN when a fallback JWKS provider is the sole validator. Re-run provisioning after a volume wipe or CMS key rotation.
A note on "loopback" ports¶
A port marked loopback is bound to 127.0.0.1:<port> on the host. Service-to-service traffic uses the docker network and is unaffected. Most data-plane and ops ports are additionally published on the tailnet IP (*_PUBLIC_BIND, default 100.124.170.43) for direct peer access. To reach a loopback-only port from your workstation, open an SSH tunnel:
ssh -L 4143:127.0.0.1:4143 \
-L 4145:127.0.0.1:4145 \
-L 4146:127.0.0.1:4146 \
-L 8080:127.0.0.1:8080 \
-L 9090:127.0.0.1:9090 \
-L 3000:127.0.0.1:3000 \
ege@100.124.170.43 # (1)!
- Tunnel terminates on the tailnet IP (
100.124.170.43), never the public Spectrum IP — the data plane is bound only to loopback + the*_PUBLIC_BINDtailnet address (DAT-73). Each-Lforwards a loopback-only port: RAG (4143), Redis (4145), Qdrant HTTP (4146), notification-api (8080), Prometheus (9090), Grafana (3000). Cloudflare-fronted services (4141/4144/4152/4148) need no tunnel.
Cloudflare-fronted services (agent 4141, museum 4144, webui 4152, docs 4148) are reachable through their public *.dataland.chat hostname. Tailnet peers can also hit any *_PUBLIC_BIND port directly as spark:<port> without a tunnel.
Detail pages¶
See also: Architecture · Public ports · Deploy · Observability · Service hosting & relocation