Skip to content

Services

This is the catalog of every container in the Dataland stack: what each one does, which repo builds it, the endpoints and ports it exposes, and which datastores it depends on. The whole stack runs as one Docker Compose project (name: dataland, file compose.yml) on the Spark DGX VDS, on a single bridge network — dataland-network — fronted by Cloudflare Tunnels and a Tailscale tailnet.

For the wiring between services (who calls whom over the docker network, public ingress paths, the three-ring layout) see Architecture. This page is the reference table; each row links to a detail page.

Recent changes (2026-06-03 → 2026-06-04)

  • DAT-269: every LLM call standardised on gemini-3.5-flash (agent chat + RAG Gemini captioning). RAG vectors use gemini-embedding. Code, .env.example, and the live .env are aligned.
  • DAT-291: deploy.sh now fails fast on placeholder prod secrets, the *_PUBLIC_BIND tailnet-publishing convention landed across stateful services, and the docs service (docs.dataland.chat) was added to compose.yml.
  • DAT-296 removed the agent /register and /current ticket endpoints — the first POST /v1/chat/museum call registers the ticket and conversation_id == ticket_id.

How the stack fits together

graph LR
  subgraph clients["Clients"]
    MOB["Mobile app"]
    CUR["Curators"]
    OPS["Operators"]
  end

  subgraph edge["Edge"]
    CF["Cloudflare Tunnel<br/>(host systemd cloudflared)"]
    TS["Tailscale tailnet<br/>(spark 100.124.170.43)"]
  end

  MOB --> CF
  CUR --> CF
  OPS --> TS

  CF -->|dataland.chat| AG["agent :4141"]
  CF -->|"dataland.chat (dashboard)"| MU["museum-api :4144→5001"]
  CF -->|data.dataland.chat| WU["information-webui :4152"]
  CF -->|docs.dataland.chat| DC["docs :4148→80"]

  AG -->|"HTTP /search, /images/*"| RG["rag :4143"]
  AG -->|"HTTP /api/*"| MU
  AG -->|"HTTP /v1/ops/*"| NA["notification-api :8080"]
  AG -->|JWKS RS256| AU["auth :9000"]
  WU -->|"HTTP /ingest/*"| RG

  MU -->|"read (mixed msgpack)"| RDC[("RDC redis<br/>external")]
  MU -->|"publish museum:telemetry"| RD[("redis")]
  NW["notification-worker"] -->|consume telemetry| RD
  NW -->|resolve ticket→user| AG
  NW -->|OneSignal push| OS["OneSignal"]
  NW -->|ops alerts| DSC["Discord / Slack"]

  AG --> PG[("postgres")]
  AU --> PG
  RG --> QD[("qdrant")]
  RG --> GCS["GCS buckets"]

The recurring pattern: mobile + visitor traffic enters over Cloudflare, operator/curator traffic enters over the tailnet (or an SSH tunnel for loopback-only ports), service-to-service calls stay on the docker network using container DNS names (e.g. http://dataland-rag:4143) and never touch host-published ports.

App services

Business-logic containers. All are FastAPI/uvicorn unless noted; all build from a sibling repo under /home/cobanov/DATALAND/ on the host.

Service Container Repo Role Host port → internal Datastore deps
Agent dataland-agent dataland-agent AI guide: museum + general SSE chat, conversation history, internal service API 41414141 Postgres, Redis, RAG, museum-api, notification-api, auth (JWKS)
Museum API + dashboard dataland-museum dataland-museum Bridge to external RDC redis; publishes museum:telemetry; mirrors Visitors:ActiveTicketIDs; serves dashboard + chapters.json 41445001 Redis (write), RDC redis (read)
Museum simulator dataland-simulator dataland-museum Synthetic telemetry producer for dev/testing (publishes to museum:telemetry) — (no HTTP) Redis (write)
RAG dataland-rag dataland-rag-v2 Retrieval: hybrid dense+BM25 search + rerank, file/image ingest, Gemini captioning 41434143 (loopback + tailnet) Qdrant, GCS
Information WebUI dataland-atlas dataland-atlas "Catalog Studio" CMS: Refik artwork catalog + Museum sections/scenes, RAG live-sync 41524152 SQLite (/app/data), GCS, RAG
Notification worker dataland-notification-worker dataland-notification Telemetry rules engine → OneSignal pushes + ops alerts; DLQ + replay — (no HTTP) Redis, agent (ticket resolve)
Notification API dataland-notification-api dataland-notification Ops surface: rules reload, ticket state, /v1/ops/* ingress, /metrics 80808080 (loopback + tailnet) Redis, agent
Auth (JWKS) dataland-auth dataland-agent (auth_server.py) Mobile signup/login + RS256 JWT issuer; serves /.well-known/jwks.json 90009000 Postgres (auth DB)
Docs dataland-docs dataland-infrastructure (docs/) This site (MkDocs Material via nginx + basic auth) 414880 (loopback + tailnet)

Data plane

Stateful stores. None are published on 0.0.0.0; each is bound to 127.0.0.1 (host tooling + SSH tunnel) and to its *_PUBLIC_BIND tailnet IP (default 100.124.170.43) for direct peer access — never the public Spectrum IP (DAT-73).

Service Container Image Host port → internal Auth Holds
Redis dataland-redis redis:7-alpine 41456379 requirepass (REDIS_PASSWORD) museum:telemetry stream, ticket state, dedup keys, rate-limit state
Postgres dataland-postgres postgres:16-alpine 54325432 user/password Agent conversations DB + auth users DB
Qdrant (HTTP) dataland-qdrant qdrant/qdrant 41466333 none (tailnet is the boundary) Vector collections (see below)
Qdrant (gRPC) dataland-qdrant qdrant/qdrant 41476334 none Same collections, gRPC client path

RDC redis is external

The Refik Anadol data center (RDC) redis — the wearable/sensor source of truth — is not part of this Compose stack. dataland-museum reads it (mixed msgpack + plain-UTF-8 encoding) and bridges what it needs onto the local dataland-redis. Configure it via RDC_REDIS_URL. Never replay museum-simulation-playback Avro into dataland-redis; use a dedicated container instead.

Qdrant collections (RAG)

Collection Approx points Source Vectors
knowledge ~4969 Documents + museum sections/scenes/overview (/ingest/file) gemini-embedding (dense) + BM25 (sparse)
images ~1485 Artwork + chapter images, Gemini-captioned (/ingest/image) gemini-embedding
scenes Scene JSON (/ingest/sync) gemini-embedding

Knowledge re-ingest

The 20 museum sections + scenes + overview were re-ingested today, taking knowledge from ~4839 → ~4969 points. The RAG /search client read timeout in the agent was raised 10s → 25s after museum-knowledge queries surfaced a 10s-timeout → 3× retry → agent_timeout chain.

Observability stack

Metrics, dashboards, and alerting. All UIs default to 127.0.0.1 binding (override *_PUBLIC_BIND for tailnet access); the exporters expose no host ports. The host-metrics exporters are gated behind the host-metrics Compose profile so macOS dev boxes don't try to mount /proc + /sys. See Observability.

Service Container Image Host port Role
Prometheus dataland-prometheus prom/prometheus 9090 (loopback) Scrape coordinator + TSDB (30d / 10GB retention)
Grafana dataland-grafana grafana/grafana 3000 (loopback) Dashboards (admin password required, DAT-266)
Alertmanager dataland-alertmanager prom/alertmanager 9093 (loopback) Routes alerts from ./monitoring/rules
Postgres exporter dataland-postgres-exporter prometheuscommunity/postgres-exporter Postgres internals
Redis exporter dataland-redis-exporter oliver006/redis_exporter Redis internals (authenticates via REDIS_PASSWORD)
cAdvisor dataland-cadvisor gcr.io/cadvisor/cadvisor Per-container metrics (host-metrics profile, Linux only)
Node exporter dataland-node-exporter prom/node-exporter Host metrics (host-metrics profile, Linux only)

Endpoint quick-reference

The endpoints you reach for most often. Internal/service-to-service paths are marked; everything else is client-facing.

Agent (dataland-agent, port 4141)

uvicorn app.main:app. All /v1 endpoints require a mobile access JWT (Authorization: Bearer <token>, RS256, validated against the auth JWKS). /v1/service/* is service-to-service, gated by AGENT_SERVICE_TOKEN.

Method Path Purpose
GET /health, /health/full Liveness; composite dependency check (RAG + museum)
GET /v1/auth/me Resolve the JWT to the Dataland user record
POST /v1/auth/logout Stateless logout acknowledgement
POST /v1/chat/museum Museum-mode SSE chat. First call registers the ticket; empty first message = instant welcome (DAT-296)
POST /v1/chat/museum/multimodal Museum chat with an optional image (multipart)
POST /v1/chat/general General assistant SSE chat
POST /v1/chat/general/multimodal General chat with an optional image (multipart)
GET /v1/conversations List the user's conversations
GET /v1/conversations/{id}/messages Mobile message timeline (?raw=true for provider-native)
DELETE /v1/conversations/{id} Delete a conversation
POST /v1/conversations/{id}/messages/{mid}/feedback Like / dislike an assistant message
GET /v1/service/tickets/{ticket_id}/user Internal: resolve ticket → user + conversation (used by notification)
POST /v1/service/chat/museum Internal: notification-triggered museum chat
GET /metrics Prometheus (multiprocess; aggregated across uvicorn workers)

Ticket endpoints removed (DAT-296)

/v1/tickets/register and /v1/tickets/current no longer exist. The first POST /v1/chat/museum (or /multimodal) with a ticket_id registers the ticket implicitly, and the conversation id is permanently equal to the ticket id. /v1/ops/complaint and /v1/ops/welcome are not agent endpoints — the agent posts to them on the notification service.

The agent's chat tools: get_visitor_vitals, get_room_info, get_scene_flow (DAT-261), search_knowledge, search_artwork_images. It speaks real room/section names — Data Pavilion (GA), Latent Gallery (GB), Infinity Room (GC), The Sanctuary (GD), Discovery Portal (ON) — never bare codes (DAT-281).

Museum API (dataland-museum, host 4144 → container 5001)

uvicorn run:app. Dashboard is gated by a shared password (MUSEUM_PASSWORD) + signed session cookie.

Method Path Purpose
GET /health Liveness (pings Redis)
GET / Dashboard (auth)
POST /api/auth/login, /api/auth/logout, GET /api/auth/session Dashboard auth
GET /api/active-tickets Mirror of Visitors:ActiveTicketIDs
GET /api/tickets/{ticket_id}/vitals Live vitals for a ticket
GET /api/tickets/{ticket_id}/session-status Session state
GET /api/visitors/{visitor_id}/vitals Vitals by visitor
GET /api/rooms/{room_code}/chapters, /api/chapters Chapter catalog
GET /api/bridge/metrics Telemetry-bridge stats
GET /api/aqi, /api/qsys, /api/audio, /api/onboarding/kiosks RDC-sourced environment + Q-SYS/audio (plain-UTF-8 decode, DAT-293)
GET /metrics Prometheus

RAG (dataland-rag, port 4143)

FastAPI dataland-rag v3.0.0. Write/search endpoints require X-API-Key (RAG_API_KEY). Reranker: jina-reranker-v2-base-multilingual (ONNX, the heavy tenant — 12 GB / 12 cores).

Method Path Purpose
GET /health, /health/full Liveness; full dependency check
POST /search Hybrid dense+BM25 search + rerank (knowledge)
POST /ingest/file Ingest a document into knowledge
POST /ingest/image Ingest an image into images (Gemini captioning)
POST /ingest/sync Ingest scene JSON into scenes
POST /ingest/by-project-slug/{slug} Project-scoped ingest
POST /images/search/text, /images/search/image Image search by text / by image
GET /images/{filepath:path}, /images/extracted/{filepath:path} Serve images
POST /admin/sparse-backfill Ops-driven BM25 backfill
GET /metrics Prometheus

GCS buckets: dataland-public/artworks (+ cobanov-public/chapters for chapter images), dataland-private (documents, museum/scenes).

Information WebUI (dataland-atlas, port 4152)

The Catalog Studio CMS. Auth-gated (INFORMATION_WEBUI_PASSWORD + session cookie); all /api/* routes sit behind require_auth. Persists to SQLite (/app/data) + GCS, and live-syncs to RAG (text → /ingest/file, images → /ingest/image; rag slugs museum-section-<slug> / museum-scene-<slug>, replace-by-slug with UUIDv5 ids).

Method Path (prefix /api) Purpose
GET/POST/PUT/DELETE /projects, /projects/{id} Refik artwork catalog CRUD
GET /projects/search, /catalog, /projects/{id}/catalog Search + catalog export
POST/PUT/DELETE /projects/{id}/images... Artwork image upload + management
GET/PUT /museum, /museum/catalog Museum overview
GET/POST/PUT/DELETE /museum/sections, /museum/sections/{identifier} Sections CRUD (+ images, catalog)
GET/POST/PUT/DELETE /museum/scenes, /museum/scenes/{identifier} Scenes CRUD (+ images, catalog)

Notification API (dataland-notification-api, port 8080)

Ops surface. Write/inspect endpoints require the ops bearer token; /health + /metrics are open (and readable by tailnet peers).

Method Path Purpose
GET /health Liveness (pings Redis)
GET /metrics Prometheus (incl. DLQ + telemetry stream XLEN gauges)
GET /rules, POST /rules/reload, POST /rules/validate Inspect / hot-swap / dry-run the rules TOML
POST /events Process a raw telemetry event (debug)
GET /state/{ticket_id} Per-ticket notification state
POST /v1/ops/complaint Receive a silent visitor-complaint event from the agent (DAT-213)
POST /v1/ops/welcome Send the museum welcome push for a ticket (DAT-296)

Telemetry rules: visit_started (welcome), heart_rate, skin_conductance, artwork_engagement, room_transition, session_flow, experience_tip, visit_ended. Content pushes are gated until Gallery A (welcome exempt); room-transition pushes bypass the 240s cooldown (DAT-287/289/290). Ops alerts fan out to Discord/Slack via a swappable OpsNotifier (OPS_NOTIFIER_PROVIDER, comma-separated multi-provider).

Auth (dataland-auth, port 9000)

auth_server.py (same image as the agent). The RS256 signing authority for mobile JWTs.

Method Path Purpose
GET /.well-known/jwks.json Public JWKS the agent validates tokens against
POST /api/auth/signup, /api/auth/login Mobile account creation + login
GET /api/auth/me Verify token, return user
GET / Test/login HTML

JWKS mirror (DAT-286)

The local dataland-auth mirrors the CMS signing key (kid dataland-rs256-1) via data/extra_jwks.json in the auth-data volume, removing the chat-auth single point of failure. The agent logs a WARN when a fallback JWKS provider is the sole validator. Re-run provisioning after a volume wipe or CMS key rotation.

A note on "loopback" ports

A port marked loopback is bound to 127.0.0.1:<port> on the host. Service-to-service traffic uses the docker network and is unaffected. Most data-plane and ops ports are additionally published on the tailnet IP (*_PUBLIC_BIND, default 100.124.170.43) for direct peer access. To reach a loopback-only port from your workstation, open an SSH tunnel:

ssh -L 4143:127.0.0.1:4143 \
    -L 4145:127.0.0.1:4145 \
    -L 4146:127.0.0.1:4146 \
    -L 8080:127.0.0.1:8080 \
    -L 9090:127.0.0.1:9090 \
    -L 3000:127.0.0.1:3000 \
    ege@100.124.170.43      # (1)!
  1. Tunnel terminates on the tailnet IP (100.124.170.43), never the public Spectrum IP — the data plane is bound only to loopback + the *_PUBLIC_BIND tailnet address (DAT-73). Each -L forwards a loopback-only port: RAG (4143), Redis (4145), Qdrant HTTP (4146), notification-api (8080), Prometheus (9090), Grafana (3000). Cloudflare-fronted services (4141/4144/4152/4148) need no tunnel.

Cloudflare-fronted services (agent 4141, museum 4144, webui 4152, docs 4148) are reachable through their public *.dataland.chat hostname. Tailnet peers can also hit any *_PUBLIC_BIND port directly as spark:<port> without a tunnel.

Detail pages

See also: Architecture · Public ports · Deploy · Observability · Service hosting & relocation