Architecture¶

Dataland is the live AI-art museum by Refik Anadol Studio. The visitor-facing experience — a personalized AI guide that knows where you are standing, what your body is doing, and what artwork is in front of you — is backed by a small set of Python microservices that run as a Docker Compose stack on a single host (the Spark DGX VDS, ege@100.124.170.43, repos under /home/cobanov/DATALAND/).

The stack is organised around three concentric rings:

Data plane — the stateful stores (Postgres, Redis, Qdrant) plus the external RDC Redis owned by Refik Anadol Studio.
App services — the business logic (agent, auth, rag, museum-api, museum-simulator, notification-worker, notification-api, information-webui, docs).
Observability — Prometheus, Grafana, Alertmanager, plus datastore/host/container exporters.

One Docker bridge network, dataland-network, connects everything. Cross-service traffic uses container DNS names (http://dataland-rag:4143), never host-published ports. Public ingress is a single Cloudflare Tunnel; operator access is over a Tailscale tailnet.

Recent changes (2026-06-03 → 2026-06-04)

This page reflects a large change-set landed across all five service repos. The headline items: model standardised on gemini-3.5-flash (DAT-269); the museum chat init/welcome flow rewritten so an empty first message yields an instant LLM-free welcome and the /register + /current endpoints were removed (DAT-296); silent server-side complaint detection added (DAT-213); a local JWKS mirror removed the chat-auth single point of failure (DAT-286); and deploy.sh now fails fast on placeholder prod secrets, publishes services on the tailnet, and added docs.dataland.chat (DAT-291).

Service map¶

graph TB
  subgraph ext["External — Refik Anadol Studio"]
    RDC[("RDC Redis<br/>wearable + sensor truth<br/>msgpack + plain UTF-8")]
  end

  subgraph dnet["dataland-network (Docker bridge)"]
    subgraph data["Data plane"]
      PG[("postgres<br/>agent + auth DBs")]
      RD[("dataland-redis<br/>museum:telemetry · ticket state · dedup")]
      QD[("qdrant<br/>knowledge · images · scenes")]
    end
    subgraph app["App services"]
      AU["auth :9000<br/>JWKS issuer + mirror"]
      AG["agent :4141<br/>chat SSE + tools"]
      RG["rag :4143<br/>hybrid retrieval"]
      MU["museum-api :5001→4144<br/>RDC bridge + dashboard"]
      SIM["museum-simulator<br/>(profile: simulator)"]
      NW["notification-worker<br/>rules → push/ops"]
      NA["notification-api :8080<br/>DLQ · state · ops"]
      WU["information-webui :4152<br/>Catalog Studio CMS"]
      DC["docs :80→4148"]
    end
    subgraph obs["Observability"]
      PR["prometheus"]
      GR["grafana"]
      AM["alertmanager"]
      PE["postgres-exporter"]
      RE["redis-exporter"]
      CA["cadvisor (host-metrics)"]
      NE["node-exporter (host-metrics)"]
    end
  end

  GCS[("GCS buckets<br/>dataland-public · dataland-private · cobanov-public")]

  RDC -->|PSUBSCRIBE wearables/visitors| MU
  MU -->|XADD| RD
  RD -->|XREADGROUP| NW
  NW -->|POST /v1/service/chat/museum| AG
  NW -->|OneSignal push| OS["OneSignal"]
  NW -->|ops alert| OPS["Discord / Slack"]

  AG --> RG
  AG --> MU
  AG --> AU
  AG --> PG
  AG --> RD
  AG -->|/v1/ops/complaint · /v1/ops/welcome| NA

  WU --> RG
  WU --> GCS
  RG --> QD
  RG --> GCS

  CF["Cloudflare Tunnel (host systemd)"] -.->|public ingress| AG
  CF -.-> MU
  CF -.-> WU
  CF -.-> DC
  TS["Tailscale tailnet"] -.->|*_PUBLIC_BIND| data
  TS -.-> NA

Repo	Container(s)	Role
`dataland-agent`	`agent` (`:4141`), `auth` (`:9000`)	FastAPI + pydantic-ai chat agent (museum + general, SSE). Hosts `auth_server.py` (the `dataland-auth` JWKS service) and a static test chat client.
`dataland-rag-v2`	`rag` (`:4143`)	Retrieval: ingest + hybrid search over Qdrant. Gemini captioning + embeddings. GCS-backed assets.
`dataland-museum`	`museum-api` (`:5001`→`:4144`), `museum-simulator`	Bridge to the external RDC Redis; publishes `museum:telemetry`, mirrors active tickets, serves the chapter catalog + ops dashboard.
`dataland-notification`	`notification-worker`, `notification-api` (`:8080`)	Telemetry rules engine → OneSignal pushes + Discord/Slack ops alerts. DLQ + replay.
`dataland-atlas`	`information-webui` (`:4152`)	The "Catalog Studio" CMS for curators. SQLite catalog + GCS uploads + RAG live-sync.
`dataland-infrastructure`	`docs` (`:80`→`:4148`), monitoring stack, compose	Compose definitions, deploy/smoke scripts, notification rules, docs site.

Three things to remember about the network:

Container DNS only inside the stack. Service-to-service calls use names like dataland-rag, dataland-museum, dataland-postgres — these resolve over dataland-network, not via the host. Host-published ports are for operators and Cloudflare ingress, not for internal callers.
Stateful services and ops surfaces bind to 127.0.0.1 + a tailnet IP only. Postgres, Redis, Qdrant, notification-api, and docs publish to 127.0.0.1:<port> (local tooling / SSH tunnel) and to a *_PUBLIC_BIND (default the tailnet interface 100.124.170.43) so tailnet peers reach them directly — never 0.0.0.0 (DAT-73).
Cloudflare is the only public ingress. cloudflared runs as a host systemd service in token mode and routes the public *.dataland.chat hostnames (dataland.chat, data.dataland.chat, docs.dataland.chat) to the right local port. There is no nginx/Traefik in front of the stack.

Core data flows¶

1. Visitor chat over SSE (incl. museum init/welcome)¶

The agent serves two chat modes over server-sent events: museum mode (POST /v1/chat/museum[/multimodal], ticket-bound, location/vitals-aware) and general mode (POST /v1/chat/general[/multimodal]). All chat endpoints sit behind the /v1 prefix and require an RS256 JWT (Authorization: Bearer).

The first museum message is special. There is no separate registration call any more — sending an empty first /museum message is the registration and returns an instant, personalized welcome with no LLM round-trip.

sequenceDiagram
  autonumber
  participant App as Mobile app
  participant Agent as agent
  participant Auth as auth (JWKS)
  participant PG as postgres
  participant NA as notification-api
  participant Museum as museum-api
  participant RAG as rag

  App->>Agent: POST /v1/chat/museum {ticket_id, message:""} (Bearer JWT)
  Agent->>Auth: verify RS256 via JWKS (cached, multi-provider)
  Agent->>PG: register_ticket(user, ticket_id)
  Note over Agent,PG: conversation_id == ticket_id (DAT-296)
  alt empty first message (init)
    Agent-->>App: SSE static welcome — no LLM, no RAG (DAT-296)
    Agent->>NA: POST /v1/ops/welcome (off-path welcome push)
    Note right of NA: ticket-deduped vs RDC visit_started
  else real message
    Agent->>Agent: schedule_complaint_check() off-path (DAT-213)
    Agent->>Museum: GET /api/tickets/{id}/vitals (get_visitor_vitals)
    Agent->>Museum: GET /api/chapters (get_room_info / get_scene_flow)
    Agent->>RAG: POST /search (search_knowledge / search_artwork_images)
    Agent-->>App: SSE token stream
    Agent-->>App: SSE follow-up suggestions (restored on reload, DAT-284)
  end

Key facts about this flow:

Conversation identity. For museum chat, conversation_id == ticket_id; the same ticket_id always resumes the same chat. register_ticket is idempotent and returns whether the ticket was newly created. The old /register and /current endpoints were removed (DAT-296).
Instant welcome. An empty first message returns welcome_message(full_name) — a fixed greeting kept in sync with the notification service's visit_started copy. New ticket → the welcome is persisted; re-init on an existing ticket → it is streamed without duplicating. The welcome push is fired off-path to notification-api /v1/ops/welcome, which ticket-dedups it against the RDC-driven visit_started welcome (DAT-296).
Anonymous-first. Identity is ticket_id ↔ external_id ↔ OneSignal; no registered/email account is required. full_name may be empty and the welcome degrades gracefully ("Welcome to Dataland!").
Agent tools. The museum agent exposes get_visitor_vitals, get_room_info, get_scene_flow, search_knowledge, and search_artwork_images. Tools speak real room names, never bare codes (DAT-281): Data Pavilion (GA), Latent Gallery (GB), Infinity Room (GC), The Sanctuary (GD), Discovery Portal (ON), Lobby (LO).
Silent complaint detection (DAT-213). Every real visitor message triggers an off-path server-side LLM judge (schedule_complaint_check). It never changes the reply or emits an SSE event; a detected complaint is POSTed to notification-api /v1/ops/complaint with the visitor's identity, per-session deduped, and audit-logged.
Timeouts. The chat run is capped at agent_run_timeout_seconds = 60s; the suggestion call at 15s. The RAG /search client read timeout was raised 10s → 25s because museum-knowledge queries occasionally hit 10s → 3× retry → second search → 60s wall-clock → agent_timeout.

Recent changes — chat

DAT-296 (empty-first-message welcome + push, /register and /current removed), DAT-213 (silent complaint detection, POST /v1/ops/complaint), DAT-284 (follow-up suggestions on reload), DAT-281 (real room names), DAT-279/280/285/261 (get_room_info image cap + dedup, empty-room fix, vitals sanity bounds, get_scene_flow), DAT-269 (gemini-3.5-flash).

2. RDC telemetry → notification rules → push¶

museum-api reads exclusively from the external RDC Redis (Refik Anadol data center: wearable/sensor source of truth). There is no host/port fallback — the service refuses to start if RDC_REDIS_URL is empty. The RDC feed is mixed encoding: most payloads are msgpack, but some channels (Q-SYS / audio) are plain UTF-8, which the decoders handle (DAT-282).

sequenceDiagram
  autonumber
  participant Band as Empatica wearable
  participant RDC as RDC Redis (external)
  participant Museum as museum-api
  participant Redis as dataland-redis
  participant Worker as notification-worker
  participant Agent as agent
  participant OS as OneSignal
  participant OPS as Discord / Slack

  Band->>RDC: PUBLISH BioSensors / position / status
  RDC-->>Museum: PSUBSCRIBE pmessage (msgpack / UTF-8)
  Museum->>Museum: resolve serial→ticket→room→chapter, normalise
  Museum->>Redis: XADD museum:telemetry (telemetry bridge)
  Museum->>Redis: overwrite museum:active_ticket_ids @1Hz
  Redis-->>Worker: XREADGROUP
  Worker->>Worker: evaluate rules (gating + cooldown)
  alt push channel
    Worker->>OS: push (external_id, fallback user_id)
  end
  alt chat channel
    Worker->>Agent: POST /v1/service/chat/museum
  end
  alt ops alert
    Worker->>OPS: OpsNotifier fan-out
  end

How the bridge resolves context, per BioSensors message:

serial → ticket by walking Visitors:ActiveTicketIDs + per-ticket Visitors:*:EmpaticaDeviceID (cached, 60s TTL, cleared on an AssignDevice event).
serial → room from Wearables:WatchDevices:<serial>:RoomCode.
room → gallery → chapter via ROOM_TO_GALLERY + GALLERY_ART_IDS, reading Galleries:<gallery>:<art_id>:SceneControl:RTChapter/DDSChapter, then joining the chapters.json index for chapter_name / art_name.

The normalised event (ticket_id, heart_rate, skin_conductance, body_temperature, room_code, chapter_*, simulator_source: "rdc-bridge", …) is XADD-ed onto dataland-redis::museum:telemetry. The museum-simulator container (compose profile simulator) can publish synthetic events to the same stream; both flow through the same rule engine, distinguished by simulator_source.

The active-ticket mirror is a separate 1 Hz loop that fully overwrites the dataland-redis SET museum:active_ticket_ids with RDC's Visitors:ActiveTicketIDs. This key is a fixed cross-service contract — the agent (session_state.py) reads the exact same key to decide whether a ticket is live.

The notification engine rules (config in config/notification-rules.toml): visit_started (welcome), heart_rate, heart_rate_drop, skin_conductance, spo2, temperature, artwork_engagement, room_transition, session_flow, experience_tip, visit_ended. Gating logic landed this change-set:

Content gating (DAT-287/293). Content/condition notifications are gated until the visitor reaches an exhibit gallery (GALLERY_ROOM_CODES = {GA, GB, GC, GD}). Only rules marked pre_gallery_exempt (the welcome + checkout) fire before then.
Cooldown (DAT-290). Telemetry rules share a per-ticket cooldown (telemetry_cooldown_seconds = 240). Rules marked ignore_cooldown (room-transition) bypass it so transitions always fire.
Session-flow + checkout. Gallery-B session-flow pushes (DAT-289) and visit_ended / checkout pushes (DAT-282) added.
Resolver fallback (DAT-282). The OneSignal resolver falls back to user_id when external_id is absent.

Ops alerts go through a swappable OpsNotifier: OPS_NOTIFIER_PROVIDER takes a comma-separated list for multi-provider fan-out (Discord + Slack). The service also keeps a DLQ + replay path; notification-api exposes the DLQ / state / rule surfaces plus /v1/ops/complaint and /v1/ops/welcome.

Telemetry bridge is the only path into dataland-redis

museum-api reads RDC, normalises, and writes museum:telemetry. If MUSEUM_TELEMETRY_BRIDGE_ENABLED=false, the notification worker has nothing to consume off live data. Never replay museum-simulation-playback Avro into the live dataland-redis — use a dedicated container.

Recent changes — notifications

DAT-287/289/290/293/282 (gating, Gallery-B session-flow, room-transition cooldown bypass, visit_ended/checkout push, RDC plain-UTF-8 decode, resolver user_id fallback), DAT-213 (swappable Discord/Slack ops notifier with comma-separated multi-provider), DAT-296 (/v1/ops/welcome ticket-dedup).

3. Curator content → RAG ingestion → Qdrant → agent retrieval¶

Curators use the Catalog Studio (information-webui, data.dataland.chat). It is a non-dev CMS with two workspaces: Projects (the Refik artwork catalog) and Museum (sections / scenes / overview). SQLite is the source of truth; GCS holds the uploaded assets; RAG is kept in step by live-sync.

sequenceDiagram
  autonumber
  participant Curator
  participant WebUI as information-webui
  participant GCS
  participant RAG as rag
  participant QD as qdrant
  participant Agent as agent

  Curator->>WebUI: create/update/delete project or museum entity + upload images
  WebUI->>WebUI: persist SQLite (source of truth)
  WebUI->>GCS: PUT images (dataland-public/artworks, cobanov-public/chapters)
  WebUI->>RAG: DELETE /ingest/by-project-slug/<slug>  (replace-by-slug)
  WebUI->>RAG: POST /ingest/file  (rendered markdown → knowledge)
  WebUI->>RAG: POST /ingest/image (raw bytes → images, Gemini caption)
  RAG->>RAG: chunk + embed (gemini-embedding, 3072-dim) + BM25 sparse
  RAG->>QD: upsert points (deterministic UUIDv5 ids)
  Note over WebUI,QD: fire-and-forget; a RAG/GCS hiccup never blocks the save
  Agent->>RAG: POST /search (hybrid dense + BM25 + rerank)
  RAG->>QD: query knowledge / images / scenes
  RAG-->>Agent: reranked passages + image hits

RAG slug + payload conventions (the contract between webui and rag):

Project flow → markdown to the knowledge collection (webui-<slug>.md), images to the images collection. Replace semantics: DELETE /ingest/by-project-slug/<slug> first, then re-ingest. Deterministic UUIDv5 point ids give clean upserts with no duplicate accumulation.
Museum flow uses namespaced slugs: museum (overview), museum-section-<slug>, museum-scene-<slug>; payload entity_type is museum / section / scene. Same replace-by-slug + UUIDv5 model.

RAG's retrieval side (/search) is a hybrid pipeline: dense vectors (gemini-embedding, embedding_dim = 3072) + a BM25 sparse channel (Qdrant/bm25), blended via RRF, then reranked with the jinaai/jina-reranker-v2-base-multilingual cross-encoder (FastEmbed/ONNX). The optional text-scroll channel is off by default (text_search_enabled = false, DAT-167). Captioning for /ingest/image uses gemini-3.5-flash (DAT-269).

Collections currently hold roughly: knowledge ~4969 points (after the museum 20-sections + scenes + overview re-ingest, 4839 → 4969), images ~1485 points, plus scenes. The default_reference_* placeholder images were purged (DAT-288) from chapters.json and GCS cobanov-public/chapters (they were never in Qdrant).

Recent changes — retrieval

DAT-269 (gemini-3.5-flash caption model; vectors still gemini-embedding), DAT-288 (placeholder image purge), museum sections/scenes/overview re-ingest (4839 → 4969 knowledge points), agent /search read timeout 10s → 25s.

Data stores¶

Store	Container / location	Holds	Notes
dataland-redis	`redis:7-alpine`, `dataland-redis`	`museum:telemetry` stream, `museum:active_ticket_ids` SET, ticket/session state, dedup keys (`welcome_sent:<ticket>`, complaint dedup), HR history, rate-limit buckets	`--requirepass` (DAT-76), `appendonly yes`, `maxmemory 512mb` / `noeviction` (drain, don't drop). 1 GB / 0.5 core.
RDC Redis	external, Refik Anadol data center	Wearable BioSensors / BlueIoT position, `Visitors:` control plane, `Galleries::SceneControl:*`, `Visitors:ActiveTicketIDs`	Read-only from Dataland. Mixed msgpack + plain-UTF-8 encoding. `museum-api` PSUBSCRIBEs; required at boot (no fallback).
Postgres	`postgres:16-alpine`, `dataland-postgres`	Agent DB (users, tickets, conversations, messages, runs) + auth DB	Single instance, both logical DBs. 1 GB / 1 core, tuned `shared_buffers=256MB`.
Qdrant	`qdrant/qdrant`, `dataland-qdrant`	Collections `knowledge`, `images`, `scenes` (3072-dim vectors + BM25 sparse)	No API key — the tailnet/`127.0.0.1` binding is the trust boundary. 4 GB / 2 cores.

Authentication¶

Chat endpoints require an RS256 JWT validated against JWKS. The dataland-auth service (auth_server.py, container auth, :9000) issues and serves keys at /.well-known/jwks.json. The agent's app/auth.py caches JWKS via PyJWKClient (jwks_cache_ttl = 3600s), verifies RS256, requires exp, and skips aud (mobile tokens carry a client-scoped audience the agent does not constrain). The user identity comes from user_id (or sub); the local User row is auto-created/updated on first sight.

flowchart LR
  Token["Bearer RS256 JWT"] --> AgentAuth["agent app/auth.py"]
  AgentAuth -->|"1. primary"| Local["local dataland-auth JWKS<br/>kid dataland-rs256-1"]
  AgentAuth -->|"2. fallback (WARN)"| CMS["external CMS JWKS"]
  Local -.->|"data/extra_jwks.json<br/>(public key only)"| AgentAuth
  subgraph mirror["DAT-286 JWKS mirror"]
    CMSkey["CMS signing key (public JWK)"] --> Extra["auth-data: data/extra_jwks.json"]
    Extra --> Served["dataland-auth serves merged JWKS"]
  end

JWKS mirror (DAT-286). The agent validates a token by trying each configured JWKS endpoint in order (JWKS_URL then JWKS_URLS). Previously a token signed by the CMS could only be validated by the external CMS JWKS endpoint — a single point of failure for chat auth. Now the local dataland-auth mirrors the CMS signing key (kid = dataland-rs256-1) by loading its public JWK into data/extra_jwks.json on the persisted auth-data volume and serving a merged JWKS (local signing key first, then extra keys, deduped by kid; the local key never gets shadowed). Re-run the mirror provisioning after a volume wipe or a CMS key rotation. The agent emits a WARN whenever a fallback JWKS provider is the sole validator of a token, making the SPoF condition alertable.

Other auth surfaces use a single shared password (no per-user accounts): the museum dashboard (MUSEUM_PASSWORD), Catalog Studio (INFORMATION_WEBUI_PASSWORD), and docs.dataland.chat (nginx basic auth, DOCS_USERNAME / DOCS_PASSWORD). Service-to-service calls (/v1/service/*, /v1/ops/*, RAG, notification ops) use bearer tokens: AGENT_SERVICE_TOKEN, NOTIFICATION_OPS_TOKEN, RAG_API_KEY.

Recent changes — auth

DAT-286 (local JWKS mirrors the CMS signing key via data/extra_jwks.json, agent WARNs when a fallback provider is the sole validator).

Deployment topology¶

The stack is one Docker Compose project (dataland-infrastructure/compose.yml) on the Spark DGX VDS. Public traffic enters through a single Cloudflare Tunnel (host systemd); operators reach internal surfaces over Tailscale.

graph LR
  subgraph internet["Public internet"]
    User["Visitor / curator browser"]
  end
  subgraph cf["Cloudflare"]
    Tunnel["cloudflared tunnel (TLS)"]
  end
  subgraph host["Spark DGX VDS (100.124.170.43)"]
    cfd["cloudflared (host systemd)"]
    subgraph compose["docker compose: dataland-network"]
      AG2["agent :4141 → dataland.chat"]
      WU2["information-webui :4152 → data.dataland.chat"]
      DC2["docs :4148 → docs.dataland.chat"]
      MU2["museum-api :4144"]
      rest["rag · auth · notification · stores · monitoring"]
    end
  end
  subgraph tnet["Tailscale tailnet"]
    Operator["operator peer"]
  end

  User --> Tunnel --> cfd
  cfd --> AG2
  cfd --> WU2
  cfd --> DC2
  Operator -.->|*_PUBLIC_BIND| rest
  Operator -.-> MU2

Concern	Owner
Service config	`/home/cobanov/DATALAND/.env` (host)
Secrets (GCP key)	`/home/cobanov/DATALAND/secrets/gcp-key.json` (mode 600)
Compose stack	`dataland-infrastructure/compose.yml`
Dev / simulator overlays	`compose.dev.yml`, `compose.sim.yml`; simulator via `--profile simulator`
Host metrics	`--profile host-metrics` (cadvisor + node-exporter; Linux only)
Deploy script	`dataland-infrastructure/deploy.sh`
Smoke runner	`dataland-infrastructure/scripts/smoke.sh`
Notification rules	`dataland-infrastructure/config/notification-rules.toml`
Public ingress	Cloudflare Tunnel (host systemd)
Operator ingress	Tailscale tailnet via `*_PUBLIC_BIND`

Deploy guardrail (DAT-291). deploy.sh (set -euo pipefail) runs a production secret check before rebuilding: if the prod .env still holds placeholder/default secrets it aborts with a non-zero exit and prints the flagged keys, rather than triggering the agent's boot guard into a crash-loop deploy. The same change-set added tailnet *_PUBLIC_BIND publishing, the docs.dataland.chat service, and prod secret rotation. On Linux deploys the host-metrics profile is layered on so cadvisor + node-exporter come up; macOS dev boxes leave the profile off (they can't expose host /proc + /sys).

Recent changes — deployment

DAT-291 (deploy.sh fail-fast on placeholder prod secrets, tailnet *_PUBLIC_BIND publishing, docs.dataland.chat, prod secret rotation).

Models¶

The stack standardised on gemini-3.5-flash for all generative work — the chat agent (agent_model = google-gla:gemini-3.5-flash) and RAG's Gemini image captioning — under DAT-269. RAG vectors use gemini-embedding (3072-dim). Older gemini-2.5-flash / gemini-3.1-flash-lite references are no longer current.

Resource budget¶

The host is a single 20-core machine; compose.yml pins memory + cores per service. The two heaviest tenants:

rag — 12 GB / 12 cores. The jina cross-encoder reranker is multi-threaded ONNX inference; at 2 cores a single 20-candidate rerank took ~25s (CPU pegged), so it is allowed 12 cores while leaving 8 for the rest of the stack.
qdrant — 4 GB / 2 cores. Three collections (knowledge, images, scenes) with 3072-dim vectors.

Everything else lives within 64 MB – 1 GB. See each service page for its exact budget: Agent, RAG, Museum, Notification, Catalog Studio.