Architecture¶
Dataland is the live AI-art museum by Refik Anadol Studio. The visitor-facing
experience — a personalized AI guide that knows where you are standing, what your
body is doing, and what artwork is in front of you — is backed by a small set of
Python microservices that run as a Docker Compose stack on a single host (the
Spark DGX VDS, ege@100.124.170.43, repos under /home/cobanov/DATALAND/).
The stack is organised around three concentric rings:
- Data plane — the stateful stores (Postgres, Redis, Qdrant) plus the external RDC Redis owned by Refik Anadol Studio.
- App services — the business logic (
agent,auth,rag,museum-api,museum-simulator,notification-worker,notification-api,information-webui,docs). - Observability — Prometheus, Grafana, Alertmanager, plus datastore/host/container exporters.
One Docker bridge network, dataland-network, connects everything. Cross-service
traffic uses container DNS names (http://dataland-rag:4143), never
host-published ports. Public ingress is a single Cloudflare Tunnel; operator
access is over a Tailscale tailnet.
Recent changes (2026-06-03 → 2026-06-04)
This page reflects a large change-set landed across all five service repos.
The headline items: model standardised on gemini-3.5-flash (DAT-269);
the museum chat init/welcome flow rewritten so an empty first message yields
an instant LLM-free welcome and the /register + /current endpoints were
removed (DAT-296); silent server-side complaint detection added (DAT-213);
a local JWKS mirror removed the chat-auth single point of failure
(DAT-286); and deploy.sh now fails fast on placeholder prod secrets,
publishes services on the tailnet, and added docs.dataland.chat (DAT-291).
Service map¶
graph TB
subgraph ext["External — Refik Anadol Studio"]
RDC[("RDC Redis<br/>wearable + sensor truth<br/>msgpack + plain UTF-8")]
end
subgraph dnet["dataland-network (Docker bridge)"]
subgraph data["Data plane"]
PG[("postgres<br/>agent + auth DBs")]
RD[("dataland-redis<br/>museum:telemetry · ticket state · dedup")]
QD[("qdrant<br/>knowledge · images · scenes")]
end
subgraph app["App services"]
AU["auth :9000<br/>JWKS issuer + mirror"]
AG["agent :4141<br/>chat SSE + tools"]
RG["rag :4143<br/>hybrid retrieval"]
MU["museum-api :5001→4144<br/>RDC bridge + dashboard"]
SIM["museum-simulator<br/>(profile: simulator)"]
NW["notification-worker<br/>rules → push/ops"]
NA["notification-api :8080<br/>DLQ · state · ops"]
WU["information-webui :4152<br/>Catalog Studio CMS"]
DC["docs :80→4148"]
end
subgraph obs["Observability"]
PR["prometheus"]
GR["grafana"]
AM["alertmanager"]
PE["postgres-exporter"]
RE["redis-exporter"]
CA["cadvisor (host-metrics)"]
NE["node-exporter (host-metrics)"]
end
end
GCS[("GCS buckets<br/>dataland-public · dataland-private · cobanov-public")]
RDC -->|PSUBSCRIBE wearables/visitors| MU
MU -->|XADD| RD
RD -->|XREADGROUP| NW
NW -->|POST /v1/service/chat/museum| AG
NW -->|OneSignal push| OS["OneSignal"]
NW -->|ops alert| OPS["Discord / Slack"]
AG --> RG
AG --> MU
AG --> AU
AG --> PG
AG --> RD
AG -->|/v1/ops/complaint · /v1/ops/welcome| NA
WU --> RG
WU --> GCS
RG --> QD
RG --> GCS
CF["Cloudflare Tunnel (host systemd)"] -.->|public ingress| AG
CF -.-> MU
CF -.-> WU
CF -.-> DC
TS["Tailscale tailnet"] -.->|*_PUBLIC_BIND| data
TS -.-> NA
| Repo | Container(s) | Role |
|---|---|---|
dataland-agent |
agent (:4141), auth (:9000) |
FastAPI + pydantic-ai chat agent (museum + general, SSE). Hosts auth_server.py (the dataland-auth JWKS service) and a static test chat client. |
dataland-rag-v2 |
rag (:4143) |
Retrieval: ingest + hybrid search over Qdrant. Gemini captioning + embeddings. GCS-backed assets. |
dataland-museum |
museum-api (:5001→:4144), museum-simulator |
Bridge to the external RDC Redis; publishes museum:telemetry, mirrors active tickets, serves the chapter catalog + ops dashboard. |
dataland-notification |
notification-worker, notification-api (:8080) |
Telemetry rules engine → OneSignal pushes + Discord/Slack ops alerts. DLQ + replay. |
dataland-atlas |
information-webui (:4152) |
The "Catalog Studio" CMS for curators. SQLite catalog + GCS uploads + RAG live-sync. |
dataland-infrastructure |
docs (:80→:4148), monitoring stack, compose |
Compose definitions, deploy/smoke scripts, notification rules, docs site. |
Three things to remember about the network:
- Container DNS only inside the stack. Service-to-service calls use names
like
dataland-rag,dataland-museum,dataland-postgres— these resolve overdataland-network, not via the host. Host-published ports are for operators and Cloudflare ingress, not for internal callers. - Stateful services and ops surfaces bind to
127.0.0.1+ a tailnet IP only. Postgres, Redis, Qdrant,notification-api, anddocspublish to127.0.0.1:<port>(local tooling / SSH tunnel) and to a*_PUBLIC_BIND(default the tailnet interface100.124.170.43) so tailnet peers reach them directly — never0.0.0.0(DAT-73). - Cloudflare is the only public ingress.
cloudflaredruns as a host systemd service in token mode and routes the public*.dataland.chathostnames (dataland.chat,data.dataland.chat,docs.dataland.chat) to the right local port. There is no nginx/Traefik in front of the stack.
Core data flows¶
1. Visitor chat over SSE (incl. museum init/welcome)¶
The agent serves two chat modes over server-sent events: museum mode
(POST /v1/chat/museum[/multimodal], ticket-bound, location/vitals-aware) and
general mode (POST /v1/chat/general[/multimodal]). All chat endpoints sit
behind the /v1 prefix and require an RS256 JWT (Authorization: Bearer).
The first museum message is special. There is no separate registration call any
more — sending an empty first /museum message is the registration and
returns an instant, personalized welcome with no LLM round-trip.
sequenceDiagram
autonumber
participant App as Mobile app
participant Agent as agent
participant Auth as auth (JWKS)
participant PG as postgres
participant NA as notification-api
participant Museum as museum-api
participant RAG as rag
App->>Agent: POST /v1/chat/museum {ticket_id, message:""} (Bearer JWT)
Agent->>Auth: verify RS256 via JWKS (cached, multi-provider)
Agent->>PG: register_ticket(user, ticket_id)
Note over Agent,PG: conversation_id == ticket_id (DAT-296)
alt empty first message (init)
Agent-->>App: SSE static welcome — no LLM, no RAG (DAT-296)
Agent->>NA: POST /v1/ops/welcome (off-path welcome push)
Note right of NA: ticket-deduped vs RDC visit_started
else real message
Agent->>Agent: schedule_complaint_check() off-path (DAT-213)
Agent->>Museum: GET /api/tickets/{id}/vitals (get_visitor_vitals)
Agent->>Museum: GET /api/chapters (get_room_info / get_scene_flow)
Agent->>RAG: POST /search (search_knowledge / search_artwork_images)
Agent-->>App: SSE token stream
Agent-->>App: SSE follow-up suggestions (restored on reload, DAT-284)
end
Key facts about this flow:
- Conversation identity. For museum chat,
conversation_id == ticket_id; the sameticket_idalways resumes the same chat.register_ticketis idempotent and returns whether the ticket was newlycreated. The old/registerand/currentendpoints were removed (DAT-296). - Instant welcome. An empty first message returns
welcome_message(full_name)— a fixed greeting kept in sync with the notification service'svisit_startedcopy. New ticket → the welcome is persisted; re-init on an existing ticket → it is streamed without duplicating. The welcome push is fired off-path tonotification-api /v1/ops/welcome, which ticket-dedups it against the RDC-drivenvisit_startedwelcome (DAT-296). - Anonymous-first. Identity is
ticket_id ↔ external_id ↔ OneSignal; no registered/email account is required.full_namemay be empty and the welcome degrades gracefully ("Welcome to Dataland!"). - Agent tools. The museum agent exposes
get_visitor_vitals,get_room_info,get_scene_flow,search_knowledge, andsearch_artwork_images. Tools speak real room names, never bare codes (DAT-281):Data Pavilion (GA),Latent Gallery (GB),Infinity Room (GC),The Sanctuary (GD),Discovery Portal (ON),Lobby (LO). - Silent complaint detection (DAT-213). Every real visitor message triggers
an off-path server-side LLM judge (
schedule_complaint_check). It never changes the reply or emits an SSE event; a detected complaint is POSTed tonotification-api /v1/ops/complaintwith the visitor's identity, per-session deduped, and audit-logged. - Timeouts. The chat run is capped at
agent_run_timeout_seconds = 60s; the suggestion call at15s. The RAG/searchclient read timeout was raised 10s → 25s because museum-knowledge queries occasionally hit 10s → 3× retry → second search → 60s wall-clock →agent_timeout.
Recent changes — chat
DAT-296 (empty-first-message welcome + push, /register and /current
removed), DAT-213 (silent complaint detection, POST /v1/ops/complaint),
DAT-284 (follow-up suggestions on reload), DAT-281 (real room names),
DAT-279/280/285/261 (get_room_info image cap + dedup, empty-room fix,
vitals sanity bounds, get_scene_flow), DAT-269 (gemini-3.5-flash).
2. RDC telemetry → notification rules → push¶
museum-api reads exclusively from the external RDC Redis (Refik Anadol
data center: wearable/sensor source of truth). There is no host/port fallback —
the service refuses to start if RDC_REDIS_URL is empty. The RDC feed is mixed
encoding: most payloads are msgpack, but some channels (Q-SYS / audio) are
plain UTF-8, which the decoders handle (DAT-282).
sequenceDiagram
autonumber
participant Band as Empatica wearable
participant RDC as RDC Redis (external)
participant Museum as museum-api
participant Redis as dataland-redis
participant Worker as notification-worker
participant Agent as agent
participant OS as OneSignal
participant OPS as Discord / Slack
Band->>RDC: PUBLISH BioSensors / position / status
RDC-->>Museum: PSUBSCRIBE pmessage (msgpack / UTF-8)
Museum->>Museum: resolve serial→ticket→room→chapter, normalise
Museum->>Redis: XADD museum:telemetry (telemetry bridge)
Museum->>Redis: overwrite museum:active_ticket_ids @1Hz
Redis-->>Worker: XREADGROUP
Worker->>Worker: evaluate rules (gating + cooldown)
alt push channel
Worker->>OS: push (external_id, fallback user_id)
end
alt chat channel
Worker->>Agent: POST /v1/service/chat/museum
end
alt ops alert
Worker->>OPS: OpsNotifier fan-out
end
How the bridge resolves context, per BioSensors message:
- serial → ticket by walking
Visitors:ActiveTicketIDs+ per-ticketVisitors:*:EmpaticaDeviceID(cached, 60s TTL, cleared on anAssignDeviceevent). - serial → room from
Wearables:WatchDevices:<serial>:RoomCode. - room → gallery → chapter via
ROOM_TO_GALLERY+GALLERY_ART_IDS, readingGalleries:<gallery>:<art_id>:SceneControl:RTChapter/DDSChapter, then joining thechapters.jsonindex forchapter_name/art_name.
The normalised event (ticket_id, heart_rate, skin_conductance,
body_temperature, room_code, chapter_*, simulator_source: "rdc-bridge",
…) is XADD-ed onto dataland-redis::museum:telemetry. The
museum-simulator container (compose profile simulator) can publish synthetic
events to the same stream; both flow through the same rule engine, distinguished
by simulator_source.
The active-ticket mirror is a separate 1 Hz loop that fully overwrites the
dataland-redis SET museum:active_ticket_ids with RDC's
Visitors:ActiveTicketIDs. This key is a fixed cross-service contract — the
agent (session_state.py) reads the exact same key to decide whether a ticket
is live.
The notification engine rules (config in
config/notification-rules.toml): visit_started (welcome), heart_rate,
heart_rate_drop, skin_conductance, spo2, temperature,
artwork_engagement, room_transition, session_flow, experience_tip,
visit_ended. Gating logic landed this change-set:
- Content gating (DAT-287/293). Content/condition notifications are gated
until the visitor reaches an exhibit gallery
(
GALLERY_ROOM_CODES = {GA, GB, GC, GD}). Only rules markedpre_gallery_exempt(the welcome + checkout) fire before then. - Cooldown (DAT-290). Telemetry rules share a per-ticket cooldown
(
telemetry_cooldown_seconds = 240). Rules markedignore_cooldown(room-transition) bypass it so transitions always fire. - Session-flow + checkout. Gallery-B session-flow pushes (DAT-289) and
visit_ended/ checkout pushes (DAT-282) added. - Resolver fallback (DAT-282). The OneSignal resolver falls back to
user_idwhenexternal_idis absent.
Ops alerts go through a swappable OpsNotifier: OPS_NOTIFIER_PROVIDER
takes a comma-separated list for multi-provider fan-out (Discord + Slack). The
service also keeps a DLQ + replay path; notification-api exposes the DLQ /
state / rule surfaces plus /v1/ops/complaint and /v1/ops/welcome.
Telemetry bridge is the only path into dataland-redis
museum-api reads RDC, normalises, and writes museum:telemetry. If
MUSEUM_TELEMETRY_BRIDGE_ENABLED=false, the notification worker has nothing
to consume off live data. Never replay museum-simulation-playback Avro into
the live dataland-redis — use a dedicated container.
Recent changes — notifications
DAT-287/289/290/293/282 (gating, Gallery-B session-flow, room-transition
cooldown bypass, visit_ended/checkout push, RDC plain-UTF-8 decode, resolver
user_id fallback), DAT-213 (swappable Discord/Slack ops notifier with
comma-separated multi-provider), DAT-296 (/v1/ops/welcome ticket-dedup).
3. Curator content → RAG ingestion → Qdrant → agent retrieval¶
Curators use the Catalog Studio (information-webui, data.dataland.chat).
It is a non-dev CMS with two workspaces: Projects (the Refik artwork
catalog) and Museum (sections / scenes / overview). SQLite is the source of
truth; GCS holds the uploaded assets; RAG is kept in step by live-sync.
sequenceDiagram
autonumber
participant Curator
participant WebUI as information-webui
participant GCS
participant RAG as rag
participant QD as qdrant
participant Agent as agent
Curator->>WebUI: create/update/delete project or museum entity + upload images
WebUI->>WebUI: persist SQLite (source of truth)
WebUI->>GCS: PUT images (dataland-public/artworks, cobanov-public/chapters)
WebUI->>RAG: DELETE /ingest/by-project-slug/<slug> (replace-by-slug)
WebUI->>RAG: POST /ingest/file (rendered markdown → knowledge)
WebUI->>RAG: POST /ingest/image (raw bytes → images, Gemini caption)
RAG->>RAG: chunk + embed (gemini-embedding, 3072-dim) + BM25 sparse
RAG->>QD: upsert points (deterministic UUIDv5 ids)
Note over WebUI,QD: fire-and-forget; a RAG/GCS hiccup never blocks the save
Agent->>RAG: POST /search (hybrid dense + BM25 + rerank)
RAG->>QD: query knowledge / images / scenes
RAG-->>Agent: reranked passages + image hits
RAG slug + payload conventions (the contract between webui and rag):
- Project flow → markdown to the
knowledgecollection (webui-<slug>.md), images to theimagescollection. Replace semantics:DELETE /ingest/by-project-slug/<slug>first, then re-ingest. Deterministic UUIDv5 point ids give clean upserts with no duplicate accumulation. - Museum flow uses namespaced slugs:
museum(overview),museum-section-<slug>,museum-scene-<slug>; payloadentity_typeismuseum/section/scene. Same replace-by-slug + UUIDv5 model.
RAG's retrieval side (/search) is a hybrid pipeline: dense vectors
(gemini-embedding, embedding_dim = 3072) + a BM25 sparse channel
(Qdrant/bm25), blended via RRF, then reranked with the
jinaai/jina-reranker-v2-base-multilingual cross-encoder (FastEmbed/ONNX). The
optional text-scroll channel is off by default (text_search_enabled = false,
DAT-167). Captioning for /ingest/image uses gemini-3.5-flash (DAT-269).
Collections currently hold roughly: knowledge ~4969 points (after the museum
20-sections + scenes + overview re-ingest, 4839 → 4969), images ~1485 points,
plus scenes. The default_reference_* placeholder images were purged
(DAT-288) from chapters.json and GCS cobanov-public/chapters (they were
never in Qdrant).
Recent changes — retrieval
DAT-269 (gemini-3.5-flash caption model; vectors still gemini-embedding),
DAT-288 (placeholder image purge), museum sections/scenes/overview
re-ingest (4839 → 4969 knowledge points), agent /search read timeout
10s → 25s.
Data stores¶
| Store | Container / location | Holds | Notes |
|---|---|---|---|
| dataland-redis | redis:7-alpine, dataland-redis |
museum:telemetry stream, museum:active_ticket_ids SET, ticket/session state, dedup keys (welcome_sent:<ticket>, complaint dedup), HR history, rate-limit buckets |
--requirepass (DAT-76), appendonly yes, maxmemory 512mb / noeviction (drain, don't drop). 1 GB / 0.5 core. |
| RDC Redis | external, Refik Anadol data center | Wearable BioSensors / BlueIoT position, Visitors:* control plane, Galleries:*:SceneControl:*, Visitors:ActiveTicketIDs |
Read-only from Dataland. Mixed msgpack + plain-UTF-8 encoding. museum-api PSUBSCRIBEs; required at boot (no fallback). |
| Postgres | postgres:16-alpine, dataland-postgres |
Agent DB (users, tickets, conversations, messages, runs) + auth DB | Single instance, both logical DBs. 1 GB / 1 core, tuned shared_buffers=256MB. |
| Qdrant | qdrant/qdrant, dataland-qdrant |
Collections knowledge, images, scenes (3072-dim vectors + BM25 sparse) |
No API key — the tailnet/127.0.0.1 binding is the trust boundary. 4 GB / 2 cores. |
Authentication¶
Chat endpoints require an RS256 JWT validated against JWKS. The
dataland-auth service (auth_server.py, container auth, :9000) issues and
serves keys at /.well-known/jwks.json. The agent's app/auth.py caches JWKS
via PyJWKClient (jwks_cache_ttl = 3600s), verifies RS256, requires exp,
and skips aud (mobile tokens carry a client-scoped audience the agent does not
constrain). The user identity comes from user_id (or sub); the local User
row is auto-created/updated on first sight.
flowchart LR
Token["Bearer RS256 JWT"] --> AgentAuth["agent app/auth.py"]
AgentAuth -->|"1. primary"| Local["local dataland-auth JWKS<br/>kid dataland-rs256-1"]
AgentAuth -->|"2. fallback (WARN)"| CMS["external CMS JWKS"]
Local -.->|"data/extra_jwks.json<br/>(public key only)"| AgentAuth
subgraph mirror["DAT-286 JWKS mirror"]
CMSkey["CMS signing key (public JWK)"] --> Extra["auth-data: data/extra_jwks.json"]
Extra --> Served["dataland-auth serves merged JWKS"]
end
JWKS mirror (DAT-286). The agent validates a token by trying each configured
JWKS endpoint in order (JWKS_URL then JWKS_URLS). Previously a token signed
by the CMS could only be validated by the external CMS JWKS endpoint — a
single point of failure for chat auth. Now the local dataland-auth mirrors
the CMS signing key (kid = dataland-rs256-1) by loading its public JWK into
data/extra_jwks.json on the persisted auth-data volume and serving a merged
JWKS (local signing key first, then extra keys, deduped by kid; the local key
never gets shadowed). Re-run the mirror provisioning after a volume wipe or a CMS
key rotation. The agent emits a WARN whenever a fallback JWKS provider is
the sole validator of a token, making the SPoF condition alertable.
Other auth surfaces use a single shared password (no per-user accounts): the
museum dashboard (MUSEUM_PASSWORD), Catalog Studio
(INFORMATION_WEBUI_PASSWORD), and docs.dataland.chat (nginx basic auth,
DOCS_USERNAME / DOCS_PASSWORD). Service-to-service calls
(/v1/service/*, /v1/ops/*, RAG, notification ops) use bearer tokens:
AGENT_SERVICE_TOKEN, NOTIFICATION_OPS_TOKEN, RAG_API_KEY.
Recent changes — auth
DAT-286 (local JWKS mirrors the CMS signing key via data/extra_jwks.json,
agent WARNs when a fallback provider is the sole validator).
Deployment topology¶
The stack is one Docker Compose project (dataland-infrastructure/compose.yml)
on the Spark DGX VDS. Public traffic enters through a single Cloudflare Tunnel
(host systemd); operators reach internal surfaces over Tailscale.
graph LR
subgraph internet["Public internet"]
User["Visitor / curator browser"]
end
subgraph cf["Cloudflare"]
Tunnel["cloudflared tunnel (TLS)"]
end
subgraph host["Spark DGX VDS (100.124.170.43)"]
cfd["cloudflared (host systemd)"]
subgraph compose["docker compose: dataland-network"]
AG2["agent :4141 → dataland.chat"]
WU2["information-webui :4152 → data.dataland.chat"]
DC2["docs :4148 → docs.dataland.chat"]
MU2["museum-api :4144"]
rest["rag · auth · notification · stores · monitoring"]
end
end
subgraph tnet["Tailscale tailnet"]
Operator["operator peer"]
end
User --> Tunnel --> cfd
cfd --> AG2
cfd --> WU2
cfd --> DC2
Operator -.->|*_PUBLIC_BIND| rest
Operator -.-> MU2
| Concern | Owner |
|---|---|
| Service config | /home/cobanov/DATALAND/.env (host) |
| Secrets (GCP key) | /home/cobanov/DATALAND/secrets/gcp-key.json (mode 600) |
| Compose stack | dataland-infrastructure/compose.yml |
| Dev / simulator overlays | compose.dev.yml, compose.sim.yml; simulator via --profile simulator |
| Host metrics | --profile host-metrics (cadvisor + node-exporter; Linux only) |
| Deploy script | dataland-infrastructure/deploy.sh |
| Smoke runner | dataland-infrastructure/scripts/smoke.sh |
| Notification rules | dataland-infrastructure/config/notification-rules.toml |
| Public ingress | Cloudflare Tunnel (host systemd) |
| Operator ingress | Tailscale tailnet via *_PUBLIC_BIND |
Deploy guardrail (DAT-291). deploy.sh (set -euo pipefail) runs a
production secret check before rebuilding: if the prod .env still holds
placeholder/default secrets it aborts with a non-zero exit and prints the
flagged keys, rather than triggering the agent's boot guard into a crash-loop
deploy. The same change-set added tailnet *_PUBLIC_BIND publishing, the
docs.dataland.chat service, and prod secret rotation. On Linux deploys the
host-metrics profile is layered on so cadvisor + node-exporter come up;
macOS dev boxes leave the profile off (they can't expose host /proc + /sys).
Recent changes — deployment
DAT-291 (deploy.sh fail-fast on placeholder prod secrets, tailnet
*_PUBLIC_BIND publishing, docs.dataland.chat, prod secret rotation).
Models¶
The stack standardised on gemini-3.5-flash for all generative work — the
chat agent (agent_model = google-gla:gemini-3.5-flash) and RAG's Gemini image
captioning — under DAT-269. RAG vectors use gemini-embedding (3072-dim). Older
gemini-2.5-flash / gemini-3.1-flash-lite references are no longer current.
Resource budget¶
The host is a single 20-core machine; compose.yml pins memory + cores per
service. The two heaviest tenants:
rag— 12 GB / 12 cores. The jina cross-encoder reranker is multi-threaded ONNX inference; at 2 cores a single 20-candidate rerank took ~25s (CPU pegged), so it is allowed 12 cores while leaving 8 for the rest of the stack.qdrant— 4 GB / 2 cores. Three collections (knowledge,images,scenes) with 3072-dim vectors.
Everything else lives within 64 MB – 1 GB. See each service page for its exact budget: Agent, RAG, Museum, Notification, Catalog Studio.