Dataland Architecture¶

Welcome. This is the operator + contributor map of the Dataland backend stack: the set of microservices that powers the museum guide app, the live visitor experience, and the curator tooling behind Refik Anadol Studio's Dataland museum.

If you have just landed on the project, read Architecture for the network topology and data flows, then the Services reference for the per-service detail. If you came here because something broke, jump straight to Deploy and Observability.

Recent changes (2026-06-03 → 2026-06-04)

A large change-set landed across every repo. See the What changed this cycle section below for the full DAT-referenced summary. Highlights: the whole stack standardized on gemini-3.5-flash (DAT-269), the first museum message now produces an instant zero-LLM welcome (DAT-296), silent visitor-complaint detection went live (DAT-213), the local auth service now mirrors the CMS signing key to kill the chat-auth single point of failure (DAT-286), and deploy.sh now fails fast on placeholder prod secrets (DAT-291).

What Dataland is¶

Dataland is a physical AI-art museum. Visitors wear biosensor bands; sensor data flows from the RDC (Refik Anadol data center) into this stack, where it drives a live AI guide and a context-aware notification engine. The backend has four jobs:

Talk to visitors. An AI guide (agent) answers questions over SSE, grounded in live room/vitals telemetry and a retrieval corpus of artwork and museum knowledge.
Ingest the live experience. The museum bridge reads the external RDC redis (the wearable/sensor source of truth) and normalizes it into an internal telemetry stream.
React to the visit. The notification engine consumes that telemetry and fires OneSignal pushes (welcome, room-transition tips, session-flow nudges, checkout) plus internal ops alerts.
Keep the corpus current. Curators use the Catalog Studio (information-webui) to manage the artwork + museum catalog, which live-syncs into the RAG vector store.

What runs where¶

The whole stack runs on a single Linux host: the Spark DGX VDS (tailnet IP 100.124.170.43, repos under /home/cobanov/DATALAND/), under Docker Compose. The dataland-infrastructure repo is the orchestrator. It owns compose.yml, the deploy script, the monitoring stack, and this docs site itself.

The unified compose stack brings up three rings plus the docs site:

Layer	Services
Data plane	`postgres`, `redis`, `qdrant`
App services	`auth`, `rag`, `museum-api`, `agent`, `notification-worker`, `notification-api`, `information-webui`
Docs	`docs` (this site)
Observability	`prometheus`, `grafana`, `alertmanager`, `postgres-exporter`, `redis-exporter`, `cadvisor`, `node-exporter`

A dev-only overlay (compose.dev.yml) and the simulator profile add museum-simulator for synthetic telemetry. The simulator is not part of production. In production, museum-api itself bridges from the external RDC redis.

Naming

The compose name: is dataland, so containers are dataland-agent, dataland-rag, dataland-postgres, and so on. Internal callers use those container DNS names over the Docker network (http://dataland-rag:4143), never host-published ports. See Architecture.

Service map at a glance¶

Service	Container	Role	Public port	Internal port
Agent	`dataland-agent`	AI guide. Museum + general chat over SSE, conversation history, service + ops endpoints	`4141` (`0.0.0.0`)	`4141`
Museum API	`dataland-museum`	RDC redis bridge, telemetry normalizer, chapters catalog + dashboard	`4144` (`0.0.0.0`)	`5001`
RAG	`dataland-rag`	Retrieval. Qdrant search + ingest, Gemini captioning + embeddings	`4143` (loopback / tailnet)	`4143`
Information WebUI	`dataland-atlas`	Catalog Studio CMS. Projects + Museum workspaces, RAG live-sync	`4152` (`0.0.0.0`)	`4152`
Notification worker	`dataland-notification-worker`	Telemetry rules engine → OneSignal pushes + ops alerts	— (no HTTP)	—
Notification API	`dataland-notification-api`	DLQ + replay + state inspector ops surface	`8080` (loopback / tailnet)	`8080`
Auth	`dataland-auth`	JWKS provider for RS256 JWT chat auth	`9000` (`0.0.0.0`)	`9000`
Redis	`dataland-redis`	`museum:telemetry` stream, ticket state, dedup keys (requirepass)	`4145` (loopback / tailnet)	`6379`
Postgres	`dataland-postgres`	Agent + auth databases	`5432` (loopback / tailnet)	`5432`
Qdrant	`dataland-qdrant`	Vector store: `knowledge`, `images`, `scenes`	`4146`/`4147` (loopback / tailnet)	`6333`/`6334`
Docs	`dataland-docs`	This site (MkDocs Material + nginx basic auth)	`4148` (loopback / tailnet)	`80`

For the authoritative port + bind matrix, see Public ports.

The 30-second mental model¶

flowchart LR
  subgraph external["External"]
    Mobile["Mobile app<br/>(visitor)"]
    RDC[("RDC Redis<br/>wearable/sensor<br/>source of truth")]
    OneSignal["OneSignal"]
    GCS[("GCS<br/>public + private")]
    Gemini["Gemini API<br/>(gemini-3.5-flash<br/>+ gemini-embedding)"]
    Curator["Curator"]
  end

  subgraph dataland["Dataland host (Spark, 100.124.170.43)"]
    Agent["agent :4141"]
    Museum["museum-api :5001"]
    RAG["rag :4143"]
    WebUI["information-webui :4152"]
    NotifW["notification-worker"]
    NotifA["notification-api :8080"]
    Auth["auth :9000"]
    Redis[("redis :6379")]
    PG[("postgres :5432")]
    QD[("qdrant :6333")]
  end

  Mobile -->|SSE chat| Agent
  Agent -->|JWT verify| Auth
  Agent --> RAG
  Agent --> Museum
  Agent --> PG
  Agent --> Redis
  Agent -->|complaint detect| NotifA

  Museum -->|PSUBSCRIBE| RDC
  Museum -->|XADD museum:telemetry| Redis

  NotifW -->|XREADGROUP| Redis
  NotifW -->|service call| Agent
  NotifW --> OneSignal

  RAG --> QD
  RAG --> GCS
  RAG --> Gemini

  Curator --> WebUI
  WebUI --> RAG
  WebUI --> GCS

Two flows do most of the work:

Telemetry → reaction. museum-api PSUBSCRIBEs the external RDC redis, normalizes each event, and XADDs it to the internal museum:telemetry stream on dataland-redis. The notification worker XREADGROUPs that stream, evaluates rules, and fans out OneSignal pushes (and ops alerts). See the telemetry sequence.
Chat. The mobile app POSTs to the agent with a Bearer JWT; the agent verifies it against the auth JWKS, calls its tools (get_visitor_vitals, get_room_info, get_scene_flow, search_knowledge, search_artwork_images) against museum-api and RAG, and streams tokens back over SSE. See the chat sequence.

Edge + access¶

flowchart LR
  Internet(("Internet")) --> CF["Cloudflare Tunnel<br/>(host systemd, token mode)"]
  Tailnet(("Tailscale tailnet")) -.->|*_PUBLIC_BIND| Host
  CF --> Host["Spark host<br/>published ports"]
  Host --> Agent["agent → dataland.chat"]
  Host --> Museum["museum-api → museum.dataland.chat"]
  Host --> WebUI["information-webui → data.dataland.chat"]
  Host --> Docs["docs → docs.dataland.chat"]

Cloudflare is the only public ingress. cloudflared runs as a host systemd service and routes the public *.dataland.chat hostnames to the right local port. There is no nginx/Traefik in front of the stack. TLS terminates at Cloudflare.
Tailnet is the second access path. Stateful + internal services publish on 127.0.0.1 and on a *_PUBLIC_BIND host IP (default 100.124.170.43, the tailnet interface) so tailnet peers reach them directly without an SSH tunnel. Never 0.0.0.0 (DAT-73). See Public ports.
This docs site is private. Cloudflare ingresses docs.dataland.chat to the docs container; nginx inside enforces a single shared-password gate (HTTP basic auth, DOCS_USERNAME / DOCS_PASSWORD from the host .env). Same posture as the museum and Catalog Studio dashboards. Ask an admin for credentials, or hit http://127.0.0.1:4148 on the host (still password-gated). The docs service itself shipped this cycle (DAT-291).

How to navigate these docs¶

You want to…	Go to
Understand the topology + data flows	Architecture
Look up what a service does, its ports, env, and auth	Services → the per-service page
Find which port / Cloudflare hostname maps where	Public ports
Deploy, rebuild, or roll back	Deploy
Debug something slow, or read metrics/alerts	Observability
Re-home a service to another host	Service hosting & relocation
Plan the Spark ↔ GCP move	Migration plan

Quick service jumps: Agent · Museum · RAG · Information WebUI · Notification · Auth · Redis · Postgres · Qdrant.

Models¶

The stack is standardized on Google Gemini:

Use	Model	Where
Chat (museum + general)	`gemini-3.5-flash`	agent (`agent_model = google-gla:gemini-3.5-flash`)
Image captioning at ingest	`gemini-3.5-flash`	rag (`gemini_model`)
RAG vector embeddings	`gemini-embedding-2`	rag (`embedding_model`)

Single model, single source of truth

Everything chat- and caption-facing runs gemini-3.5-flash after DAT-269. Code, .env.example (GEMINI_MODEL=gemini-3.5-flash), and the live .env are aligned. Older gemini-2.5-flash / gemini-3.1-flash-lite references are deprecated and should not appear in docs or config.

What changed this cycle (2026-06-03/04)¶

This was a heavy cycle touching the agent, RAG, museum, notification, auth, and infra repos. Grouped by area, with Linear DAT references.

Models + chat experience¶

DAT-269 — Standardized on gemini-3.5-flash across agent, RAG, and infra. Code, .env.example, and the live .env aligned.
DAT-296 — Instant personalized welcome. An empty first /v1/chat/museum message now returns a personalized welcome without an LLM call and triggers a welcome push, ticket-deduped against the RDC visit_started event. The /register and /current endpoints were removed: the first /museum message lazily registers the ticket, and conversation_id == ticket_id.
DAT-284 — Follow-up suggestions restored. Post-stream follow-up suggestions now reappear when a conversation is reloaded.
DAT-281 — Real room names. The agent speaks the public gallery names, never bare codes: Data Pavilion (GA), Latent Gallery (GB), Infinity Room (GC), The Sanctuary (GD), and the Discovery Portal entrance (ON).

Agent tools + reliability¶

DAT-279 / 280 / 285 / 261 — get_room_info image cap + dedup, an empty-room "flail" fix, physiological sanity bounds on vitals, and the new get_scene_flow tool.
RAG client timeout — the agent's RAG /search read timeout was raised 10s → 25s. Museum-knowledge queries were tripping a 10s timeout → 3x retry → agent_timeout.

Notifications¶

DAT-213 — Silent visitor-complaint detection. A server-side LLM judge runs off the response path, invisible to the visitor, and raises an ops ticket on detected dissatisfaction via POST /v1/ops/complaint. The ops notifier is swappable (Discord/Slack) with comma-separated multi-provider fan-out (OPS_NOTIFIER_PROVIDER); the ticket carries visitor identity, with per-session dedup + an audit log.
DAT-287 / 289 / 290 / 293 / 282 — notification gating (no content pushes until a visitor reaches Gallery A; the welcome push is exempt), Gallery-B session-flow pushes, room-transition pushes that bypass the 240s cooldown, a visit_ended / checkout push, RDC plain-UTF-8 decode for Q-SYS/audio fields, and a resolver that falls back to user_id when external_id is absent.

Content + retrieval¶

Museum corpus re-ingest — the 20 museum sections + scenes + overview were (re-)ingested into the Qdrant knowledge collection (point count 4839 → 4969).
DAT-288 — Placeholder images purged. default_reference_* placeholder images were removed from chapters.json and the GCS cobanov-public/chapters bucket. (These were never in Qdrant.)

Security + ops¶

DAT-286 — Auth single-point-of-failure removed. The local dataland-auth service now mirrors the CMS signing key (kid dataland-rs256-1) via data/extra_jwks.json in the auth-data volume, so chat auth no longer depends on a single remote JWKS. The agent WARNs when a fallback JWKS provider is the sole validator. Re-run after a volume wipe or CMS key rotation.
DAT-291 — Safer deploys + tailnet publishing. deploy.sh now runs the agent's real boot guard (assert_boot_required_env) against the new .env before rebuilding, so placeholder/default prod secrets abort the deploy instead of crash-looping chat. This cycle also added tailnet *_PUBLIC_BIND publishing and the docs.dataland.chat service, plus a prod secret rotation.

Deploy gotcha

Because of DAT-291, deploy.sh will abort if the prod .env still holds placeholder secrets, before any rebuild. If a deploy stops at "ABORT: .env failed the agent boot guard", fix the flagged secrets and re-run. See Deploy.