Atlas (Information WebUI) — the Catalog Studio¶

Now deployed as Atlas on Google Cloud (2026-06)

This service has migrated to Cloud Run as the atlas service (us-west1 / project dataland-agent), served at https://atlas.dataland.chat. What changed from the on-prem description below:

Catalog → Cloud SQL (Postgres 16) via DATABASE_URL (the local SQLite file remains the dev/test default when DATABASE_URL is empty).
Images → GCS is the source of truth; the local data/ dir is ephemeral on Cloud Run and image bytes (incl. thumbnails) are hydrated from GCS on demand. Public dataland-public, private dataland-atlas-prod.
Secrets → Secret Manager + Workload Identity (no gcp-key.json). Auth env vars renamed INFORMATION_WEBUI_* → ATLAS_*.
Ingress → Cloudflare DNS-only CNAME → Cloud Run managed TLS (not the Spark Cloudflare Tunnel / data.dataland.chat).
Infra as code → dataland-ai/dataland-gcp-terraform.

The sections below still describe the original on-prem (Spark/compose) deployment, kept for reference; the cloud facts above are authoritative for production.

dataland-atlas is the non-developer CMS for everything the agent and RAG layer read about the institution. It is a FastAPI backend + a plain Vite/TypeScript single-page frontend ("Dataland Catalog Studio") served from the same container. Curators create entities, upload images, and write captions/descriptions; the service persists to a SQLite catalog, uploads bytes to GCS, generates WebP thumbnails locally, and fires a live ingest into RAG so Qdrant stays in lockstep on every save.

It exposes two workspaces over the same storage and live-sync machinery:

Projects — the Refik Anadol Studio artwork catalog (the original surface).
Museum — the physical-venue catalog: a singleton museum overview plus museum sections (galleries, orientation room, FAQ, tickets, membership, etc.) and their ordered scenes. This is what the in-museum agent answers from.


Container	`dataland-atlas`
Image	`dataland/information-webui:${IMAGE_TAG}` (multi-stage build from `dataland-atlas/Dockerfile`)
Public host	`data.dataland.chat` (via Cloudflare Tunnel → `http://localhost:4152`)
Public port	`4152`
Internal URL	`http://dataland-atlas:4152`
Healthcheck	`GET /health` → `{"status":"ok","service":"dataland-atlas"}`
Metrics	`GET /metrics` (Prometheus text; DAT-82)
API docs	`GET /docs` (Swagger UI), `GET /openapi.json`
Runtime	single uvicorn worker, `--factory` mode (`app.main:create_app`), non-root uid/gid 1000

Recent changes (DAT- references)

Museum workspace CMS shipped: sections + scenes + images, section & scene image uploads, scene thumbnails, flat image URLs, gallery-image RAG ingest.
The earlier "gallery" entity was renamed to section across DB + API + UI, and the separate hero image was dropped — header art is now derived from the section's own first image.
All 20 museum sections + 3 scenes + the museum overview were (re-)ingested into Qdrant via the RAG live-sync helpers (knowledge 4839 → 4969 points), confirmed retrievable through rag-v2 /search (e.g. "Biome Lumina").
DAT-269 standardized the model to gemini-3.5-flash everywhere, which is what backs this service's /ingest/image captioning and /search on rag-v2.
DAT-281 — the agent now speaks real section names (Data Pavilion=GA, Latent Gallery=GB, Infinity Room=GC, The Sanctuary=GD, Discovery Portal=ON); those names come from this catalog.
DAT-288 — default_reference_* placeholder images were purged (museum catalog + GCS, Qdrant-checked).
DAT-82 — /metrics exposed via a pure-ASGI middleware so streaming responses stay unbuffered.

Architecture¶

graph LR
  subgraph webui[dataland-atlas]
    FE["Vite SPA<br/>(web/dist)"]
    API["FastAPI /api"]
    PS["ProjectStore"]
    MS["MuseumStore"]
    RS["rag_sync<br/>(bounded ThreadPoolExecutor)"]
  end
  FE --> API
  API --> PS
  API --> MS
  PS --> SQLite[("catalog.sqlite3")]
  MS --> SQLite
  PS -. upload bytes .-> GCS[("GCS public/private")]
  MS -. upload bytes .-> GCS
  PS --> RS
  MS --> RS
  RS -. "/ingest/file (knowledge)" .-> RAG[dataland-rag]
  RS -. "/ingest/image (images)" .-> RAG
  RS -. "DELETE /ingest/by-project-slug" .-> RAG

The two stores (ProjectStore for projects, MuseumStore for the museum tree) share one SQLAlchemy engine + session factory so a single SQLite connection pool serves both writers. MuseumStore.__init__ only seeds the singleton museum row if it's missing — schema creation and migrations are owned by ProjectStore.

Data model¶

Projects¶

Entity	Fields
Project	`slug`, `name`, `short_description`, `description`, `location`, `exhibition_date`, `language`, `categories`, `tags`, plus `images[]`
Image (`ImageAsset`)	`caption`, `alt_text`, `description`, `tags`, `sort_order`, dimensions, multi-tier URLs, enrichment (`auto_caption`, `keywords`, `source_url`, `source_page`)

exhibition_date is normalized to YYYY-MM-DD; a year-only value like 2024 becomes 2024-01-01, and project lists are ordered newest-first by exhibition date. Image files are named <project-name>-001.jpg, <project-name>-002.jpg, … under data/projects/<slug>/images/.

Museum (sections → scenes)¶

Three entity types share the Projects field shape (so the uploader, thumbnailer, and RAG-sync code is reused with only the parent identifier differing):

Museum — a singleton overview/settings row (the institution itself). Get + update only; never created or deleted via the API. The store seeds the row on first run so GET /api/museum is always total. Date field is opening_date (same YYYY-MM-DD semantics as exhibition_date).
Section — a museum area: gallery-a … gallery-d, orientation-room, frequently-asked-questions, tickets-operating-hours, membership, large-nature-model, biome-lumina, etc. Fields: slug, name, location, language, sort_order, categories, tags, short_description, description, plus images[] and a scenes[] summary list.
Scene — an ordered sub-experience scoped to a section_id (e.g. the scenes under gallery-b). Same field shape as a section, plus images[].

graph TD
  M["Museum (singleton 'museum')"] --> S1["Section gallery-a"]
  M --> S2["Section gallery-b"]
  M --> S3["Section orientation-room"]
  S2 --> SC1["Scene …"]
  S2 --> SC2["Scene …"]
  S1 --> I1[["Section images"]]
  SC1 --> I2[["Scene images"]]

Naming history

An earlier gallery entity was renamed to section across the DB, API, and UI, and the separate per-section hero image was removed. The section header art is now derived from the section's first ordered image (cover_image_id in SectionSummary), not a dedicated upload.

Storage layout¶

SQLite at data/catalog.sqlite3 is the runtime source of truth. Project/agent JSON is generated on demand (the .../catalog routes) and never written to disk in normal flows. Older legacy JSON mirrors may linger but are not read or written by current create/update/delete paths.

data/
  catalog.sqlite3
  projects/<project-slug>/images/<project-name>-001.jpg
  museum/sections/<section-slug>/<section-name>-001.jpg
  museum/scenes/<scene-slug>/<scene-name>-001.jpg
  thumbnails/<project-slug>/<image-id>-<size>.webp
  thumbnails/museum/sections/<section-slug>/<image-id>-<size>.webp
  thumbnails/museum/scenes/<scene-slug>/<image-id>-<size>.webp

Thumbnails are WebP (quality=82, method=6), bounded 80…1200 px, regenerated only when stale relative to the source mtime. They are used for list/grid rendering; the full image loads on preview.

Image record fields¶

Every image record (project, section, or scene) carries:

Field	Source	Notes
`width`, `height`	PIL on upload	Lets the gallery pick a thumbnail tier without re-decoding. Read-only (never set from the UI).
`auto_caption`	Gemini caption via rag-v2 `/ingest/image`	Visual description. Distinct from the curator-written `caption`.
`keywords`	`app/keyword_extractor.py` (`merge_keywords`)	Deduplicated tokens merged from `caption + alt_text + description + tags + parent context`. Capped at 50, aligned with the rag-v2 keyword surface.
`source_url`, `source_page`	parsed from the image `description`	Upstream provenance (`app/source_url_parser.py`).

A PIL decompression-bomb cap of 24,000,000 px rejects oversized declared-dimension images with 413 before they hit disk/GCS (with rollback of the local file and any GCS object already uploaded).

Image URL tiers (`bestImageUrl`)¶

The frontend resolver in web/src/api.ts picks cdn_url > public_url > local /api/.../file fallback (passing updated_at as a cache-busting tag).

Field	Source	Use
`public_url`	`${GCS_PUBLIC_BASE_URL}/${GCS_PUBLIC_BUCKET}/<object>`	Long-term public address.
`cdn_url`	`${WEBUI_CDN_BASE_URL}/<object>` (when set)	CDN in front of the public bucket; preferred when present.
`gs_uri`	`gs://<bucket>/<object>`	Internal references (Qdrant payload, scripts).
`thumbnail_url`	generated WebP under `data/thumbnails/...`	Fast list rendering.
`original_file_url`	`/api/.../images/<id>/file`	Authenticated download served by the webui itself.

API¶

All /api/* routes are auth-gated (see Auth posture). /health, /metrics, /docs, /openapi.json, and /api/auth/* are open.

Auth & system¶

GET  /health
GET  /metrics
GET  /docs · GET /openapi.json
GET  /api/auth/session
POST /api/auth/login
POST /api/auth/logout

Projects¶

GET    /api/projects
GET    /api/projects/search?q=&limit=
POST   /api/projects
GET    /api/projects/{project_id_or_slug}
PUT    /api/projects/{project_id_or_slug}
DELETE /api/projects/{project_id_or_slug}
POST   /api/projects/{id_or_slug}/images
PUT    /api/projects/{id_or_slug}/images/{image_id}
DELETE /api/projects/{id_or_slug}/images/{image_id}
GET    /api/projects/{id_or_slug}/images/{image_id}/file
GET    /api/projects/{id_or_slug}/images/{image_id}/thumbnail?size=
GET    /api/projects/{id_or_slug}/catalog        # (1)!
GET    /api/catalog                              # (2)!

Emits the per-project schema-versioned RAG/agent JSON (dataland.rag.*.catalog.v1) generated on demand. This is what RAG and the agent consume, never written to disk in normal flows.
The global catalog across all projects. Generated on demand from SQLite, the runtime source of truth.

Museum workspace¶

GET  /api/museum                                 # (1)!
PUT  /api/museum
GET  /api/museum/catalog                          # (2)!

GET  /api/museum/sections
POST /api/museum/sections
GET  /api/museum/sections/{identifier}
PUT  /api/museum/sections/{identifier}
DELETE /api/museum/sections/{identifier}          # (3)!
GET  /api/museum/sections/{identifier}/catalog
POST /api/museum/sections/{identifier}/images
PUT  /api/museum/sections/{identifier}/images/{image_id}
DELETE /api/museum/sections/{identifier}/images/{image_id}
GET  /api/museum/sections/{identifier}/images/{image_id}/file
GET  /api/museum/sections/{identifier}/images/{image_id}/thumbnail?size=

GET  /api/museum/sections/{section_identifier}/scenes
POST /api/museum/sections/{section_identifier}/scenes
GET  /api/museum/scenes/{identifier}
PUT  /api/museum/scenes/{identifier}
DELETE /api/museum/scenes/{identifier}            # (4)!
GET  /api/museum/scenes/{identifier}/catalog
POST /api/museum/scenes/{identifier}/images
PUT  /api/museum/scenes/{identifier}/images/{image_id}
DELETE /api/museum/scenes/{identifier}/images/{image_id}
GET  /api/museum/scenes/{identifier}/images/{image_id}/file
GET  /api/museum/scenes/{identifier}/images/{image_id}/thumbnail?size=

The singleton museum overview row. Get + update only — never created or deleted via the API; the store seeds it on first run so this is always total.
The full museum tree as RAG/agent JSON (dataland.rag.museum.catalog.v1) — sections, scenes, bucket layout, RAG target collections, and inlined image entries.
Deleting a section cascades to its scenes plus the matching GCS object prefix and the RAG points across both collections. Not just the section row.
Deleting a scene cascades to its GCS object prefix and RAG points, via schedule_cascade_delete_for_scene.

Identifiers

Every section/scene route accepts either the slug or the id (WHERE slug = :id OR id = :id). Slugs are generated from the name via slugify() and made unique against both the DB and the on-disk directory.

The .../catalog endpoints emit schema-versioned RAG/agent JSON (dataland.rag.museum.catalog.v1, dataland.rag.section.catalog.v1, dataland.rag.scene.catalog.v1) with the full tree, bucket layout, RAG target collections, and inlined image entries. The frontend deep-link routes (/projects/{slug}, /museum, /museum/{section}, /museum/{section}/{scene}) return the SPA index.html so client routing works on refresh.

RAG live-sync¶

Every project, section, scene, and museum-overview create / update / delete fires a fire-and-forget ingest against rag-v2 so Qdrant tracks the catalog:

text → POST /ingest/file (the knowledge collection),
images → POST /ingest/image (the images collection, which captions via Gemini and stores multimodal vectors + denormalized parent context as payload).

Replace-by-slug semantics. Each sync first calls DELETE /ingest/by-project-slug/<slug> to wipe the entity's stale points across both collections, then re-ingests. Combined with deterministic UUIDv5 point ids on the RAG side, this gives clean upserts with no duplicate accumulation across many edits.

Entities are rendered to markdown (render_project_markdown / render_museum_markdown / render_section_markdown / render_scene_markdown) so the RAG chunker, which splits on headings, produces one chunk per logical block. Museum entities carry a namespaced RAG slug and an entity_type in the payload so the agent can scope retrieval, and so a museum section named archive-dreaming can never collide with a Refik project of the same slug:

Entity	RAG slug (`project_slug` grouping key)	`entity_type`
Museum overview	`museum`	`museum`
Section	`museum-section-<slug>`	`section` (its images: `section_image`)
Scene	`museum-scene-<slug>`	`scene` (its images: `scene_image`)

MuseumStore fires schedule_ingest_for_section / _scene and the museum overview on each persist (same contract ProjectStore uses with schedule_ingest_for_project). Section/scene deletes go through schedule_cascade_delete_for_section/_scene, which also delete the GCS object prefix.

Sync runtime (DAT-195)¶

Jobs run on a bounded ThreadPoolExecutor (replacing per-call daemon threads), so a bulk operation queues instead of forking N threads. Each step runs through _with_retry: transient failures (URLError, TimeoutError, ConnectionError, HTTP 5xx/429) retry with exponential backoff + jitter, giving up after RAG_SYNC_MAX_RETRIES. A RAG or GCS hiccup logs a warning and returns — it never propagates back into the user-facing save. The FastAPI lifespan shutdown hook drains the executor (shutdown_executor(wait=True)) so in-flight syncs finish.

Env var	Default	Purpose
`RAG_BASE_URL`	`""`	e.g. `http://dataland-rag:4143`. Empty disables the hook.
`RAG_API_KEY`	`""`	`X-API-Key` value for `/ingest/file` and `/ingest/image`.
`RAG_REQUEST_TIMEOUT_SECONDS`	`60`	Per-call timeout.
`RAG_SYNC_MAX_WORKERS`	`4`	Bounded executor concurrency.
`RAG_SYNC_MAX_RETRIES`	`3`	Retry attempts per step.
`RAG_SYNC_RETRY_BASE_S` / `RAG_SYNC_RETRY_MAX_S`	`1.0` / `8.0`	Backoff bounds.

Standalone mode

When RAG_BASE_URL (or RAG_API_KEY) is empty, the webui runs standalone and skips all ingest. The manual backfill for projects is scripts/ingest_to_rag.py. To bulk re-ingest the museum (all sections + scenes + overview) drive the blocking helpers in-process:

from app.config import get_settings
from app.storage import ProjectStore
from app.museum_storage import MuseumStore
from app.rag_sync import (
    ingest_museum_blocking,      # (1)!
    ingest_section_blocking,
    ingest_scene_blocking,
)

s = get_settings()
ps = ProjectStore(s)
store = MuseumStore(s, engine=ps.engine, session_factory=ps.session_factory)  # (2)!
ingest_museum_blocking(s, store.get_museum())
for summ in store.list_sections():
    section = store.get_section(summ.slug)
    ingest_section_blocking(s, section, museum_dir=store.museum_dir)
    for sc in store.list_scenes(section.slug):
        ingest_scene_blocking(s, store.get_scene(sc.slug), museum_dir=store.museum_dir)

The *_blocking variants run synchronously, unlike the fire-and-forget schedule_* hooks the live service uses. Use them for a manual backfill where you want to wait for completion and see failures surface.
MuseumStore reuses ProjectStore's engine + session factory so both writers share one SQLite connection pool. Schema creation and migrations are owned by ProjectStore; MuseumStore only seeds the singleton museum row if missing.

GCS uploads¶

Image uploads land in the public bucket so the mobile app and rag-v2 can fetch them directly. Museum content is namespaced under artworks/museum/... so it can't collide with a project that shares a slug.

GCS path	Source
`artworks/<project-slug>/<file>`	project images
`artworks/museum/sections/<section-slug>/<file>`	section images
`artworks/museum/scenes/<scene-slug>/<file>`	scene images

Env var	Default	Purpose
`GCP_PROJECT_ID`	`dataland-ai`	Used by `google-cloud-storage`.
`GCS_PUBLIC_BASE_URL`	`https://storage.googleapis.com`	Base prefix for `public_url`.
`GCS_PUBLIC_BUCKET`	`""` (empty disables uploads)	Production: `dataland-public`.
`GCS_PRIVATE_BUCKET`	`dataland-private`	Reserved for scene metadata, documents, external articles.
`GCS_ARTWORKS_PREFIX`	`artworks/`	Subprefix inside the public bucket (matches the rag-v2 scanner default).
`GCS_SCENES_PREFIX`	`museum/scenes/`	Private-bucket scene-metadata subprefix (reflected in `museum_catalog.bucket_layout`).
`GCS_CATALOGS_PREFIX`	`catalogs/`	Catalog-mirror subprefix.
`WEBUI_CDN_BASE_URL`	`""`	Optional CDN in front of the public bucket.
`GOOGLE_APPLICATION_CREDENTIALS`	`/app/gcp-key.json`	Service-account JSON, bind-mounted read-only (DAT-191).

GCS degrades silently

When GCS_PUBLIC_BUCKET is empty, the webui keeps images on the local volume only and never reaches GCS — and with no bucket, RAG image ingest is skipped (a rag_sync_*_images_skipped_no_gcs_bucket log line). Without the GCP key mount, GCS upload degrades to None and the only signal is a "GCS credentials missing" warning. Text ingest into knowledge still works without GCS.

Auth posture¶

Login is a single shared password (INFORMATION_WEBUI_PASSWORD) exchanged for an HMAC-SHA256-signed session cookie (dataland_information_session, SameSite=Lax, HttpOnly, default 12h TTL). The signing key is INFORMATION_WEBUI_SESSION_SECRET (falls back to the password if unset).

DAT-192 — per-IP sliding-window rate limit on POST /api/auth/login: 5 failures inside a 600s window triggers a 300s lockout (429 with Retry-After). A correct password clears the bucket. The 503 "password not configured" branch does not consume capacity.
DAT-190 — strict-Origin CSRF gate on mutating cookie-authenticated requests: every non-safe method must carry an Origin/Referer whose origin is in CORS_ORIGINS, else 403. Safe methods (GET/HEAD/OPTIONS) are exempt.
Server-to-server callers set INFORMATION_WEBUI_API_TOKEN and pass it as Authorization: Bearer <token> or X-Information-WebUI-Token: <token>. Token auth bypasses both the cookie and the CSRF check (it can't carry a cross-site cookie).
When no password is configured and AUTH_ENABLED=false, the service is intentionally open (local dev / tests).

Operator name remap

The infra deploy env exposes a few variables under an operator-friendly prefix that compose remaps to the container names: INFORMATION_WEBUI_AUTH_ENABLED → AUTH_ENABLED, INFORMATION_WEBUI_COOKIE_SECURE → AUTH_COOKIE_SECURE, INFORMATION_WEBUI_CORS_ORIGINS → CORS_ORIGINS, INFORMATION_WEBUI_SESSION_MAX_AGE_SECONDS → AUTH_SESSION_MAX_AGE_SECONDS. See the env inventory (docs/env-inventory.md).

Key env vars¶

APP_ENV=production
APP_HOST=0.0.0.0
APP_PORT=4152
LOG_LEVEL=info

STORAGE_DIR=/app/data
MAX_UPLOAD_SIZE_MB=50                         # (1)!
CORS_ORIGINS=http://localhost:5173,http://127.0.0.1:5173,http://localhost:4152  # (2)!

AUTH_ENABLED=true
INFORMATION_WEBUI_PASSWORD=***               # (3)!
INFORMATION_WEBUI_SESSION_SECRET=***         # (4)!
INFORMATION_WEBUI_API_TOKEN=                 # (5)!
AUTH_SESSION_MAX_AGE_SECONDS=43200           # (6)!
AUTH_COOKIE_SECURE=false                     # (7)!

GCP_PROJECT_ID=dataland-ai
GCS_PUBLIC_BASE_URL=https://storage.googleapis.com
GCS_PUBLIC_BUCKET=dataland-public            # (8)!
GCS_PRIVATE_BUCKET=dataland-private
GCS_ARTWORKS_PREFIX=artworks/
GCS_SCENES_PREFIX=museum/scenes/
GCS_CATALOGS_PREFIX=catalogs/
GOOGLE_APPLICATION_CREDENTIALS=/app/gcp-key.json
GCP_KEY_FILE=./secrets/gcp-key.json          # (9)!

RAG_BASE_URL=http://dataland-rag:4143        # (10)!
RAG_API_KEY=***
RAG_REQUEST_TIMEOUT_SECONDS=60

INFORMATION_WEBUI_PUBLIC_PORT=4152
INFORMATION_WEBUI_DATA_DIR=./data            # (11)!

Upload size cap. Uploads above this return 413; raise it for large assets. Note this is separate from the PIL decompression-bomb cap of 24,000,000 px that rejects oversized declared-dimension images.
Allowed origins for CORS and the DAT-190 strict-Origin CSRF gate: every mutating cookie-authenticated request must carry an Origin/Referer whose origin is in this list, else 403.
The single shared login password. When empty and AUTH_ENABLED=false, the service is intentionally open (local dev / tests only).
HMAC-SHA256 signing key for the session cookie. Falls back to INFORMATION_WEBUI_PASSWORD if unset, so set it explicitly in production.
Server-to-server bearer token. Callers pass it as Authorization: Bearer <token> or X-Information-WebUI-Token: <token>; it bypasses both the cookie and the CSRF check. Empty disables token auth.
Session cookie TTL in seconds (43200 = 12h).
Set to true behind TLS so the dataland_information_session cookie is sent only over HTTPS. Maps from INFORMATION_WEBUI_COOKIE_SECURE in the deploy env.
Production public bucket. Empty disables all GCS uploads — images stay local-only and RAG image ingest is silently skipped (text ingest still works).
Host path of the service-account JSON, bind-mounted read-only to /app/gcp-key.json (DAT-191). Without it, GCS upload degrades to None with only a "GCS credentials missing" warning.
rag-v2 base URL for the live-sync hook. Empty (or empty RAG_API_KEY) disables the hook and runs the webui standalone with no ingest.
Host bind-mount source mapped to /app/data (catalog.sqlite3 + images + thumbnails). Persistent so rebuilds/restarts keep the same catalog.

Volumes¶

/app/data         <- host: ${INFORMATION_WEBUI_DATA_DIR:-./data}  # (1)!
/app/gcp-key.json <- host: ${GCP_KEY_FILE:-./secrets/gcp-key.json} # (2)!

Persistent bind-mount holding catalog.sqlite3 + images + thumbnails. SQLite is the standalone source of truth, so this must survive rebuilds/restarts. The container runs as uid/gid 1000 (DAT-210) so the mount can be chowned once.
Service-account JSON, read-only (chmod 600). DAT-191 bind-mounts it :ro so the running container can never mutate the credential.

/app/data is a persistent bind-mount so rebuilds/restarts keep the same catalog and uploads. The container runs as uid/gid 1000 (DAT-210), matching the rest of the stack so the bind-mount can be chowned once.

Scripts¶

All scripts read the same .env the service does and are idempotent.

Script	Purpose
`build_refik_dataset.py`	One-shot Refik Anadol Studio scrape into `dataset/`. Does not touch live `data/`.
`import_refik_dataset.py`	Upserts the scraped `dataset/` into `data/` (projects + image rows by slug).
`ingest_images.py`	Bulk-upload a folder of images for a given project slug.
`ingest_to_rag.py`	Manual backfill equivalent of the live-sync hook for projects.
`backfill_image_urls.py`	Rewrites `public_url` / `cdn_url` / `gs_uri` / `thumbnail_url` after a bucket rename or CDN cutover.
`refresh_project_metadata.py`	Recomputes derived metadata (keywords, ordering, thumbnails) without re-uploading source images.
`rewrite_short_descriptions.py`	Bulk LLM rewrite of `short_description` across projects.
`sync_from_qdrant.py`	Read-side cross-check; reconciles Qdrant payloads against the SQLite catalog.
`sync_project_slugs.py`	Migrates project slugs when titles change.

Reaching it¶

curl -fsS https://data.dataland.chat/health      # (1)!
curl -fsS http://localhost:4152/health           # (2)!
curl -fsS http://localhost:4152/metrics          # (3)!

Public entry point. Cloudflare Tunnel routes data.dataland.chat to the container's local target http://localhost:4152; the container itself only needs that local URL.
Host-local healthcheck. Returns {"status":"ok","service":"dataland-atlas"} — this is the same GET /health the container healthcheck hits, and it is open (not auth-gated).
Prometheus scrape endpoint (DAT-82), served through a pure-ASGI middleware so streaming responses stay unbuffered. Also open, no auth.

Cloudflare Tunnel routes data.dataland.chat to the container's local target http://localhost:4152; the container itself only needs that local URL.

Local development¶

# backend
uv sync --extra dev              # (1)!
Copy-Item .env.example .env
uv run python main.py            # (2)!

# frontend (Vite dev server, proxies /api -> 127.0.0.1:4152)
cd web
npm install
npm run dev                      # (3)!

Installs the dev extra (ruff, pytest) alongside runtime deps. The same .env powers both the service and every script.
Boots a single uvicorn worker in --factory mode (app.main:create_app) on http://localhost:4152. In dev with empty RAG_BASE_URL/GCS_PUBLIC_BUCKET it runs standalone (no ingest, local-only images).
Vite dev server on http://localhost:5173, proxying /api to 127.0.0.1:4152. Note 5173 must be in CORS_ORIGINS or mutating requests hit the DAT-190 CSRF gate.

Production build serves the SPA from FastAPI:

cd web; npm install; npm run build; cd ..   # (1)!
uv run python main.py            # (2)!

Builds the SPA into web/dist, which FastAPI then serves directly (no separate Vite server). Deep-link routes (/projects/{slug}, /museum, etc.) fall back to index.html so client routing survives a refresh.
Same single uvicorn entrypoint as dev, but now serving the prebuilt SPA from web/dist instead of proxying to Vite.

Docker:

Copy-Item .env.example .env
docker compose up -d --build     # (1)!

--build rebuilds the multi-stage image (dataland-atlas/Dockerfile) so the prebuilt SPA and Python deps are baked in. The /app/data bind-mount persists the catalog across rebuilds.

Checks before a PR:

uv run --extra dev ruff check scripts app tests main.py   # (1)!
uv run --extra dev pytest
cd web; npm run build            # (2)!

Lint gate over the full source set (scripts, app, tests, main.py). The --extra dev pulls in ruff/pytest without polluting the runtime env.
The frontend build must pass too — a broken web/dist ships a blank SPA even when the backend is green, so this catches it before a PR.

The design standard lives in web/src/design-system/ (dark grid, square panels, mono labels, thin borders, cyan accents) and is also published as the shared @dataland-ai/design-system package for other Dataland frontends.

Deployment notes¶

Keep the app behind Cloudflare Tunnel/Access before exposing it publicly.
Mount /app/data as a persistent volume; SQLite is the standalone source of truth. If write volume outgrows this editor workload, move the same ORM model to Postgres rather than switching SQLite to async.
Increase MAX_UPLOAD_SIZE_MB for large assets (uploads above the cap return 413).
See sibling pages: RAG (the ingest target), Agent (the catalog consumer), observability (/metrics scrape), and deploy.

Atlas (Information WebUI) — the Catalog Studio¶

Architecture¶

Data model¶

Projects¶

Museum (sections → scenes)¶

Storage layout¶

Image record fields¶

Image URL tiers (bestImageUrl)¶

API¶

Auth & system¶

Projects¶

Museum workspace¶

RAG live-sync¶

Sync runtime (DAT-195)¶

GCS uploads¶

Auth posture¶

Key env vars¶

Volumes¶

Scripts¶

Reaching it¶

Local development¶

Deployment notes¶

Image URL tiers (`bestImageUrl`)¶