Atlas (Information WebUI) — the Catalog Studio¶
Now deployed as Atlas on Google Cloud (2026-06)
This service has migrated to Cloud Run as the atlas service
(us-west1 / project dataland-agent), served at
https://atlas.dataland.chat. What changed from the on-prem
description below:
- Catalog → Cloud SQL (Postgres 16) via
DATABASE_URL(the local SQLite file remains the dev/test default whenDATABASE_URLis empty). - Images → GCS is the source of truth; the local
data/dir is ephemeral on Cloud Run and image bytes (incl. thumbnails) are hydrated from GCS on demand. Publicdataland-public, privatedataland-atlas-prod. - Secrets → Secret Manager + Workload Identity (no
gcp-key.json). Auth env vars renamedINFORMATION_WEBUI_* → ATLAS_*. - Ingress → Cloudflare DNS-only CNAME → Cloud Run managed TLS
(not the Spark Cloudflare Tunnel /
data.dataland.chat). - Infra as code →
dataland-ai/dataland-gcp-terraform.
The sections below still describe the original on-prem (Spark/compose) deployment, kept for reference; the cloud facts above are authoritative for production.
dataland-atlas is the non-developer CMS for everything the agent and RAG layer read about the institution. It is a FastAPI backend + a plain Vite/TypeScript single-page frontend ("Dataland Catalog Studio") served from the same container. Curators create entities, upload images, and write captions/descriptions; the service persists to a SQLite catalog, uploads bytes to GCS, generates WebP thumbnails locally, and fires a live ingest into RAG so Qdrant stays in lockstep on every save.
It exposes two workspaces over the same storage and live-sync machinery:
- Projects — the Refik Anadol Studio artwork catalog (the original surface).
- Museum — the physical-venue catalog: a singleton museum overview plus museum sections (galleries, orientation room, FAQ, tickets, membership, etc.) and their ordered scenes. This is what the in-museum agent answers from.
| Container | dataland-atlas |
| Image | dataland/information-webui:${IMAGE_TAG} (multi-stage build from dataland-atlas/Dockerfile) |
| Public host | data.dataland.chat (via Cloudflare Tunnel → http://localhost:4152) |
| Public port | 4152 |
| Internal URL | http://dataland-atlas:4152 |
| Healthcheck | GET /health → {"status":"ok","service":"dataland-atlas"} |
| Metrics | GET /metrics (Prometheus text; DAT-82) |
| API docs | GET /docs (Swagger UI), GET /openapi.json |
| Runtime | single uvicorn worker, --factory mode (app.main:create_app), non-root uid/gid 1000 |
Recent changes (DAT- references)
- Museum workspace CMS shipped: sections + scenes + images, section & scene image uploads, scene thumbnails, flat image URLs, gallery-image RAG ingest.
- The earlier "gallery" entity was renamed to
sectionacross DB + API + UI, and the separate hero image was dropped — header art is now derived from the section's own first image. - All 20 museum sections + 3 scenes + the museum overview were (re-)ingested into Qdrant via the RAG live-sync helpers (knowledge
4839 → 4969points), confirmed retrievable through rag-v2/search(e.g. "Biome Lumina"). - DAT-269 standardized the model to
gemini-3.5-flasheverywhere, which is what backs this service's/ingest/imagecaptioning and/searchon rag-v2. - DAT-281 — the agent now speaks real section names (Data Pavilion=GA, Latent Gallery=GB, Infinity Room=GC, The Sanctuary=GD, Discovery Portal=ON); those names come from this catalog.
- DAT-288 —
default_reference_*placeholder images were purged (museum catalog + GCS, Qdrant-checked). - DAT-82 —
/metricsexposed via a pure-ASGI middleware so streaming responses stay unbuffered.
Architecture¶
graph LR
subgraph webui[dataland-atlas]
FE["Vite SPA<br/>(web/dist)"]
API["FastAPI /api"]
PS["ProjectStore"]
MS["MuseumStore"]
RS["rag_sync<br/>(bounded ThreadPoolExecutor)"]
end
FE --> API
API --> PS
API --> MS
PS --> SQLite[("catalog.sqlite3")]
MS --> SQLite
PS -. upload bytes .-> GCS[("GCS public/private")]
MS -. upload bytes .-> GCS
PS --> RS
MS --> RS
RS -. "/ingest/file (knowledge)" .-> RAG[dataland-rag]
RS -. "/ingest/image (images)" .-> RAG
RS -. "DELETE /ingest/by-project-slug" .-> RAG
The two stores (ProjectStore for projects, MuseumStore for the museum tree) share one SQLAlchemy engine + session factory so a single SQLite connection pool serves both writers. MuseumStore.__init__ only seeds the singleton museum row if it's missing — schema creation and migrations are owned by ProjectStore.
Data model¶
Projects¶
| Entity | Fields |
|---|---|
| Project | slug, name, short_description, description, location, exhibition_date, language, categories, tags, plus images[] |
Image (ImageAsset) |
caption, alt_text, description, tags, sort_order, dimensions, multi-tier URLs, enrichment (auto_caption, keywords, source_url, source_page) |
exhibition_date is normalized to YYYY-MM-DD; a year-only value like 2024 becomes 2024-01-01, and project lists are ordered newest-first by exhibition date. Image files are named <project-name>-001.jpg, <project-name>-002.jpg, … under data/projects/<slug>/images/.
Museum (sections → scenes)¶
Three entity types share the Projects field shape (so the uploader, thumbnailer, and RAG-sync code is reused with only the parent identifier differing):
- Museum — a singleton overview/settings row (the institution itself). Get + update only; never created or deleted via the API. The store seeds the row on first run so
GET /api/museumis always total. Date field isopening_date(sameYYYY-MM-DDsemantics asexhibition_date). - Section — a museum area:
gallery-a…gallery-d,orientation-room,frequently-asked-questions,tickets-operating-hours,membership,large-nature-model,biome-lumina, etc. Fields:slug,name,location,language,sort_order,categories,tags,short_description,description, plusimages[]and ascenes[]summary list. - Scene — an ordered sub-experience scoped to a
section_id(e.g. the scenes undergallery-b). Same field shape as a section, plusimages[].
graph TD
M["Museum (singleton 'museum')"] --> S1["Section gallery-a"]
M --> S2["Section gallery-b"]
M --> S3["Section orientation-room"]
S2 --> SC1["Scene …"]
S2 --> SC2["Scene …"]
S1 --> I1[["Section images"]]
SC1 --> I2[["Scene images"]]
Naming history
An earlier gallery entity was renamed to section across the DB, API, and UI, and the separate per-section hero image was removed. The section header art is now derived from the section's first ordered image (cover_image_id in SectionSummary), not a dedicated upload.
Storage layout¶
SQLite at data/catalog.sqlite3 is the runtime source of truth. Project/agent JSON is generated on demand (the .../catalog routes) and never written to disk in normal flows. Older legacy JSON mirrors may linger but are not read or written by current create/update/delete paths.
data/
catalog.sqlite3
projects/<project-slug>/images/<project-name>-001.jpg
museum/sections/<section-slug>/<section-name>-001.jpg
museum/scenes/<scene-slug>/<scene-name>-001.jpg
thumbnails/<project-slug>/<image-id>-<size>.webp
thumbnails/museum/sections/<section-slug>/<image-id>-<size>.webp
thumbnails/museum/scenes/<scene-slug>/<image-id>-<size>.webp
Thumbnails are WebP (quality=82, method=6), bounded 80…1200 px, regenerated only when stale relative to the source mtime. They are used for list/grid rendering; the full image loads on preview.
Image record fields¶
Every image record (project, section, or scene) carries:
| Field | Source | Notes |
|---|---|---|
width, height |
PIL on upload | Lets the gallery pick a thumbnail tier without re-decoding. Read-only (never set from the UI). |
auto_caption |
Gemini caption via rag-v2 /ingest/image |
Visual description. Distinct from the curator-written caption. |
keywords |
app/keyword_extractor.py (merge_keywords) |
Deduplicated tokens merged from caption + alt_text + description + tags + parent context. Capped at 50, aligned with the rag-v2 keyword surface. |
source_url, source_page |
parsed from the image description |
Upstream provenance (app/source_url_parser.py). |
A PIL decompression-bomb cap of 24,000,000 px rejects oversized declared-dimension images with 413 before they hit disk/GCS (with rollback of the local file and any GCS object already uploaded).
Image URL tiers (bestImageUrl)¶
The frontend resolver in web/src/api.ts picks cdn_url > public_url > local /api/.../file fallback (passing updated_at as a cache-busting tag).
| Field | Source | Use |
|---|---|---|
public_url |
${GCS_PUBLIC_BASE_URL}/${GCS_PUBLIC_BUCKET}/<object> |
Long-term public address. |
cdn_url |
${WEBUI_CDN_BASE_URL}/<object> (when set) |
CDN in front of the public bucket; preferred when present. |
gs_uri |
gs://<bucket>/<object> |
Internal references (Qdrant payload, scripts). |
thumbnail_url |
generated WebP under data/thumbnails/... |
Fast list rendering. |
original_file_url |
/api/.../images/<id>/file |
Authenticated download served by the webui itself. |
API¶
All /api/* routes are auth-gated (see Auth posture). /health, /metrics, /docs, /openapi.json, and /api/auth/* are open.
Auth & system¶
GET /health
GET /metrics
GET /docs · GET /openapi.json
GET /api/auth/session
POST /api/auth/login
POST /api/auth/logout
Projects¶
GET /api/projects
GET /api/projects/search?q=&limit=
POST /api/projects
GET /api/projects/{project_id_or_slug}
PUT /api/projects/{project_id_or_slug}
DELETE /api/projects/{project_id_or_slug}
POST /api/projects/{id_or_slug}/images
PUT /api/projects/{id_or_slug}/images/{image_id}
DELETE /api/projects/{id_or_slug}/images/{image_id}
GET /api/projects/{id_or_slug}/images/{image_id}/file
GET /api/projects/{id_or_slug}/images/{image_id}/thumbnail?size=
GET /api/projects/{id_or_slug}/catalog # (1)!
GET /api/catalog # (2)!
- Emits the per-project schema-versioned RAG/agent JSON (
dataland.rag.*.catalog.v1) generated on demand. This is what RAG and the agent consume, never written to disk in normal flows. - The global catalog across all projects. Generated on demand from SQLite, the runtime source of truth.
Museum workspace¶
GET /api/museum # (1)!
PUT /api/museum
GET /api/museum/catalog # (2)!
GET /api/museum/sections
POST /api/museum/sections
GET /api/museum/sections/{identifier}
PUT /api/museum/sections/{identifier}
DELETE /api/museum/sections/{identifier} # (3)!
GET /api/museum/sections/{identifier}/catalog
POST /api/museum/sections/{identifier}/images
PUT /api/museum/sections/{identifier}/images/{image_id}
DELETE /api/museum/sections/{identifier}/images/{image_id}
GET /api/museum/sections/{identifier}/images/{image_id}/file
GET /api/museum/sections/{identifier}/images/{image_id}/thumbnail?size=
GET /api/museum/sections/{section_identifier}/scenes
POST /api/museum/sections/{section_identifier}/scenes
GET /api/museum/scenes/{identifier}
PUT /api/museum/scenes/{identifier}
DELETE /api/museum/scenes/{identifier} # (4)!
GET /api/museum/scenes/{identifier}/catalog
POST /api/museum/scenes/{identifier}/images
PUT /api/museum/scenes/{identifier}/images/{image_id}
DELETE /api/museum/scenes/{identifier}/images/{image_id}
GET /api/museum/scenes/{identifier}/images/{image_id}/file
GET /api/museum/scenes/{identifier}/images/{image_id}/thumbnail?size=
- The singleton museum overview row. Get + update only — never created or deleted via the API; the store seeds it on first run so this is always total.
- The full museum tree as RAG/agent JSON (
dataland.rag.museum.catalog.v1) — sections, scenes, bucket layout, RAG target collections, and inlined image entries. - Deleting a section cascades to its scenes plus the matching GCS object prefix and the RAG points across both collections. Not just the section row.
- Deleting a scene cascades to its GCS object prefix and RAG points, via
schedule_cascade_delete_for_scene.
Identifiers
Every section/scene route accepts either the slug or the id (WHERE slug = :id OR id = :id). Slugs are generated from the name via slugify() and made unique against both the DB and the on-disk directory.
The .../catalog endpoints emit schema-versioned RAG/agent JSON (dataland.rag.museum.catalog.v1, dataland.rag.section.catalog.v1, dataland.rag.scene.catalog.v1) with the full tree, bucket layout, RAG target collections, and inlined image entries. The frontend deep-link routes (/projects/{slug}, /museum, /museum/{section}, /museum/{section}/{scene}) return the SPA index.html so client routing works on refresh.
RAG live-sync¶
Every project, section, scene, and museum-overview create / update / delete fires a fire-and-forget ingest against rag-v2 so Qdrant tracks the catalog:
- text →
POST /ingest/file(theknowledgecollection), - images →
POST /ingest/image(theimagescollection, which captions via Gemini and stores multimodal vectors + denormalized parent context as payload).
Replace-by-slug semantics. Each sync first calls DELETE /ingest/by-project-slug/<slug> to wipe the entity's stale points across both collections, then re-ingests. Combined with deterministic UUIDv5 point ids on the RAG side, this gives clean upserts with no duplicate accumulation across many edits.
Entities are rendered to markdown (render_project_markdown / render_museum_markdown / render_section_markdown / render_scene_markdown) so the RAG chunker, which splits on headings, produces one chunk per logical block. Museum entities carry a namespaced RAG slug and an entity_type in the payload so the agent can scope retrieval, and so a museum section named archive-dreaming can never collide with a Refik project of the same slug:
| Entity | RAG slug (project_slug grouping key) |
entity_type |
|---|---|---|
| Museum overview | museum |
museum |
| Section | museum-section-<slug> |
section (its images: section_image) |
| Scene | museum-scene-<slug> |
scene (its images: scene_image) |
MuseumStore fires schedule_ingest_for_section / _scene and the museum overview on each persist (same contract ProjectStore uses with schedule_ingest_for_project). Section/scene deletes go through schedule_cascade_delete_for_section/_scene, which also delete the GCS object prefix.
Sync runtime (DAT-195)¶
Jobs run on a bounded ThreadPoolExecutor (replacing per-call daemon threads), so a bulk operation queues instead of forking N threads. Each step runs through _with_retry: transient failures (URLError, TimeoutError, ConnectionError, HTTP 5xx/429) retry with exponential backoff + jitter, giving up after RAG_SYNC_MAX_RETRIES. A RAG or GCS hiccup logs a warning and returns — it never propagates back into the user-facing save. The FastAPI lifespan shutdown hook drains the executor (shutdown_executor(wait=True)) so in-flight syncs finish.
| Env var | Default | Purpose |
|---|---|---|
RAG_BASE_URL |
"" |
e.g. http://dataland-rag:4143. Empty disables the hook. |
RAG_API_KEY |
"" |
X-API-Key value for /ingest/file and /ingest/image. |
RAG_REQUEST_TIMEOUT_SECONDS |
60 |
Per-call timeout. |
RAG_SYNC_MAX_WORKERS |
4 |
Bounded executor concurrency. |
RAG_SYNC_MAX_RETRIES |
3 |
Retry attempts per step. |
RAG_SYNC_RETRY_BASE_S / RAG_SYNC_RETRY_MAX_S |
1.0 / 8.0 |
Backoff bounds. |
Standalone mode
When RAG_BASE_URL (or RAG_API_KEY) is empty, the webui runs standalone and skips all ingest. The manual backfill for projects is scripts/ingest_to_rag.py. To bulk re-ingest the museum (all sections + scenes + overview) drive the blocking helpers in-process:
from app.config import get_settings
from app.storage import ProjectStore
from app.museum_storage import MuseumStore
from app.rag_sync import (
ingest_museum_blocking, # (1)!
ingest_section_blocking,
ingest_scene_blocking,
)
s = get_settings()
ps = ProjectStore(s)
store = MuseumStore(s, engine=ps.engine, session_factory=ps.session_factory) # (2)!
ingest_museum_blocking(s, store.get_museum())
for summ in store.list_sections():
section = store.get_section(summ.slug)
ingest_section_blocking(s, section, museum_dir=store.museum_dir)
for sc in store.list_scenes(section.slug):
ingest_scene_blocking(s, store.get_scene(sc.slug), museum_dir=store.museum_dir)
- The
*_blockingvariants run synchronously, unlike the fire-and-forgetschedule_*hooks the live service uses. Use them for a manual backfill where you want to wait for completion and see failures surface. MuseumStorereusesProjectStore's engine + session factory so both writers share one SQLite connection pool. Schema creation and migrations are owned byProjectStore;MuseumStoreonly seeds the singleton museum row if missing.
GCS uploads¶
Image uploads land in the public bucket so the mobile app and rag-v2 can fetch them directly. Museum content is namespaced under artworks/museum/... so it can't collide with a project that shares a slug.
| GCS path | Source |
|---|---|
artworks/<project-slug>/<file> |
project images |
artworks/museum/sections/<section-slug>/<file> |
section images |
artworks/museum/scenes/<scene-slug>/<file> |
scene images |
| Env var | Default | Purpose |
|---|---|---|
GCP_PROJECT_ID |
dataland-ai |
Used by google-cloud-storage. |
GCS_PUBLIC_BASE_URL |
https://storage.googleapis.com |
Base prefix for public_url. |
GCS_PUBLIC_BUCKET |
"" (empty disables uploads) |
Production: dataland-public. |
GCS_PRIVATE_BUCKET |
dataland-private |
Reserved for scene metadata, documents, external articles. |
GCS_ARTWORKS_PREFIX |
artworks/ |
Subprefix inside the public bucket (matches the rag-v2 scanner default). |
GCS_SCENES_PREFIX |
museum/scenes/ |
Private-bucket scene-metadata subprefix (reflected in museum_catalog.bucket_layout). |
GCS_CATALOGS_PREFIX |
catalogs/ |
Catalog-mirror subprefix. |
WEBUI_CDN_BASE_URL |
"" |
Optional CDN in front of the public bucket. |
GOOGLE_APPLICATION_CREDENTIALS |
/app/gcp-key.json |
Service-account JSON, bind-mounted read-only (DAT-191). |
GCS degrades silently
When GCS_PUBLIC_BUCKET is empty, the webui keeps images on the local volume only and never reaches GCS — and with no bucket, RAG image ingest is skipped (a rag_sync_*_images_skipped_no_gcs_bucket log line). Without the GCP key mount, GCS upload degrades to None and the only signal is a "GCS credentials missing" warning. Text ingest into knowledge still works without GCS.
Auth posture¶
Login is a single shared password (INFORMATION_WEBUI_PASSWORD) exchanged for an HMAC-SHA256-signed session cookie (dataland_information_session, SameSite=Lax, HttpOnly, default 12h TTL). The signing key is INFORMATION_WEBUI_SESSION_SECRET (falls back to the password if unset).
- DAT-192 — per-IP sliding-window rate limit on
POST /api/auth/login: 5 failures inside a 600s window triggers a 300s lockout (429withRetry-After). A correct password clears the bucket. The 503 "password not configured" branch does not consume capacity. - DAT-190 — strict-Origin CSRF gate on mutating cookie-authenticated requests: every non-safe method must carry an
Origin/Refererwhose origin is inCORS_ORIGINS, else403. Safe methods (GET/HEAD/OPTIONS) are exempt. - Server-to-server callers set
INFORMATION_WEBUI_API_TOKENand pass it asAuthorization: Bearer <token>orX-Information-WebUI-Token: <token>. Token auth bypasses both the cookie and the CSRF check (it can't carry a cross-site cookie). - When no password is configured and
AUTH_ENABLED=false, the service is intentionally open (local dev / tests).
Operator name remap
The infra deploy env exposes a few variables under an operator-friendly prefix that compose remaps to the container names: INFORMATION_WEBUI_AUTH_ENABLED → AUTH_ENABLED, INFORMATION_WEBUI_COOKIE_SECURE → AUTH_COOKIE_SECURE, INFORMATION_WEBUI_CORS_ORIGINS → CORS_ORIGINS, INFORMATION_WEBUI_SESSION_MAX_AGE_SECONDS → AUTH_SESSION_MAX_AGE_SECONDS. See the env inventory (docs/env-inventory.md).
Key env vars¶
APP_ENV=production
APP_HOST=0.0.0.0
APP_PORT=4152
LOG_LEVEL=info
STORAGE_DIR=/app/data
MAX_UPLOAD_SIZE_MB=50 # (1)!
CORS_ORIGINS=http://localhost:5173,http://127.0.0.1:5173,http://localhost:4152 # (2)!
AUTH_ENABLED=true
INFORMATION_WEBUI_PASSWORD=*** # (3)!
INFORMATION_WEBUI_SESSION_SECRET=*** # (4)!
INFORMATION_WEBUI_API_TOKEN= # (5)!
AUTH_SESSION_MAX_AGE_SECONDS=43200 # (6)!
AUTH_COOKIE_SECURE=false # (7)!
GCP_PROJECT_ID=dataland-ai
GCS_PUBLIC_BASE_URL=https://storage.googleapis.com
GCS_PUBLIC_BUCKET=dataland-public # (8)!
GCS_PRIVATE_BUCKET=dataland-private
GCS_ARTWORKS_PREFIX=artworks/
GCS_SCENES_PREFIX=museum/scenes/
GCS_CATALOGS_PREFIX=catalogs/
GOOGLE_APPLICATION_CREDENTIALS=/app/gcp-key.json
GCP_KEY_FILE=./secrets/gcp-key.json # (9)!
RAG_BASE_URL=http://dataland-rag:4143 # (10)!
RAG_API_KEY=***
RAG_REQUEST_TIMEOUT_SECONDS=60
INFORMATION_WEBUI_PUBLIC_PORT=4152
INFORMATION_WEBUI_DATA_DIR=./data # (11)!
- Upload size cap. Uploads above this return
413; raise it for large assets. Note this is separate from the PIL decompression-bomb cap of 24,000,000 px that rejects oversized declared-dimension images. - Allowed origins for CORS and the DAT-190 strict-Origin CSRF gate: every mutating cookie-authenticated request must carry an
Origin/Refererwhose origin is in this list, else403. - The single shared login password. When empty and
AUTH_ENABLED=false, the service is intentionally open (local dev / tests only). - HMAC-SHA256 signing key for the session cookie. Falls back to
INFORMATION_WEBUI_PASSWORDif unset, so set it explicitly in production. - Server-to-server bearer token. Callers pass it as
Authorization: Bearer <token>orX-Information-WebUI-Token: <token>; it bypasses both the cookie and the CSRF check. Empty disables token auth. - Session cookie TTL in seconds (
43200= 12h). - Set to
truebehind TLS so thedataland_information_sessioncookie is sent only over HTTPS. Maps fromINFORMATION_WEBUI_COOKIE_SECUREin the deploy env. - Production public bucket. Empty disables all GCS uploads — images stay local-only and RAG image ingest is silently skipped (text ingest still works).
- Host path of the service-account JSON, bind-mounted read-only to
/app/gcp-key.json(DAT-191). Without it, GCS upload degrades toNonewith only a "GCS credentials missing" warning. - rag-v2 base URL for the live-sync hook. Empty (or empty
RAG_API_KEY) disables the hook and runs the webui standalone with no ingest. - Host bind-mount source mapped to
/app/data(catalog.sqlite3 + images + thumbnails). Persistent so rebuilds/restarts keep the same catalog.
Volumes¶
/app/data <- host: ${INFORMATION_WEBUI_DATA_DIR:-./data} # (1)!
/app/gcp-key.json <- host: ${GCP_KEY_FILE:-./secrets/gcp-key.json} # (2)!
- Persistent bind-mount holding
catalog.sqlite3+ images + thumbnails. SQLite is the standalone source of truth, so this must survive rebuilds/restarts. The container runs as uid/gid 1000 (DAT-210) so the mount can bechowned once. - Service-account JSON, read-only (
chmod 600). DAT-191 bind-mounts it:roso the running container can never mutate the credential.
/app/data is a persistent bind-mount so rebuilds/restarts keep the same catalog and uploads. The container runs as uid/gid 1000 (DAT-210), matching the rest of the stack so the bind-mount can be chowned once.
Scripts¶
All scripts read the same .env the service does and are idempotent.
| Script | Purpose |
|---|---|
build_refik_dataset.py |
One-shot Refik Anadol Studio scrape into dataset/. Does not touch live data/. |
import_refik_dataset.py |
Upserts the scraped dataset/ into data/ (projects + image rows by slug). |
ingest_images.py |
Bulk-upload a folder of images for a given project slug. |
ingest_to_rag.py |
Manual backfill equivalent of the live-sync hook for projects. |
backfill_image_urls.py |
Rewrites public_url / cdn_url / gs_uri / thumbnail_url after a bucket rename or CDN cutover. |
refresh_project_metadata.py |
Recomputes derived metadata (keywords, ordering, thumbnails) without re-uploading source images. |
rewrite_short_descriptions.py |
Bulk LLM rewrite of short_description across projects. |
sync_from_qdrant.py |
Read-side cross-check; reconciles Qdrant payloads against the SQLite catalog. |
sync_project_slugs.py |
Migrates project slugs when titles change. |
Reaching it¶
curl -fsS https://data.dataland.chat/health # (1)!
curl -fsS http://localhost:4152/health # (2)!
curl -fsS http://localhost:4152/metrics # (3)!
- Public entry point. Cloudflare Tunnel routes
data.dataland.chatto the container's local targethttp://localhost:4152; the container itself only needs that local URL. - Host-local healthcheck. Returns
{"status":"ok","service":"dataland-atlas"}— this is the sameGET /healththe container healthcheck hits, and it is open (not auth-gated). - Prometheus scrape endpoint (DAT-82), served through a pure-ASGI middleware so streaming responses stay unbuffered. Also open, no auth.
Cloudflare Tunnel routes data.dataland.chat to the container's local target http://localhost:4152; the container itself only needs that local URL.
Local development¶
# backend
uv sync --extra dev # (1)!
Copy-Item .env.example .env
uv run python main.py # (2)!
# frontend (Vite dev server, proxies /api -> 127.0.0.1:4152)
cd web
npm install
npm run dev # (3)!
- Installs the dev extra (ruff, pytest) alongside runtime deps. The same
.envpowers both the service and every script. - Boots a single uvicorn worker in
--factorymode (app.main:create_app) onhttp://localhost:4152. In dev with emptyRAG_BASE_URL/GCS_PUBLIC_BUCKETit runs standalone (no ingest, local-only images). - Vite dev server on
http://localhost:5173, proxying/apito127.0.0.1:4152. Note5173must be inCORS_ORIGINSor mutating requests hit the DAT-190 CSRF gate.
Production build serves the SPA from FastAPI:
- Builds the SPA into
web/dist, which FastAPI then serves directly (no separate Vite server). Deep-link routes (/projects/{slug},/museum, etc.) fall back toindex.htmlso client routing survives a refresh. - Same single uvicorn entrypoint as dev, but now serving the prebuilt SPA from
web/distinstead of proxying to Vite.
Docker:
--buildrebuilds the multi-stage image (dataland-atlas/Dockerfile) so the prebuilt SPA and Python deps are baked in. The/app/databind-mount persists the catalog across rebuilds.
Checks before a PR:
uv run --extra dev ruff check scripts app tests main.py # (1)!
uv run --extra dev pytest
cd web; npm run build # (2)!
- Lint gate over the full source set (
scripts,app,tests,main.py). The--extra devpulls in ruff/pytest without polluting the runtime env. - The frontend build must pass too — a broken
web/distships a blank SPA even when the backend is green, so this catches it before a PR.
The design standard lives in web/src/design-system/ (dark grid, square panels, mono labels, thin borders, cyan accents) and is also published as the shared @dataland-ai/design-system package for other Dataland frontends.
Deployment notes¶
- Keep the app behind Cloudflare Tunnel/Access before exposing it publicly.
- Mount
/app/dataas a persistent volume; SQLite is the standalone source of truth. If write volume outgrows this editor workload, move the same ORM model to Postgres rather than switching SQLite to async. - Increase
MAX_UPLOAD_SIZE_MBfor large assets (uploads above the cap return413). - See sibling pages: RAG (the ingest target), Agent (the catalog consumer), observability (
/metricsscrape), and deploy.