RAG — Search and Ingest¶

dataland-rag (the v2 service, repo dataland-rag-v2) is the unified retrieval service for the museum. It owns all ingestion of museum content and all search — text, image, and multimodal — backed by Qdrant and Gemini embeddings. The Agent queries it for search_knowledge and search_artwork_images; the Catalog Studio (information-webui) pushes content into it on every edit.


Container	`dataland-rag`
Image	`dataland/rag:${IMAGE_TAG}` (built from `dataland-rag-v2/Dockerfile`, app version `3.0.0`)
Internal URL	`http://dataland-rag:4143` (the docker bridge `dataland-network`)
Bind	`127.0.0.1:4143` (loopback + SSH tunnel) and `${RAG_PUBLIC_BIND}:4143` (tailnet peer, `spark:4143`) — never `0.0.0.0`
Memory / CPU	`mem_limit: 12g` (`mem_reservation: 4g`) / `cpus: 12.0`
Liveness	`GET /health` (compose healthcheck, anonymous)
Diagnostics	`GET /health/full` (auth-gated, DAT-184)

Recent changes (2026-06-03 → 2026-06-04)

DAT-269 — the generative model standardized on gemini-3.5-flash for image captioning and the reranker's Gemini fallback (GEMINI_MODEL). RAG vectors continue to use the gemini-embedding family (gemini-embedding-2-preview in production). Do not assume gemini-2.5-flash/gemini-3.1-flash-lite — both are retired here.
Museum content re-ingest — the 20 museum sections + their scenes + the museum overview were (re-)pushed into the knowledge collection, taking the point count 4839 → 4969.
Search timeout interplay — the Agent's RAG /search read timeout was raised 10s → 25s. Museum-knowledge queries (query embedding + dense + sparse + rerank) round-trip in ~10s; a 10s client timeout triggered ReadTimeout → 3 retries (~30s) → a second search → 60s agent wall-clock → agent_timeout. See the interplay.

What it does¶

graph LR
  AG[agent] -.search.-> RAG[dataland-rag]
  WU[information-webui] -.ingest + delete.-> RAG
  RAG --> QD[(Qdrant<br/>knowledge / images / scenes)]
  RAG --> GCS[("GCS<br/>dataland-public + dataland-private")]
  RAG --> GEM["Gemini API<br/>(embedding + 3.5-flash)"]
  RAG --> FE["FastEmbed<br/>(BM25 + Jina reranker, in-process)"]

Search (POST /search) — hybrid text retrieval (dense + BM25 sparse, optional text-scroll) fused with RRF, then reranked by a Jina cross-encoder.
Image search (POST /images/search/text, POST /images/search/image) — multimodal similarity against the images collection.
Ingest — POST /ingest/file (knowledge), POST /ingest/image (images, Gemini captioning), POST /ingest/sync (GCS delta scan), and DELETE /ingest/by-project-slug/{slug} (replace-by-slug for the webui).
Image serving — GET /images/{filepath} (public bucket) and GET /images/extracted/{filepath} (private bucket), both auth-gated and path-hardened (DAT-164).
Ops — POST /admin/sparse-backfill (BM25 backfill without a restart, DAT-168).

Qdrant collections¶

All three collections share the same dense vector config — 3072-dim, cosine (storage/schema.py: VECTOR_SIZE = 3072, Distance.COSINE). They are created idempotently at startup by ensure_collections().

Collection	What populates it	Embed task type	Sparse vector
`knowledge`	Document chunks from GCS `documents/` `blog-posts/` `external/`, uploaded files via `/ingest/file`, and webui text (projects, museum overview, sections, scenes)	`RETRIEVAL_DOCUMENT` (index) / `RETRIEVAL_QUERY` (search)	`sparse` (BM25, `Modifier.IDF`)
`images`	Artwork images from GCS `artworks/`, webui-curated images via `/ingest/image`, and (optionally) document-extracted inline images	multimodal image embedding (no task type)	none
`scenes`	Scene JSON under GCS `museum/scenes/` (`embed_for_similarity` over the `description` field)	`SEMANTIC_SIMILARITY`	none

Production point counts

knowledge ≈ 4969 after today's museum re-ingest (was 4839), images ≈ 1485, scenes small. Live counts come from GET /health/full (collections[].vectors_count).

Knowledge payload¶

Written by embed_and_upsert_knowledge (ingestion/ingestors/document.py). Each chunk carries: content, title, source (filename), category (documents/blog-posts/external), source_type (extension), source_path (the dedup key), doc_id (groups a document's chunks), chunk_index, token_count, optional section_title / page_number, tags, created_at. Webui-driven ingests add project_slug + entity_type via the /ingest/file metadata form field (see webui live-sync).

Reserved metadata keys

/ingest/file rejects caller-supplied metadata that contains any of the ingestor-owned keys (content, title, source, source_path, source_type, category, doc_id, chunk_index, token_count, section_title, page_number, created_at, tags) with a 400. The webui passes only safe extras like project_slug, project_name, entity_type, location, categories.

Image payload¶

Built by the shared build_image_point helper (ingestion/ingestors/image_point.py, DAT-172) so all three image-ingest paths share 11 base keys: file_name, title, caption, keywords, tags, gs_uri, public_url, collection_name, artwork_id, source_path, ingested_at. The webui path layers ~20 extras on top (image_id, project_slug, width/height, description, alt_text, auto_caption, source_type="webui-image", CDN/source URLs, timestamps, …). Its /ingest/image route also enforces its own reserved-key set (auto_caption, source_type, width, height, size_bytes, title, tags, keywords, artwork_id, collection_name, ingested_at).

The search pipeline¶

POST /search accepts {query, collection, top_k≤50, filters?, rerank=true, score_threshold?}. The collection must be one of the three known names or it 400s.

flowchart TD
  Q[query] --> E["embed_query<br/>(RETRIEVAL_QUERY, Gemini)"]
  E --> D[dense query_points]
  E --> H["hybrid RRF prefetch<br/>(dense + BM25 sparse)"]
  Q --> T["text-scroll<br/>(MatchPhrase + MatchText)"]
  D --> M[merge_candidates]
  H --> M
  T --> M
  M --> TH["score-threshold filter<br/>+ blend weights"]
  TH --> R{rerank?}
  R -- yes --> RR["Jina cross-encoder<br/>(FastEmbed ONNX)"]
  R -- no --> TK[top_k slice]
  RR --> OUT[results]
  TK --> OUT

Three retrieval channels merged with weighted RRF (retrieval/searcher.py):

Dense — cosine kNN over the Gemini vector. Always runs, every collection.
Hybrid (dense + BM25 sparse) — a single Qdrant FusionQuery(RRF) with two prefetches. Only fires on the knowledge collection and only when HYBRID_SEARCH_ENABLED=true and the collection actually exposes the sparse vector. On any error it logs and falls back to dense.
Text-scroll — MatchPhrase over (section_title, title, source, content) plus a wider MatchText pass, re-scored in Python with field weights. Only on knowledge and only when TEXT_SEARCH_ENABLED=true (deployed true to surface entity-lookup matches; the code default is false).

The candidates are merged in _merge_candidates: each channel contributes an RRF term (hybrid 1.2, dense 0.8, text 1.5 over rank + 10) plus a weighted absolute-score term:

final = SCORE_WEIGHT_HYBRID*hybrid + SCORE_WEIGHT_DENSE*dense + SCORE_WEIGHT_TEXT*text  # (1)!
      = 0.65*hybrid + 0.45*dense + 0.18*text   (deployed)  # (2)!

The absolute-score blend that rides on top of the RRF rank terms. Each channel's raw similarity is scaled by its weight, so a channel can be down-weighted without dropping it from candidate generation.
Deployed weights: SCORE_WEIGHT_HYBRID=0.65, SCORE_WEIGHT_DENSE=0.45, SCORE_WEIGHT_TEXT=0.18. Hybrid leads because it already fuses dense + BM25; text is deliberately the weakest contributor since text-scroll exists mainly to surface exact entity-name hits, not to rank semantic relevance.

A candidate survives the threshold if max(dense, hybrid) ≥ score_threshold or its text score clears the internal text floor (2.0). RERANK_CANDIDATES (deployed 45, was 20) bounds how deep each channel reaches before reranking.

Reranking — Jina cross-encoder, Gemini fallback¶

rerank=true (default) reorders the surviving candidates with a FastEmbed Jina cross-encoder running ONNX in-process — jinaai/jina-reranker-v2-base-multilingual (~1.1 GB ONNX, multilingual, ~80–120 ms warm). This is the primary path (RERANKER_BACKEND=fastembed).

Gemini Flash is only the fallback

retrieval/reranker.py carries a Gemini-Flash scoring path (_score_batch_gemini, using GEMINI_MODEL = gemini-3.5-flash). It runs only when the FastEmbed cross-encoder raises. The module docstring describing "Stage 2: Gemini Flash" is stale — under normal operation reranking never calls Gemini. If both fail, the original RRF order is returned unchanged.

A single result short-circuits with rerank_score = 10.0; the rerank toggle is ignored unless there are at least two candidates.

Image search thresholds¶

Image search (searcher.search_by_image / search_images_by_text) is a plain cosine kNN against images with no rerank and no hybrid. It applies a separate, higher per-channel floor — IMAGE_SEARCH_SCORE_THRESHOLD=0.45 (DAT-262) — because the multimodal channel produces lower absolute cosines (observed top-1 0.48–0.51 for real artwork matches) than text knowledge. Reusing the 0.35 text threshold let every weakly-matching selfie return an "artwork" hit. The agent's vision path (/images/search/text) consumes this.

Ingestion¶

`/ingest/file` (knowledge)¶

Multipart upload (file + JSON metadata form field, ≤ 50 MB). The Kreuzberg parser extracts text (VLM OCR via OCR_BACKEND=vlm, capped by MAX_PARSE_TIME_S=120), the chunker splits at CHUNK_SIZE=400 / CHUNK_OVERLAP=60, each chunk is embedded (RETRIEVAL_DOCUMENT), and dense + (if enabled) BM25 sparse vectors are upserted in batches of 100. Supported extensions come from the document ingestor's SUPPORTED_EXTENSIONS (PDF, DOCX, MD, TXT, …).

`/ingest/image` (images)¶

Single image (file + required JSON metadata, ≤ 25 MB). Required metadata: image_id, project_slug, source_path. The endpoint:

Optionally auto-captions the bytes with Gemini (CAPTION_PROMPT + gemini-3.5-flash, auto_caption=true default) for richer keyword extraction.
Embeds the image bytes multimodally (embedder.embed_image).
Sanitizes every user string (DAT-165) and upserts one point keyed by a deterministic UUIDv5 minted from source_path (_INGEST_IMAGE_NAMESPACE), so re-ingestion replaces in place rather than duplicating.

Quota + rate guards (DAT-163 / DAT-166)

Two layers protect the Gemini quota on /ingest/image:

Per-key sliding window — IMAGE_INGEST_RATE_PER_MINUTE=30 (default). Over-budget calls return 429 with Retry-After.
Daily Gemini budget — GEMINI_DAILY_BUDGET=5000 calls/UTC-day across all keys (each /ingest/image reserves 2 calls with auto-caption, 1 without). Exhaustion returns 503.

If the auto-caption call itself hits a Gemini 429, the route short-circuits with 503 before embedding (DAT-166) rather than burning a second quota slot or writing a point with an empty caption.

`/ingest/sync` (GCS delta)¶

Scans every configured GCS prefix, computes the delta against what's already indexed (by source_path), and ingests only new blobs. SYNC_ON_STARTUP=false in production — trigger on demand. The same scanner can run at boot if flipped.

Webui live-sync¶

The Catalog Studio keeps Qdrant in lockstep with the CMS. On every save it runs DELETE → file → image per entity so a re-sync never accumulates duplicates (dataland-atlas/app/rag_sync.py):

DELETE /ingest/by-project-slug/{slug} — wipes the entity's footprint across both images and knowledge (filters payload.project_slug == slug).
POST /ingest/file — the rendered Markdown into knowledge.
POST /ingest/image — each attached image into images.

Museum entities use namespaced RAG slugs so a museum section can't collide with a Refik project of the same name:

Entity	`project_slug` (RAG grouping key)	`entity_type`
Museum overview	`museum`	`museum`
Section	`museum-section-<slug>`	`section`
Scene	`museum-scene-<slug>`	`scene`
Scene image	`museum-scene-<slug>`	`scene_image`

This is the path that re-ingested the 20 sections + scenes + overview today, moving knowledge from 4839 → 4969 points.

Why it's the heaviest service¶

The Jina reranker is ONNX inference. At cpus: 2.0 a single 20-candidate rerank pegged both cores for ~25 s (observed 2026-05-12, CPU at 209 %). The cross-encoder scales well past 8 threads, so cpus: 12.0 covers concurrent reranks with headroom on the 20-core host.
FastEmbed keeps both models resident: the reranker (~1.1 GB ONNX + ~2 GB ORT runtime) and the BM25 sparse model. Runtime working set ~3.25 GB; mem_limit: 12g leaves room for OS page cache, which matters because the ONNX session memory-maps its weights and re-reads them per call.
Model warm-up (DAT-174) — WARM_MODELS_ON_STARTUP=true eagerly loads the sparse + reranker ONNX sessions at lifespan startup (in parallel) so the first /search after deploy doesn't pay a 1–3 s cold-start tax. The image is also pre-warmed at build time via scripts/prewarm_fastembed.py into FASTEMBED_CACHE_DIR=/opt/fastembed_cache.

Agent search-timeout interplay¶

The end-to-end museum-knowledge query is intentionally slow: a gemini-embedding query embedding, three retrieval channels, then a cross-encoder rerank over up to 45 candidates round-trip in ~10 s. The Agent's RAG client read timeout is therefore set to 25 s (rag_search_timeout_seconds = 25.0, app/config.py):

A 10s timeout caused ReadTimeout → 3 retries (~30s) → a second search → 60s agent wall-clock → agent_timeout on knowledge queries. 25s lets a single search complete on the first attempt.

When tuning RAG latency (deeper RERANK_CANDIDATES, re-enabling TEXT_SEARCH_ENABLED, or moving off the warm path), keep this 25 s budget in mind — the agent gives up, and gives up expensively, before RAG does.

GCS bucket layout¶

Bucket	Prefix	Holds
`dataland-public`	`artworks/{collection_name}/*.jpg`	Artwork images served to the mobile app and `GET /images/{filepath}`. (`cobanov-public/chapters/` holds museum chapter imagery served by museum.)
`dataland-public`	`extracted-images/{doc_id}/...`	Inline images extracted from documents (only when `EXTRACT_IMAGES_FROM_DOCUMENTS=true`; off by default).
`dataland-private`	`documents/`, `blog-posts/`, `external/`	Knowledge-base source documents. Never served publicly.
`dataland-private`	`museum/scenes/<slug>.json`	Scene JSON for the `scenes` collection.

The GCS key is mounted read-only at /app/gcp-key.json (compose: ../secrets/gcp-key.json:/app/gcp-key.json:ro, GOOGLE_APPLICATION_CREDENTIALS).

DAT-288

default_reference_* placeholder artwork images were purged from chapters.json and cobanov-public/chapters. They were never in Qdrant, so no RAG reindex was needed.

Key env vars¶

APP_ENV=production
QDRANT_HOST=dataland-qdrant
QDRANT_PORT=6333
QDRANT_GRPC_PORT=6334
QDRANT_API_KEY=***

KNOWLEDGE_COLLECTION=knowledge
SCENES_COLLECTION=scenes
IMAGES_COLLECTION=images

GEMINI_API_KEY=***
EMBEDDING_MODEL=gemini-embedding-2-preview     # (1)!
GEMINI_MODEL=gemini-3.5-flash                  # (2)!

SEARCH_TOP_K=5
RERANK_CANDIDATES=45                           # (3)!
SEARCH_SCORE_THRESHOLD=0.30                    # (4)!
IMAGE_SEARCH_SCORE_THRESHOLD=0.45              # (5)!
HYBRID_SEARCH_ENABLED=true
TEXT_SEARCH_ENABLED=true                       # (6)!
SCORE_WEIGHT_HYBRID=0.65 / DENSE=0.45 / TEXT=0.18  # (7)!

SPARSE_EMBEDDING_MODEL=Qdrant/bm25
SPARSE_VECTOR_NAME=sparse
RERANKER_BACKEND=fastembed                     # (8)!
RERANKER_MODEL=jinaai/jina-reranker-v2-base-multilingual

CHUNK_SIZE=400 / CHUNK_OVERLAP=60              # (9)!
MAX_FILE_SIZE_MB=50

API_KEY=***                                    # (10)!
GCP_PROJECT_ID=dataland-ai
GCS_PUBLIC_BUCKET=dataland-public / GCS_ARTWORKS_PREFIX=artworks/
GCS_PRIVATE_BUCKET=dataland-private
GCS_DOCUMENTS_PREFIXES=documents/,blog-posts/,external/
GCS_SCENES_PREFIX=museum/scenes/
GOOGLE_APPLICATION_CREDENTIALS=/app/gcp-key.json  # (11)!

The embedding model that produces the 3072-dim cosine vectors for all three collections. Must stay in the gemini-embedding family — changing it changes vector geometry and forces a full reindex.
DAT-269 — generative model for image captioning and the reranker's Gemini fallback only. RAG vectors do not use this. Do not set it to gemini-2.5-flash/gemini-3.1-flash-lite; both are retired here.
Raised from 20 (DAT-14). Bounds how deep each retrieval channel reaches before reranking. Deeper = better recall but more cross-encoder work, which directly inflates the ~10 s round-trip the agent's 25 s timeout budgets for.
Lowered from 0.35 (DAT-20). A candidate survives if max(dense, hybrid) ≥ this. The text channel uses a separate internal floor (2.0) instead.
DAT-262 — a higher, image-only floor. The multimodal channel produces lower absolute cosines (top-1 0.48–0.51 for real matches), so reusing the 0.30 text threshold let weak selfies return false "artwork" hits.
Deployed true to surface exact entity-lookup matches; the code default is false. Only affects the knowledge collection's text-scroll channel.
The absolute-score blend weights from the search pipeline. Hybrid leads (it already fuses dense + BM25); text is the weakest since it exists for exact-name hits, not semantic ranking.
Primary reranker is FastEmbed ONNX, in-process. The Gemini path is fallback-only and runs solely when the cross-encoder raises.
Chunker config for /ingest/file: 400-token chunks with 60-token overlap. Overlap preserves context across chunk boundaries so a fact split across two chunks still retrieves.
DAT-162 — required non-empty and not the literal dataland placeholder in production, or the service refuses to boot (enforce_runtime_invariants).
Read-only mount of the GCS service-account key at /app/gcp-key.json (compose binds ../secrets/gcp-key.json:ro). Used for both GCS access and Gemini auth.

Auth posture¶

Every endpoint except GET /health requires X-API-Key. The check is constant-time (secrets.compare_digest, api/deps.py).
GET /health/full was opened to anonymous callers historically; DAT-184 moved it behind verify_api_key because it discloses collection names + vector counts. The agent's admin dashboard probe authenticates with the same key.
DAT-162 — in production the service refuses to boot if API_KEY is empty or still the literal dataland placeholder, or if GEMINI_API_KEY is empty (enforce_runtime_invariants in runtime.py, called from the lifespan).
/docs, /redoc, /openapi.json are disabled when APP_ENV=production.
Image-serving paths are hardened (DAT-164): strict ASCII per-segment allowlist, no ../backslash/control bytes/Unicode confusables, length + segment caps, and generic 400 bodies that never leak the rejection reason.

Health and reaching it¶

GET /health returns {"status": "ok"} (liveness only). GET /health/full returns Qdrant status, the embedder model id, per-collection status + vector counts, and the runtime-config issue list — status is degraded if Qdrant errors or any config invariant is violated.

# Loopback only — open an SSH tunnel from your workstation first:
ssh -L 4143:127.0.0.1:4143 ege@100.124.170.43  # (1)!

curl -fsS http://localhost:4143/health
curl -fsS http://localhost:4143/health/full -H "X-API-Key: $RAG_API_KEY"  # (2)!

# Sample search
curl -fsS http://localhost:4143/search \
  -H "X-API-Key: $RAG_API_KEY" -H 'Content-Type: application/json' \
  -d '{"query":"Latent Gallery", "collection":"knowledge", "top_k":5}'  # (3)!

# Ops: backfill BM25 sparse vectors without a restart (loop until has_more=false)
curl -fsS -X POST "http://localhost:4143/admin/sparse-backfill?max_points=1000" \
  -H "X-API-Key: $RAG_API_KEY"  # (4)!

The service binds loopback (127.0.0.1:4143) plus the tailnet peer, never 0.0.0.0 — so it is unreachable without either this tunnel or tailnet access. 100.124.170.43 is the host's tailnet address.
/health/full is auth-gated (DAT-184) because it discloses collection names and vector counts; only GET /health is anonymous.
collection must be one of knowledge/images/scenes or the endpoint 400s. top_k is capped at 50; rerank defaults to true.
DAT-168 — backfills BM25 sparse vectors live, no restart. Page through with max_points and repeat until the response reports has_more=false.

The Agent admin dashboard already pings /health/full for you — see agent → /admin. Prometheus scrapes dataland-rag:4143; the alert rules cover RAG 5xx burn-rate and the mem_limit / cpus pressure that this service is most prone to.