RAG — Search and Ingest¶
dataland-rag (the v2 service, repo dataland-rag-v2) is the unified retrieval
service for the museum. It owns all ingestion of museum content and all
search — text, image, and multimodal — backed by Qdrant
and Gemini embeddings. The Agent queries it for search_knowledge
and search_artwork_images; the Catalog Studio (information-webui)
pushes content into it on every edit.
| Container | dataland-rag |
| Image | dataland/rag:${IMAGE_TAG} (built from dataland-rag-v2/Dockerfile, app version 3.0.0) |
| Internal URL | http://dataland-rag:4143 (the docker bridge dataland-network) |
| Bind | 127.0.0.1:4143 (loopback + SSH tunnel) and ${RAG_PUBLIC_BIND}:4143 (tailnet peer, spark:4143) — never 0.0.0.0 |
| Memory / CPU | mem_limit: 12g (mem_reservation: 4g) / cpus: 12.0 |
| Liveness | GET /health (compose healthcheck, anonymous) |
| Diagnostics | GET /health/full (auth-gated, DAT-184) |
Recent changes (2026-06-03 → 2026-06-04)
- DAT-269 — the generative model standardized on
gemini-3.5-flashfor image captioning and the reranker's Gemini fallback (GEMINI_MODEL). RAG vectors continue to use thegemini-embeddingfamily (gemini-embedding-2-previewin production). Do not assumegemini-2.5-flash/gemini-3.1-flash-lite— both are retired here. - Museum content re-ingest — the 20 museum sections + their scenes +
the museum overview were (re-)pushed into the
knowledgecollection, taking the point count4839 → 4969. - Search timeout interplay — the Agent's RAG
/searchread timeout was raised10s → 25s. Museum-knowledge queries (query embedding + dense + sparse + rerank) round-trip in ~10s; a 10s client timeout triggeredReadTimeout → 3 retries (~30s) → a second search → 60s agent wall-clock → agent_timeout. See the interplay.
What it does¶
graph LR
AG[agent] -.search.-> RAG[dataland-rag]
WU[information-webui] -.ingest + delete.-> RAG
RAG --> QD[(Qdrant<br/>knowledge / images / scenes)]
RAG --> GCS[("GCS<br/>dataland-public + dataland-private")]
RAG --> GEM["Gemini API<br/>(embedding + 3.5-flash)"]
RAG --> FE["FastEmbed<br/>(BM25 + Jina reranker, in-process)"]
- Search (
POST /search) — hybrid text retrieval (dense + BM25 sparse, optional text-scroll) fused with RRF, then reranked by a Jina cross-encoder. - Image search (
POST /images/search/text,POST /images/search/image) — multimodal similarity against theimagescollection. - Ingest —
POST /ingest/file(knowledge),POST /ingest/image(images, Gemini captioning),POST /ingest/sync(GCS delta scan), andDELETE /ingest/by-project-slug/{slug}(replace-by-slug for the webui). - Image serving —
GET /images/{filepath}(public bucket) andGET /images/extracted/{filepath}(private bucket), both auth-gated and path-hardened (DAT-164). - Ops —
POST /admin/sparse-backfill(BM25 backfill without a restart, DAT-168).
Qdrant collections¶
All three collections share the same dense vector config — 3072-dim, cosine
(storage/schema.py: VECTOR_SIZE = 3072, Distance.COSINE). They are created
idempotently at startup by ensure_collections().
| Collection | What populates it | Embed task type | Sparse vector |
|---|---|---|---|
knowledge |
Document chunks from GCS documents/ blog-posts/ external/, uploaded files via /ingest/file, and webui text (projects, museum overview, sections, scenes) |
RETRIEVAL_DOCUMENT (index) / RETRIEVAL_QUERY (search) |
sparse (BM25, Modifier.IDF) |
images |
Artwork images from GCS artworks/, webui-curated images via /ingest/image, and (optionally) document-extracted inline images |
multimodal image embedding (no task type) | none |
scenes |
Scene JSON under GCS museum/scenes/ (embed_for_similarity over the description field) |
SEMANTIC_SIMILARITY |
none |
Production point counts
knowledge ≈ 4969 after today's museum re-ingest (was 4839),
images ≈ 1485, scenes small. Live counts come from GET /health/full
(collections[].vectors_count).
Knowledge payload¶
Written by embed_and_upsert_knowledge (ingestion/ingestors/document.py). Each
chunk carries: content, title, source (filename), category
(documents/blog-posts/external), source_type (extension), source_path
(the dedup key), doc_id (groups a document's chunks), chunk_index,
token_count, optional section_title / page_number, tags, created_at.
Webui-driven ingests add project_slug + entity_type via the /ingest/file
metadata form field (see webui live-sync).
Reserved metadata keys
/ingest/file rejects caller-supplied metadata that contains any of the
ingestor-owned keys (content, title, source, source_path,
source_type, category, doc_id, chunk_index, token_count,
section_title, page_number, created_at, tags) with a 400. The
webui passes only safe extras like project_slug, project_name,
entity_type, location, categories.
Image payload¶
Built by the shared build_image_point helper (ingestion/ingestors/image_point.py,
DAT-172) so all three image-ingest paths share 11 base keys: file_name,
title, caption, keywords, tags, gs_uri, public_url, collection_name,
artwork_id, source_path, ingested_at. The webui path layers ~20 extras
on top (image_id, project_slug, width/height, description, alt_text,
auto_caption, source_type="webui-image", CDN/source URLs, timestamps, …). Its
/ingest/image route also enforces its own reserved-key set (auto_caption,
source_type, width, height, size_bytes, title, tags, keywords,
artwork_id, collection_name, ingested_at).
The search pipeline¶
POST /search accepts {query, collection, top_k≤50, filters?, rerank=true, score_threshold?}.
The collection must be one of the three known names or it 400s.
flowchart TD
Q[query] --> E["embed_query<br/>(RETRIEVAL_QUERY, Gemini)"]
E --> D[dense query_points]
E --> H["hybrid RRF prefetch<br/>(dense + BM25 sparse)"]
Q --> T["text-scroll<br/>(MatchPhrase + MatchText)"]
D --> M[merge_candidates]
H --> M
T --> M
M --> TH["score-threshold filter<br/>+ blend weights"]
TH --> R{rerank?}
R -- yes --> RR["Jina cross-encoder<br/>(FastEmbed ONNX)"]
R -- no --> TK[top_k slice]
RR --> OUT[results]
TK --> OUT
Three retrieval channels merged with weighted RRF (retrieval/searcher.py):
- Dense — cosine kNN over the Gemini vector. Always runs, every collection.
- Hybrid (dense + BM25 sparse) — a single Qdrant
FusionQuery(RRF)with two prefetches. Only fires on theknowledgecollection and only whenHYBRID_SEARCH_ENABLED=trueand the collection actually exposes thesparsevector. On any error it logs and falls back to dense. - Text-scroll —
MatchPhraseover(section_title, title, source, content)plus a widerMatchTextpass, re-scored in Python with field weights. Only onknowledgeand only whenTEXT_SEARCH_ENABLED=true(deployedtrueto surface entity-lookup matches; the code default isfalse).
The candidates are merged in _merge_candidates: each channel contributes an RRF
term (hybrid 1.2, dense 0.8, text 1.5 over rank + 10) plus a weighted
absolute-score term:
final = SCORE_WEIGHT_HYBRID*hybrid + SCORE_WEIGHT_DENSE*dense + SCORE_WEIGHT_TEXT*text # (1)!
= 0.65*hybrid + 0.45*dense + 0.18*text (deployed) # (2)!
- The absolute-score blend that rides on top of the RRF rank terms. Each channel's raw similarity is scaled by its weight, so a channel can be down-weighted without dropping it from candidate generation.
- Deployed weights:
SCORE_WEIGHT_HYBRID=0.65,SCORE_WEIGHT_DENSE=0.45,SCORE_WEIGHT_TEXT=0.18. Hybrid leads because it already fuses dense + BM25; text is deliberately the weakest contributor since text-scroll exists mainly to surface exact entity-name hits, not to rank semantic relevance.
A candidate survives the threshold if max(dense, hybrid) ≥ score_threshold
or its text score clears the internal text floor (2.0). RERANK_CANDIDATES
(deployed 45, was 20) bounds how deep each channel reaches before reranking.
Reranking — Jina cross-encoder, Gemini fallback¶
rerank=true (default) reorders the surviving candidates with a FastEmbed Jina
cross-encoder running ONNX in-process — jinaai/jina-reranker-v2-base-multilingual
(~1.1 GB ONNX, multilingual, ~80–120 ms warm). This is the primary path
(RERANKER_BACKEND=fastembed).
Gemini Flash is only the fallback
retrieval/reranker.py carries a Gemini-Flash scoring path (_score_batch_gemini,
using GEMINI_MODEL = gemini-3.5-flash). It runs only when the FastEmbed
cross-encoder raises. The module docstring describing "Stage 2: Gemini Flash"
is stale — under normal operation reranking never calls Gemini. If both
fail, the original RRF order is returned unchanged.
A single result short-circuits with rerank_score = 10.0; the rerank toggle is
ignored unless there are at least two candidates.
Image search thresholds¶
Image search (searcher.search_by_image / search_images_by_text) is a plain
cosine kNN against images with no rerank and no hybrid. It applies a
separate, higher per-channel floor — IMAGE_SEARCH_SCORE_THRESHOLD=0.45
(DAT-262) — because the multimodal channel produces lower absolute cosines
(observed top-1 0.48–0.51 for real artwork matches) than text knowledge.
Reusing the 0.35 text threshold let every weakly-matching selfie return an
"artwork" hit. The agent's vision path (/images/search/text) consumes this.
Ingestion¶
/ingest/file (knowledge)¶
Multipart upload (file + JSON metadata form field, ≤ 50 MB). The
Kreuzberg parser extracts text (VLM OCR
via OCR_BACKEND=vlm, capped by MAX_PARSE_TIME_S=120), the chunker splits at
CHUNK_SIZE=400 / CHUNK_OVERLAP=60, each chunk is embedded
(RETRIEVAL_DOCUMENT), and dense + (if enabled) BM25 sparse vectors are upserted
in batches of 100. Supported extensions come from the document ingestor's
SUPPORTED_EXTENSIONS (PDF, DOCX, MD, TXT, …).
/ingest/image (images)¶
Single image (file + required JSON metadata, ≤ 25 MB). Required metadata:
image_id, project_slug, source_path. The endpoint:
- Optionally auto-captions the bytes with Gemini (
CAPTION_PROMPT+gemini-3.5-flash,auto_caption=truedefault) for richer keyword extraction. - Embeds the image bytes multimodally (
embedder.embed_image). - Sanitizes every user string (DAT-165) and upserts one point keyed by a
deterministic UUIDv5 minted from
source_path(_INGEST_IMAGE_NAMESPACE), so re-ingestion replaces in place rather than duplicating.
Quota + rate guards (DAT-163 / DAT-166)
Two layers protect the Gemini quota on /ingest/image:
- Per-key sliding window —
IMAGE_INGEST_RATE_PER_MINUTE=30(default). Over-budget calls return429withRetry-After. - Daily Gemini budget —
GEMINI_DAILY_BUDGET=5000calls/UTC-day across all keys (each/ingest/imagereserves 2 calls with auto-caption, 1 without). Exhaustion returns503.
If the auto-caption call itself hits a Gemini 429, the route
short-circuits with 503 before embedding (DAT-166) rather than burning a
second quota slot or writing a point with an empty caption.
/ingest/sync (GCS delta)¶
Scans every configured GCS prefix, computes the delta against what's already
indexed (by source_path), and ingests only new blobs. SYNC_ON_STARTUP=false
in production — trigger on demand. The same scanner can run at boot if flipped.
Webui live-sync¶
The Catalog Studio keeps Qdrant in lockstep with the CMS.
On every save it runs DELETE → file → image per entity so a re-sync never
accumulates duplicates (dataland-atlas/app/rag_sync.py):
DELETE /ingest/by-project-slug/{slug}— wipes the entity's footprint across bothimagesandknowledge(filterspayload.project_slug == slug).POST /ingest/file— the rendered Markdown intoknowledge.POST /ingest/image— each attached image intoimages.
Museum entities use namespaced RAG slugs so a museum section can't collide with a Refik project of the same name:
| Entity | project_slug (RAG grouping key) |
entity_type |
|---|---|---|
| Museum overview | museum |
museum |
| Section | museum-section-<slug> |
section |
| Scene | museum-scene-<slug> |
scene |
| Scene image | museum-scene-<slug> |
scene_image |
This is the path that re-ingested the 20 sections + scenes + overview today,
moving knowledge from 4839 → 4969 points.
Why it's the heaviest service¶
- The Jina reranker is ONNX inference. At
cpus: 2.0a single 20-candidate rerank pegged both cores for ~25 s (observed 2026-05-12, CPU at 209 %). The cross-encoder scales well past 8 threads, socpus: 12.0covers concurrent reranks with headroom on the 20-core host. - FastEmbed keeps both models resident: the reranker (~1.1 GB ONNX + ~2 GB
ORT runtime) and the BM25 sparse model. Runtime working set ~3.25 GB;
mem_limit: 12gleaves room for OS page cache, which matters because the ONNX session memory-maps its weights and re-reads them per call. - Model warm-up (DAT-174) —
WARM_MODELS_ON_STARTUP=trueeagerly loads the sparse + reranker ONNX sessions at lifespan startup (in parallel) so the first/searchafter deploy doesn't pay a 1–3 s cold-start tax. The image is also pre-warmed at build time viascripts/prewarm_fastembed.pyintoFASTEMBED_CACHE_DIR=/opt/fastembed_cache.
Agent search-timeout interplay¶
The end-to-end museum-knowledge query is intentionally slow: a gemini-embedding
query embedding, three retrieval channels, then a cross-encoder rerank over up to
45 candidates round-trip in ~10 s. The Agent's RAG client read timeout
is therefore set to 25 s (rag_search_timeout_seconds = 25.0, app/config.py):
A 10s timeout caused
ReadTimeout→ 3 retries (~30s) → a second search → 60s agent wall-clock →agent_timeouton knowledge queries. 25s lets a single search complete on the first attempt.
When tuning RAG latency (deeper RERANK_CANDIDATES, re-enabling TEXT_SEARCH_ENABLED,
or moving off the warm path), keep this 25 s budget in mind — the agent gives up,
and gives up expensively, before RAG does.
GCS bucket layout¶
| Bucket | Prefix | Holds |
|---|---|---|
dataland-public |
artworks/{collection_name}/*.jpg |
Artwork images served to the mobile app and GET /images/{filepath}. (cobanov-public/chapters/ holds museum chapter imagery served by museum.) |
dataland-public |
extracted-images/{doc_id}/... |
Inline images extracted from documents (only when EXTRACT_IMAGES_FROM_DOCUMENTS=true; off by default). |
dataland-private |
documents/, blog-posts/, external/ |
Knowledge-base source documents. Never served publicly. |
dataland-private |
museum/scenes/<slug>.json |
Scene JSON for the scenes collection. |
The GCS key is mounted read-only at /app/gcp-key.json (compose:
../secrets/gcp-key.json:/app/gcp-key.json:ro, GOOGLE_APPLICATION_CREDENTIALS).
DAT-288
default_reference_* placeholder artwork images were purged from
chapters.json and cobanov-public/chapters. They were never in Qdrant, so
no RAG reindex was needed.
Key env vars¶
APP_ENV=production
QDRANT_HOST=dataland-qdrant
QDRANT_PORT=6333
QDRANT_GRPC_PORT=6334
QDRANT_API_KEY=***
KNOWLEDGE_COLLECTION=knowledge
SCENES_COLLECTION=scenes
IMAGES_COLLECTION=images
GEMINI_API_KEY=***
EMBEDDING_MODEL=gemini-embedding-2-preview # (1)!
GEMINI_MODEL=gemini-3.5-flash # (2)!
SEARCH_TOP_K=5
RERANK_CANDIDATES=45 # (3)!
SEARCH_SCORE_THRESHOLD=0.30 # (4)!
IMAGE_SEARCH_SCORE_THRESHOLD=0.45 # (5)!
HYBRID_SEARCH_ENABLED=true
TEXT_SEARCH_ENABLED=true # (6)!
SCORE_WEIGHT_HYBRID=0.65 / DENSE=0.45 / TEXT=0.18 # (7)!
SPARSE_EMBEDDING_MODEL=Qdrant/bm25
SPARSE_VECTOR_NAME=sparse
RERANKER_BACKEND=fastembed # (8)!
RERANKER_MODEL=jinaai/jina-reranker-v2-base-multilingual
CHUNK_SIZE=400 / CHUNK_OVERLAP=60 # (9)!
MAX_FILE_SIZE_MB=50
API_KEY=*** # (10)!
GCP_PROJECT_ID=dataland-ai
GCS_PUBLIC_BUCKET=dataland-public / GCS_ARTWORKS_PREFIX=artworks/
GCS_PRIVATE_BUCKET=dataland-private
GCS_DOCUMENTS_PREFIXES=documents/,blog-posts/,external/
GCS_SCENES_PREFIX=museum/scenes/
GOOGLE_APPLICATION_CREDENTIALS=/app/gcp-key.json # (11)!
- The embedding model that produces the 3072-dim cosine vectors for all three collections. Must stay in the
gemini-embeddingfamily — changing it changes vector geometry and forces a full reindex. - DAT-269 — generative model for image captioning and the reranker's Gemini fallback only. RAG vectors do not use this. Do not set it to
gemini-2.5-flash/gemini-3.1-flash-lite; both are retired here. - Raised from
20(DAT-14). Bounds how deep each retrieval channel reaches before reranking. Deeper = better recall but more cross-encoder work, which directly inflates the ~10 s round-trip the agent's 25 s timeout budgets for. - Lowered from
0.35(DAT-20). A candidate survives ifmax(dense, hybrid) ≥ this. The text channel uses a separate internal floor (2.0) instead. - DAT-262 — a higher, image-only floor. The multimodal channel produces lower absolute cosines (top-1
0.48–0.51for real matches), so reusing the0.30text threshold let weak selfies return false "artwork" hits. - Deployed
trueto surface exact entity-lookup matches; the code default isfalse. Only affects theknowledgecollection's text-scroll channel. - The absolute-score blend weights from the search pipeline. Hybrid leads (it already fuses dense + BM25); text is the weakest since it exists for exact-name hits, not semantic ranking.
- Primary reranker is FastEmbed ONNX, in-process. The Gemini path is fallback-only and runs solely when the cross-encoder raises.
- Chunker config for
/ingest/file: 400-token chunks with 60-token overlap. Overlap preserves context across chunk boundaries so a fact split across two chunks still retrieves. - DAT-162 — required non-empty and not the literal
datalandplaceholder in production, or the service refuses to boot (enforce_runtime_invariants). - Read-only mount of the GCS service-account key at
/app/gcp-key.json(compose binds../secrets/gcp-key.json:ro). Used for both GCS access and Gemini auth.
Auth posture¶
- Every endpoint except
GET /healthrequiresX-API-Key. The check is constant-time (secrets.compare_digest,api/deps.py). GET /health/fullwas opened to anonymous callers historically; DAT-184 moved it behindverify_api_keybecause it discloses collection names + vector counts. The agent's admin dashboard probe authenticates with the same key.- DAT-162 — in production the service refuses to boot if
API_KEYis empty or still the literaldatalandplaceholder, or ifGEMINI_API_KEYis empty (enforce_runtime_invariantsinruntime.py, called from the lifespan). /docs,/redoc,/openapi.jsonare disabled whenAPP_ENV=production.- Image-serving paths are hardened (DAT-164): strict ASCII per-segment allowlist,
no
../backslash/control bytes/Unicode confusables, length + segment caps, and generic400bodies that never leak the rejection reason.
Health and reaching it¶
GET /health returns {"status": "ok"} (liveness only). GET /health/full
returns Qdrant status, the embedder model id, per-collection status + vector
counts, and the runtime-config issue list — status is degraded if Qdrant
errors or any config invariant is violated.
# Loopback only — open an SSH tunnel from your workstation first:
ssh -L 4143:127.0.0.1:4143 ege@100.124.170.43 # (1)!
curl -fsS http://localhost:4143/health
curl -fsS http://localhost:4143/health/full -H "X-API-Key: $RAG_API_KEY" # (2)!
# Sample search
curl -fsS http://localhost:4143/search \
-H "X-API-Key: $RAG_API_KEY" -H 'Content-Type: application/json' \
-d '{"query":"Latent Gallery", "collection":"knowledge", "top_k":5}' # (3)!
# Ops: backfill BM25 sparse vectors without a restart (loop until has_more=false)
curl -fsS -X POST "http://localhost:4143/admin/sparse-backfill?max_points=1000" \
-H "X-API-Key: $RAG_API_KEY" # (4)!
- The service binds loopback (
127.0.0.1:4143) plus the tailnet peer, never0.0.0.0— so it is unreachable without either this tunnel or tailnet access.100.124.170.43is the host's tailnet address. /health/fullis auth-gated (DAT-184) because it discloses collection names and vector counts; onlyGET /healthis anonymous.collectionmust be one ofknowledge/images/scenesor the endpoint400s.top_kis capped at50;rerankdefaults totrue.- DAT-168 — backfills BM25 sparse vectors live, no restart. Page through with
max_pointsand repeat until the response reportshas_more=false.
The Agent admin dashboard already pings /health/full for you — see
agent → /admin. Prometheus scrapes dataland-rag:4143; the alert rules cover
RAG 5xx burn-rate and the mem_limit / cpus pressure that this service is most
prone to.
See also¶
- Agent — primary
/search+ image-search consumer; owns the 25 s timeout. - Catalog Studio (information-webui) — drives ingest + delete-by-slug.
- Museum — chapters catalog and
cobanov-public/chaptersimagery.