Skip to content

RAG — Search and Ingest

dataland-rag (the v2 service, repo dataland-rag-v2) is the unified retrieval service for the museum. It owns all ingestion of museum content and all search — text, image, and multimodal — backed by Qdrant and Gemini embeddings. The Agent queries it for search_knowledge and search_artwork_images; the Catalog Studio (information-webui) pushes content into it on every edit.

Container dataland-rag
Image dataland/rag:${IMAGE_TAG} (built from dataland-rag-v2/Dockerfile, app version 3.0.0)
Internal URL http://dataland-rag:4143 (the docker bridge dataland-network)
Bind 127.0.0.1:4143 (loopback + SSH tunnel) and ${RAG_PUBLIC_BIND}:4143 (tailnet peer, spark:4143) — never 0.0.0.0
Memory / CPU mem_limit: 12g (mem_reservation: 4g) / cpus: 12.0
Liveness GET /health (compose healthcheck, anonymous)
Diagnostics GET /health/full (auth-gated, DAT-184)

Recent changes (2026-06-03 → 2026-06-04)

  • DAT-269 — the generative model standardized on gemini-3.5-flash for image captioning and the reranker's Gemini fallback (GEMINI_MODEL). RAG vectors continue to use the gemini-embedding family (gemini-embedding-2-preview in production). Do not assume gemini-2.5-flash/gemini-3.1-flash-lite — both are retired here.
  • Museum content re-ingest — the 20 museum sections + their scenes + the museum overview were (re-)pushed into the knowledge collection, taking the point count 4839 → 4969.
  • Search timeout interplay — the Agent's RAG /search read timeout was raised 10s → 25s. Museum-knowledge queries (query embedding + dense + sparse + rerank) round-trip in ~10s; a 10s client timeout triggered ReadTimeout → 3 retries (~30s) → a second search → 60s agent wall-clock → agent_timeout. See the interplay.

What it does

graph LR
  AG[agent] -.search.-> RAG[dataland-rag]
  WU[information-webui] -.ingest + delete.-> RAG
  RAG --> QD[(Qdrant<br/>knowledge / images / scenes)]
  RAG --> GCS[("GCS<br/>dataland-public + dataland-private")]
  RAG --> GEM["Gemini API<br/>(embedding + 3.5-flash)"]
  RAG --> FE["FastEmbed<br/>(BM25 + Jina reranker, in-process)"]
  • Search (POST /search) — hybrid text retrieval (dense + BM25 sparse, optional text-scroll) fused with RRF, then reranked by a Jina cross-encoder.
  • Image search (POST /images/search/text, POST /images/search/image) — multimodal similarity against the images collection.
  • IngestPOST /ingest/file (knowledge), POST /ingest/image (images, Gemini captioning), POST /ingest/sync (GCS delta scan), and DELETE /ingest/by-project-slug/{slug} (replace-by-slug for the webui).
  • Image servingGET /images/{filepath} (public bucket) and GET /images/extracted/{filepath} (private bucket), both auth-gated and path-hardened (DAT-164).
  • OpsPOST /admin/sparse-backfill (BM25 backfill without a restart, DAT-168).

Qdrant collections

All three collections share the same dense vector config — 3072-dim, cosine (storage/schema.py: VECTOR_SIZE = 3072, Distance.COSINE). They are created idempotently at startup by ensure_collections().

Collection What populates it Embed task type Sparse vector
knowledge Document chunks from GCS documents/ blog-posts/ external/, uploaded files via /ingest/file, and webui text (projects, museum overview, sections, scenes) RETRIEVAL_DOCUMENT (index) / RETRIEVAL_QUERY (search) sparse (BM25, Modifier.IDF)
images Artwork images from GCS artworks/, webui-curated images via /ingest/image, and (optionally) document-extracted inline images multimodal image embedding (no task type) none
scenes Scene JSON under GCS museum/scenes/ (embed_for_similarity over the description field) SEMANTIC_SIMILARITY none

Production point counts

knowledge ≈ 4969 after today's museum re-ingest (was 4839), images ≈ 1485, scenes small. Live counts come from GET /health/full (collections[].vectors_count).

Knowledge payload

Written by embed_and_upsert_knowledge (ingestion/ingestors/document.py). Each chunk carries: content, title, source (filename), category (documents/blog-posts/external), source_type (extension), source_path (the dedup key), doc_id (groups a document's chunks), chunk_index, token_count, optional section_title / page_number, tags, created_at. Webui-driven ingests add project_slug + entity_type via the /ingest/file metadata form field (see webui live-sync).

Reserved metadata keys

/ingest/file rejects caller-supplied metadata that contains any of the ingestor-owned keys (content, title, source, source_path, source_type, category, doc_id, chunk_index, token_count, section_title, page_number, created_at, tags) with a 400. The webui passes only safe extras like project_slug, project_name, entity_type, location, categories.

Image payload

Built by the shared build_image_point helper (ingestion/ingestors/image_point.py, DAT-172) so all three image-ingest paths share 11 base keys: file_name, title, caption, keywords, tags, gs_uri, public_url, collection_name, artwork_id, source_path, ingested_at. The webui path layers ~20 extras on top (image_id, project_slug, width/height, description, alt_text, auto_caption, source_type="webui-image", CDN/source URLs, timestamps, …). Its /ingest/image route also enforces its own reserved-key set (auto_caption, source_type, width, height, size_bytes, title, tags, keywords, artwork_id, collection_name, ingested_at).

The search pipeline

POST /search accepts {query, collection, top_k≤50, filters?, rerank=true, score_threshold?}. The collection must be one of the three known names or it 400s.

flowchart TD
  Q[query] --> E["embed_query<br/>(RETRIEVAL_QUERY, Gemini)"]
  E --> D[dense query_points]
  E --> H["hybrid RRF prefetch<br/>(dense + BM25 sparse)"]
  Q --> T["text-scroll<br/>(MatchPhrase + MatchText)"]
  D --> M[merge_candidates]
  H --> M
  T --> M
  M --> TH["score-threshold filter<br/>+ blend weights"]
  TH --> R{rerank?}
  R -- yes --> RR["Jina cross-encoder<br/>(FastEmbed ONNX)"]
  R -- no --> TK[top_k slice]
  RR --> OUT[results]
  TK --> OUT

Three retrieval channels merged with weighted RRF (retrieval/searcher.py):

  1. Dense — cosine kNN over the Gemini vector. Always runs, every collection.
  2. Hybrid (dense + BM25 sparse) — a single Qdrant FusionQuery(RRF) with two prefetches. Only fires on the knowledge collection and only when HYBRID_SEARCH_ENABLED=true and the collection actually exposes the sparse vector. On any error it logs and falls back to dense.
  3. Text-scrollMatchPhrase over (section_title, title, source, content) plus a wider MatchText pass, re-scored in Python with field weights. Only on knowledge and only when TEXT_SEARCH_ENABLED=true (deployed true to surface entity-lookup matches; the code default is false).

The candidates are merged in _merge_candidates: each channel contributes an RRF term (hybrid 1.2, dense 0.8, text 1.5 over rank + 10) plus a weighted absolute-score term:

final = SCORE_WEIGHT_HYBRID*hybrid + SCORE_WEIGHT_DENSE*dense + SCORE_WEIGHT_TEXT*text  # (1)!
      = 0.65*hybrid + 0.45*dense + 0.18*text   (deployed)  # (2)!
  1. The absolute-score blend that rides on top of the RRF rank terms. Each channel's raw similarity is scaled by its weight, so a channel can be down-weighted without dropping it from candidate generation.
  2. Deployed weights: SCORE_WEIGHT_HYBRID=0.65, SCORE_WEIGHT_DENSE=0.45, SCORE_WEIGHT_TEXT=0.18. Hybrid leads because it already fuses dense + BM25; text is deliberately the weakest contributor since text-scroll exists mainly to surface exact entity-name hits, not to rank semantic relevance.

A candidate survives the threshold if max(dense, hybrid) ≥ score_threshold or its text score clears the internal text floor (2.0). RERANK_CANDIDATES (deployed 45, was 20) bounds how deep each channel reaches before reranking.

Reranking — Jina cross-encoder, Gemini fallback

rerank=true (default) reorders the surviving candidates with a FastEmbed Jina cross-encoder running ONNX in-process — jinaai/jina-reranker-v2-base-multilingual (~1.1 GB ONNX, multilingual, ~80–120 ms warm). This is the primary path (RERANKER_BACKEND=fastembed).

Gemini Flash is only the fallback

retrieval/reranker.py carries a Gemini-Flash scoring path (_score_batch_gemini, using GEMINI_MODEL = gemini-3.5-flash). It runs only when the FastEmbed cross-encoder raises. The module docstring describing "Stage 2: Gemini Flash" is stale — under normal operation reranking never calls Gemini. If both fail, the original RRF order is returned unchanged.

A single result short-circuits with rerank_score = 10.0; the rerank toggle is ignored unless there are at least two candidates.

Image search thresholds

Image search (searcher.search_by_image / search_images_by_text) is a plain cosine kNN against images with no rerank and no hybrid. It applies a separate, higher per-channel floor — IMAGE_SEARCH_SCORE_THRESHOLD=0.45 (DAT-262) — because the multimodal channel produces lower absolute cosines (observed top-1 0.48–0.51 for real artwork matches) than text knowledge. Reusing the 0.35 text threshold let every weakly-matching selfie return an "artwork" hit. The agent's vision path (/images/search/text) consumes this.

Ingestion

/ingest/file (knowledge)

Multipart upload (file + JSON metadata form field, ≤ 50 MB). The Kreuzberg parser extracts text (VLM OCR via OCR_BACKEND=vlm, capped by MAX_PARSE_TIME_S=120), the chunker splits at CHUNK_SIZE=400 / CHUNK_OVERLAP=60, each chunk is embedded (RETRIEVAL_DOCUMENT), and dense + (if enabled) BM25 sparse vectors are upserted in batches of 100. Supported extensions come from the document ingestor's SUPPORTED_EXTENSIONS (PDF, DOCX, MD, TXT, …).

/ingest/image (images)

Single image (file + required JSON metadata, ≤ 25 MB). Required metadata: image_id, project_slug, source_path. The endpoint:

  1. Optionally auto-captions the bytes with Gemini (CAPTION_PROMPT + gemini-3.5-flash, auto_caption=true default) for richer keyword extraction.
  2. Embeds the image bytes multimodally (embedder.embed_image).
  3. Sanitizes every user string (DAT-165) and upserts one point keyed by a deterministic UUIDv5 minted from source_path (_INGEST_IMAGE_NAMESPACE), so re-ingestion replaces in place rather than duplicating.

Quota + rate guards (DAT-163 / DAT-166)

Two layers protect the Gemini quota on /ingest/image:

  • Per-key sliding windowIMAGE_INGEST_RATE_PER_MINUTE=30 (default). Over-budget calls return 429 with Retry-After.
  • Daily Gemini budgetGEMINI_DAILY_BUDGET=5000 calls/UTC-day across all keys (each /ingest/image reserves 2 calls with auto-caption, 1 without). Exhaustion returns 503.

If the auto-caption call itself hits a Gemini 429, the route short-circuits with 503 before embedding (DAT-166) rather than burning a second quota slot or writing a point with an empty caption.

/ingest/sync (GCS delta)

Scans every configured GCS prefix, computes the delta against what's already indexed (by source_path), and ingests only new blobs. SYNC_ON_STARTUP=false in production — trigger on demand. The same scanner can run at boot if flipped.

Webui live-sync

The Catalog Studio keeps Qdrant in lockstep with the CMS. On every save it runs DELETE → file → image per entity so a re-sync never accumulates duplicates (dataland-atlas/app/rag_sync.py):

  1. DELETE /ingest/by-project-slug/{slug} — wipes the entity's footprint across both images and knowledge (filters payload.project_slug == slug).
  2. POST /ingest/file — the rendered Markdown into knowledge.
  3. POST /ingest/image — each attached image into images.

Museum entities use namespaced RAG slugs so a museum section can't collide with a Refik project of the same name:

Entity project_slug (RAG grouping key) entity_type
Museum overview museum museum
Section museum-section-<slug> section
Scene museum-scene-<slug> scene
Scene image museum-scene-<slug> scene_image

This is the path that re-ingested the 20 sections + scenes + overview today, moving knowledge from 4839 → 4969 points.

Why it's the heaviest service

  • The Jina reranker is ONNX inference. At cpus: 2.0 a single 20-candidate rerank pegged both cores for ~25 s (observed 2026-05-12, CPU at 209 %). The cross-encoder scales well past 8 threads, so cpus: 12.0 covers concurrent reranks with headroom on the 20-core host.
  • FastEmbed keeps both models resident: the reranker (~1.1 GB ONNX + ~2 GB ORT runtime) and the BM25 sparse model. Runtime working set ~3.25 GB; mem_limit: 12g leaves room for OS page cache, which matters because the ONNX session memory-maps its weights and re-reads them per call.
  • Model warm-up (DAT-174)WARM_MODELS_ON_STARTUP=true eagerly loads the sparse + reranker ONNX sessions at lifespan startup (in parallel) so the first /search after deploy doesn't pay a 1–3 s cold-start tax. The image is also pre-warmed at build time via scripts/prewarm_fastembed.py into FASTEMBED_CACHE_DIR=/opt/fastembed_cache.

Agent search-timeout interplay

The end-to-end museum-knowledge query is intentionally slow: a gemini-embedding query embedding, three retrieval channels, then a cross-encoder rerank over up to 45 candidates round-trip in ~10 s. The Agent's RAG client read timeout is therefore set to 25 s (rag_search_timeout_seconds = 25.0, app/config.py):

A 10s timeout caused ReadTimeout → 3 retries (~30s) → a second search → 60s agent wall-clock → agent_timeout on knowledge queries. 25s lets a single search complete on the first attempt.

When tuning RAG latency (deeper RERANK_CANDIDATES, re-enabling TEXT_SEARCH_ENABLED, or moving off the warm path), keep this 25 s budget in mind — the agent gives up, and gives up expensively, before RAG does.

GCS bucket layout

Bucket Prefix Holds
dataland-public artworks/{collection_name}/*.jpg Artwork images served to the mobile app and GET /images/{filepath}. (cobanov-public/chapters/ holds museum chapter imagery served by museum.)
dataland-public extracted-images/{doc_id}/... Inline images extracted from documents (only when EXTRACT_IMAGES_FROM_DOCUMENTS=true; off by default).
dataland-private documents/, blog-posts/, external/ Knowledge-base source documents. Never served publicly.
dataland-private museum/scenes/<slug>.json Scene JSON for the scenes collection.

The GCS key is mounted read-only at /app/gcp-key.json (compose: ../secrets/gcp-key.json:/app/gcp-key.json:ro, GOOGLE_APPLICATION_CREDENTIALS).

DAT-288

default_reference_* placeholder artwork images were purged from chapters.json and cobanov-public/chapters. They were never in Qdrant, so no RAG reindex was needed.

Key env vars

APP_ENV=production
QDRANT_HOST=dataland-qdrant
QDRANT_PORT=6333
QDRANT_GRPC_PORT=6334
QDRANT_API_KEY=***

KNOWLEDGE_COLLECTION=knowledge
SCENES_COLLECTION=scenes
IMAGES_COLLECTION=images

GEMINI_API_KEY=***
EMBEDDING_MODEL=gemini-embedding-2-preview     # (1)!
GEMINI_MODEL=gemini-3.5-flash                  # (2)!

SEARCH_TOP_K=5
RERANK_CANDIDATES=45                           # (3)!
SEARCH_SCORE_THRESHOLD=0.30                    # (4)!
IMAGE_SEARCH_SCORE_THRESHOLD=0.45              # (5)!
HYBRID_SEARCH_ENABLED=true
TEXT_SEARCH_ENABLED=true                       # (6)!
SCORE_WEIGHT_HYBRID=0.65 / DENSE=0.45 / TEXT=0.18  # (7)!

SPARSE_EMBEDDING_MODEL=Qdrant/bm25
SPARSE_VECTOR_NAME=sparse
RERANKER_BACKEND=fastembed                     # (8)!
RERANKER_MODEL=jinaai/jina-reranker-v2-base-multilingual

CHUNK_SIZE=400 / CHUNK_OVERLAP=60              # (9)!
MAX_FILE_SIZE_MB=50

API_KEY=***                                    # (10)!
GCP_PROJECT_ID=dataland-ai
GCS_PUBLIC_BUCKET=dataland-public / GCS_ARTWORKS_PREFIX=artworks/
GCS_PRIVATE_BUCKET=dataland-private
GCS_DOCUMENTS_PREFIXES=documents/,blog-posts/,external/
GCS_SCENES_PREFIX=museum/scenes/
GOOGLE_APPLICATION_CREDENTIALS=/app/gcp-key.json  # (11)!
  1. The embedding model that produces the 3072-dim cosine vectors for all three collections. Must stay in the gemini-embedding family — changing it changes vector geometry and forces a full reindex.
  2. DAT-269 — generative model for image captioning and the reranker's Gemini fallback only. RAG vectors do not use this. Do not set it to gemini-2.5-flash/gemini-3.1-flash-lite; both are retired here.
  3. Raised from 20 (DAT-14). Bounds how deep each retrieval channel reaches before reranking. Deeper = better recall but more cross-encoder work, which directly inflates the ~10 s round-trip the agent's 25 s timeout budgets for.
  4. Lowered from 0.35 (DAT-20). A candidate survives if max(dense, hybrid) ≥ this. The text channel uses a separate internal floor (2.0) instead.
  5. DAT-262 — a higher, image-only floor. The multimodal channel produces lower absolute cosines (top-1 0.48–0.51 for real matches), so reusing the 0.30 text threshold let weak selfies return false "artwork" hits.
  6. Deployed true to surface exact entity-lookup matches; the code default is false. Only affects the knowledge collection's text-scroll channel.
  7. The absolute-score blend weights from the search pipeline. Hybrid leads (it already fuses dense + BM25); text is the weakest since it exists for exact-name hits, not semantic ranking.
  8. Primary reranker is FastEmbed ONNX, in-process. The Gemini path is fallback-only and runs solely when the cross-encoder raises.
  9. Chunker config for /ingest/file: 400-token chunks with 60-token overlap. Overlap preserves context across chunk boundaries so a fact split across two chunks still retrieves.
  10. DAT-162 — required non-empty and not the literal dataland placeholder in production, or the service refuses to boot (enforce_runtime_invariants).
  11. Read-only mount of the GCS service-account key at /app/gcp-key.json (compose binds ../secrets/gcp-key.json:ro). Used for both GCS access and Gemini auth.

Auth posture

  • Every endpoint except GET /health requires X-API-Key. The check is constant-time (secrets.compare_digest, api/deps.py).
  • GET /health/full was opened to anonymous callers historically; DAT-184 moved it behind verify_api_key because it discloses collection names + vector counts. The agent's admin dashboard probe authenticates with the same key.
  • DAT-162 — in production the service refuses to boot if API_KEY is empty or still the literal dataland placeholder, or if GEMINI_API_KEY is empty (enforce_runtime_invariants in runtime.py, called from the lifespan).
  • /docs, /redoc, /openapi.json are disabled when APP_ENV=production.
  • Image-serving paths are hardened (DAT-164): strict ASCII per-segment allowlist, no ../backslash/control bytes/Unicode confusables, length + segment caps, and generic 400 bodies that never leak the rejection reason.

Health and reaching it

GET /health returns {"status": "ok"} (liveness only). GET /health/full returns Qdrant status, the embedder model id, per-collection status + vector counts, and the runtime-config issue list — status is degraded if Qdrant errors or any config invariant is violated.

# Loopback only — open an SSH tunnel from your workstation first:
ssh -L 4143:127.0.0.1:4143 ege@100.124.170.43  # (1)!

curl -fsS http://localhost:4143/health
curl -fsS http://localhost:4143/health/full -H "X-API-Key: $RAG_API_KEY"  # (2)!

# Sample search
curl -fsS http://localhost:4143/search \
  -H "X-API-Key: $RAG_API_KEY" -H 'Content-Type: application/json' \
  -d '{"query":"Latent Gallery", "collection":"knowledge", "top_k":5}'  # (3)!

# Ops: backfill BM25 sparse vectors without a restart (loop until has_more=false)
curl -fsS -X POST "http://localhost:4143/admin/sparse-backfill?max_points=1000" \
  -H "X-API-Key: $RAG_API_KEY"  # (4)!
  1. The service binds loopback (127.0.0.1:4143) plus the tailnet peer, never 0.0.0.0 — so it is unreachable without either this tunnel or tailnet access. 100.124.170.43 is the host's tailnet address.
  2. /health/full is auth-gated (DAT-184) because it discloses collection names and vector counts; only GET /health is anonymous.
  3. collection must be one of knowledge/images/scenes or the endpoint 400s. top_k is capped at 50; rerank defaults to true.
  4. DAT-168 — backfills BM25 sparse vectors live, no restart. Page through with max_points and repeat until the response reports has_more=false.

The Agent admin dashboard already pings /health/full for you — see agent/admin. Prometheus scrapes dataland-rag:4143; the alert rules cover RAG 5xx burn-rate and the mem_limit / cpus pressure that this service is most prone to.

See also