Qdrant¶
dataland-qdrant is the vector store that backs all of Dataland's retrieval. It holds three collections — knowledge, images, and scenes — every one of them a 3072-dimensional Gemini embedding under cosine distance. The knowledge collection additionally carries a native BM25 sparse vector so RAG can fuse dense + lexical signals on the museum corpus.
Only dataland-rag-v2 reads and writes Qdrant. Every other service that needs retrieval (the Agent's search_knowledge / search_artwork_images tools, the Information WebUI's live-sync) goes through RAG's HTTP API, never to Qdrant directly.
| Container | dataland-qdrant |
| Image | qdrant/qdrant:v1.12.4 (QDRANT_VERSION) |
| HTTP (REST) port | 4146 → container :6333 (QDRANT_HTTP_PORT) |
| gRPC port | 4147 → container :6334 (QDRANT_GRPC_PUBLIC_PORT) |
| Host bind | 127.0.0.1 + QDRANT_PUBLIC_BIND (tailnet 100.124.170.43). Never 0.0.0.0 |
| Internal URLs | http://dataland-qdrant:6333, dataland-qdrant:6334 (gRPC) |
| Memory / CPU | mem_limit 4 GB (reservation 1 GB) / cpus 2.0 |
| Persistence | Named volume qdrant-data → /qdrant/storage |
| API key | None (QDRANT_API_KEY empty) — the tailnet is the trust boundary |
No API key — the network is the only fence
Qdrant runs with no authentication. That is deliberate but load-bearing: the container is published on 127.0.0.1 (local + SSH tunnel) and on the Tailscale interface (QDRANT_PUBLIC_BIND=100.124.170.43, exposed as spark:4146 / spark:4147), and must never bind 0.0.0.0. RAG reaches Qdrant over the internal dataland-network Docker bridge (QDRANT_HOST=dataland-qdrant) regardless of host binding, so the published host ports exist purely for operator inspection and direct tailnet peer access. If you ever flip a *_PUBLIC_BIND to a public interface, you have handed the whole vector store to the internet. (Bind discipline tightened in DAT-73 / DAT-291.)
How RAG connects¶
RAG builds the client from its settings (storage/client.py):
QdrantClient(
host=settings.qdrant_host, # (1)!
port=settings.qdrant_port, # (2)!
grpc_port=settings.qdrant_grpc_port, # (3)!
api_key=settings.qdrant_api_key or None, # (4)!
prefer_grpc=False, # (5)!
)
QDRANT_HOST, resolves todataland-qdranton the internaldataland-networkDocker bridge. The container name is the DNS name; this is why RAG reaches Qdrant regardless of host port binding.QDRANT_PORT, the REST port6333inside the container (published to the host as4146).QDRANT_GRPC_PORT,6334. Built into the client but unused at runtime becauseprefer_grpc=False.QDRANT_API_KEY— empty today, so this evaluates toNone. The tailnet is the only trust boundary; there is no key to pass.- Forces the REST transport on
:6333. The gRPC port is published for tooling but the service itself never speaks gRPC.
| Setting (env) | Default |
|---|---|
QDRANT_HOST |
dataland-qdrant |
QDRANT_PORT |
6333 |
QDRANT_GRPC_PORT |
6334 |
QDRANT_API_KEY |
(empty) |
KNOWLEDGE_COLLECTION |
knowledge |
IMAGES_COLLECTION |
images |
SCENES_COLLECTION |
scenes |
prefer_grpc=False means RAG talks REST on :6333; the gRPC port is published for tooling but not used by the service.
The three collections¶
| Name | Points (approx.) | Purpose | Write path |
|---|---|---|---|
knowledge |
~4969 | Museum docs, sections, scenes-as-text, blog posts, external articles, uploaded files | /ingest/file, GCS /sync, Information WebUI live-sync |
images |
~1485 | Artwork images + extracted-document images | /ingest/image, GCS /sync, Information WebUI uploads |
scenes |
— (small) | Museum scene metadata embedded for semantic similarity | GCS /sync of museum/scenes/*.json |
Recent changes
- Knowledge grew 4839 → 4969 when the 20 museum sections + their scenes + the museum overview were (re-)ingested into Qdrant
knowledge(today's change-set). The agent now answers room/section questions from this corpus and speaks the real room names — Data Pavilion (GA), Latent Gallery (GB), Infinity Room (GC), The Sanctuary (GD), Discovery Portal (ON) — instead of bare codes (DAT-281). - DAT-288: the
default_reference_*placeholder images were purged fromchapters.jsonand the GCScobanov-public/chaptersbucket. Those reference images are URL-only and were never embedded into Qdrant, so theimagescollection was unaffected.
Why three, and why they differ¶
All three share one vector geometry, but they are populated by different embedder calls and serve different query shapes:
flowchart LR
subgraph Qdrant[dataland-qdrant 3072d cosine]
K[knowledge<br/>dense + sparse BM25]
I[images<br/>dense only]
S[scenes<br/>dense only]
end
subgraph Embedder[GeminiEmbedder gemini-embedding-2-preview]
D[RETRIEVAL_DOCUMENT]
Q[RETRIEVAL_QUERY]
SS[SEMANTIC_SIMILARITY]
M[multimodal image]
end
D -->|chunk text| K
SS -->|scene description| S
M -->|image bytes| I
Q -.->|query text| K & I
M -.->|query image| I
knowledgestores text chunks embedded with task typeRETRIEVAL_DOCUMENT; queries useRETRIEVAL_QUERY. This is the only hybrid collection — see Sparse BM25.imagesstores image vectors embedded from the image bytes themselves (multimodal, no task type). It supports both image→image (search_by_image) and text→image (search_images_by_text, using aRETRIEVAL_QUERYtext vector against the same image space).scenesembeds the scenedescriptionwithSEMANTIC_SIMILARITY.
Vector configuration¶
Every collection is created with one shared spec (storage/schema.py):
VECTOR_SIZE = 3072 # (1)!
VECTOR_DISTANCE = Distance.COSINE # (2)!
COLLECTION_VECTOR_PARAMS = VectorParams(size=VECTOR_SIZE, distance=VECTOR_DISTANCE) # (3)!
- The Gemini
gemini-embedding-2-previewoutput dimension. This is baked into every stored point — changing it forces a full drop-and-re-ingest of all collections (see the danger admonition below). - Cosine distance, matching how Gemini embeddings are normalized. All three collections share this geometry.
- One shared
VectorParamsreused forknowledge,images, andscenesat creation time, so the dense space is identical across all three.
| Property | Value | Source |
|---|---|---|
| Dense dimensions | 3072 | EMBEDDING_DIM, VECTOR_SIZE |
| Distance | Cosine | VECTOR_DISTANCE |
| Embedding model | gemini-embedding-2-preview |
EMBEDDING_MODEL |
| Batch size | 50 | EMBEDDING_BATCH_SIZE |
The embedding model name is gemini-embedding-2-preview, not gemini-2.5-flash
Vectors come from Gemini's embedding model (gemini-embedding-2-preview, the effective value in both .env.example and compose; the Python class default is the same and config.py's bare default is gemini-embedding-2). This is a separate model from the chat / captioning model, which is gemini-3.5-flash (GEMINI_MODEL, standardized in DAT-269). Do not conflate the two: the chat model never touches a vector, and the embedding model never writes a chat token.
Changing the embedding model is a full re-index
The 3072 dimension and the cosine geometry are baked into every stored point. Switching EMBEDDING_MODEL to a model with a different output dimension — or even the same dimension but a different latent space — makes all existing vectors incomparable to new queries. There is no in-place migration: you would drop and re-ingest every collection (hours of Gemini-billed embed calls). Treat EMBEDDING_MODEL as immutable for the life of a qdrant-data volume.
Sparse BM25 (knowledge only)¶
The knowledge collection gets a second, sparse named vector for lexical/keyword recall, fused with the dense channel via Reciprocal Rank Fusion server-side. It is provisioned at startup (ensure_collections → _ensure_sparse_support) and is idempotent:
Modifier.IDFtells Qdrant to apply inverse-document-frequency weighting natively, server-side, so RAG only ships raw term vectors. The vector is namedsparse(SPARSE_VECTOR_NAME);imagesandscenesnever get one, which is why hybrid search only fires when the target collection isknowledge.
| Property | Value |
|---|---|
| Sparse vector name | sparse (SPARSE_VECTOR_NAME) |
| Modifier | IDF (inverse document frequency) — Qdrant computes IDF natively |
| Sparse model | Qdrant/bm25 (SPARSE_EMBEDDING_MODEL), run client-side via FastEmbed |
| Enabled by | HYBRID_SEARCH_ENABLED=true |
The BM25 sparse vectors themselves are produced client-side by RAG's FastEmbed SparseTextEmbedding (retrieval/sparse.py) and upserted as a second vector on each knowledge point; the IDF weighting and fusion happen inside Qdrant. images and scenes have no sparse vector — hybrid search only fires when the target collection is knowledge (Searcher._hybrid_query guards on collection == knowledge_collection).
Backfilling sparse vectors
Documents ingested before hybrid was enabled have dense-only points. SPARSE_BACKFILL_ON_STARTUP=true (default off) walks the collection at boot and adds the missing sparse vector in batches of SPARSE_BACKFILL_BATCH_SIZE (64). Day-to-day, leave it off — fresh ingests write both vectors in one pass (embed_and_upsert_knowledge).
Payload indexes (knowledge only)¶
On startup RAG also ensures payload indexes on knowledge so filtered and phrase searches stay fast (_ensure_payload_indexes):
- Text indexes (
multilingualtokenizer, lowercase, ASCII-folding, phrase matching,min_token_len=2) oncontent,title,section_title,source. - Keyword indexes on
category,source_type,doc_id,source_path.
These back the optional "text-scroll" lexical channel (TEXT_SEARCH_ENABLED, off by default since DAT-167 because it is redundant with BM25 and dominates latency at scale) and any MatchValue / MatchAny filters.
Payload shapes¶
The payload is the JSON metadata stored alongside each vector. RAG returns it on every search hit (minus the bulky content, which is hoisted to the result's content field) and uses several keys for dedup, filtering, and the rerank prompt.
knowledge payload¶
Written by embed_and_upsert_knowledge (ingestion/ingestors/document.py). Point IDs are random UUIDv4 (one per chunk).
| Key | Type | Notes |
|---|---|---|
content |
str | The chunk text (≈400 tokens, 60 overlap) |
title |
str | Derived from filename |
source |
str | Original filename or URL |
category |
str | documents | blog-posts | external (derived from GCS prefix) |
source_type |
str | File extension (pdf, md, txt, …) |
source_path |
str | bucket/prefix/filename — dedup key |
doc_id |
str | doc_<12hex> — groups all chunks of one document |
chunk_index |
int | Position within the document |
token_count |
int | |
section_title |
str? | When the parser found a heading |
page_number |
int? | When available |
tags |
list[str] | |
created_at |
str | ISO-8601 |
WebUI museum payloads (entity_type, museum_slug)
The Information WebUI ingests museum sections/scenes/overview into knowledge via POST /ingest/file with an extra_metadata JSON blob, which RAG merges onto the base payload (any key outside the reserved set is passed through verbatim). That is how knowledge points pick up WebUI-specific keys like entity_type (section | scene | museum) and museum_slug, alongside the RAG slug convention museum-section-<slug> / museum-scene-<slug>. The reserved keys the caller may not override are content, title, source, source_path, source_type, category, doc_id, chunk_index, token_count, section_title, page_number, created_at, tags (a 400 is returned if any appears) — they are set by the ingestor and are load-bearing for dedup and the rerank pipeline.
images payload¶
Every image point — regardless of which of the three ingest paths produced it — shares the same 11 base keys via the unified build_image_point helper (DAT-172):
file_name, title, caption, keywords, tags, gs_uri, public_url, collection_name, artwork_id, source_path, ingested_at.
| Key | Type | Notes |
|---|---|---|
gs_uri |
str | Canonical gs://bucket/path URI |
public_url |
str | Full HTTPS URL (rendered by the chat client) |
collection_name |
str | Sub-folder under artworks/ for GCS ingest; the project_slug for WebUI uploads |
artwork_id |
str | Filename stem |
source_path |
str | dedup key |
caption |
str | Gemini-generated (gemini-3.5-flash) on GCS/extracted ingest; curator-supplied on WebUI |
The WebUI image path (/ingest/image) layers ~20 extra keys on top through extras, including the all-important project_slug (plus project_id, project_name, image_id, width/height, description, alt_text, source_type="webui-image", and provenance fields bucket, object_path, cdn_url, source_url, source_page). These extras keep the Qdrant payload in lockstep with the WebUI's DB schema.
Deterministic IDs for WebUI images (UUIDv5)
The GCS and extracted-document image paths use random UUIDv4 IDs. The WebUI path mints a deterministic UUIDv5 from source_path (namespace 4b7c8a3a-…-1c7a6b3a51d2), so re-ingesting the same image upserts in place rather than accumulating duplicates — load-bearing for the WebUI live-sync hook. The WebUI also issues DELETE /ingest/by-project-slug/{slug} (filtering on payload.project_slug) to wipe a project's footprint across both images and knowledge before a re-sync.
scenes payload¶
Written by ingest_scene. Point IDs are random UUIDv4.
| Key | Type | Notes |
|---|---|---|
scene_id |
str | From the JSON id, or filename stem |
title, description, location |
str | description is what gets embedded |
tags |
list[str] | |
related_images |
list[str] | |
metadata |
dict | Any extra JSON fields not mapped above |
source_path |
str | dedup key |
created_at |
str | ISO-8601 |
Search filtering¶
Callers may scope a /search with a filters map, but RAG enforces an allowlist (services/search_filters.py) — any key outside it is silently dropped so a caller cannot probe internal payload state or force expensive unindexed filters:
category, source_type, source, source_path, doc_id, tags,
collection_name, artwork_id, project_slug, project_id, project_name
Per-project search isolation depends on project_slug being on this list (DAT-232): filters={"project_slug": "<x>"} was previously dropped to an unfiltered search and leaked cross-project hits. Scalar values become MatchValue; lists become MatchAny. For the full retrieval flow (dense + RRF-hybrid + rerank), see RAG.
Healthcheck quirk¶
The Qdrant image ships bash but no curl / wget, so the compose healthcheck uses bash's /dev/tcp builtin to TCP-connect to :6333, send a raw GET /healthz HTTP/1.0, read the response, and grep -q passed:
test:
- "CMD-SHELL"
- "bash -c 'exec 3<>/dev/tcp/localhost/6333 && printf \"GET /healthz HTTP/1.0\\r\\nHost: localhost\\r\\n\\r\\n\" >&3 && cat <&3 | grep -q passed'" # (1)!
interval: 10s
timeout: 5s
retries: 5
start_period: 15s # (2)!
- No
curl/wgetin the image, so this opens a TCP socket via bash's/dev/tcpbuiltin, hand-writes a raw HTTP/1.0 request to/healthz, and greps the response forpassed. Qdrant binds:6333only after storage init, so one check covers both "process up" and "ready to serve". - 15s grace before failed checks count, giving the storage layer time to initialise and open
:6333. Without it the container would flap unhealthy on a cold boot.
Qdrant binds :6333 only after the storage layer is initialised, so this single check catches both "process up" and "ready to serve".
Reaching it¶
# Health (loopback, on the VDS):
curl -fsS http://localhost:4146/healthz # (1)!
# List collections:
curl -fsS http://localhost:4146/collections | jq
# Per-collection stats (vector count, status, config):
curl -fsS http://localhost:4146/collections/knowledge | jq
curl -fsS http://localhost:4146/collections/images | jq
curl -fsS http://localhost:4146/collections/scenes | jq
# Native Prometheus endpoint (scraped by dataland-prometheus):
curl -fsS http://localhost:4146/metrics # (2)!
- Port
4146is the host publish of container:6333(QDRANT_HTTP_PORT).-fsSfails on HTTP error, stays silent on progress, but still prints errors. No API key — the loopback/tailnet boundary is the only gate. - Qdrant's native Prometheus exposition endpoint, scraped by
dataland-prometheus. Same port4146, no auth.
From a tailnet peer, swap localhost for spark (100.124.170.43) on the same 4146 / 4147 ports. There is no API key to pass — access is gated entirely by the network boundary.
Volumes + backup + recovery¶
Losing qdrant-data is a full re-ingest
Qdrant is the heaviest stateful tenant (DAT-80: 4 GB / 2 cores). If you lose the volume, RAG can rebuild everything from GCS — knowledge/images/scenes re-derive from the source buckets — but expect several hours of Gemini-billed embed + caption calls plus a full Information WebUI live-sync cycle to repopulate the WebUI-curated points. The legacy GCS image points and the WebUI image points are distinguishable by project_slug (WebUI points have it; ~66 legacy points do not). Backup procedure: dataland-infrastructure/reports/backup-restore.md.
See also¶
- RAG (dataland-rag-v2) — the only reader/writer; hybrid search, reranking, ingest endpoints.
- Information WebUI — live-sync source of museum + project payloads.
- Agent — consumes retrieval via
search_knowledge/search_artwork_imagestools.