Qdrant¶

dataland-qdrant is the vector store that backs all of Dataland's retrieval. It holds three collections — knowledge, images, and scenes — every one of them a 3072-dimensional Gemini embedding under cosine distance. The knowledge collection additionally carries a native BM25 sparse vector so RAG can fuse dense + lexical signals on the museum corpus.

Only dataland-rag-v2 reads and writes Qdrant. Every other service that needs retrieval (the Agent's search_knowledge / search_artwork_images tools, the Information WebUI's live-sync) goes through RAG's HTTP API, never to Qdrant directly.


Container	`dataland-qdrant`
Image	`qdrant/qdrant:v1.12.4` (`QDRANT_VERSION`)
HTTP (REST) port	`4146` → container `:6333` (`QDRANT_HTTP_PORT`)
gRPC port	`4147` → container `:6334` (`QDRANT_GRPC_PUBLIC_PORT`)
Host bind	`127.0.0.1` + `QDRANT_PUBLIC_BIND` (tailnet `100.124.170.43`). Never `0.0.0.0`
Internal URLs	`http://dataland-qdrant:6333`, `dataland-qdrant:6334` (gRPC)
Memory / CPU	`mem_limit 4 GB` (reservation 1 GB) / `cpus 2.0`
Persistence	Named volume `qdrant-data` → `/qdrant/storage`
API key	None (`QDRANT_API_KEY` empty) — the tailnet is the trust boundary

No API key — the network is the only fence

Qdrant runs with no authentication. That is deliberate but load-bearing: the container is published on 127.0.0.1 (local + SSH tunnel) and on the Tailscale interface (QDRANT_PUBLIC_BIND=100.124.170.43, exposed as spark:4146 / spark:4147), and must never bind 0.0.0.0. RAG reaches Qdrant over the internal dataland-network Docker bridge (QDRANT_HOST=dataland-qdrant) regardless of host binding, so the published host ports exist purely for operator inspection and direct tailnet peer access. If you ever flip a *_PUBLIC_BIND to a public interface, you have handed the whole vector store to the internet. (Bind discipline tightened in DAT-73 / DAT-291.)

How RAG connects¶

RAG builds the client from its settings (storage/client.py):

QdrantClient(
    host=settings.qdrant_host,        # (1)!
    port=settings.qdrant_port,        # (2)!
    grpc_port=settings.qdrant_grpc_port,  # (3)!
    api_key=settings.qdrant_api_key or None,  # (4)!
    prefer_grpc=False,                # (5)!
)

QDRANT_HOST, resolves to dataland-qdrant on the internal dataland-network Docker bridge. The container name is the DNS name; this is why RAG reaches Qdrant regardless of host port binding.
QDRANT_PORT, the REST port 6333 inside the container (published to the host as 4146).
QDRANT_GRPC_PORT, 6334. Built into the client but unused at runtime because prefer_grpc=False.
QDRANT_API_KEY — empty today, so this evaluates to None. The tailnet is the only trust boundary; there is no key to pass.
Forces the REST transport on :6333. The gRPC port is published for tooling but the service itself never speaks gRPC.

Setting (env)	Default
`QDRANT_HOST`	`dataland-qdrant`
`QDRANT_PORT`	`6333`
`QDRANT_GRPC_PORT`	`6334`
`QDRANT_API_KEY`	(empty)
`KNOWLEDGE_COLLECTION`	`knowledge`
`IMAGES_COLLECTION`	`images`
`SCENES_COLLECTION`	`scenes`

prefer_grpc=False means RAG talks REST on :6333; the gRPC port is published for tooling but not used by the service.

The three collections¶

Name	Points (approx.)	Purpose	Write path
`knowledge`	~4969	Museum docs, sections, scenes-as-text, blog posts, external articles, uploaded files	`/ingest/file`, GCS `/sync`, Information WebUI live-sync
`images`	~1485	Artwork images + extracted-document images	`/ingest/image`, GCS `/sync`, Information WebUI uploads
`scenes`	— (small)	Museum scene metadata embedded for semantic similarity	GCS `/sync` of `museum/scenes/*.json`

Recent changes

Knowledge grew 4839 → 4969 when the 20 museum sections + their scenes + the museum overview were (re-)ingested into Qdrant knowledge (today's change-set). The agent now answers room/section questions from this corpus and speaks the real room names — Data Pavilion (GA), Latent Gallery (GB), Infinity Room (GC), The Sanctuary (GD), Discovery Portal (ON) — instead of bare codes (DAT-281).
DAT-288: the default_reference_* placeholder images were purged from chapters.json and the GCS cobanov-public/chapters bucket. Those reference images are URL-only and were never embedded into Qdrant, so the images collection was unaffected.

Why three, and why they differ¶

All three share one vector geometry, but they are populated by different embedder calls and serve different query shapes:

flowchart LR
    subgraph Qdrant[dataland-qdrant 3072d cosine]
        K[knowledge<br/>dense + sparse BM25]
        I[images<br/>dense only]
        S[scenes<br/>dense only]
    end
    subgraph Embedder[GeminiEmbedder gemini-embedding-2-preview]
        D[RETRIEVAL_DOCUMENT]
        Q[RETRIEVAL_QUERY]
        SS[SEMANTIC_SIMILARITY]
        M[multimodal image]
    end
    D -->|chunk text| K
    SS -->|scene description| S
    M -->|image bytes| I
    Q -.->|query text| K & I
    M -.->|query image| I

knowledge stores text chunks embedded with task type RETRIEVAL_DOCUMENT; queries use RETRIEVAL_QUERY. This is the only hybrid collection — see Sparse BM25.
images stores image vectors embedded from the image bytes themselves (multimodal, no task type). It supports both image→image (search_by_image) and text→image (search_images_by_text, using a RETRIEVAL_QUERY text vector against the same image space).
scenes embeds the scene description with SEMANTIC_SIMILARITY.

Vector configuration¶

Every collection is created with one shared spec (storage/schema.py):

VECTOR_SIZE = 3072  # (1)!
VECTOR_DISTANCE = Distance.COSINE  # (2)!
COLLECTION_VECTOR_PARAMS = VectorParams(size=VECTOR_SIZE, distance=VECTOR_DISTANCE)  # (3)!

The Gemini gemini-embedding-2-preview output dimension. This is baked into every stored point — changing it forces a full drop-and-re-ingest of all collections (see the danger admonition below).
Cosine distance, matching how Gemini embeddings are normalized. All three collections share this geometry.
One shared VectorParams reused for knowledge, images, and scenes at creation time, so the dense space is identical across all three.

Property	Value	Source
Dense dimensions	3072	`EMBEDDING_DIM`, `VECTOR_SIZE`
Distance	Cosine	`VECTOR_DISTANCE`
Embedding model	`gemini-embedding-2-preview`	`EMBEDDING_MODEL`
Batch size	50	`EMBEDDING_BATCH_SIZE`

The embedding model name is gemini-embedding-2-preview, not gemini-2.5-flash

Vectors come from Gemini's embedding model (gemini-embedding-2-preview, the effective value in both .env.example and compose; the Python class default is the same and config.py's bare default is gemini-embedding-2). This is a separate model from the chat / captioning model, which is gemini-3.5-flash (GEMINI_MODEL, standardized in DAT-269). Do not conflate the two: the chat model never touches a vector, and the embedding model never writes a chat token.

Changing the embedding model is a full re-index

The 3072 dimension and the cosine geometry are baked into every stored point. Switching EMBEDDING_MODEL to a model with a different output dimension — or even the same dimension but a different latent space — makes all existing vectors incomparable to new queries. There is no in-place migration: you would drop and re-ingest every collection (hours of Gemini-billed embed calls). Treat EMBEDDING_MODEL as immutable for the life of a qdrant-data volume.

Sparse BM25 (`knowledge` only)¶

The knowledge collection gets a second, sparse named vector for lexical/keyword recall, fused with the dense channel via Reciprocal Rank Fusion server-side. It is provisioned at startup (ensure_collections → _ensure_sparse_support) and is idempotent:

sparse_vectors_config = {
    "sparse": SparseVectorParams(modifier=Modifier.IDF),  # (1)!
}

Modifier.IDF tells Qdrant to apply inverse-document-frequency weighting natively, server-side, so RAG only ships raw term vectors. The vector is named sparse (SPARSE_VECTOR_NAME); images and scenes never get one, which is why hybrid search only fires when the target collection is knowledge.

Property	Value
Sparse vector name	`sparse` (`SPARSE_VECTOR_NAME`)
Modifier	IDF (inverse document frequency) — Qdrant computes IDF natively
Sparse model	`Qdrant/bm25` (`SPARSE_EMBEDDING_MODEL`), run client-side via FastEmbed
Enabled by	`HYBRID_SEARCH_ENABLED=true`

The BM25 sparse vectors themselves are produced client-side by RAG's FastEmbed SparseTextEmbedding (retrieval/sparse.py) and upserted as a second vector on each knowledge point; the IDF weighting and fusion happen inside Qdrant. images and scenes have no sparse vector — hybrid search only fires when the target collection is knowledge (Searcher._hybrid_query guards on collection == knowledge_collection).

Backfilling sparse vectors

Documents ingested before hybrid was enabled have dense-only points. SPARSE_BACKFILL_ON_STARTUP=true (default off) walks the collection at boot and adds the missing sparse vector in batches of SPARSE_BACKFILL_BATCH_SIZE (64). Day-to-day, leave it off — fresh ingests write both vectors in one pass (embed_and_upsert_knowledge).

Payload indexes (`knowledge` only)¶

On startup RAG also ensures payload indexes on knowledge so filtered and phrase searches stay fast (_ensure_payload_indexes):

Text indexes (multilingual tokenizer, lowercase, ASCII-folding, phrase matching, min_token_len=2) on content, title, section_title, source.
Keyword indexes on category, source_type, doc_id, source_path.

These back the optional "text-scroll" lexical channel (TEXT_SEARCH_ENABLED, off by default since DAT-167 because it is redundant with BM25 and dominates latency at scale) and any MatchValue / MatchAny filters.

Payload shapes¶

The payload is the JSON metadata stored alongside each vector. RAG returns it on every search hit (minus the bulky content, which is hoisted to the result's content field) and uses several keys for dedup, filtering, and the rerank prompt.

`knowledge` payload¶

Written by embed_and_upsert_knowledge (ingestion/ingestors/document.py). Point IDs are random UUIDv4 (one per chunk).

Key	Type	Notes
`content`	str	The chunk text (≈400 tokens, 60 overlap)
`title`	str	Derived from filename
`source`	str	Original filename or URL
`category`	str	`documents` \| `blog-posts` \| `external` (derived from GCS prefix)
`source_type`	str	File extension (`pdf`, `md`, `txt`, …)
`source_path`	str	`bucket/prefix/filename` — dedup key
`doc_id`	str	`doc_<12hex>` — groups all chunks of one document
`chunk_index`	int	Position within the document
`token_count`	int
`section_title`	str?	When the parser found a heading
`page_number`	int?	When available
`tags`	list[str]
`created_at`	str	ISO-8601

WebUI museum payloads (entity_type, museum_slug)

The Information WebUI ingests museum sections/scenes/overview into knowledge via POST /ingest/file with an extra_metadata JSON blob, which RAG merges onto the base payload (any key outside the reserved set is passed through verbatim). That is how knowledge points pick up WebUI-specific keys like entity_type (section | scene | museum) and museum_slug, alongside the RAG slug convention museum-section-<slug> / museum-scene-<slug>. The reserved keys the caller may not override are content, title, source, source_path, source_type, category, doc_id, chunk_index, token_count, section_title, page_number, created_at, tags (a 400 is returned if any appears) — they are set by the ingestor and are load-bearing for dedup and the rerank pipeline.

`images` payload¶

Every image point — regardless of which of the three ingest paths produced it — shares the same 11 base keys via the unified build_image_point helper (DAT-172):

file_name, title, caption, keywords, tags, gs_uri, public_url, collection_name, artwork_id, source_path, ingested_at.

Key	Type	Notes
`gs_uri`	str	Canonical `gs://bucket/path` URI
`public_url`	str	Full HTTPS URL (rendered by the chat client)
`collection_name`	str	Sub-folder under `artworks/` for GCS ingest; the `project_slug` for WebUI uploads
`artwork_id`	str	Filename stem
`source_path`	str	dedup key
`caption`	str	Gemini-generated (`gemini-3.5-flash`) on GCS/extracted ingest; curator-supplied on WebUI

The WebUI image path (/ingest/image) layers ~20 extra keys on top through extras, including the all-important project_slug (plus project_id, project_name, image_id, width/height, description, alt_text, source_type="webui-image", and provenance fields bucket, object_path, cdn_url, source_url, source_page). These extras keep the Qdrant payload in lockstep with the WebUI's DB schema.

Deterministic IDs for WebUI images (UUIDv5)

The GCS and extracted-document image paths use random UUIDv4 IDs. The WebUI path mints a deterministic UUIDv5 from source_path (namespace 4b7c8a3a-…-1c7a6b3a51d2), so re-ingesting the same image upserts in place rather than accumulating duplicates — load-bearing for the WebUI live-sync hook. The WebUI also issues DELETE /ingest/by-project-slug/{slug} (filtering on payload.project_slug) to wipe a project's footprint across both images and knowledge before a re-sync.

`scenes` payload¶

Written by ingest_scene. Point IDs are random UUIDv4.

Key	Type	Notes
`scene_id`	str	From the JSON `id`, or filename stem
`title`, `description`, `location`	str	`description` is what gets embedded
`tags`	list[str]
`related_images`	list[str]
`metadata`	dict	Any extra JSON fields not mapped above
`source_path`	str	dedup key
`created_at`	str	ISO-8601

Search filtering¶

Callers may scope a /search with a filters map, but RAG enforces an allowlist (services/search_filters.py) — any key outside it is silently dropped so a caller cannot probe internal payload state or force expensive unindexed filters:

category, source_type, source, source_path, doc_id, tags,
collection_name, artwork_id, project_slug, project_id, project_name

Per-project search isolation depends on project_slug being on this list (DAT-232): filters={"project_slug": "<x>"} was previously dropped to an unfiltered search and leaked cross-project hits. Scalar values become MatchValue; lists become MatchAny. For the full retrieval flow (dense + RRF-hybrid + rerank), see RAG.

Healthcheck quirk¶

The Qdrant image ships bash but no curl / wget, so the compose healthcheck uses bash's /dev/tcp builtin to TCP-connect to :6333, send a raw GET /healthz HTTP/1.0, read the response, and grep -q passed:

test:
  - "CMD-SHELL"
  - "bash -c 'exec 3<>/dev/tcp/localhost/6333 && printf \"GET /healthz HTTP/1.0\\r\\nHost: localhost\\r\\n\\r\\n\" >&3 && cat <&3 | grep -q passed'"  # (1)!
interval: 10s
timeout: 5s
retries: 5
start_period: 15s  # (2)!

No curl/wget in the image, so this opens a TCP socket via bash's /dev/tcp builtin, hand-writes a raw HTTP/1.0 request to /healthz, and greps the response for passed. Qdrant binds :6333 only after storage init, so one check covers both "process up" and "ready to serve".
15s grace before failed checks count, giving the storage layer time to initialise and open :6333. Without it the container would flap unhealthy on a cold boot.

Qdrant binds :6333 only after the storage layer is initialised, so this single check catches both "process up" and "ready to serve".

Reaching it¶

# Health (loopback, on the VDS):
curl -fsS http://localhost:4146/healthz  # (1)!

# List collections:
curl -fsS http://localhost:4146/collections | jq

# Per-collection stats (vector count, status, config):
curl -fsS http://localhost:4146/collections/knowledge | jq
curl -fsS http://localhost:4146/collections/images   | jq
curl -fsS http://localhost:4146/collections/scenes   | jq

# Native Prometheus endpoint (scraped by dataland-prometheus):
curl -fsS http://localhost:4146/metrics  # (2)!

Port 4146 is the host publish of container :6333 (QDRANT_HTTP_PORT). -fsS fails on HTTP error, stays silent on progress, but still prints errors. No API key — the loopback/tailnet boundary is the only gate.
Qdrant's native Prometheus exposition endpoint, scraped by dataland-prometheus. Same port 4146, no auth.

From a tailnet peer, swap localhost for spark (100.124.170.43) on the same 4146 / 4147 ports. There is no API key to pass — access is gated entirely by the network boundary.

Volumes + backup + recovery¶

qdrant-data  -> /qdrant/storage   (named volume, stateful)

Losing qdrant-data is a full re-ingest

Qdrant is the heaviest stateful tenant (DAT-80: 4 GB / 2 cores). If you lose the volume, RAG can rebuild everything from GCS — knowledge/images/scenes re-derive from the source buckets — but expect several hours of Gemini-billed embed + caption calls plus a full Information WebUI live-sync cycle to repopulate the WebUI-curated points. The legacy GCS image points and the WebUI image points are distinguishable by project_slug (WebUI points have it; ~66 legacy points do not). Backup procedure: dataland-infrastructure/reports/backup-restore.md.

Qdrant¶

How RAG connects¶

The three collections¶

Why three, and why they differ¶

Vector configuration¶

Sparse BM25 (knowledge only)¶

Payload indexes (knowledge only)¶

Payload shapes¶

knowledge payload¶

images payload¶

scenes payload¶