Semantic hybrid search approach in VDO


Semantic Vector Search on Top of SQLite FTS in VDO

Search in a VDO looks straightforward until the library becomes large and the user stops searching with exact words.

In my Electron-based VDO, I already had a strong local search foundation built on SQLite FTS. It was fast, lightweight, offline-friendly, and excellent for lexical retrieval. For many queries, especially short and exact ones, it worked very well.

But over time I ran into a recurring systems problem:

lexical retrieval is efficient at matching tokens, but users often search by meaning.

That gap is what led me to introduce vector search as a second retrieval signal.

This was not a case of replacing full-text search with AI. It was a case of designing a hybrid retrieval architecture where lexical indexing and semantic similarity each solve the part of the problem they are best suited for.


The Problem I Was Solving

My library supports search across several content surfaces:

  • title
  • file name
  • tags
  • generated metadata
  • transcript content

The challenge is that these data sources are structurally different.

A title is short and intentional.
A file name is literal and often noisy.
A tag is curated and sparse.
A transcript is verbose and inconsistent.
A description is denser, more focused, and usually closer to a summary of what the video is actually about.

This difference matters because retrieval quality depends heavily on the shape of the indexed text.

Where SQLite FTS performed well

SQLite FTS was already doing a very good job for:

  • exact token matching
  • prefix matching
  • phrase-style retrieval
  • proximity-aware lookup
  • fast local ranking over text fields

For direct searches, this is ideal. If the user knows the title, remembers a transcript phrase, or uses a precise keyword, lexical search is usually the correct tool.

Where lexical retrieval started to break down

The limitation showed up when the query became more conceptual.

For example, a user might search with wording like:

  • “video about restoring old camera footage”
  • “someone explaining why analog visuals feel softer”
  • “repairing a vehicle engine step by step”

But the indexed metadata might contain different wording:

  • “vintage film cleanup”
  • “image softness from legacy optics”
  • “starter replacement and maintenance walkthrough”

These are semantically related, but not necessarily lexically aligned.

From an information-retrieval perspective, the issue was clear:

  • the user intent was semantically relevant
  • the text corpus contained related meaning
  • the token overlap was weak

So even though the correct result existed, lexical retrieval alone did not always rank it appropriately.


Why I Did Not Replace FTS

It is easy to look at a limitation in lexical search and conclude that vector search should replace it entirely.

I did not take that route.

There were three reasons for that.

1. Lexical retrieval still provides the best precision for exact queries

If a user searches for a known phrase, title fragment, filename, or tag, a token-based inverted index is still the fastest and most precise way to retrieve that result.

Once embeddings enter the system, the cost model changes. Query processing is no longer just “parse terms and hit an index.” It becomes:

  • transform query text into an embedding vector
  • load candidate vectors
  • compute similarity scores
  • rank by vector proximity
  • apply fallback logic when confidence is low

That is manageable, but it should be used intentionally.

3. I wanted a local-first architecture that scales cleanly

This is a desktop app, not a distributed search cluster. I wanted the system to remain:

  • responsive
  • resource-aware
  • offline-capable
  • operationally simple
  • predictable under library growth

So instead of replacing lexical search, I designed vector search as a second-stage relevance mechanism.

That ended up being the right architectural tradeoff.


The Core Design: Hybrid Retrieval

The final architecture is a hybrid retrieval pipeline.

At a high level, the search system combines:

  • lexical retrieval for fast candidate generation
  • vector-based similarity for semantic reranking
  • fallback to lexical ranking when semantic confidence is weak

In practical terms, I treat search as two separate problems:

1. Candidate generation

This stage answers:

Which items are plausible matches?

This is where SQLite FTS excels. It is fast, efficient, and highly scalable for narrowing the corpus.

2. Candidate reranking

This stage answers:

Among the plausible matches, which ones are semantically closest to the user’s intent?

This is where vector search adds value.

This separation is important. It keeps the system computationally efficient and also makes the ranking behavior easier to reason about.

Rather than forcing vector search to scan the entire corpus, I use lexical search to reduce the search space first. Then I apply semantic similarity only to the reduced candidate set.

That is the single most important systems decision in the design.


Why I Chose Description-Centric Embeddings

One of the most important implementation decisions was choosing what text should become the embedding source.

I did not embed everything indiscriminately.

Instead, I generate embeddings only from a normalized description, and that normalized description is itself derived from the video description.

That decision was deliberate.

Why not embed raw transcripts?

Transcripts are valuable for keyword retrieval, but they are not always ideal as the direct source for embeddings:

  • they can be very long
  • they can contain noise
  • they often include incidental or low-signal wording
  • their semantic center can be diluted by conversational structure

Embedding raw transcripts would increase storage and processing costs while also making the semantic signal less stable.

Why not embed all metadata fields together?

Combining title, tags, transcript, and description into one embedding input sounds attractive, but in practice it introduces blending problems:

  • short, high-signal fields get mixed with noisy fields
  • ranking behavior becomes harder to interpret
  • changes to one metadata source can disproportionately affect vector meaning
  • embedding generation becomes less deterministic

Why the normalized description worked better

The description is already the most focused summary-level representation of the video. I take that description and produce a normalized description specifically optimized for embedding and retrieval.

That normalized description becomes the canonical semantic representation.

This gives me several advantages:

  • a consistent embedding source
  • lower token volume
  • reduced semantic noise
  • better comparability across videos
  • more predictable vector-space behavior

From a system-design perspective, this is effectively a semantic abstraction layer over raw metadata.

Instead of embedding whatever text happens to exist, I embed a curated, normalized representation of video meaning.

That improved both quality and operational efficiency.


The Role of the Normalized Description

The normalized description is central to the semantic layer.

Its job is not to be creative or verbose. Its job is to act as a retrieval-oriented semantic summary.

I treat it as a compact textual representation of the most searchable meaning in the video description.

That distinction matters.

A good embedding source should be:

  • semantically dense
  • concise
  • stable
  • comparable across documents
  • free from unnecessary stylistic variation

If the source text is inconsistent, embeddings become less reliable. If the source text is too long, similarity can become noisier. If the source text includes irrelevant detail, vector quality degrades.

By normalizing the description first, I create a more uniform vector corpus.

That makes vector similarity more meaningful.


How Vector Search Fits into the Retrieval Pipeline

Once I had a reliable embedding source, the vector search architecture became much cleaner.

The pipeline works like this conceptually:

  1. the user submits a query
  2. the query is normalized
  3. lexical retrieval generates an initial candidate set
  4. if the query pattern suggests that semantic reasoning would help, the query is embedded
  5. candidate video embeddings are loaded
  6. vector similarity is computed
  7. results are reranked by semantic proximity
  8. if vector reranking does not improve the result set, the lexical ordering remains the fallback

This is not pure vector retrieval in the ANN-database sense. It is a candidate-constrained vector reranking system.

That distinction is important because it shapes both performance and scaling behavior.

I am not asking the semantic layer to perform full-corpus retrieval on every query. I am using it as a precision-enhancement layer after a cheaper retrieval stage has already done the narrowing.

That is why the architecture remains practical in a local application.


What looks like one search box in the UI is actually a multi-stage retrieval system with several distinct responsibilities:

  • query normalization
  • lexical candidate generation
  • candidate expansion when strict retrieval is too narrow
  • vector query construction
  • embedding lookup
  • similarity scoring
  • result reranking
  • duplicate reconciliation
  • source-aware result grouping
  • UI hydration and filtering
  • fallback behavior when semantic ranking adds no value

This complexity is unavoidable once you want search to be both fast and meaning-aware.

The key was not reducing the number of moving parts. The key was isolating responsibilities cleanly so each layer could be tuned independently.

That separation has been one of the biggest engineering wins in the project.

It means I can:

  • tune lexical retrieval without changing vector logic
  • adjust embedding quality without touching UI code
  • optimize caching independently from ranking
  • evolve search policy based on query type
  • maintain predictable fallback behavior

In other words, the architecture is not simple because the problem is simple. It is maintainable because the complexity is partitioned well.


Optimizations I Added on Top

After the hybrid model was in place, I focused heavily on reducing cost and preserving responsiveness.

Candidate-first vector reranking

This is the main optimization.

Instead of running vector similarity against the entire corpus, I first use lexical indexing to produce a bounded candidate set. Vector scoring is then applied only to those candidates.

This keeps vector search computationally narrow and gives the system much better scaling characteristics.

I do not invoke vector reranking for every query.

Short, exact, or clearly lexical queries usually perform well with FTS alone. Semantic reranking is more useful for concept-heavy or longer queries where lexical overlap is likely to be weaker.

This conditional routing improves both performance and precision.

Query normalization before retrieval

The query is cleaned before any retrieval strategy is selected. This reduces noise, removes low-value tokens, and produces a better signal for both lexical and semantic stages.

This also helps determine whether the query is better treated as an exact search or a semantic-intent search.

Strict-first, broad-second lexical retrieval

I first retrieve with a stricter lexical strategy. If that does not return enough candidates, I expand to a broader retrieval strategy before applying vector reranking.

That means semantic search receives a richer but still bounded candidate pool.

Stable embedding source

Because all vectors come from normalized descriptions derived from descriptions, the embedding corpus is much more consistent than it would be if I embedded arbitrary fields directly.

This consistency improves similarity behavior and reduces semantic drift.

Fallback-safe ranking

If semantic reranking produces weak or empty results, lexical retrieval remains authoritative.

This avoids failure cascades and keeps the search experience robust even when vector signals are incomplete or low-confidence.

Caching and bounded result sets

Repeated searches and incremental typing benefit from in-memory caching. Bounded candidate windows and per-category result limits also help maintain responsiveness under repeated UI interaction.


Why This Scales to 50K Videos or More

A design starts to matter once the library gets large.

At 50,000 videos or beyond, the difference between brute-force semantic retrieval and bounded hybrid retrieval becomes significant.

If I were to run vector similarity across the entire corpus for every query, the cost would scale with total corpus size:

  • more embeddings to load
  • more similarity computations
  • more memory pressure
  • more latency variability
  • more hardware sensitivity

That is not the scaling profile I want in a local Electron app.

With the architecture I chose, the cost profile is different.

The lexical index absorbs the corpus-scale retrieval problem. It narrows the search space efficiently using inverted-index mechanics that are already proven to work well at this size.

Then the vector layer operates only on a bounded candidate set.

That means the expensive stage scales primarily with:

  • candidate count
  • not total video count

That is a much healthier scaling property.

So even as the library grows to 50K videos or more, I do not need semantic search to become corpus-wide on every query. The vector layer remains focused, bounded, and cost-controlled.

This is what makes the system practical for large personal or professional media libraries.

It also leaves room for future improvements if scale grows further:

  • approximate nearest-neighbor indexing if full vector retrieval ever becomes necessary
  • compressed vector storage
  • preloaded hot embeddings
  • better candidate-threshold tuning
  • background embedding refresh pipelines
  • tiered search policies based on hardware capability

The current design does not block any of those future directions.


What I Learned

1. Lexical search and vector search should not compete; they should cooperate

The best results came from assigning each retrieval method a clear role. Lexical search is excellent for fast narrowing. Vector search is excellent for meaning-aware reranking.

2. Embedding source quality matters as much as embedding model quality

A well-normalized description can produce a far better vector corpus than raw, noisy, or inconsistently structured text. Good semantic retrieval starts with good semantic representation.

3. Candidate generation is a systems problem, not just a ranking problem

If the candidate pool is wrong, the reranker has no chance. Retrieval architecture matters just as much as the scoring layer.

4. Bounded vector reranking is much more practical than corpus-wide vector retrieval in a local app

This was one of the most important design conclusions. It gave me most of the benefit of semantic search without the cost profile of a full vector database approach.

5. Fallback behavior is part of relevance engineering

A search system should degrade gracefully. If semantic ranking underperforms, lexical search should still provide a strong baseline.

6. Clean separation of retrieval stages makes the whole system easier to evolve

Once the pipeline is modular, it becomes much easier to improve one part without destabilizing the others.

7. Scaling is better when expensive work depends on candidate set size rather than corpus size

This is the reason I am confident in the design even as the library grows beyond 50K videos.


Final Thoughts

Adding vector search to my VDO was not about replacing a weak search engine. SQLite FTS was already strong. The real issue was that human recall is often semantic, while traditional local search is mostly lexical.

The solution was not choosing one method over the other.

The solution was building a retrieval architecture where:

  • full-text search handles fast corpus narrowing
  • normalized-description embeddings provide a clean semantic representation
  • vector similarity improves ranking when intent matters more than wording
  • lexical fallback preserves robustness
  • the whole pipeline remains efficient enough to scale locally

For me, the most important part of this work was not just adding embeddings. It was deciding where embeddings should come from, how vector search should be constrained, and how semantic ranking should coexist with a strong lexical baseline.

That is what made the system useful in practice.

And that is also why I believe this architecture will continue to hold up well as the library grows far beyond its current size.