May 10, 2026

Designing Related Videos for VDO

One of the most useful features in a personal video library is the ability to surface related videos automatically.

On the surface, that sounds simple: compare videos, rank the closest ones, and show the result.

In practice, it becomes a balancing problem between:

explicit organization like tags, playlists, collections, and categories
text-based similarity from titles and descriptions
optional AI-generated metadata and embeddings
indexing cost on a local desktop app
keeping recommendations available even when some data is missing

In VDO, the related videos system is designed around those tradeoffs considering it honors content based filtering(CBF) mechanisms. It excludes the complex collaborative filtering(CF) of hybrid of CBF and CF and Deep learning at this moment for simplicity.

This post walks through how the service works today, why the scoring shifts depending on whether AI is enabled, how similarity indexes are generated and stored, and what optimizations make it practical for a local Electron application.

The goal

The purpose of related videos in VDO is not just “recommendation” in the streaming-platform sense.

It is more practical than that.

The feature helps users:

jump between videos on the same topic
rediscover content from the same collection, category, or workflow
navigate large local libraries more efficiently
benefit from metadata and AI enrichment without depending entirely on them

That last part is important.

VDO supports AI-enabled workflows, but the app should still provide useful related results when AI is disabled or when semantic metadata is not available yet.

So the service is intentionally built to support two operating modes:

AI-enabled mode
non-AI fallback mode

The scoring shifts between those modes because the available signals are different.

The core idea: combine multiple similarity signals

Relatedness in VDO is not based on one source.

The service combines several signals:

title similarity
tag similarity
playlist similarity
collection similarity
category similarity
description similarity
- via semantic embeddings when AI is enabled
- via text/token matching when AI is disabled

Each of these produces a list of candidate related videos with scores.

Those lists are then merged into one final ranking using weighted scoring.

At the end, the system stores the final related-video indexes so the app can retrieve them quickly later instead of recomputing them on every view.

Category as an additional relevance signal

VDO also treats category as an explicit structural parameter in related-video ranking.

Each video may belong to one broad content category such as:

Home & Family
Movies
Shows & Series
Documentaries
Music Videos
Podcasts
Educational
Shorts
Travel & Vlogs
Sports & Games
Work
Surveillance
Others

These categories help the system avoid comparing videos only through text or metadata when there is already a meaningful high-level content grouping available.

For example:

a family event clip should be more likely to relate to other Home & Family videos
a film file should be more likely to relate to other Movies
an explainer or lecture is more likely to match Educational
a short-form clip is more likely to match Shorts
security camera footage should strongly remain within Surveillance
anything that does not fit clearly can fall into Others

Category is especially useful because it is:

cheap to compare
stable over time
understandable to users
effective even when tags or descriptions are sparse

It is not meant to replace finer-grained signals like tags or semantic description similarity. Instead, it acts as a broad structural anchor, much like collections, but at a content-type level rather than a library-grouping level.

The current scoring model

The current implementation does not use a formal layered scoring architecture.

Instead, it computes similarity per factor and then combines them into a single overall score after normalization.

That distinction matters.

A layered model would usually separate:

structural similarity
lexical similarity
semantic similarity

and then blend them in stages.

VDO’s current approach is simpler: every factor contributes directly to the final score.

That has some advantages:

simpler to reason about
faster to implement
easier to adjust individual factors
fewer moving parts in the ranking pipeline

It also has tradeoffs:

all factors compete in one flat weighting model
factor interactions are less expressive
confidence-based blending is harder
semantic and non-semantic signals are not grouped conceptually

Even so, for a local desktop app, this is a very practical design.

Why scoring changes when AI is enabled

This is one of the most important design choices in the system.

VDO does not treat AI-enabled and non-AI environments as the same ranking problem.

When AI is enabled, description-related similarity becomes much stronger because it can use embeddings rather than depending only on keyword overlap. In that mode, semantic description signals are allowed to contribute more strongly to the final ranking.

When AI is disabled, the system relies more heavily on the non-semantic signals already available in the library, especially the explicit structure created by the user such as playlists, tags, collections, categories, along with title-based relevance.

This difference is intentional.

If the same balance were used in both modes, one of two things would happen:

AI mode would underuse semantic data
non-AI mode would overtrust weaker text matching

So instead of forcing one universal formula, the service changes the balance depending on whether semantic metadata is available.

It is not a full confidence-aware model yet, but it is a good practical compromise.

Normalization before combining scores

The related-videos service normalizes scores within each factor before combining them.

This is important because each factor produces scores differently:

title similarity may come from cosine similarity over TF-IDF-like representations
description similarity may come from embeddings or text overlap
tags and playlists often behave more like overlap-based categorical signals
collection and category are closer to binary or grouped signals

Without normalization, one factor could dominate simply because its raw score range is larger.

Normalization keeps the weighting meaningful.

Instead of mixing raw numbers from incompatible scales, the service first puts factor scores into a comparable range and then combines them.

That is one of the quiet but critical parts of the ranking design.

Title similarity: lightweight lexical relevance

Title similarity is one of the cheapest useful signals.

VDO preprocesses titles by:

converting to lowercase
removing punctuation
splitting into tokens
removing stop words

From there, the service builds TF-IDF-style vectors and compares videos using cosine similarity.

This works well for titles because titles are usually:

short
relatively clean
highly descriptive when present
cheap to process across the whole library

Title similarity is especially useful when:

descriptions are missing
AI is disabled
tags are sparse
users use consistent naming conventions

It is not sufficient alone, but it gives good lexical grounding at low cost.

Description similarity has two paths

Description similarity is more complicated than title similarity because descriptions are larger, noisier, and more expensive to compare at scale.

That is why VDO effectively supports two paths.

Path 1: semantic description similarity with embeddings

When AI is enabled, VDO uses stored description embeddings.

The process is roughly:

fetch the embedding for each video
build an in-memory map of videoId -> embedding
compare each video against others using embedding similarity
keep only the top related candidates
store the resulting similarity map

This is handled through the embedding-driven related-videos path.

The benefit is obvious: embeddings can identify semantic similarity even when two descriptions use different vocabulary.

That means videos can still be matched when they are conceptually similar but lexically different.

Examples:

“JavaScript async patterns” and “understanding promises and event loops”
“SQLite full-text search” and “fast metadata lookup”
“vector similarity” and “semantic matching”

These might not overlap strongly by keyword, but embeddings can often place them close together.

Because of that, semantic description similarity is allowed to influence ranking more when AI is available, though it still remains balanced against user-created structure like tags, playlists, titles, collections, and categories.

Path 2: non-AI description similarity with keyword/token indexing

When AI is disabled, the service falls back to text-based description similarity.

There are two relevant ideas in the code here:

a TF-IDF-based compatibility path
an optimized inverted-index candidate-first path for larger libraries

TF-IDF compatibility path

There is still TF-IDF logic for descriptions in the service.

This is useful conceptually and can work well for smaller datasets. It computes document frequencies, term frequencies, and then applies cosine similarity between description vectors.

That gives a traditional lexical similarity model.

However, full pairwise TF-IDF similarity across long descriptions becomes expensive as the library grows.

So for larger or more practical indexing flows, the service uses a more optimized approach.

Inverted index optimization for description similarity

For larger libraries, VDO uses a candidate-first approach built around an inverted index.

This is one of the most important optimizations in the related-videos design.

Why not compare every description with every other description?

Because that becomes unnecessarily expensive.

If a library contains many videos, brute-force pairwise comparison grows quickly and wastes time comparing obviously unrelated items.

The optimization approach

The description indexing path does this instead:

preprocess each description into a limited set of tokens
build a videoId -> tokenSet map
build an inverted index of term -> Set(videoId)
for each source video, gather only candidates that share terms
compute overlap-based scores only for those candidates
keep the top N

That changes the problem from:

compare everything with everything

to:

compare only videos that have at least some lexical reason to be related

That is a major performance win.

Preprocessing details matter

The optimization works because the preprocessing is deliberately constrained.

Descriptions are normalized by:

lowercasing
stripping punctuation
splitting on whitespace
removing stop words

Then the token list is capped using a maximum unique-token limit.

This matters for performance and memory.

Descriptions can be very long, especially if they are AI-generated, transcript-derived, or imported from metadata-rich sources. Keeping only a bounded number of unique tokens prevents:

huge token maps
oversized inverted indexes
candidate explosion from noisy text
disproportionate influence from very long descriptions

This is a practical desktop-app optimization rather than an academic one, and it is exactly the kind of tradeoff that keeps indexing usable.

Candidate-first scoring nuances

The candidate-first scoring path has a few important nuances.

1. It filters overly broad terms

The service supports limits on very large posting lists during candidate generation.

This helps avoid terms that appear in too many videos.

Even after stop-word filtering, some terms can still be too common to be useful. If a term points to a large portion of the library, it is usually a poor candidate-generation signal.

So those terms can be capped or skipped during candidate generation.

That keeps the inverted index useful rather than noisy.

2. It falls back when filtering becomes too aggressive

If all terms get filtered out for a video, the service falls back to using all terms.

That is a small but very important detail.

Without this fallback, an aggressively filtered video could end up with no candidates at all, even when some good matches exist.

3. It uses set overlap rather than full weighted vector scoring

For candidate scoring in the optimized path, the service computes overlap between token sets and uses a Jaccard-style score.

This is cheaper than full vector comparison and works reasonably well once the candidate set has already been narrowed.

So the optimization is not just “index first.” It also uses a scoring method that is simple enough to keep the process lightweight.

Collections and categories as structural anchors

Collections and categories are related, but they serve different purposes.

Collections

Collections are user- or library-level organizational groupings. They often reflect where a video belongs within the broader library structure.

Similarity maps are generated ahead of time

One of the most important design choices in VDO is that related-video results are generated and stored, not computed on-demand every time a user opens a video.

That matters for user experience.

Instead of doing all similarity work at read time, the service:

collects all videos
computes per-factor similarity maps
merges them into final related-video rankings
stores the results in the database

Later, the app can fetch related videos directly by videoId.

This gives much faster access during normal usage.

For a desktop application, this is a strong design choice because it shifts expensive work away from interactive UI moments.

When indexes are regenerated

Related video indexes are not rebuilt continuously on every tiny change in a blocking way.

Instead, the service uses delayed indexing.

When relevant data changes, indexing is scheduled after a short delay rather than immediately re-running the whole pipeline for every update.

That helps in situations like:

importing many videos
updating metadata in batches
tag changes happening in bursts
category reassignment for multiple videos
AI description generation finishing for multiple videos
playlist updates coming close together

This debounce-like behavior reduces unnecessary work and avoids repeated full reindexing during active update periods.

Why delayed indexing is useful

Without delayed indexing, the system could re-run related-video generation too often.

For example, imagine importing a batch of videos:

first the video rows are added
then categories are assigned
then tags arrive
then playlists update
then descriptions are generated
then embeddings are added

If indexing ran after each micro-update, the service would waste a lot of effort producing incomplete intermediate results.

By waiting briefly and then indexing, the system gets:

fewer redundant runs
more complete data per run
less UI disruption
better overall throughput

This is especially important in a desktop app where the CPU is shared directly with the user’s machine.

Storage for fast retrieval

After generation, the final related-video sets are saved.

That means the read path is simple:

ask for related videos by videoId
return the stored result

This is much better than recomputing from titles, descriptions, tags, playlists, collections, categories, and embeddings every time the user opens a detail view.

The indexing step is the expensive part. The read step should be cheap.

This separation is one of the strongest practical aspects of the design.

Caching and invalidation strategy

There are two layers of caching and invalidation worth understanding here.

1. Metadata and embedding caching

The metadata service keeps caches for things like:

metadata records
embeddings
short descriptions

That helps because metadata, especially embeddings, can be expensive to parse and repeatedly fetch.

When metadata changes, the corresponding cache entries are invalidated.

That includes updates such as:

description changes
generated metadata updates
embedding updates
transcript updates

This ensures related-video generation uses fresh data without forcing the entire metadata layer to be uncached all the time.

For related videos themselves, the strategy is less about fine-grained cache invalidation and more about regenerating and replacing stored results after data changes.

The service also tracks the previously generated related-videos map and compares it to the updated one before writing everything back.

That reduces unnecessary writes and avoids updating unchanged rows.

This is a very practical optimization.

Why embeddings use a separate cache

Embeddings are treated differently from ordinary metadata because they are larger and accessed differently.

That is exactly the right call.

A normal metadata cache and an embedding cache do not behave the same way:

descriptions are text and often read with other metadata
embeddings are vector payloads and used for similarity workflows
embeddings consume more memory
embedding access patterns are more specialized

Keeping them separate avoids polluting general metadata caching with large vector payloads.

In a feature like related videos, that separation helps keep memory behavior more predictable.

Why the service supports both AI and non-AI paths

The answer is not just feature flexibility. It is also operational stability.

In a real video library:

some videos may not have generated descriptions yet
some may not have embeddings yet
AI may be disabled by user preference
AI enrichment may happen later than import
old libraries may contain sparse metadata
category may exist even when richer metadata does not

If related-videos depended entirely on embeddings, the feature would feel inconsistent.

By supporting both:

embedding-based semantic similarity
text/index-based fallback similarity
structural matching through collections, tags, playlists, and categories

VDO ensures the feature is still usable across different library states.

That is one of the most practical aspects of the design.

Why title, tags, playlists, collections, and categories still matter in AI mode

A common mistake in AI-assisted systems is to let semantic similarity dominate everything.

That often creates recommendations that are technically plausible but less useful in the context of how a user organizes their library.

In VDO, user-created and structural organization still matters.

Playlists

Playlists often represent workflow, grouping, or curated context.

Collections

Collections are broader structural grouping and can provide strong relevance.

Titles

Titles remain a cheap and often accurate lexical signal.

Even when AI is enabled, those signals continue to anchor the ranking in the user’s actual library organization and content type.

That is why the score is not “embedding only.”

Current tradeoff: simple weighting vs richer layering

At the moment, the service uses a direct weighted combination model.

That is a reasonable choice, but it is worth being honest about the tradeoff.

Current approach advantages

simpler implementation
easy to tune per factor
low conceptual overhead
fast enough for practical local indexing
works well with precomputed similarity maps

Current approach limitations

no formal structural vs lexical vs semantic grouping
semantic confidence is not modeled explicitly
partial AI coverage is handled indirectly rather than adaptively
some factors may deserve grouped treatment in the future

A layered strategy could improve this by making the ranking logic more expressive.

For example:

first combine structural signals like collections, categories, tags, and playlists
then combine lexical signals like titles and token-based descriptions
then apply semantic influence conditionally

That would likely produce a more stable and explainable model.

But it would also add complexity:

more parameters
more tuning cost
more debugging overhead
more opportunities for edge cases

So this is not a free improvement.

For a local-first app, the current design is a sensible middle ground.

A realistic next step, not a total rewrite

The most practical future direction would not be replacing the current system entirely.

It would be incrementally improving it.

Some possible next steps:

1. Add confidence-aware semantic weighting

Instead of only shifting weights globally based on AI enabled or disabled mode, also consider whether a given video actually has reliable embedding data.

2. Introduce grouped scoring internally

Keep the external behavior simple, but internally group:

structural factors
lexical factors
semantic factors

3. Store more diagnostic score breakdowns

This would help debugging and tuning by showing why a related video ranked highly.

4. Further optimize large-library candidate generation

The inverted-index approach is already a strong step, but it could be extended with:

rarer-term prioritization
better candidate pruning
adaptive limits based on library size

5. Incremental reindexing

Instead of rebuilding all relationships every time, update only the affected subset when possible.

Not all categories behave the same. For example:

Surveillance may need stronger same-category weighting
Movies and Documentaries may benefit more from metadata-rich similarity
Educational may benefit from title and description quality
Shorts may need different ranking behavior because content is brief and often noisier
Home & Family may rely more on collection, titles, and dates
Others may need weaker category influence because it is intentionally broad

That said, even without those changes, the current implementation already reflects a thoughtful set of engineering tradeoffs.

The practical philosophy behind the feature

If I had to summarize the related-videos design in VDO in one sentence, it would be this:

make related videos useful in every environment, and better when richer metadata is available

That philosophy shows up everywhere in the implementation:

title similarity for lightweight lexical relevance
tags, playlists, collections, and categories for explicit structure
embeddings when AI is enabled
token-indexed description matching when AI is not
precomputed similarity maps for speed
delayed regeneration for efficiency
storage-based retrieval for responsiveness
caching and invalidation to keep metadata fresh

It is not trying to be an academic recommendation engine.

It is trying to be a smart, dependable local feature for a real media library.

And for a desktop app, that is exactly the right priority.

Closing

Designing related videos is really a balancing exercise between quality, consistency, and cost.

In VDO, the design chooses:

multiple relevance signals instead of one magic score
different ranking balance for AI and non-AI environments
semantic similarity where available
text-based fallback where necessary
category-aware structural relevance
indexing ahead of time instead of recomputing on demand
optimizations like inverted indexes and token caps to keep large libraries manageable

That combination gives the feature a strong practical foundation.

It may evolve toward more layered or confidence-aware ranking over time, but even in its current form, it captures an important principle:

recommendation quality improves when you combine user structure, category alignment, lexical similarity, semantic metadata, and operational pragmatism.

Designing Related Videos for VDO

Designing Related Videos for any given Video

The goal

The core idea: combine multiple similarity signals

Category as an additional relevance signal

The current scoring model

Why scoring changes when AI is enabled

Normalization before combining scores

Title similarity: lightweight lexical relevance

Description similarity has two paths

Path 1: semantic description similarity with embeddings

Path 2: non-AI description similarity with keyword/token indexing

TF-IDF compatibility path

Inverted index optimization for description similarity

Why not compare every description with every other description?

The optimization approach

Preprocessing details matter

Candidate-first scoring nuances

1. It filters overly broad terms

2. It falls back when filtering becomes too aggressive

3. It uses set overlap rather than full weighted vector scoring

Collections and categories as structural anchors

Collections

Categories

Similarity maps are generated ahead of time

When indexes are regenerated

Why delayed indexing is useful

Storage for fast retrieval

Caching and invalidation strategy

1. Metadata and embedding caching

2. Related-video map invalidation by regeneration

Why embeddings use a separate cache

Why the service supports both AI and non-AI paths

Why title, tags, playlists, collections, and categories still matter in AI mode

Tags

Playlists

Collections

Categories

Titles

Current tradeoff: simple weighting vs richer layering

Current approach advantages

Current approach limitations

A realistic next step, not a total rewrite

1. Add confidence-aware semantic weighting

2. Introduce grouped scoring internally

3. Store more diagnostic score breakdowns

4. Further optimize large-library candidate generation

5. Incremental reindexing

6. Category-aware weighting refinements

The practical philosophy behind the feature

Closing