Designing Related Videos for VDO
Designing Related Videos for any given Video
One of the most useful features in a personal video library is the ability to surface related videos automatically.
On the surface, that sounds simple: compare videos, rank the closest ones, and show the result.
In practice, it becomes a balancing problem between:
- explicit organization like tags, playlists, collections, and categories
- text-based similarity from titles and descriptions
- optional AI-generated metadata and embeddings
- indexing cost on a local desktop app
- keeping recommendations available even when some data is missing
In VDO, the related videos system is designed around those tradeoffs considering it honors content based filtering(CBF) mechanisms. It excludes the complex collaborative filtering(CF) of hybrid of CBF and CF and Deep learning at this moment for simplicity.
This post walks through how the service works today, why the scoring shifts depending on whether AI is enabled, how similarity indexes are generated and stored, and what optimizations make it practical for a local Electron application.
The goal
The purpose of related videos in VDO is not just “recommendation” in the streaming-platform sense.
It is more practical than that.
The feature helps users:
- jump between videos on the same topic
- rediscover content from the same collection, category, or workflow
- navigate large local libraries more efficiently
- benefit from metadata and AI enrichment without depending entirely on them
That last part is important.
VDO supports AI-enabled workflows, but the app should still provide useful related results when AI is disabled or when semantic metadata is not available yet.
So the service is intentionally built to support two operating modes:
- AI-enabled mode
- non-AI fallback mode
The scoring shifts between those modes because the available signals are different.
The core idea: combine multiple similarity signals
Relatedness in VDO is not based on one source.
The service combines several signals:
- title similarity
- tag similarity
- playlist similarity
- collection similarity
- category similarity
- description similarity
- via semantic embeddings when AI is enabled
- via text/token matching when AI is disabled
Each of these produces a list of candidate related videos with scores.
Those lists are then merged into one final ranking using weighted scoring.
At the end, the system stores the final related-video indexes so the app can retrieve them quickly later instead of recomputing them on every view.
Category as an additional relevance signal
VDO also treats category as an explicit structural parameter in related-video ranking.
Each video may belong to one broad content category such as:
- Home & Family
- Movies
- Shows & Series
- Documentaries
- Music Videos
- Podcasts
- Educational
- Shorts
- Travel & Vlogs
- Sports & Games
- Work
- Surveillance
- Others
These categories help the system avoid comparing videos only through text or metadata when there is already a meaningful high-level content grouping available.
For example:
- a family event clip should be more likely to relate to other Home & Family videos
- a film file should be more likely to relate to other Movies
- an explainer or lecture is more likely to match Educational
- a short-form clip is more likely to match Shorts
- security camera footage should strongly remain within Surveillance
- anything that does not fit clearly can fall into Others
Category is especially useful because it is:
- cheap to compare
- stable over time
- understandable to users
- effective even when tags or descriptions are sparse
It is not meant to replace finer-grained signals like tags or semantic description similarity. Instead, it acts as a broad structural anchor, much like collections, but at a content-type level rather than a library-grouping level.
The current scoring model
The current implementation does not use a formal layered scoring architecture.
Instead, it computes similarity per factor and then combines them into a single overall score after normalization.
That distinction matters.
A layered model would usually separate:
- structural similarity
- lexical similarity
- semantic similarity
and then blend them in stages.
VDO’s current approach is simpler: every factor contributes directly to the final score.
That has some advantages:
- simpler to reason about
- faster to implement
- easier to adjust individual factors
- fewer moving parts in the ranking pipeline
It also has tradeoffs:
- all factors compete in one flat weighting model
- factor interactions are less expressive
- confidence-based blending is harder
- semantic and non-semantic signals are not grouped conceptually
Even so, for a local desktop app, this is a very practical design.
Why scoring changes when AI is enabled
This is one of the most important design choices in the system.
VDO does not treat AI-enabled and non-AI environments as the same ranking problem.
When AI is enabled, description-related similarity becomes much stronger because it can use embeddings rather than depending only on keyword overlap. In that mode, semantic description signals are allowed to contribute more strongly to the final ranking.
When AI is disabled, the system relies more heavily on the non-semantic signals already available in the library, especially the explicit structure created by the user such as playlists, tags, collections, categories, along with title-based relevance.
This difference is intentional.
If the same balance were used in both modes, one of two things would happen:
- AI mode would underuse semantic data
- non-AI mode would overtrust weaker text matching
So instead of forcing one universal formula, the service changes the balance depending on whether semantic metadata is available.
It is not a full confidence-aware model yet, but it is a good practical compromise.
Normalization before combining scores
The related-videos service normalizes scores within each factor before combining them.
This is important because each factor produces scores differently:
- title similarity may come from cosine similarity over TF-IDF-like representations
- description similarity may come from embeddings or text overlap
- tags and playlists often behave more like overlap-based categorical signals
- collection and category are closer to binary or grouped signals
Without normalization, one factor could dominate simply because its raw score range is larger.
Normalization keeps the weighting meaningful.
Instead of mixing raw numbers from incompatible scales, the service first puts factor scores into a comparable range and then combines them.
That is one of the quiet but critical parts of the ranking design.
Title similarity: lightweight lexical relevance
Title similarity is one of the cheapest useful signals.
VDO preprocesses titles by:
- converting to lowercase
- removing punctuation
- splitting into tokens
- removing stop words
From there, the service builds TF-IDF-style vectors and compares videos using cosine similarity.
This works well for titles because titles are usually:
- short
- relatively clean
- highly descriptive when present
- cheap to process across the whole library
Title similarity is especially useful when:
- descriptions are missing
- AI is disabled
- tags are sparse
- users use consistent naming conventions
It is not sufficient alone, but it gives good lexical grounding at low cost.
Description similarity has two paths
Description similarity is more complicated than title similarity because descriptions are larger, noisier, and more expensive to compare at scale.
That is why VDO effectively supports two paths.
Path 1: semantic description similarity with embeddings
When AI is enabled, VDO uses stored description embeddings.
The process is roughly:
- fetch the embedding for each video
- build an in-memory map of
videoId -> embedding - compare each video against others using embedding similarity
- keep only the top related candidates
- store the resulting similarity map
This is handled through the embedding-driven related-videos path.
The benefit is obvious: embeddings can identify semantic similarity even when two descriptions use different vocabulary.
That means videos can still be matched when they are conceptually similar but lexically different.
Examples:
- “JavaScript async patterns” and “understanding promises and event loops”
- “SQLite full-text search” and “fast metadata lookup”
- “vector similarity” and “semantic matching”
These might not overlap strongly by keyword, but embeddings can often place them close together.
Because of that, semantic description similarity is allowed to influence ranking more when AI is available, though it still remains balanced against user-created structure like tags, playlists, titles, collections, and categories.
Path 2: non-AI description similarity with keyword/token indexing
When AI is disabled, the service falls back to text-based description similarity.
There are two relevant ideas in the code here:
- a TF-IDF-based compatibility path
- an optimized inverted-index candidate-first path for larger libraries
TF-IDF compatibility path
There is still TF-IDF logic for descriptions in the service.
This is useful conceptually and can work well for smaller datasets. It computes document frequencies, term frequencies, and then applies cosine similarity between description vectors.
That gives a traditional lexical similarity model.
However, full pairwise TF-IDF similarity across long descriptions becomes expensive as the library grows.
So for larger or more practical indexing flows, the service uses a more optimized approach.
Inverted index optimization for description similarity
For larger libraries, VDO uses a candidate-first approach built around an inverted index.
This is one of the most important optimizations in the related-videos design.
Why not compare every description with every other description?
Because that becomes unnecessarily expensive.
If a library contains many videos, brute-force pairwise comparison grows quickly and wastes time comparing obviously unrelated items.
The optimization approach
The description indexing path does this instead:
- preprocess each description into a limited set of tokens
- build a
videoId -> tokenSetmap - build an inverted index of
term -> Set(videoId) - for each source video, gather only candidates that share terms
- compute overlap-based scores only for those candidates
- keep the top N
That changes the problem from:
- compare everything with everything
to:
- compare only videos that have at least some lexical reason to be related
That is a major performance win.
Preprocessing details matter
The optimization works because the preprocessing is deliberately constrained.
Descriptions are normalized by:
- lowercasing
- stripping punctuation
- splitting on whitespace
- removing stop words
Then the token list is capped using a maximum unique-token limit.
This matters for performance and memory.
Descriptions can be very long, especially if they are AI-generated, transcript-derived, or imported from metadata-rich sources. Keeping only a bounded number of unique tokens prevents:
- huge token maps
- oversized inverted indexes
- candidate explosion from noisy text
- disproportionate influence from very long descriptions
This is a practical desktop-app optimization rather than an academic one, and it is exactly the kind of tradeoff that keeps indexing usable.
Candidate-first scoring nuances
The candidate-first scoring path has a few important nuances.
1. It filters overly broad terms
The service supports limits on very large posting lists during candidate generation.
This helps avoid terms that appear in too many videos.
Even after stop-word filtering, some terms can still be too common to be useful. If a term points to a large portion of the library, it is usually a poor candidate-generation signal.
So those terms can be capped or skipped during candidate generation.
That keeps the inverted index useful rather than noisy.
2. It falls back when filtering becomes too aggressive
If all terms get filtered out for a video, the service falls back to using all terms.
That is a small but very important detail.
Without this fallback, an aggressively filtered video could end up with no candidates at all, even when some good matches exist.
3. It uses set overlap rather than full weighted vector scoring
For candidate scoring in the optimized path, the service computes overlap between token sets and uses a Jaccard-style score.
This is cheaper than full vector comparison and works reasonably well once the candidate set has already been narrowed.
So the optimization is not just “index first.” It also uses a scoring method that is simple enough to keep the process lightweight.
Collections and categories as structural anchors
Collections and categories are related, but they serve different purposes.
Collections
Collections are user- or library-level organizational groupings. They often reflect where a video belongs within the broader library structure.
Categories
Categories describe the type of content itself.
A category can help answer questions like:
- Is this a home video?
- Is this a movie?
- Is this a documentary?
- Is this a music video?
- Is this a podcast?
- Is this an educational video?
- Is this a short-form clip?
- Is this a travel vlog?
- Is this surveillance footage?
That makes category a strong structural relevance signal, especially when:
- descriptions are unavailable
- tags are inconsistent
- titles are vague
- semantic metadata has not been generated yet
In practice, category helps preserve content-type coherence in recommendations.
A surveillance clip should not be considered strongly related to a movie just because both happen to share a few words in the title or description.
Likewise, a short-form clip and a full documentary may overlap lexically, but category can still help keep rankings sensible.
Similarity maps are generated ahead of time
One of the most important design choices in VDO is that related-video results are generated and stored, not computed on-demand every time a user opens a video.
That matters for user experience.
Instead of doing all similarity work at read time, the service:
- collects all videos
- computes per-factor similarity maps
- merges them into final related-video rankings
- stores the results in the database
Later, the app can fetch related videos directly by videoId.
This gives much faster access during normal usage.
For a desktop application, this is a strong design choice because it shifts expensive work away from interactive UI moments.
When indexes are regenerated
Related video indexes are not rebuilt continuously on every tiny change in a blocking way.
Instead, the service uses delayed indexing.
When relevant data changes, indexing is scheduled after a short delay rather than immediately re-running the whole pipeline for every update.
That helps in situations like:
- importing many videos
- updating metadata in batches
- tag changes happening in bursts
- category reassignment for multiple videos
- AI description generation finishing for multiple videos
- playlist updates coming close together
This debounce-like behavior reduces unnecessary work and avoids repeated full reindexing during active update periods.
Why delayed indexing is useful
Without delayed indexing, the system could re-run related-video generation too often.
For example, imagine importing a batch of videos:
- first the video rows are added
- then categories are assigned
- then tags arrive
- then playlists update
- then descriptions are generated
- then embeddings are added
If indexing ran after each micro-update, the service would waste a lot of effort producing incomplete intermediate results.
By waiting briefly and then indexing, the system gets:
- fewer redundant runs
- more complete data per run
- less UI disruption
- better overall throughput
This is especially important in a desktop app where the CPU is shared directly with the user’s machine.
Storage for fast retrieval
After generation, the final related-video sets are saved.
That means the read path is simple:
- ask for related videos by
videoId - return the stored result
This is much better than recomputing from titles, descriptions, tags, playlists, collections, categories, and embeddings every time the user opens a detail view.
The indexing step is the expensive part. The read step should be cheap.
This separation is one of the strongest practical aspects of the design.
Caching and invalidation strategy
There are two layers of caching and invalidation worth understanding here.
1. Metadata and embedding caching
The metadata service keeps caches for things like:
- metadata records
- embeddings
- short descriptions
That helps because metadata, especially embeddings, can be expensive to parse and repeatedly fetch.
When metadata changes, the corresponding cache entries are invalidated.
That includes updates such as:
- description changes
- generated metadata updates
- embedding updates
- transcript updates
This ensures related-video generation uses fresh data without forcing the entire metadata layer to be uncached all the time.
2. Related-video map invalidation by regeneration
For related videos themselves, the strategy is less about fine-grained cache invalidation and more about regenerating and replacing stored results after data changes.
The service also tracks the previously generated related-videos map and compares it to the updated one before writing everything back.
That reduces unnecessary writes and avoids updating unchanged rows.
This is a very practical optimization.
Why embeddings use a separate cache
Embeddings are treated differently from ordinary metadata because they are larger and accessed differently.
That is exactly the right call.
A normal metadata cache and an embedding cache do not behave the same way:
- descriptions are text and often read with other metadata
- embeddings are vector payloads and used for similarity workflows
- embeddings consume more memory
- embedding access patterns are more specialized
Keeping them separate avoids polluting general metadata caching with large vector payloads.
In a feature like related videos, that separation helps keep memory behavior more predictable.
Why the service supports both AI and non-AI paths
The answer is not just feature flexibility. It is also operational stability.
In a real video library:
- some videos may not have generated descriptions yet
- some may not have embeddings yet
- AI may be disabled by user preference
- AI enrichment may happen later than import
- old libraries may contain sparse metadata
- category may exist even when richer metadata does not
If related-videos depended entirely on embeddings, the feature would feel inconsistent.
By supporting both:
- embedding-based semantic similarity
- text/index-based fallback similarity
- structural matching through collections, tags, playlists, and categories
VDO ensures the feature is still usable across different library states.
That is one of the most practical aspects of the design.
Why title, tags, playlists, collections, and categories still matter in AI mode
A common mistake in AI-assisted systems is to let semantic similarity dominate everything.
That often creates recommendations that are technically plausible but less useful in the context of how a user organizes their library.
In VDO, user-created and structural organization still matters.
Tags
Tags reflect explicit labeling and intent.
Playlists
Playlists often represent workflow, grouping, or curated context.
Collections
Collections are broader structural grouping and can provide strong relevance.
Categories
Categories provide high-level content-type alignment such as Movies, Documentaries, Educational, Shorts, Music Videos, or Surveillance.
Titles
Titles remain a cheap and often accurate lexical signal.
Even when AI is enabled, those signals continue to anchor the ranking in the user’s actual library organization and content type.
That is why the score is not “embedding only.”
Current tradeoff: simple weighting vs richer layering
At the moment, the service uses a direct weighted combination model.
That is a reasonable choice, but it is worth being honest about the tradeoff.
Current approach advantages
- simpler implementation
- easy to tune per factor
- low conceptual overhead
- fast enough for practical local indexing
- works well with precomputed similarity maps
Current approach limitations
- no formal structural vs lexical vs semantic grouping
- semantic confidence is not modeled explicitly
- partial AI coverage is handled indirectly rather than adaptively
- some factors may deserve grouped treatment in the future
A layered strategy could improve this by making the ranking logic more expressive.
For example:
- first combine structural signals like collections, categories, tags, and playlists
- then combine lexical signals like titles and token-based descriptions
- then apply semantic influence conditionally
That would likely produce a more stable and explainable model.
But it would also add complexity:
- more parameters
- more tuning cost
- more debugging overhead
- more opportunities for edge cases
So this is not a free improvement.
For a local-first app, the current design is a sensible middle ground.
A realistic next step, not a total rewrite
The most practical future direction would not be replacing the current system entirely.
It would be incrementally improving it.
Some possible next steps:
1. Add confidence-aware semantic weighting
Instead of only shifting weights globally based on AI enabled or disabled mode, also consider whether a given video actually has reliable embedding data.
2. Introduce grouped scoring internally
Keep the external behavior simple, but internally group:
- structural factors
- lexical factors
- semantic factors
3. Store more diagnostic score breakdowns
This would help debugging and tuning by showing why a related video ranked highly.
4. Further optimize large-library candidate generation
The inverted-index approach is already a strong step, but it could be extended with:
- rarer-term prioritization
- better candidate pruning
- adaptive limits based on library size
5. Incremental reindexing
Instead of rebuilding all relationships every time, update only the affected subset when possible.
6. Category-aware weighting refinements
Not all categories behave the same. For example:
- Surveillance may need stronger same-category weighting
- Movies and Documentaries may benefit more from metadata-rich similarity
- Educational may benefit from title and description quality
- Shorts may need different ranking behavior because content is brief and often noisier
- Home & Family may rely more on collection, titles, and dates
- Others may need weaker category influence because it is intentionally broad
That said, even without those changes, the current implementation already reflects a thoughtful set of engineering tradeoffs.
The practical philosophy behind the feature
If I had to summarize the related-videos design in VDO in one sentence, it would be this:
make related videos useful in every environment, and better when richer metadata is available
That philosophy shows up everywhere in the implementation:
- title similarity for lightweight lexical relevance
- tags, playlists, collections, and categories for explicit structure
- embeddings when AI is enabled
- token-indexed description matching when AI is not
- precomputed similarity maps for speed
- delayed regeneration for efficiency
- storage-based retrieval for responsiveness
- caching and invalidation to keep metadata fresh
It is not trying to be an academic recommendation engine.
It is trying to be a smart, dependable local feature for a real media library.
And for a desktop app, that is exactly the right priority.
Closing
Designing related videos is really a balancing exercise between quality, consistency, and cost.
In VDO, the design chooses:
- multiple relevance signals instead of one magic score
- different ranking balance for AI and non-AI environments
- semantic similarity where available
- text-based fallback where necessary
- category-aware structural relevance
- indexing ahead of time instead of recomputing on demand
- optimizations like inverted indexes and token caps to keep large libraries manageable
That combination gives the feature a strong practical foundation.
It may evolve toward more layered or confidence-aware ranking over time, but even in its current form, it captures an important principle:
recommendation quality improves when you combine user structure, category alignment, lexical similarity, semantic metadata, and operational pragmatism.