It’s not difficult to hack this together with CLIP. I did this with about a tenth of my movie collection last week with a GTX 1080 - though it lacks temporal understanding so you have to do the scene analysis yourself
I'm guessing you're not storing the CLIP for every single frame, instead of every second or so? Also, are you using the cosine similarity? How are you finding the nearest vector?
Sure. I had a lot of help from Claude Opus 4.5, but it was roughly:
- Using pyscenedetect to split each video on a per scene level
- Using the decord library https://github.com/dmlc/decord to pull frames from each scene at a particular sample rate (specific rate I don't have handy right now, but it was 1-2 per scene)
- Aggregating frames in batches of around 256 frames to be normalized for CLIP embedding on GPU (had to re-write the normalization process for this because the default library does it on CPU)
- Uploading the frames along with metadata (timestamp, etc) into a vector DB, in my case Qdrant running locally along with a screenclip of the frame itself for debugging.
I'm bottlenecked by GPU compute so I also started experimenting with using Modal for the embedding work too, but then vacation ended :) Might pick it up again in a few weeks. I'd like to be able to have a temporal-aware and potentially enriched search so that I can say "Seek to the scene in Oppenheimer where Rami Malek testifies" and be able to get a timestamped clip from the movie.