It’s not difficult to hack this together with CLIP. I did this with about a tent...

vhcr · 2025-12-03T06:56:10 1764744970

I'm guessing you're not storing the CLIP for every single frame, instead of every second or so? Also, are you using the cosine similarity? How are you finding the nearest vector?

laidoffamazon · 2025-12-03T15:25:45 1764775545

I split per scene using pyscenedetect and sampled from each. Distance is via cosine similarity- I fed it into qdrant

dynode · 2025-12-03T05:09:58 1764738598

Would you be willing to share more details of what you did?

laidoffamazon · 2025-12-03T20:34:43 1764794083

Sure. I had a lot of help from Claude Opus 4.5, but it was roughly:

- Using pyscenedetect to split each video on a per scene level

- Using the decord library https://github.com/dmlc/decord to pull frames from each scene at a particular sample rate (specific rate I don't have handy right now, but it was 1-2 per scene)

- Aggregating frames in batches of around 256 frames to be normalized for CLIP embedding on GPU (had to re-write the normalization process for this because the default library does it on CPU)

- Uploading the frames along with metadata (timestamp, etc) into a vector DB, in my case Qdrant running locally along with a screenclip of the frame itself for debugging.

I'm bottlenecked by GPU compute so I also started experimenting with using Modal for the embedding work too, but then vacation ended :) Might pick it up again in a few weeks. I'd like to be able to have a temporal-aware and potentially enriched search so that I can say "Seek to the scene in Oppenheimer where Rami Malek testifies" and be able to get a timestamped clip from the movie.