Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Video-LLaVA (github.com/pku-yuangroup)
234 points by tosh on Nov 21, 2023 | hide | past | favorite | 45 comments


Researchers seem very comfortable sticking "Apache 2.0" licenses all over their foundation model finetunes.

This model is absolutely not Apache 2.0 in reality (it's a Vicuna finetune nevermind the sourcing of the finetune dataset) and I would use it for business at your peril.


It's unlikely that model weights are even copyrightable in the first place, at least based on what the US Copyright Office has said in the past. In fact, it seems that even the outputs of such models are excluded from copyright protection, regardless of the skill involved in their production[1]. IANAL, but it's likely that the license documents attached to these large language models amount to little more than somewhat weak ToS agreements, allowing the distributor a bit more leeway in managing their legal/commercial relationship with users of said models.

At the end of the day, though, this is all mere speculation. We won't know the truth until someone decides to burn the cash necessary to test it in court.

[1] https://fingfx.thomsonreuters.com/gfx/legaldocs/klpygnkyrpg/...


It's going to be really ironic when the same people who think copyright law doesn't apply to their deep-learning models get all pissy over people violating their IP.


> even the outputs of such models are excluded from copyright protection, regardless of the skill involved in their production

Not “even” - the case for model outputs not being copyrightable is much more straight forward than the weights.


Looks like the Vicuna repo is Apache 2.0 also[1].

What's the interpretation of copyright law that would prevent the code being Apache 2.0 based on the source of the fine-tuning dataset?

[1] https://github.com/lm-sys/FastChat


Not quite: fastchat is the inference code which is Apache 2.0 but distinct from the model artifact. If you look at the model [0] it is licensed as non-commercial.

But why?

Well for one, Vicuna is a Llama finetune, which already excludes it from being Apache 2.0. It's also finetuned on OAI data which is... questionable in terms of license (don't think you can really legally license a model trained on OAI output as Apache 2.0 - although OAI doesn't really play by its own rules so who knows)

[0]: https://huggingface.co/lmsys/vicuna-13b-v1.3


Which part of copyright law are model weights governed by? (Or, if not by copyright law, what's the legal basis that would let you choose a "license" for model weights?)


Weights may or may not be subject to copyright law.

Are they a mere aggregation of facts -- some uncopyrightable, and some from other sources--, or is there a creative component to them?

https://libraries.emory.edu/research/copyright/copyright-dat...


There's no basis for models to be copyrightable that I can think of, licences on models are legal fictions for now until there's a new addition into copyright laws.


It has not yet been challenged, so we don’t really know yet.

Personally I will be surprised if weights aren’t considered to be licensable - courts are a lot more practical than one might expect.


Tbf the llama license allows for small businesses usage.

But also these models aren’t watermarked or anything (not that watermarking really works) so it’s kind of the wild west


Llama 2 does, but Llama does not. Vicuna is based on Llama.


Not sure why you are downvoted, you are correct.


When I see unenforceable conditions I don’t point them out, I just accept and plan to make enough money for Federal appeals court

That fits my risk tolerance


Fine-tuning the weights scrambles the original representations (sometimes more than others depending on training settings, but if you train the text encoder it certainly will). All authors have to do is not be honest about the original model it was fine-tuned on in a world where lawyers start to come down on this.

I see no issue for businesses using it.


I don't know - it sounds like your default assumption is that there is no issue because businesses can commit copyright infringement/fraud and not be caught, I am not a lawyer so I can't comment on the merits of the approach.

Generally I think it is difficult for businesses to break the law given that any one of the members might defect on you.

Also I suspect that the logprobs for various sequences would reveal which foundation model you used.


I do not think it has been determined that weights are copyrightable


That's not the issue here, regardless of if it's true or not.

The issue here is that if you have some upstream with license XXX on it, you can most certainly slap some YYY license on your repo, but that doesn't make it true.

This is like when someone pushes up a corporate private repo and puts an MIT license file on the repo.

Releasing code as GPL when you dont have the authority to do so, does not make the code GPL.

Certainly, as a consumer you can say that you 'didn't know' it was not actually MIT, but that doesn't absolve you from liability; as a consumer, you are required to do your due diligence, and if upon finding that you were mislead / mistaken / whatever, take steps to remediate it.

...otherwise you're liable.

It's that simple.


I think the parent comment is implying that they don't believe they are liable for infringement since, based on their understanding of the law, model weights are not copyrightable in the first place, and until a court actually rules on the issue, there's nothing to remediate.


Yes thank you. I can put xxx license on a math equation but everyone can ignore it because the equation isn’t subject to licensing regardless of who or when it was made. I believe the same is true for weights (at least in some places, tbd in others)


It is my understanding that you can compare weights and with a high degree of confidence determine what the parent of a model is unless the fine tuning destroyed all the original information in which case there isn’t a huge reason to have fine tuned to begin with. There are other ways to scramble weights which make the comparisons a lot harder to do though which will matter if weights ever are considered copyrightable


This is a very cool project! Kudos to the authors for being on top and keeping the features coming. Appears to be feature-competitive with OpenAI's GPT-4V `vision` endpoint.


It has more features actually - cause it does video. But quality is far, far behind gpt4-vision. (Just tried a bunch of images on it and compared to the gpt4 output). This feels more gpt3 level. I’m glad someone is working on it though!


Demo just errors out unfortunately


I honestly have no idea what this project is about. It may be because I'm completely out of the loop regarding LLMs but still...


Open source question answering over videos:

> With the binding of unified visual representations to the language feature space, we enable an LLM to perform visual reasoning capabilities on both images and videos simultaneously.


Thanks


It will answer questions about images and/or videos; it's open-source and would compete with some of Chat-GPT 4's advanced features. It's extremely poorly explained on the Github page because it's trying to get interest from other AI researchers with ~256 buzzwords. Great if you already know what it is, extremely unhelpful if you don't.

It seems quite good, ironic that the Github landing page communicates the idea so poorly.


I had no idea from the name, but the README does a good job of explaining what it's about. Even has a nice video demo.


Does it? I have tried to read the README and I still can't figure out what it does. There's also so much random stuff smashed into the README that just trying to figure out where to get started...reading...is an exercise in frustration.

The Title:

> "Video-LLaVA: Learning United Visual Representation by Alignment Before Projection"

I know all of those words, but I don't understand what they mean in that order or context. Let's move on.

Next up is a bunch of links to other pages, other projects, and news about the project. Let's skip all that.

Finally we get to something called " Highlights":

> "Video-LLaVA exhibits remarkable interactive capabilities between images and videos, despite the absence of image-video pairs in the dataset."

OK, so now I know that it does something with images and videos, although I am not sure what that something is. I still don't know what it IS though. Is it an application? A LLM?

Continuing on...

> Simple baseline, learning united visual representation by alignment before projection

> With the binding of unified visual representations to the language feature space, we enable an LLM to perform visual reasoning capabilities on both images and videos simultaneously.

> High performance, complementary learning with video and image

> Extensive experiments demonstrate the complementarity of modalities, showcasing significant superiority when compared to models specifically designed for either images or videos.

Seriously...what? Did a (bad) LLM write those sentences or am I just an idiot?

Then there's a picture, demo video, some installation and basic CLI usage commands (hey, now I finally know it's a python tool!), API info, and more random stuff.

Honestly I have attempted to read through this README several times and I still don't really know what I'm looking at.


I agree, the README is not really understandable if you're not into AI research techno-babble. Just adding one sentence targeted at normal people would maybe have been useful.

To answer your question, it's a model that you can give image and videos, which you can then interact with via an LLM (ask questions, describe, process further, etc.) It can "see" them, basically.

It the same capability as GPT-4V (ChatGPT's "upload image" feature), except that ChatGPT only offers images.


> Honestly I have attempted to read through this README several times and I still don't really know what I'm looking at.

Sounds like you've attempted to watch the video 0 times though, because despite not even reading the readme in detail I could tell what the project does by watching the demo.


Fair enough, the video does show what the project is.

That said, I also think it's fair to expect that reading a readme should be enough to learn about something.


The related paper is here: https://arxiv.org/pdf/2311.10122.pdf

I think the TL;DR is "it can tell what's in the video and 'reason' about it"


Side note: Why does every GitHub readme look like a children’s book these days? Emojis, big colorful graphics, gifs, cute project logo, etc. Makes me feel awkward trying to read about a serious topic with the “:o” emoji staring in my face. I’m just waiting for the air horns to start blaring and a dancing cat to slide across my screen.


Because you're dealing with humans and sometimes humans don't behave in the same way you apparently expect everyone to? These aren't massive billion dollar corps they're some engineer or group of engineers doing something interesting to them.

In this case it seems related to a university, so these are students and researchers at a university. Some of them are very likely qualifiable as kids to us old people.

Not sure why it's such a bother to you, does a topic need to be cold and black and white for it to further our technological research? (That's hypothetical because this repo, for instance, absolutely furthers our tech abilities while also being in a more friendly non-academic format.)


The closer to discord a community is the more things look this way, at least that's my interpretation.


you could also ask why does serious writing often avoid adding big colorful graphics if it looks better.


Emojis are part of the common vernacular now, and software development is a mainstream career instead of a siloed off nerd-haven.


Because it's more inviting than to just people who like text alone.

https://shuiblue.github.io/forcolab-uoft/paper/IST2022-emoji...


I’m both baffled and enthused that there is a study on exactly this


I love that this exists


Me too.

Not to say a study can’t often be found for most viewpoints.


Do you use syntax highlighting?


Couldn't agree more!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: