- cross-posted to:
- technology@lemmy.zip
- cross-posted to:
- technology@lemmy.zip
In my humble opinion, the most important aspect here is that it shouldn’t be possible to copyright ai-generated works. They are trained on the collective body of human intellectual product, the entire public domain, if you will, and in turn whatever is produced should be public domain, and available to everyone.
Certainly an AI company may charge for usage/distribution and generation of content to fund their endeavours, but that is about the limit of it as I see.
So, I agree with the EFF that we should not introduce some kind of new legal right to prohibit training on something just because it’s copyrighted. There’s nothing that keeps a human from training themselves on content, so neither should an AI be prohibited.
However.
It is possible for a human to make a work that will infringe existing copyright rights, by producing a derivative work. Not every work inspired by something else will meet the legal bar for being derived, but some can. And just as a human can do that, so too can AIs.
I have no problem with, say, an AI being able to emulate a style. But it’s possible for AIs today to produce works that do meet the bar for being derivative works. As things stand, I believe that that’d make the user of the AI liable. And yet, there’s not really a very good way for them to avoid that. That’s a legit point of complaint, I think, because it leads to people making derivative works.
The existing generative AI systems don’t have a very good way of trying to hint to a user of the model whether a work is derivative.
However, I’d think that what we could do is operate something like a federal registry of images. For published, copyrighted works, we already have mandatory deposit with the Library of Congress.
If something akin to Tineye were funded by the government, it would be possible to maintain an archive of registered, copyrighted work. It would then be practical for someone who had just generated an image to check whether there was a pre-existing image.
I don’t know whether Tineye works like this, but for it to work, we’d probably have to have a way to recognize an image under a bunch of transformations: scale, rotation, color, etc. I don’t know what Tineye does today, but I’d assume some kind of feature recognition – maybe does line-detection, vectorizes it, breaks an image up into a bunch of chunks, performs some operation to canonicalize the rotation based on the content of the chunk, and then performs some kind of fuzzy hash on the lines.
Then one could place an expectation that if one is to distribute an LLM-generated work, it be fed into such a system, and if not so verified and distributed and the work is derivative of a registered work, the presumption being that the infringement was intentional (which IIRC entitles a rights holder to treble damages under US law). We don’t have a mathematical model today to determine whether one work is “derivative” of another, but we could make one or at least give an approximation and warning.
I think that that’s practical for most cases for for holders of copyrighted images and LLM users. It permits people to use LLMs to generate images for non-distributed use. It doesn’t create a legal minefield for an LLM user. It places no restrictions on model creators. It’s doable using something like existing technology. It permits a viewer of a generated image to verify that the image is not derivative.
You could always just do reverse search on the open dataset to see if it’s an exact copy (or over a threshold).
You MIGHT even be able to do that while masking the data using hashing.
You could always just do reverse search on the open dataset to see if it’s an exact copy (or over a threshold).
True, but “exact copy” almost certainly isn’t going to be what gets produced – and you can have a derivative work that isn’t an exact copy of the original, just generate something that looks a lot like part of the original. Like, you’d want to have a pretty good chance of finding a derivative work.
And that would mean that anyone who generates a model to would need to provide access their training corpus, which is gonna be huge – the models, which themselves are large, are a tiny fraction the size of the training set – and I’m sure that some people generating models aren’t gonna want to provide all of their training corpus.
Minhash might be able to produce a similarity metric without needing exactness and without revealing the training data.
It is a mine field. The fact that it can generate almost an exact copy of some things if it’s over trained on an image or if the stars hit just right
On a different not llm is Large Language Model not the image generator
deleted by creator
Some frequently repeated false premises, particularly on what AI is and does, but mostly correct conclusions on the effects of regulating it through copyright expansion, IMO.