What do you think happens to data when it’s scraped? Copying the data is a fundamental requirement for using it in training. These models are trained in big datacenters where the original work is split up and tokenized and used over and over again.
Tokenizing and calculating vectors or whatever is not the same thing as distributing copies of said work.
The difference between you training a model and you reading a book (put online by its author in clear text, to avoid the obvious issue of actual piracy for human use) is that you reading on a website is the intention of the copyright holder and you as a person have a fundamental right to remember things and be inspired.
Copyright holders can't say what I do with their work, nor what I do with the knowledge of their book. They can only say how I copy and distribute it. I don't need consent to burn an author's book, create fan art around it, or quote characters in my blog. I do need their consent to copy and distribute their works directly.
You don’t however have a right to copy and use the text for other purposes, whether that’s making a t-shirt with a memorable line, printing it out to give to someone else, or tokenizing it to train a computer algorithm.
And at some point the resolution of said words is so specific that it becomes uncopyrightable. You can't copyright most phrases nor words.
Tokenizing and calculating vectors or whatever is not the same thing as distributing copies of said work.
It very much is. You can't just run a cipher on a copyrighted work and say "it's not the same, so I didn't copy it". Tokenization is reversible to the original text. And "distributing" is separate from violating copyright. It's not distriburight, it's copyright. Copying a work without authorization for private use is still violating copyright.
You can’t just run a cipher on a copyrighted work and say “it’s not the same, so I didn’t copy it”.
Yes I can. I can download a Web page, encrypt it on my machine, and I'm not distributing said work.
And “distributing” is separate from violating copyright. It’s not distriburight, it’s copyright. Copying a work without authorization for private use is still violating copyright.
You absolutely do not know what you're talking about. This is just trivial copyright law, but there's a weird internet mythology that if you can access something on the net you can take it as long as you don't share it further. The reason the mass-sharers tended to get prosecuted is because they were easier and more valuable targets, not because the people they were sharing it with weren't also breaking the law.
Tokenizing and calculating vectors or whatever is not the same thing as distributing copies of said work.
Copyright holders can't say what I do with their work, nor what I do with the knowledge of their book. They can only say how I copy and distribute it. I don't need consent to burn an author's book, create fan art around it, or quote characters in my blog. I do need their consent to copy and distribute their works directly.
And at some point the resolution of said words is so specific that it becomes uncopyrightable. You can't copyright most phrases nor words.
It very much is. You can't just run a cipher on a copyrighted work and say "it's not the same, so I didn't copy it". Tokenization is reversible to the original text. And "distributing" is separate from violating copyright. It's not distriburight, it's copyright. Copying a work without authorization for private use is still violating copyright.
Yes I can. I can download a Web page, encrypt it on my machine, and I'm not distributing said work.
That's just false.
You absolutely do not know what you're talking about. This is just trivial copyright law, but there's a weird internet mythology that if you can access something on the net you can take it as long as you don't share it further. The reason the mass-sharers tended to get prosecuted is because they were easier and more valuable targets, not because the people they were sharing it with weren't also breaking the law.