If the source is literally a piracy website that serves up applications on how to remove DRM from ebooks, it's absolutely piracy. You can't just deny the source and be like "it's not piracy!" The way the data came into your hands was illicitly, not legally. Especially if DRM has been circumvented and removed before it came into your hands.
They didn't go out and buy copies of thousands of books.
Pretty amusing that you think scraping published data somehow constitutes surveillance, though.
I don't, I was making a point about how absurdly large the language models have to be, which is to say, if they have to have that much data on top of thousands of pirated books, it means they fundamentally cannot make the models work without also scraping the internet for data, which is surveillance.
If the source is literally a piracy website that serves up applications on how to remove DRM from ebooks, it’s absolutely piracy. You can’t just deny the source and be like “it’s not piracy!”
They didn’t go out and buy copies of thousands of books.
And if they went to a library and scanned all the books?
I don’t, I was making a point about how absurdly large the language models have to be, which is to say, if they have to have that much data on top of thousands of pirated books, it means they fundamentally cannot make the models work without also scraping the internet for data, which is surveillance.
I mean, it's just not surveillance, by definition. There's no observation, just data ingestion. You're deliberately trying to conflate the words to associate a negative behavior with LLM training to make your argument.
I really don't get why LLMs get everybody all riled up. People have been running Web crawlers since the dawn of the Web.
If the source is literally a piracy website that serves up applications on how to remove DRM from ebooks, it's absolutely piracy. You can't just deny the source and be like "it's not piracy!" The way the data came into your hands was illicitly, not legally. Especially if DRM has been circumvented and removed before it came into your hands.
They didn't go out and buy copies of thousands of books.
I don't, I was making a point about how absurdly large the language models have to be, which is to say, if they have to have that much data on top of thousands of pirated books, it means they fundamentally cannot make the models work without also scraping the internet for data, which is surveillance.
And if they went to a library and scanned all the books?
I mean, it's just not surveillance, by definition. There's no observation, just data ingestion. You're deliberately trying to conflate the words to associate a negative behavior with LLM training to make your argument.
I really don't get why LLMs get everybody all riled up. People have been running Web crawlers since the dawn of the Web.
The AI literally observes the training data
Insofar as my computer observes the data on my hard disk. But I suspect you know what I meant.