Microsoft offers guide to pirating Harry Potter series for LLM training

RSS Bot@lemmy.bestiver.se · 12 days ago

Microsoft offers guide to pirating Harry Potter series for LLM training

itsathursday@lemmy.world · 12 days ago

They cite Kaggle as the dataset source, but that entry reads:

About Dataset This dataset contains 7 txt files of 7 books of Harry Potter. First, I downloaded the ebooks and then converted them to txt files. I removed the front page and the ending lines of the books to make it more clean. This dataset can be used for NLP tasks like text processing, text mining, etc.

contact - smaindola90@gmail.com

Usability 10.00 License CC0: Public Domain

So the dataset is public domain…? Is that how copyright works?

Boomer Humor Doomergod@lemmy.world · 12 days ago

“Your honor, this isn’t pirated video content. It’s a series of MP4 files that have had the commercials removed to make it more clean, so now it’s a dataset I can use to train video models.”

skip0110@lemmy.zip · 12 days ago

I could not get to it on wayback machine, but this works

https://archive.is/D9vEN

panda_abyss@lemmy.ca · 12 days ago

Kaggle’s dataset section is a cesspool.

All the data is either stolen or fake, which entirely defeats the fucking point.

swicano@programming.dev · 12 days ago

Aaaaand it’s down. Guess someone at Microsoft is monitoring hackernews

JohnnyCanuck@lemmy.ca · 11 days ago

Yeah, and someone is going to get a reprimand.

Microsoft offers guide to pirating Harry Potter series for LLM training

Microsoft offers guide to pirating Harry Potter series for LLM training

Page not found - Azure SQL Devs’ Corner