Fighting bots is fighting humans.
One advantage to working on freely-licensed projects for over a decade is that I was forced to grapple with this decision far before mass scraping for AI training.
Fighting bots is fighting humans.
One advantage to working on freely-licensed projects for over a decade is that I was forced to grapple with this decision far before mass scraping for AI training.
In my personal view, option 1 is almost strictly better. Option 2 is never as simple as "only allow actual human beings access" because determining who's a human is hard. In practice, it means putting a barrier in front of the website that makes it harder for EVERYONE to access it: gathering personal data, CAPTCHAs, paywalls, etc.
http://mollywhite.net/micro/entry/fighting-bots-is-fighting-humans
@molly0xfff I like Jeremy Keith’s 1.5 option of acknowledging, not accepting, via poisoning https://adactio.com/journal/21210
@vonExplaino i would never 😉
@molly0xfff Isn't there a possibility of option 1.1? Keep things open but have SOMETHING in place to keep the abuse, at least moderately, in check?
@molly0xfff Every time I do a bad job completing a CAPTCHA nowadays I'm afraid that I just caused a future autonomous vehicle to mow down a cyclist because I forgot to identify one of the pictures with a bicycle.
@molly0xfff Ironically enough that link is also 503ing for me, like the other microblog link the other day.
@molly0xfff
I think it's time to return to printed and mailed newsletters.
@molly0xfff For the time being, I’ve set pretty strict throttling limits. If you’re trying to access 60 pages in a minute, you’re a bot. And a badly behaved one.
@molly0xfff Even if we generalize option 2 to "Make everything private and only allow X access to our content" the hard part is still X. Paying subscribers? People with an invite from an existing member? People I have met at an in person meetup? All of these are viable in their own context but none are simple and all of them restrict the author's reach.
@molly0xfff I’ve got a blog post in the works about this but there’s a third option we should be considering:
Making it more expensive to download and parse by rejecting the corporate-friendly facism of minimalism and plaintext and re-embracing the creative possibilities of the multimedia web.
If a blog post was the same rough size as a YouTube video and required a scraper to understand a complex css layout and rich interactive context the scraping difficulty and the possibilities for unique, luxurious creation go through the roof in a way that cannot be replicated or co-opted by corps. And if every website was 1000x bigger, the scraping costs go up at least 1000x.
I say it’s time to discard the markdown web and for the dawn of the indie baroque.
@leon as with trying to block bots, there are tradeoffs. in your case, people with limited bandwith or low-end devices might also suffer
@molly0xfff with YouTube and TikTok so massive, I don’t believe we need to be anywhere near as concerned with bandwidth as we were when flash intros stalked the earth.
And if a blog post isn’t worth a 20 second load, is it worth a 30 minute read?
@molly0xfff @leon ...if your multimedia website is hard to parse by bots it will also be hard for accessibility tools that for instance vision impaired users rely on.
@scottjenson sure. i'm not saying everyone should, say, drop DDoS protection.
but "only allow humans to access" is just not a feasible metric — you will ALWAYS let bots through and prevent humans, and you need to decide where you want to set the cutoff.
@molly0xfff Oh completely agree! Didn't mean to take away from your main point.