You have activated the Falsifiability trap card - LLMs as tutors = lol

V0ldek@awful.systems · edit-2 8 months ago

You have activated the Falsifiability trap card - LLMs as tutors = lol

NSFW

pikasaurX4@lemm.ee · edit-2 8 months ago

This drives me up the wall. Any time I point this out, the AI fanboys are so quick to say “well, that’s v3.x. If you try on 4.x it’s actually much better.” Like, sure it is. These things are really good at sounding like they know what they’re talking about, but they will just lie. Especially any time numbers or math are involved. I’ve had a chat bot tell me things like 10+3=15. And like you pointed out, if you call it out, it always says “oh my bad” and then just lies some more or doubles down. It would be cool if they could be used to teach things, but I’ve tried it for learning the rules to games, but it will just lie and fill in important numbers with other, similar numbers and present it as completely factual. So if I ever used it for something I truly didn’t know about, I wouldn’t be able to trust anything it said

self@awful.systems · 8 months ago

just like with crypto, there’s already a long list of cliches that AI fanboys use to excuse how shitty their favorite technology is:

that’s because you’re using GPT-3.5 Turbo. if you just pay an exorbitant amount for early access to GPT-5, you’ll see it does so much better (please ignore all previous claims of GPT-3 being revolutionary)
the model doesn’t work as well as I think it used to, but I will still insist there’s no scaling problem or hidden human labor
you’re prompting it wrong
the LLM sucks because it’s being censored. please ignore that all of the uncensored models fucking suck too, when they’re not just ChatGPT with a spicy initial prompt
multi-modal LLMs will fix this. wait no, multi-agent LLMs. fuck it, I’ll just link a bunch of research papers that read like press releases and OpenAI blog posts that are press releases
making mistakes like a shitty computer program only makes the LLM more human-like, because my mental model for people is that they’re all shitty stupid computer programs that fuck up and lie all the time too

self@awful.systems · 8 months ago

speaking of chatgpt not knowing about games, please enjoy the classic that is chatgpt vs stockfish

froztbyte@awful.systems · 8 months ago

at multiple points I wanted to rewind that (garrrr, gifs) just to check on whether it did, in fact, just magically try to move a piece straight over another in an illegal move. amazing

Amoeba_Girl@awful.systems · edit-2 8 months ago

i think about this game all the time it’s so so good. the way the cheating escalates only for it to

spoiler

illegally move its king in front of the pawn and die

.

best game since murphy vs mr endon

swlabr@awful.systems · 8 months ago

that’s because you’re using GPT-3.5 Turbo. if you just pay an exorbitant amount for early access to GPT-5, you’ll see it does so much better (please ignore all previous claims of GPT-3 being revolutionary)

business card scene from american psycho but it’s LLM variants

froztbyte@awful.systems · 8 months ago

openai blog post that elaborates on the press releases documenting the prevalance of bad research papers as a result of openai products

zogwarg@awful.systems · edit-2 8 months ago

Let’s not forget the:

Ah! PotemkinTurd-4.0 is getting worse! Like it’s starting to make all the same mistakes that PotemkinTurd-3.0 used to make! Honestly Poirot-2 is just as good now.

Cue to an answer from PotemCorp:

We haven’t changes anything since the release of 4.0, but thanks we’ll look into possible causes.

Like yes those a big Spaghetti monsters of RHLF and sad attempts at content filtering and/or removals of liability from PotemCorp, but isn’t a much more rational explanation that the product was never that good to begin with, fundamentally random, and that sometimes the shit sticks and sometimes it doesn’t?

swlabr@awful.systems · 8 months ago

What is great is that it only really starts approaching correct once you tell it to essentially copy paste from wikipedia.

Also, if some rando approached me on the street and showed me the wikipedia article for dijkstra’s and asked for me to help explain it, my first-ass instinct would be to check if there was a simple english version of the article, and go from there.

Disclaimer: I glossed over said SE article just now. It might not be a great explanation or even correct, but hey, it already exists and didn’t require 1.21 Jiggowatts to get there.

swlabr@awful.systems · 8 months ago

I mean would it be great if there were some way to freely generate explanations tailored to one’s individual experience? Absolutely, yes. Are LLMs that? Absolutely not. Have the cranks convinced themselves through a mashup of dunning-kruger and gell mann amnesia that LLMs are the first thing? Absolutely, yes.

V0ldek@awful.systems · 8 months ago

The description of the algorithm is correct, although I’m not sure how much easier it is to understand if you call everything “thing”. A graph is a really easy thing to explain, it’s circles and lines between them, you can just call it circles and lines, it’s okay. The pseudocode section completely changes the style out of nowhere which is bizzare. But it doesn’t include any explanation really, it just presents the method, which ChatGPT was more-or-less able to do.

swlabr@awful.systems · 8 months ago

thanks for taking a look!

supersquirrel@sopuli.xyz · edit-2 8 months ago

edit I responded to this from an educational systems framing, not a single person using LLM/chatbot to try to understand something -framing, which was a bit awkward my bad…

I find the idea of LLMs for education really exciting… in a vacuum. In our current society where we pathologically seem unable to value human skills like teaching and the jobs of teachers in general, technology is going to be used as a cudgel to rationalize further divestment of resources from teachers and teaching. One only has to look at the educational “reform” program Bill Gates funded and pushed that warped the education system in the US for years and years that no teachers actually wanted and that received unwavering support from the general public and government because Smart Computer Guys are actually smarter than everyone else even in contexts that have nothing to do with computers… sigh

Beyond all of that I don’t really think LLMs are that useful when being prompted in a one on one conversation. There is just no way to tell how much you are being bullshitted. I do find that asking the same question to multiple LLMs on arena.lmsys.org does get me fairly quickly to technical answers however, since I can evaluate from a series of answers and cross reference (obviously you still need to google at the end of it to verify, and it is a fair point why you wouldn’t just do that in the first place).

I think in the far future (a positive vision of it) a good bit of education will be crafting questions and prompts for LLMs and then critically evaluating from a set of answers given from different LLMs/chatbots. The homework assignment could be evaluated based on how critically and intelligently a student compared several different LLM answers and triangulated an answer from it.

All that being said, LLMs are 1000% the next bitcoin, they are absolutely part of the enshittification of search engines and most of the people who are excited about them are insufferable…. but I can still step outside of that and see that there is an educational utility here, however even the act of focusing on the educational utility of LLMs in conversations about education is dangerous since it provides such a clear route for further cutting funding and resources for teachers.

Like who gives a shit about AI next to the fact that we treat human teachers like trash and give them no funding to do their jobs so they have to shell out from their own money to buy classroom supplies (???). The problem is we think education isn’t worth investing in and that teachers don’t have a professional skillset (they are just burger flippers but they pass out worksheets to kids instead of making fast food) that should be respected and nurtured.

In other words, computer people are so up their own ass they really are incapable of understanding even the basic practical problems of every day teaching that must be overcome. Those skills required to be an effective teacher (especially to determine what a human really needs to learn) are invisible to them, they are fuzzy soft skills that are at best nice to have and at worst an annoying set of behaviors and social expectations to memorize and perform. These people with this mindset will never ever be able to do anything but ruin education (again, see Bill Gates) with computers.

I am however interested in the long term to see how teachers and educators who also understand LLMs and chatbots will integrate these things into the process of teaching, but I am only interested if the computer isn’t treated as more important than the human connection between the teacher and student…

blakestacey@awful.systems · 8 months ago

Wikipedia’s coverage of math and science topics is… uneven, but that article looks to be on the decent side. It’s good enough that if you say you got absolutely nothing from it, I’d be inclined to blame your study skills before I blamed the article. And guess what? Pressing the lever to get nuggets of extruded math-substitute product will not help you develop those study skills.

self@awful.systems · 8 months ago

it was such a weird choice of an article from our esteemed guest, and that they considered the article particularly complex or hard to understand revealed so much about their quality as a supposed programming tutor (though the weird anti-math stuff also did them no favors)

like, is this really the most complex algorithm that the LLM could generate convincing bullshit for? or did their knowledge going in end here and so they didn’t even know how to ask the thing about slightly harder CS shit? I really hope their whole tutoring thing is them being an unhelpful reply guy like they were in our thread and they’re not going around teaching utterly wrong CS concepts to folks just looking to learn, but the outlook’s not too bright given how much CS woo has entered the discourse as a direct result of people regurgitating the false knowledge they’ve gotten from LLMs.

froztbyte@awful.systems · 8 months ago

I’m going to start trolling these dipshits with “I bet you don’t even understand something as basic as how BGP makes the internet work” and watch the bad takes fall out

for the viewers at home: BGP operates on (relatively) simple rules, but with hella emergent complexity with a loooooot of intermingled state and subjective truth from the point of view of each actor AS in participation. and table stakes on the internet is edge filters, MEDs, communities, RRs, (sometimes) RPKI, etc. and that’s before you get into the more esoteric shit you can do.

V0ldek@awful.systems · 8 months ago

I bet you don’t even understand something as basic as how BGP makes the internet work

You’d bet correctly, I believe packets are moved around the wired network by gnomes and the wireless network by fairies. What I don’t do, however, is confidently tell students lies about a topic I don’t understand, which happens to be an AI chat’s job description.

AcausalRobotGod@awful.systems · 8 months ago

Look, it’s simple enough to fit on a couple napkins, how hard can it be?

froztbyte@awful.systems · 8 months ago

unironically have demonstrated it to someone in a bar using condiments and cutlery!

why yes I’m great at parties, why do you ask

Mike Knell@blat.at · 8 months ago

@froztbyte @AcausalRobotGod “So the tablecloth is the Internet routing mesh, right? Now imagine that this squeezy bottle of ketchup is a provider in Pakistan who just got ordered to block access to Twitter in the country but isn’t quite sure of the right syntax to blackhole traffic from a particular AS…”

AcausalRobotGod@awful.systems · 8 months ago

As long as there’s a ketchup stain somewhere, it’s canonical.

swlabr@awful.systems · 8 months ago

https://arstechnica.com/information-technology/2018/11/major-bgp-mishap-takes-down-google-as-traffic-improperly-travels-to-china/ BGP comes up time and time again, and like every time it’s google fucking up

froztbyte@awful.systems · 8 months ago

not exclusively google, tbh, and there’s far more of this kind of shit going on than tends to hit public awareness

relatively recent the ntt-cogent slapfight happened, impacting far more than I think most people would guess a priori

blakestacey@awful.systems · 8 months ago

AI: the amazing technology that makes computers bad at math and turns me into Miss Wormwood.

blakestacey@awful.systems · 8 months ago

One thing that would be a good addition to the recommended NSFW reading list would be an explanation of basic CS theory that’s good enough to forestall the worst of the woo, or at least give a reader who isn’t already a lost cause the tools to recognize the woo.

Lung@lemmy.world · 8 months ago

Nice work! Don’t see a lot of this, and it’s a common experience with LLMs today

I’d say they are ok for learning, but only for the simplest stuff. The syntax of a programming language you don’t know, and would be trivial to google. Basic info about cats. Some models are a little better than others, but it feels like throwing more hardware/data at it is no longer the correct answer, and breakthroughs are needed

Mainly the issue here is trust. You never know when LLMs switch from being a decent teacher to being a convincing liar. And that’s kinda the whole thing with teachers, you’re supposed to trust them. Just chatting with someone about a topic you’re both only casually familiar with is different

Generally LLMs fail spectacularly when it comes to popularity of ideas vs fundamentals of ideas. A single new publication in a physics journal could fundamentally change our perception of the universe, but the LLM is much more likely to describe a common viewpoint that it’s been trained on a lot. Even with the latest GPT it was very painful talking to it about black holes and holographic universe

V0ldek@awful.systems · edit-2 8 months ago

One thing I didn’t focus on but is important to keep in context is that the cost of a semi-competent undergrad TA is like couple k a month, a coffee machine, and some pizza; whereas the LLM industrial complex had to accelerate the climate apocalypse by several years to reach this level of incompetence.

Sam Altman personally choked an orca to death so that this thing could lie to me about graph theory.

Deborah@hachyderm.io · 8 months ago

deleted by creator

froztbyte@awful.systems · 8 months ago

The syntax of a programming language you don’t know, and would be trivial to google

yeah so I tried that already, and it turns out these things are both dogshit and insidious in those cases - if I were a less informed user, I would’ve had bugs baked in at a deep level that would’ve taken hours to days to figure out down the line

it feels like throwing more hardware/data at it is no longer the correct answer

it was never the correct answer

Mainly the issue here is trust. You never know when LLMs switch from being a decent teacher to being a convincing liar. And that’s kinda the whole thing with teachers, you’re supposed to trust them

wat. I don’t really understand this point/mention - what did you mean to convey by bringing this up?

You never know when LLMs switch from being a decent teacher to being a convincing liar.

well… no? 1) it’s always synthesising, there is no distinction between truth and falsehood. it is always creating something. some of it turns out to be factual (or factual enough) that you’re parsing as “oh it gave me true bits”, but that’s because your brainmeats are doing some real fancy shit on the fly because they’ve been tuned to deal with information filtering over a couple millennia. handy, right?

but the LLM is much more likely to describe a common viewpoint that it’s been trained on a lot.

you mean the mass averaging machine is going to produce something that might be an average of the mass data? shock, horror!

why not just find a scientist to become friends with? it’s so much easier (and you get to enjoy them going fully human nerdy about really cool niche shit)

David Gerard@awful.systems · 8 months ago

delve

lol

V0ldek@awful.systems · 8 months ago

Is my delving not to your satisfaction?

David Gerard@awful.systems · 8 months ago

just with Paul Graham being himself

V0ldek@awful.systems · 8 months ago

God I hate this place

Atomic Orbitals@mstdn.social · 8 months ago

@dgerard @V0ldek By this line of “reasoning” _The Lord of the Rings_ was probably written by an AI as the largest town in the Shire was called “Michel Delvings” or “great digging”.

Charlie Stross@wandering.shop · 8 months ago

@dgerard @techtakes I don’t read PG (life’s too short, and so is my sanity) but doesn’t this merely suggest the folks he exchanges email with have paltry, stunted vocabularies?

self@awful.systems · 8 months ago

for me it’s a clear indicator that you’re talking to a nethack fan mid-delve

Turun@feddit.de · edit-2 8 months ago

LLMs are pretty good at stuff that an untrained human can do as well. Algorithms and data structures are wayyy to specialized.

I recently asked gpt4 about semiconductor physics - not a chance, it simply does not know.

But for general topics it’s really good. For one reason that you simply glossed over - you can ask it specific questions and it will always be happy to answer.

Okay, at least it’s not incorrect, there are no lies in this, although I would nitpick two things:

It doesn’t state what the actual goal of the algorithm is. It says “fundamental method used in computer science for finding the shortest paths between nodes in a graph”, but that’s not precise; it finds the shortest paths from a node to all other nodes, whereas the wording could be taken to imply its between two nodes.

“infinity (or a very large number)” is very weird without explanation. Dijkstra doesn’t work if you put “a very large number”, you have to make sure it’s larger than any possible path length (for example, sum of all weights of edges would work).

Those nitpicks are something you can ask it to clarify! Wikipedia doesn’t do that. If you are looking for something specific and it’s not in the Wikipedia article - tough luck, have fun digging through different articles or book excerpts to piece the missing pieces together.

The meme about stack overflow being rude to totally valid questions does not come from nothing. And ChatGPT is the perfect answer to that.

Edit: I’m late, but need to add that I can’t reproduce OPs experience at all. Using GPT4 turbo preview, temperature 0.2, the AI correctly describes dijkstras algorithm. (Distance from one node to all other nodes, picking the next node to process, initializing the nodes, etc).
To respond to one of the nitpicks I asked the AI what to do when my “distance” data type does not support infinity (a weak point of the answer that does not require me to know the actual bound to question the answer). It correctly told me a value larger than any possible path length is required.

It also correctly states that Dijkstras algorithm can’t find the longest path in a graph and that the problem is NP hard for general graphs.

For negative weights it explains why Dijkstra doesn’t work (Dijkstra assumes once a node is marked as completed it has found its shortest distance to the start. This is no longer a valid assumption if edge weights can be negative) and recommends the Bellman-Ford algorithm instead. It also gives a short overview of the Bellman-Ford algorithm.

V0ldek@awful.systems · 8 months ago

The issue with those nitpicks is that you need to already know about Dijkstra to pick up that something is fishy and ask for clarification.

I call bs on all of this, if anything my little experiment shows that sure, you can ask it for clarification (like giving a counterexample) and it will happily and gladly lie to you.

The fact that an LLM will very politely and confidently feed you complete falsehoods is precisely the problem. At least StackOverflow people don’t give you incorrect information out of rudeness.

froztbyte@awful.systems · 8 months ago

minor point of order: the format of the output (as in, the politely and confidently parts) is a style choice on the part of the companies pushing a particularly shaped product. it could just as easily be styled abrasively or in klingon or lojban or whatever

but the rest still applies

as a thought experiment: one possible patch/semi-improvement would be if these things got engineered to always provide data- and training sources concomitant with every bit of output it provides (thus giving provenance, possibly enabling reference checks, etc). it should be extremely obvious why this avenue of design not just doesn’t exist in existing implementations but also is likely to never see daylight

V0ldek@awful.systems · 8 months ago

Bing Chat tried to annotate claims with references to websites and the result was predictable, it says bullshit and then plugs a reference that doesn’t actually substantiate what it said.

froztbyte@awful.systems · 8 months ago

hmm, I might’ve missed that, will have to look to what it was

fwiw to complete out the dark matter of that idea: conceptually such source information would be pulled all the way through (think annotations + “linked lists” throughout latent spaces/layers(/whatever each model type uses internally)). naturally this would immensely increase the storage density and complexity of how to do any of these pitiful things in the first place, which is why no-one bothers (and also why there’s so much “we don’t actually know what happens inside the ${middlebit} state!” handwringing)

which is also yet another amazing little datapoint in how hilariously far these things are from any of the claimed capabilities…

Turun@feddit.de · 8 months ago

Ok, so I finally got to check this and I simply can’t reproduce your results at all.

Gpt4 turbo preview. Temperature 0.2

It answers all questions correctly. When pressed for details it did not lie to me, but instead correctly explained why Dijkstra can’t be used to find the longest path, and instead pointed out that this is a NP hard problem. It also correctly stated that Dijkstra can’t be used for graphs with negative weights. It correctly suggested Bellman-Ford as an alternative to Dijkstra and knows their respective runtime complexities (for Dijkstra it differentiated between the og version and one with a Fibonacci heap). When I told it my data type for distances does not support infinity it correctly stated the bound to be “larger than any possible path length in your graph”.

My initial opinion was that you simply should not use a tool for something it can’t do. I assumed that GPT is simply not knowledgeable enough to answer such domain specific questions.
I have now changed my opinion. I don’t know what your version of GPT is, but GPT4 turbo preview with a temperature of 0.2 answers all the questions in your post correctly. Therefore I think GPT can be a good teacher for even Domain specific problems if they are sufficiently entry level (but still domain specific, which is impressive!)

V0ldek@awful.systems · 8 months ago

I used OpenAI Chat as you can probably see on the screenshots. I’m definitely not paying them a dime for Spicy Autocomplete TURBO 9000 or whatever the fuck.

If you want to “debunk” this post then go ahead and post your own screenshots and analysis.

Turun@feddit.de · edit-2 8 months ago

I’m not here to sell you something. In fact, the reason it took so long for me to reply, was because I only have access to ChatGPT at work and had to wait until I had free time there. I’m not paying closed AI any money either, but despite that I can accept that their flagship product is actually really good.

I am criticising that your post is based on a mediocre model (which version and temperature did you use?), but written as if it were representative of the whole field. And if I’m being honest I’m kinda salty that I was downvoted based on examples from such a meh model.

Since a few days ago llama 3 was released. On ai.nvidia.com you can test out different models, including the new 8B and 70B versions. I only did a quick check but even llama 3 8B beats the examples you gave here.

V0ldek@awful.systems · 8 months ago

I am criticising that your post is based on a mediocre model (which version and temperature did you use?)

Mate, I just went to ChatGPT and asked it a question. It’s around 2°C outside if that matters.

Examine the trail of events that leads us here. A guy in a post says you can use LLMs to explain Wikipedia articles. I go to the most popular LLM site and ask it to do so. Now you barge in and tell me that’s a “mediocre model with wrong temperature”. I don’t think that’s in any way relevant to the claim? Like, if the original comment said “search engines can do X”, I went to Google and found out it cannot actually do X, and you came to tell me Google is shit, actually, you should use a real search engine that you have to pay for (or get a license from work).

I would be more sympathetic to your comments if you at least provided screenshots that show those way better responses. It’d at least be theoretically (as in academically) interesting.

David Gerard@awful.systems · 8 months ago

sorry, you’ve exhausted your goalpost movement quota, have fun!

Deborah@hachyderm.io · 8 months ago

> “Ok, so I finally got to check this and I simply can’t reproduce your results at all.”

/me stands on one leg

Software that doesn’t have reproducible results is buggy software. The rest is commentary.

Turun@feddit.de · 8 months ago

OP: buys shoes for one dollar “man, this footwear thing is absolute dog shit, I don’t know why anyone would ever use them”

Anyone else: buys shoes that are actually good

Alternative comment

People who develop random number generators: guess I’ll just die then

gerikson@awful.systems · 8 months ago

Those nitpicks are something you can ask it to clarify! Wikipedia doesn’t do that.

This is patently unfair with regards to the Dijktra’s algo article. It has multiple sections with both a discussion of the algorithm in text, a section with pseudocode, and a couple of animations. Now I’ve had to internalize the algo because of Advent of Code, but I found the wiki article quite helpful in that regard.

Does it require more patience than just asking an LLM? Maybe. But it will reward you more.

I’d love an actual example of an obtuse Wiki article where an LLM does better. I doubt it really exists, because training an LLM involves… reading Wikipedia, and following the examples, and modelling an output from that.

It also provides an example that these models don’t think. You’d expect a precursor to an AGI to be able to understand math. It’s a huge body of work, it’s (mostly) internally consistent, and it would be a huge boon both for math tyros and pros if it existed. Instead LLMs have statistical “knowledge” of math, nothing more.

blakestacey@awful.systems · edit-2 8 months ago

Those nitpicks are something you can ask it to clarify! Wikipedia doesn’t do that.

https://en.wikipedia.org/wiki/Wikipedia:Reference_desk

Those nitpicks are something you can ask it to clarify! Wikipedia doesn’t do that. If you are looking for something specific and it’s not in the Wikipedia article - tough luck, have fun digging through different articles or book excerpts to piece the missing pieces together.

Or, as we called it in my day, studying.

Turun@feddit.de · 8 months ago

Yes, but imagine if we gave kids the ability to ask questions instead of leaving them with books after they are able to read - wait, we actually do that. It’s called teaching or private tutoring and orders of magnitude better at conveying knowledge than scraping together information from different sources.

You have activated the Falsifiability trap card - LLMs as tutors = lol

You have activated the Falsifiability trap card - LLMs as tutors = lol

"Not all AI content is spam, but I think right now all spam is AI content." - awful.systems