Further support for the memorization claim: I posted examples of novel river crossing puzzles where LLMs completely fail (on this forum).
Note that Apple’s actors / agents river crossing is a well known “jealous husbands” variant, which you can ask a chatbot to explain to you. It gladly explains, even as it can’t follow its own explanation (since of course it isn’t its own explanation but a plagiarized one, even if changes words).
edit: https://awful.systems/post/4027490 and earlier https://awful.systems/post/1769506
I think what I need to do is to write up a bunch of puzzles, assign them randomly to 2 sets, and test & post one set, while holding back on the second set (not even testing it on any online chatbots). Then in a year or two see how much the set that’s public improves, vs the one that’s held back.
Latter test fails if they write a specific bit of code to put out the ‘llms fail the river crossing’ fire btw. Still a good test.
It would have to be more than just river crossings, yeah.
Although I’m also dubious that their LLM is good enough for universal river crossing puzzle solving using a tool. It’s not that simple, the constraints have to be translated into the format that the tool understands, and the answer translated back. I got told that o3 solves my river crossing variant but the chat log they gave had incorrect code being run and then a correct answer magically appearing, so I think it wasn’t anything quite as general as that.
The promptfondlers on places like /r/singularity are trying so hard to spin this paper. “It’s still doing reasoning, it just somehow mysteriously fails when you it’s reasoning gets too long!” or “LRMs improved with an intermediate number of reasoning tokens” or some other excuse. They are missing the point that short and medium length “reasoning” traces are potentially the result of pattern memorization. If the LLMs are actually reasoning and aren’t just pattern memorizing, then extending the number of reasoning tokens proportionately with the task length should let the LLMs maintain performance on the tasks instead of catastrophically failing. Because this isn’t the case, apple’s paper is evidence for what big names like Gary Marcus, Yann Lecun, and many pundits and analysts have been repeatedly saying: LLMs achieve their results through memorization, not generalization, especially not out-of-distribution generalization.
prompfondlers
Holy shit, I love it.
I still prefer promptards
i prefer that you take your ableist vocabulary somewhere else, preferably stick it up your arse.