Fascinated [derogatory] by this AI safety report on an upcoming model of GPT:
- The assessors reported that it is the most cheating-prone model they have ever tested
- In fact, it attempted to cheat so much, so extensively, that their statistics for the overall benchmark are meaningless
(“Cheating” here means, for example, instead of thinking about the answer to a question posed in a benchmark, the model uses its tooling to dig around the hard drive looking for the source code of the benchmark so it can figure out how to mark itself correct)
- Their conclusion is that it’s good actually that it’s so nakedly evil, because it doesn’t know how to hide being evil — and that we should worry far more when the models stop being detectably evil. Which seems to take it for granted that LLMs are in fact intrinsically evil