novelty ≠ practicality

Si, Hashimoto, and Yang:

To test whether AI-generated ideas lead to better research outcomes, we conduct an execution study by recruiting 43 expert researchers to execute randomly-assigned ideas, either written by experts or generated by an LLM. … Comparing the review scores of the same ideas before and after execution, the scores of the LLM-generated ideas decrease significantly more than expert-written ideas on all evaluation metrics (novelty, excitement, effectiveness, and overall; p < 0.05), closing the gap between LLM and human ideas observed at the ideation stage.

Zvi notes correctly that this isn’t really about ideation versus execution, but about the challenges of evaluating whether a given idea will execute well:

What is actually happening here is that the AI ideas were never better than the human ideas. The AI ideas looked better at first, but that was because of errors in than [sic] human evaluation of the ideas. Execution revealed the truth. … Claude, for rather obvious reasons, was optimizing in large part on what would look good to an evaluator, rather than for ultimate results.

Amy Crouch @amylouise