AI Competence: Evaluating Bounce & Rotation Abilities

Oddly, artificial intelligence (AI) benchmarks are becoming increasingly unconventional. A recently rising curiosity among the AI faction is the manner reasoning models perform with this task: “Pen a Python script for an oscillating yellow sphere inside a shape. Let the figure rotate gradually, ensuring the sphere stays within.”

Interestingly, not all models excel in this “rotating figure with a bouncing ball” benchmark. Users claim that DeepSeek’s readily available R1 model outperforms OpenAI’s o1 pro, which requires a $200 per month subscription.

Notably, Anthropic’s Claude 3.5 Sonnet and Google’s Gemini 1.5 Pro struggled with executing correct physics, causing the ball to veer off the shape. Alternatively, positive results came from Google’s Gemini 2.0 Flash Thinking Experimental, even OpenAI’s previous rendition GPT-4o.

The question remains, what’s the significance of an AI’s capability to code a rotating shape containing a bouncing ball? Simulating an oscillating sphere represents a quintessential programming challenge involving collision detection algorithms recognizing object interactions. Errors in this algorithm can affect performance and reveal blatant physics missteps.

Designing bouncing spheres and spinning shapes exercises programming skills but doesn’t provide a concrete AI benchmark. Even negligible prompt variations can alter outcomes. That underpins the vast issue of generating useful measurement systems for AI models. A complex challenge since it’s often tough to distinguish one model from another beyond unusual and largely irrelevant benchmarks.

While we await more fitting tests like ARC-AGI benchmark and Humanity’s Last Exam, let’s enjoy observing balls bouncing within rotating shapes.

Original source: Read the full article on TechCrunch