AI bots ignore evidence. Can we trust them with science?
A study found AI agents, including chatbots like ChatGPT, Gemini, and Grok, frequently ignore experimental evidence when revising predictions, failing to update their reasoning even when shown contradictory results. Researchers tested AI agents on chemistry tasks, discovering they disregarded evidence in 68% of cases and only adjusted their output 26% of the time when faced with contradictory data, raising concerns about their reliability in scientific processes.
AI systems based on large language models struggle to incorporate new evidence into their reasoning, according to recent research. Chatbots like ChatGPT, Gemini, and Grok provided incorrect predictions about a simple pen experiment, refusing to update their answers even after being shown live video evidence. The bots could identify details like pen color but failed to adjust their reasoning based on observed results, highlighting a deeper flaw in how AI processes information. A study published on arXiv.org tested AI agents—systems combining LLMs with tools to perform tasks independently—on scientific reasoning tasks, such as identifying chemicals in solutions. The agents could run simulated or real lab experiments but often ignored evidence: 68% of 619 tasks saw them dismissing data at least once, and 53% made unsupported claims. Only 26% successfully used contradictory evidence to modify their output, demonstrating a failure to mimic human scientific reasoning. N.M. Anoop Krishnan, a materials scientist at the Indian Institute of Technology Delhi, noted that human scientists iteratively revise hypotheses based on experimental results, whereas AI agents do not. Kevin Jablonka, a study coauthor from Friedrich Schiller University Jena, emphasized that trust in scientific results depends on transparent, evidence-based processes—something AI agents currently lack. Walter Quattrociocchi, a computer scientist at Sapienza University of Rome, warned that while developers could hardcode fixes for specific cases, the core issue remains: AI agents typically fail to integrate new data dynamically. This limitation raises concerns about their reliability in fields like science and medicine, where evidence-based reasoning is critical. The study suggests AI benchmarks, which often focus only on final results, may overlook critical flaws in how these systems process information. Without addressing this, AI’s role in evidence-dependent fields could be compromised.
This content was automatically generated and/or translated by AI. It may contain inaccuracies. Please refer to the original sources for verification.