AI Reliability Gap in Payments Infrastructure Revealed by Massive Open Benchmark
Recent findings from a comprehensive evaluation of autonomous AI agents in real-world e-commerce environments highlight a significant performance gap between leading architectures and the broader automation landscape. The ECOM1 benchmark, jointly developed by payments infrastructure platform COLIBRIX ONE and technology innovation organization BitGN, analyzed over 1.6 million trials across more than 100 cities.
The results demonstrate that while top-performing code-driven systems achieved success rates approaching 95%, the average agent managed only 20.2%—roughly one successful task out of 42 attempts. This fragility underscores a critical challenge for financial institutions seeking to automate payments processes.
Key Findings from ECOM1:
- Significant performance divergence: Top performers approached 95% success, while the average agent achieved just 20.2%
- Complex scenarios pose greatest challenges: Tasks involving customer pressure, policy updates, and security protocols saw particularly low success rates (ranging from 15.6% to 21.1%)
- Frontier models show promise when properly deployed: Coding agents like Codex CLI and Claude Code demonstrated high reliability when paired with robust safety rails
- Operational trust is the primary barrier: While technology has advanced, building confidence in autonomous systems remains a key hurdle for widespread adoption
The benchmark revealed that current AI systems often struggle to handle real-time complexities—failing to adapt when faced with unexpected inputs or changing conditions. This highlights the need for specialized architectures that combine cognitive flexibility with deterministic safety measures.
According to BitGN founder Rinat Abdullin, “Achieving reliable, fully automated agentic commerce requires highly specific engineering, continuous testing, and an unyielding commitment to operational discipline.” The true test of a commerce agent isn’t just completing straightforward transactions, but maintaining alignment with evolving policies and customer needs when issues arise.
The ECOM1 dataset represents one of the first openly documented analyses of autonomous systems operating within live transaction frameworks—providing valuable insights for financial institutions seeking to navigate the future of automated payments.