AI Shows Skill in Coding, but Falls Short on Complex Tasks

Recent research indicates that while large language models (LLMs) demonstrate proficiency in specific areas like Python programming, they often struggle with the reliability needed for broader enterprise applications. A benchmark called DELEGATE-52, developed by Microsoft researchers, tested 19 LLMs across 52 professional domains—including coding, finance, and legal work—with complex multi-step tasks.

The study found that even top models like Gemini 3.1 Pro and GPT-4 introduce errors that can silently corrupt documents over time. After 20 interactions, these frontier models lost an average of 25% of document content, while all tested models degraded by roughly 50%. The researchers noted that the impact varies significantly by domain, with Python being one of the few areas where current LLMs show near-ready performance.

Experts emphasize that this isn’t a complete failure of enterprise AI but rather highlights its limitations. As Info-Tech Research Group analyst Brian Jackson points out, these findings suggest we need more robust automation designs—perhaps using multiple specialized agents instead of relying on single models for entire workflows.

Greyhound Research’s Sanchit Vir Gogia echoed this sentiment: “The key takeaway is that AI isn’t yet trustworthy enough to be fully delegated with critical documents.” He explained that the issue goes beyond mere hallucinations; it’s about preserving the integrity of complex work products over repeated edits. This concern is particularly relevant in enterprise settings where accuracy and compliance are paramount.

Rather than abandoning AI altogether, organizations should focus on implementing mitigation strategies—like additional training, domain-specific fine-tuning, or layered approaches with human oversight—to ensure that AI enhances productivity without compromising data integrity.