LLM을 넘어서: 확장 가능한 엔터프라이즈 AI 도입이 에이전트 로직에 의존하는 이유

Beyond LLMs: Why Scalable Enterprise AI Adoption Depends on Agent Logic

LLM 너머로: 확장 가능한 엔터프라이즈 AI 도입이 에이전트 로직에 의존하는 이유

안내자는 역사를 통해 인류를 도와왔습니다. 선사 문명은 태양과 달을 이용하여 육지와 대양을 가로질러 광대한 거리를 항해할 수 있다는 것을 이해했습니다. 시간이 지나면서 다양한 여행은 더 나은 계획과 반복되는 목적지로의 더 빠른 여행 시간을 위해 지도 제작을 용이하게 했습니다. 수 세기 후, 나침반의 도입으로 항해자들은 미개척 목적지를 찾는 데 더 큰 정확성을 달성할 수 있었습니다. 그리고 오늘날 GPS 네비게이션 앱은 우리의 모든 여행을 안내합니다. 오늘날의 에이전트 AI 세계에서 AI 에이전트는 확실히 확장 가능한 AI 도입을 가능하게 하고 우리가 알고 있는 산업을 변환할 수 있는 잠재력을 가지고 있습니다. 그러나 이 잠재력을 실현하기 위해서는 높은 에이전트 품질, 비용 효과성 및 결과적인 최종 사용자 신뢰를 촉진하기 위해 지능형 안내, 즉 에이전트 로직이 필요합니다.

엔터프라이즈 워크플로 및 사용 사례

수많은 연구에서 AI 파일럿의 압도적인 실패를 언급했으며, 다른 연구에서는 확장 가능한 도입을 가능하게 하기 위해 AI가 엔터프라이즈 워크플로의 핵심에서 작동해야 한다는 필요성을 강조했습니다. [1] [2] 이 현상과 관련된 주장을 더 잘 이해하기 위해서는 엔터프라이즈 워크플로의 분석이 필요합니다. 이 워크플로는 다음과 같습니다:

A. 동적이고 오래 실행됨
B. 다양한 API, 데이터베이스 및 서비스 보유
C. 대부분 비즈니스 정책 및/또는 규제로 제한됨

에이전트가 위의 특성을 감안하여 효과적으로 기능하려면 자연스럽게 확장된 모델 컨텍스트가 필요하며, 최첨단 프론티어 LLM은 확실히 보유하고 있지만 어떤 대가로요? 증가된 환각, 토큰 소비? 더욱이, LLM을 에이전트 AI 실행이 워크플로의 핵심에서 구동되는 더 바람직한 결과를 가능하게 하는 지능형 안내, GPS로 장착할 수 있을까요? 우리는 위의 특성을 완전히 고려하여 관련 에이전트 로직으로 장착된 에이전트를 설계하고 구축하여 이 가설들을 테스트했습니다. 이들 오퍼링은 미션 크리티컬 워크로드에 대한 엔터프라이즈 소프트웨어 배달 라이프사이클의 다양한 단계를 소유한 주제 전문가들이 직면한 가장 어려운 작업 중 일부와 관련이 있습니다:

레거시 코드(Cobol / PL/1)로 작성된 애플리케이션 이해
개발자를 위한 테스트 생성 가속화
사건에 적극적으로 대응하고 shift-left 앱 복원력 활성화
중요한 환경에 대한 컴플라이언스 현대화 자동화

각 도메인을 자세히 살펴보기 전에 에이전트 로직의 특성을 정의해 봅시다. 에이전트 로직은 소프트웨어 프리미티브(예: 지식 그래프, 알고리즘, 프로그램 분석 라이브러리)이며, 에이전트 계층(에이전트 하니스 내)에서 작동하고 엔터프라이즈 워크플로의 방향으로 LLM을 의도적으로 조종하며 컨텍스트 공간을 줄일 수 있습니다. 그렇게 함으로써 더 성능이 우수한 결과를 더 비용 효율적인 방식으로 구동하는 강한 경향이 있습니다. 이제 에이전트 로직이 위의 네 가지 도메인 각각에서 이러한 결과를 어떻게 달성할 수 있는지 살펴봅시다.

레거시 코드(Cobol / PL/1)로 작성된 애플리케이션 이해 - 프로그램 분석.[3]

IBM watsonx Code assistant for Z (WCA4Z)는 AI와 자동화를 사용하여 메인프레임 애플리케이션 개발과 현대화를 가속화하는 데 사용되며, 애플리케이션 이해를 위한 App Insights 에이전트로 장착되어 있습니다. 이는 미션 크리티컬 워크로드를 실행하는 엔터프라이즈 클라이언트의 주요 초점 영역 중 하나입니다. 이 에이전트는 애플리케이션 전체에 걸친 깊은 정적 분석을 활용하고 복잡한 의미론을 가진 수백 개의 상호 연관된 테이블로 구성된 데이터베이스 스키마에 사전 인덱싱된 표현을 저장하여, 에이전트가 정확하고 구조화된 이미 사용 가능한 정보를 검색할 수 있게 하며, 따라서 답변 정확도를 개선하고, 토큰 사용을 줄이며, 언어 모델과의 왕복 상호작용을 최소화합니다(이 경우 Mistral Medium 250B). 이 접근 방식을 여러 미션 크리티컬 레거시 시스템(최대 100만 줄의 코드와 1,000개의 프로그램)에 적용할 때 기준선 프론티어 LLM 전용 접근 방식보다 약 30배 낮은 토큰 소비로 한계적으로 우수한 앱 이해 성능을 유지합니다.

Aster를 사용한 개발자를 위한 테스트 생성 가속화 - 프로그램 분석. [4], [5]

Aster는 단위, 통합, API 및 변경 기반 테스트의 에이전트 기반 생성을 위해 활용되는 IBM 독점 프로그램 분석 및 데이터 사전 및 사후 처리 기반 라이브러리이며, 다양한 개발자 커뮤니티의 분석으로부터 다양한 오픈 소스 도구 또는 개발자 작성 테스트와 비교하여 더 높은 개발자 평가를 달성합니다. 후자 및 오픈 소스 도구(통합 테스트) 및 제로샷 LLM과 코딩 에이전트(단위 테스트)와 비교할 때 우수한 라인, 분기 및 메서드 범위 벤치마크에 기반하여, 모두 오픈 소스 애플리케이션에서 테스트되었으며, Devstral 24B 모델을 사용하여 75개 이상의 Java IBM CIO 애플리케이션(최대 560개 이상의 클래스 및 67,000개 이상의 줄의 코드)에서 사전 프로덕션 모드로 Aster를 실행하고 있습니다. 현재까지의 정상 상태 결과는 라인, 분기 및 메서드 범위에서 +20% - 45% 개선을 산출하며, 주문 크기별로 낮은 토큰 소비(최대 15배)로 이러한 앱의 부분 집합에서 최첨단 코딩 에이전트와 비교할 때 우수한 성능을 제공합니다. 이 결과의 근거는 프로그램 분석 출력(LLM을 프롬프트하고 "초점"을 맞추는 데 사용됨)과 범위 증가 및 런타임과 컴파일 오류를 수정하기 위한 하위 에이전트가 더 우수한 결과를 크게 비용 절감하여 가능하게 한다는 것입니다.

사건에 적극적으로 대응하고 shift-left 앱 복원력 활성화 - 지식 그래프, 프로그램 분석 라이브러리 및 조사(관찰성) - 주도 오케스트레이션. [6],[7]

LLM 컨텍스트가 1과 2에 설명된 앱 관련 사용 사례에 대해 "제한"된 반면, 배포된 인프라에 앱의 런타임 관리를 위해 기본 IT 전체 스택이 작동합니다. 여기서 우리는 엔티티(마이크로서비스, 데이터베이스/미들웨어 서비스, MELT 등)와 함께 도메인 전문가의 포함된("부족적") 지식으로 구성된 지식 그래프(KG)를 정의합니다. 이러한 그래프와 비결정론적 결과에 대해 LLM을 로컬 경계 추론에 바운딩하여, 관찰성 주도 접근 방식은 IT 스택 및 기본 앱 소스 코드(관련된 경우)에 걸친 감소된 컨텍스트 공간을 달성하는 데 사용됩니다. 이 접근 방식을 사용하여 등가의 Instana 데이터 모델을 활용하면, 독점 Instana "I3"(지능형 사건 조사[8]) 에이전트는 ITBench[9]를 사용하여 측정한 GPT-5.1을 가진 ReAct 에이전트에 비해 최대 4.0배 개선을 달성했습니다. Gemini 3 Flash를 사용하면 ReAct 에이전트 성능은 I3 에이전트보다 17% 낮으면서 1.6배 더 많은 토큰을 소비합니다. 우리는 이 접근 방식을 소스 코드로 확장했으며, 코드 분석(프로그램 의존성 그래프 활용) 및 버그 수정(추론 스케일링 활용)을 위한 에이전트를 보유하고 있으며, ITBench에서도 테스트되었으며, 상태 예술 코딩 에이전트(Gemini 2.5 Flash)에 비해 소스 코드 분석 및 버그 수정 에이전트의 우수한 성능을 보여줍니다. 문제의 마이크로서비스를 찾는 것(3.0배) 및 버그 수정(1.6배)의 경우, 각각 3.7배 및 5.9배 더 적은 토큰을 소비합니다. 이 멀티 에이전트 시스템은 IBM Think에서 발표되었으며, shift-left IT Operations을 위한 새로 공개된 IBM Concert Platform의 일부이며, IBM CIO와도 파일럿 중입니다. [10]

중요한 환경에 대한 IT 컴플라이언스 현대화 자동화 - 알고리즘 및 적응형 계획 및 오케스트레이션. [11]

엔터프라이즈는 점점 더 복잡하고 단편화된 컴플라이언스 요구 사항에 직면하고 있어 팀이 수동으로 제어, 평가 및 수정 계획을 만드는 데 상당한 시간을 소비하도록 강요합니다. 중앙 집중식 지식은 존재하지 않으며 수정은 수동으로 작성되며, 이는 오류 및 보안 격차의 위험을 초래합니다. 컴플라이언스 작업은 복잡하고 다단계이기 때문에 수동 작업이나 간단한 AI 프롬프트보다는 특화된 에이전트 전체에 걸친 조정된 정책 중심 자동화가 필요합니다. 우리의 멀티 에이전트 시스템은 알고리즘적으로 복잡한 작업을 조정된 단계로 분해하고, 적응형 계획, 동적 분해 및 워크플로 시퀀싱을 사용하며, 수정을 반복적으로 식별하고 평가를 확대하기 위해 지속적인 피드백을 사용하여 컴플라이언스를 자동화합니다. ITBench를 사용하여 측정한 대로 고정 계획 전략을 사용한 이전 에이전트(Claude 4 Sonnet)보다 1.3 - 2.0배 더 성능이 우수합니다. 이 접근 방식은 컴플라이언스를 지속적으로 안내되는 자체 수정 프로세스로 변환하고 결과를 극적으로 개선하며, 특히 복잡한 시나리오에서 성공률을 한 자리 수에서 +80%만큼 높입니다(Claude 4 Sonnet). 이 멀티 에이전트 시스템과 16,000개 이상의 디지털화된 제어 매핑은 IBM Think에서 IBM Sovereign Core의 일부로 공개되었으며, 모니터링, 드리프트 감지, 자동 증거 생성을 제공하며, 감사 증거가 고객 제어 내에서 안전하게 유지되도록 합니다. [12]

위의 예제는 LLM 컨텍스트를 줄이고 워크플로의 핵심을 순회하도록 LLM을 안내하여 매우 성능이 우수하고 비용 효율적인 방식으로 에이전트 로직의 영향을 보여줍니다. 또한, 우리는 구성 가능한 일반주의 에이전트(CUGA)가 있는 의료 도메인의 하나의 사례 연구와 IBM Global Real Estate를 가진 물리적 자산에 대한 조건 기반 유지보수에 대한 것과 유사한 접근 방식을 사용했습니다.

도메인 사례 연구

사례 연구 1: 구성 가능한 일반주의 에이전트(CUGA) 의료 벤치마크 - 알고리즘적 정책 적용. [13]

다음 건강 보험 고객 관리 예제는 규제된 환경에서 에이전트 시스템이 LLM 전용 대화형 모델을 능가하는 이유의 간결한 일반화입니다. CUGA의(구성 가능한 일반주의 에이전트) 정책 시스템은 에이전트 거버넌스를 위해 정책을 코드로 구현하며, 이는 모델 프롬프트와 무관하게 그리고 미세 조정 없이 런타임에 적용됩니다. 우리의 실험은 에이전트의 정책 시스템이 작업 정확성의 큰 격차를 닫고, 구조화된 워크플로, 안전한 의도 처리, 안정적인 도구 사용, 모든 모델 계열(Claude Opus - 4.5, GPT OSS 120B 및 GPT - 4.1)에 걸친 제어된 출력 포맷을 적용하고 정확도 개선은 15%에서 26% 범위입니다. 권한은 최소 권한 공개, 명시적 컴플라이언스 규칙 및 인간 에스컬레이션 경로를 통해 적용됩니다. 지능형 작업이 제안되는 동안 권한은 정책 및 감시 메커니즘에 의해 행사됩니다. 추론은 자율적입니다; 결정 권리는 제한됩니다. CUGA는 또한 IBM Think Sovereign Core 출시의 핵심 구성 요소입니다.

사례 연구 2: IBM Global Real Estate에 대한 물리적 자산의 조건 기반 유지보수 - 방향성 비순환 그래프. [14],[15]

엔터프라이즈 유지보수 시스템은 많은 양의 자산 데이터를 수집하지만 이들을 효과적으로 결합할 수 없으므로 전문가가 단편화된 신호를 수동으로 조합하고 통일된 증거 기반 통찰력 없이 결정을 내리도록 요구합니다. 우리의 최근 출시 Maximo Condition Insights[16] 에이전트는 수천 개의 자산 및 위치(센서, 작업 지시서, 실패 모드 및 이벤트 분석)에 걸쳐 대규모 자산 데이터를 분석하며, 구조화된 증거 및 유효성 검사 루프를 사용하여 문제를 안정적으로 식별하고, 작업의 우선 순위를 정하고, 일관되고 추적 가능한 통찰력으로 의사 결정을 지원합니다. 우리는 이 에이전트(GPT OSS 120B 사용)를 IBM Global Real Estate(GRE)와 함께 내부적으로 파일럿했으며, 자산 분석 시간을 15-20분에서 15-30초로 단축했습니다(97% 개선). 이는 자산 검토 범위를 약 1%에서 약 30%로 증가시키며 120개 사이트 및 6,000개 물리적 자산에 걸쳐 있습니다. AssetOpsBench를 사용하여 Condition Insights 에이전트는 미지원 청구를 57% 감소시키고, 세부 정보를 35% 감소시키고, 규칙 컴플라이언스를 30% 개선했으며, 거의 영 모순을 유지하고, 토큰 사용을 평균 77% 감소시켰습니다. 동시에 진단 특이성을 약간 증가시켰습니다. 이 에이전트는 방향성 비순환 그래프로 장착되어 있으며, 구조적 엔지니어링 및 운영 컨텍스트를 제공하여 순진한 프롬프팅 하에서 미지원 추론을 줄이면서, 제약 인식 프롬프팅은 규칙 준수를 눈에 띄게 개선하고, 세부 정보를 줄이고, 불안정성을 도입하지 않으면서 전체 토큰 소비를 낮춥니다.

요약 및 참고 자료: 우리는 수 세기 동안 안내자로부터 혜택을 받았으며, 이는 우리의 삶을 단순화하고 향상시켰습니다. 기술이 발전함에 따라 우리가 사용하는 안내자도 발전했으며, 우리가 더 많은 것을 할 수 있고 우리의 글로벌 빌리지를 더욱 축소할 수 있게 했습니다. 이 에이전트 AI 시대의 도래와 함께, 우리가 경제 규모를 통해 사회를 더욱 향상시키려고 모색할 때, 우리는 이 추세를 계속해야 하며 에이전트 로직을 완전히 활용하여 모델 컨텍스트를 단순화하고 워크플로의 핵심에서 엔터프라이즈를 지능적으로 순회해야 합니다; 오직 그때만 최적의 운영 비용에서 확장 가능한 도입이 진정으로 실현 가능할 것입니다.

[1] 젠에이아이 분할: 2025년 비즈니스 AI 상태, MIT 연구, https://mlq.ai/media/quarterly_decks/v0.1_State_of_AI_in_Business_2025_Report.pdf

[2] AI 프로젝트에서 이익으로: 에이전트 AI가 재정적 수익을 유지할 수 있는 방법, IBM IBV 보고서, https://www.ibm.com/thought-leadership/institute-business-value/en-us/report/agentic-ai-profits

[3] 이해, IBM Watson Code assistant for Z, 2026년 2월 27일, https://www.ibm.com/docs/en/watsonx/watsonx-code-assistant-4z/2.x?topic=understand

[4] R. Pan, R. Krishna, R. Pavuluri, 외., ASTER: LLM을 사용한 자연 및 다중 언어 단위 테스트 생성 - IBM Research, 2025년 4월 30일, https://research.ibm.com/blog/aster-llm-unit-testing

[5] R. Pan, R. Pavuluri, R. Huang, 외., SAINT: 프로그램 분석 및 LLM 기반 에이전트를 사용한 서비스 수준 통합 테스트 생성, 2025년 11월 17일, https://arxiv.org/abs/2511.13305

[6] S. Jha, R. Arora, Bhavya, 외., 로컬하게 생각하고, 전 지구적으로 설명: 로컬 추론 및 신념 전파를 통한 그래프 안내 LLM 조사, 2026년 1월 25일, https://arxiv.org/abs/2601.17915

[7] S. Cui, R. Krishna, S. Jha, 외., 에이전트 구조화된 그래프 순회 클라우드 애플리케이션의 코드 관련 사건의 근본 원인 분석, 2025년 12월 26일, https://arxiv.org/html/2512.22113v1

[8] IBM Instana 및 지능형 사건 조사 에이전트 에이전트 AI를 사용하여 IBM Instana 지능형 사건 조사를 통해 더 빠르게 사건을 해결합니다.

[9] S. Jha, R. Arora, Y. Watanabe, 외., ITBench: 다양한 실제 IT 자동화 작업 전체에서 AI 에이전트 평가, 2025년 2월 7일, https://arxiv.org/abs/2502.05352

[10] IBM Concert 플랫폼 https://www.ibm.com/new/announcements/from-insight-to-action-closing-the-gap-in-modern-it-operations

[11] Y. Watanabe, T. Yanagawa, H. Kitahara, A. Sailer, GenAI를 사용한 IT 컴플라이언스 자동화 CISO 평가 에이전트, DZone 튜토리얼, 2025년 12월 12일 https://dzone.com/articles/itbench-part-3-it-compliance-automation-with-genai

[12] IBM Sovereign Core https://newsroom.ibm.com/2026-05-05-think-2026-ibm-makes-digital-sovereignty-operational-with-general-availability-of-ibm-sovereign-core

[13] S. Shlomov, A. Oved, S. Marreed, 외., 벤치마크에서 비즈니스 영향으로: 엔터프라이즈 프로덕션에 IBM Generalist Agent 배포, 2025년 12월 9일, https://arxiv.org/pdf/2510.23856

[14] D. Patel, S. Lin, J. Rayfield, 외., AssetOpsBench: 산업 자산 운영 및 유지보수에서 작업 자동화를 위한 AI 에이전트 벤치마킹, 2025년 6월 4일, https://arxiv.org/abs/2506.03828

[15] Fearghal O'Donncha, Nianjun Zhou, Natalia Martinez, 외., 이기종 데이터를 사용한 산업 유지보수에 대한 증거 기반 추론 https://arxiv.org/abs/2603.08171

[16] IBM Maximo 및 Condition Insights 에이전트 https://www.ibm.com/new/announcements/maximo-condition-insight

Beyond LLMs: Why Scalable Enterprise AI Adoption Depends on Agent Logic

Guides have aided humanity throughout history. Prehistoric civilizations understood that the sun and the moon could be used to navigate vast distances on land and the high seas. Over time, various journeys facilitated the production of maps for better planning and faster travel time to repeat destinations. Centuries later, the introduction of the compass enabled seagoers to achieve greater accuracy in seeking unexplored destinations. And today, GPS navigation apps guide our every journey. In today’s world of agentic AI, AI agents, admittedly, have the potential to enable scalable AI adoption, transforming industries as we know them. However, an intelligent guide, agentic logic, is needed to realize this potential by fueling high agent quality, cost-effectiveness, and consequent end-user trust.

Enterprise Workflows & Use Cases

Numerous studies have cited the overwhelming failure of AI pilots, while others have also highlighted the need for AI to operate at the core of enterprise workflows to enable scalable adoption. [1] [2] To better understand this phenomenon and the associated assertion, some analysis of enterprise workflows is required. These workflows are:

A. Dynamic and long-running
B. Possess a plethora of APIs, databases and services
C. Oftentimes are constrained by business policies and/or regulations

For an agent to function effectively, given these above characteristics, naturally demands an expanded model context, which state-of-the-art frontier LLMs certainly possess, but at what tradeoff? Increased hallucinations, token consumption? Further, can LLMs be equipped with an intelligent guide, GPS, to enable agentic AI execution at the core of the workflow, driving more desirable outcomes? We tested these hypotheses by designing and building agents, equipped with pertinent agent logic, for IBM offerings fully considering the above characteristics. These offerings pertain to some of the most challenging tasks confronting subject matter experts who own various stages of the enterprise software delivery lifecycle for mission critical workloads including:

Understanding applications written in legacy code (Cobol / PL/1)
Expediting test generation for developers
Proactively responding to incidents and enabling shift-left app resiliency
Automating compliance modernization for critical environments

Before examining each of these domains in detail, let us define what characterizes agent logic. Agent logic is software primitives, such as knowledge graphs, algorithms, program analysis libraries, which operate at the agentic layer (within an agent harness) and can intentionally steer the LLM in the direction of the enterprise workflow, reducing the context space. In so doing, have strong tendency to drive more performant outcomes in a more cost-effective manner. Let us now examine how agent logic is able to achieve such outcomes in each of the above four domains.

Understanding applications written in legacy code (Cobol / PL/1) - program analysis.[3]

IBM watsonx Code assistant for Z (WCA4Z), used to accelerate mainframe application development and modernization with AI and automation, is equipped with an App Insights agent for application understanding - one of the primary focus areas of enterprise clients running mission critical workloads on IBM mainframe. This agent leverages deep static analysis across the application and stores a pre-indexed representation in a database schema that spans hundreds of interrelated tables with complex semantics, allowing the agent to retrieve precise, structured already available information; thereby improving answer accuracy, reducing token usage, and minimizing back-and-forth interactions with the language model (Mistral Medium 250B in this instance). This approach when applied to multiple mission-critical legacy systems (up to 1M lines of code and 1K programs) maintains marginally superior app understanding performance with ~30× lower token consumption than a baseline frontier LLM-only approach.

Expediting test generation for developers with Aster - program analysis. [4], [5]

Aster is an IBM proprietary program analysis and data pre- and post-processing-based library utilized for agent-based generation of unit, integration, API and change-based tests; which from analysis of multiple developer communities achieves higher developer ratings compared with various open-sourced tools or developer-written tests. Based on the latter and superior line, branch and method coverage benchmarks compared with similar open-sourced tools (integration tests) and zero-shot LLMs and coding agents (unit tests), all tested on open-sourced applications, we have been running Aster in pre-production mode on 75+ java IBM CIO applications (up to 560+ classes and 67K+ lines of code) with Devstral 24B model. Steady-state results to date yield +20% - 45% improvement in line, branch and method coverage coupled with superior performance on a subset of these apps compared with state-of-the-art coding agent with orders of magnitude lower token consumption (up to 15×). The rationale for these results is that the program analysis output (used to prompt and “focus” the LLM) coupled with sub-agents for augmenting coverage and remediating runtime and compilation errors enable a more performant outcome with significant cost reduction.

Proactively responding to incidents and enabling shift-left app resiliency - knowledge graphs, program analysis libraries and investigation (observability) - driven orchestration. [6],[7]

While LLM context for app-related use cases as described in 1 and 2 are “restricted” to the app source code, for runtime management of apps on deployed infra, the underlying IT full stack comes into play. Here we define a knowledge graph (KG) encompassing entities (microservices, database/middleware services, MELT etc.) coupled with embedded (“tribal”) knowledge from domain experts. With such a graph and bounding the LLM to local bound reasoning for non-deterministic outcomes, an observability-driven approach is used to achieve reduced context space spanning the IT stack and underlying app source code (if relevant) for incident root cause analysis (and other use cases). With this approach, leveraging the equivalent Instana data model, we have seen the proprietary Instana “I3” (intelligent incident investigation [8]) agent achieve up to 4.0× improvement over ReAct agent with GPT-5.1 as measured using ITBench [9]. With Gemini 3 Flash the ReAct agent performance improves to within 17% lower than the I3 agent while consuming 1.6× more tokens, We have extended this approach to source code with agents for code analysis (leveraging program dependency graphs) and bug remediation (leveraging inference scaling), also tested on ITBench, illustrating superior performance for the source code analysis and bug remediation agents (Gemini 2.5 Flash) over state-of-the-art coding agent both for finding the culpable microservice (3.0×) and bug repair (1.6×) while consuming respectively 3.7× and 5.9× less tokens. This multi-agent system was announced at IBM Think as part of the newly unveiled IBM Concert Platform for shift-left IT Operations and is also being piloted internally with IBM CIO. [10]

Automating IT compliance modernization for critical environments - algorithms and adaptive planning and orchestration. [11]

Enterprises face increasingly complex and fragmented compliance requirements, forcing teams to spend considerable time manually creating controls, assessments and remediation plans. No centralized knowledge exists and fixes are written manually, which introduces a risk of errors and security gaps. Because compliance work is complex and multi-step, it requires coordinated policy-driven automation across specialized agents rather than manual effort or simple AI prompts. Our multi-agent system automates compliance by algorithmically decomposing complex tasks into coordinated steps, using adaptive planning, dynamic decomposition and workflow sequencing with continuous feedback to iteratively identify fixes and expand assessments. It is 1.3 – 2.0× more performant than prior agents (Claude 4 Sonnet) using fixed planning strategies, as also measured using ITBench. This approach transforms compliance into a continuously guided self-correcting process and dramatically improves outcomes, especially in complex scenarios, boosting success rates from single digits to as high as +80% (Claude 4 Sonnet). This multi-agent system and 16K+ digitized controls mappings were unveiled as part of IBM Sovereign Core at IBM Think, integrated with monitoring, drift detection, providing automated evidence generation, ensuring audit evidence stays securely within customer control. [12]

The above examples illustrate the impact of agent logic in reducing LLM context and guiding the LLM to traverse the core of the workflow in a highly performant and cost-effective manner. Additionally, we have employed similar approaches to two case studies, one with a configurable generalist agent and runtime (CUGA) in the healthcare domain and another for the condition-based maintenance for physical assets with IBM Global Real Estate.

Domain Case Studies

Case Study 1: Configurable Generalist Agent (CUGA) Healthcare benchmark - algorithmic policy enforcement. [13]

The following health insurance customer care example is a compact illustration of why agentic systems outperform LLM-only conversational models in regulated environments. CUGA’s (configurable generalist agent) policy system implements policy-as-code for agent governance, which is enforced at runtime independent of model prompts and without fine-tuning. Our experiments show that the agent’s policy system closes large gaps in task correctness, enforcing structured workflows, safe intent handling, reliable tool usage, and controlled output formatting across all model families (Claude Opus – 4.5, GPT OSS 120B and GPT – 4.1) with accuracy improvements ranging from 15% to 26%. Authority is enforced through least-privilege disclosure, explicit compliance rules, and human escalation paths. Intelligent actions are proposed, while authority is exercised by policy and oversight mechanisms. Reasoning is autonomous; decision rights are constrained. CUGA is also a key component in the IBM Think Sovereign Core launch.

Case Study 2: Condition-based Maintenance of Physical Assets for IBM Global Real Estate - directed acyclic graph. [14],[15]

Enterprise maintenance systems collect copious amounts of asset data but are unable to effectively combine them, demanding experts to manually piece together fragmented signals and make decisions without unified, evidence-based insights. Our recently launched Maximo Condition Insights [16] agent analyzes large-scale asset data across thousands of assets and locations (sensors, work orders, failure modes and events analysis), using structured evidence and validation loops to reliably identify issues, prioritize actions and support decision-making with consistent, traceable insights. We have piloted this agent (using GPT OSS 120B) internally with IBM Global Real Estate (GRE), reducing asset analysis time from 15-20 mins to 15-30 sec (a 97% improvement) and increasing asset review coverage from ~1% to ~30% spanning over 120 sites and 6K physical assets. Using AssetOpsBench, the Condition Insights agent reduced unsupported claims by 57%, cut verbosity by 35%, improved rule compliance by 30%, maintained near-zero contradictions, and lowered token usage by on average 77%, while slightly increasing diagnostic specificity. This agent, equipped with a directed acyclic graph, provides structural engineering and operational context to reduce unsupported reasoning under naive prompting, while constraint-aware prompting markedly improves rule adherence, reduces verbosity, and lowers overall token consumption without introducing instability.

Summary and References: We have benefited from guides for centuries, which have simplified and enhanced our lives. As technology has evolved, so have the guides we use, enabling us to do more and further shrink our global village. With the arrival of this agentic AI era, as we seek to further enhance society in part through economies of scale, we should continue this trend and fully leverage agent logic to simplify model context and intelligently traverse enterprise workflows at the core; only then will scalable adoption at optimal operating costs be truly feasible.

[1] The GenAI Divide: STATE OF AI IN BUSINESS 2025, MIT study, https://mlq.ai/media/quarterly_decks/v0.1_State_of_AI_in_Business_2025_Report.pdf

[2] From AI projects to profits: How agentic AI can sustain financial returns, IBM IBV report, https://www.ibm.com/thought-leadership/institute-business-value/en-us/report/agentic-ai-profits

[3] Understand, IBM Watson Code assistant for Z, Feb 27, 2026, https://www.ibm.com/docs/en/watsonx/watsonx-code-assistant-4z/2.x?topic=understand

[4] R. Pan, R. Krishna, R. Pavuluri, et.al, ASTER: Natural and multi-language unit test generation with LLMs - IBM Research, Apr 30, 2025, https://research.ibm.com/blog/aster-llm-unit-testing

[5] R. Pan, R. Pavuluri, R. Huang, et al., SAINT: Service-level Integration Test Generation with Program Analysis and LLM-based Agents, Nov 17, 2025, https://arxiv.org/abs/2511.13305

[6] S. Jha, R. Arora, Bhavya, et al, Think Locally, Explain Globally: Graph-Guided LLM Investigations via Local Reasoning and Belief Propagation, Jan 25, 2026, https://arxiv.org/abs/2601.17915

[7] S. Cui, R. Krishna, S. Jha, et al, Agentic Structured Graph Traversal for Root Cause Analysis of Code-related Incidents in Cloud Applications, Dec 26, 2025, https://arxiv.org/html/2512.22113v1

[8] IBM Instana and Intelligent Incident Investigation agent Use agentic AI to resolve incidents faster with IBM Instana Intelligent Incident Investigation

[9] S. Jha, R. Arora, Y. Watanabe, et al, ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks, Feb 7, 2025, https://arxiv.org/abs/2502.05352

[10] IBM Concert platform https://www.ibm.com/new/announcements/from-insight-to-action-closing-the-gap-in-modern-it-operations

[11] Y. Watanabe, T. Yanagawa, H. Kitahara, A. Sailer, IT Compliance Automation with GenAI CISO Assessment Agent , DZone Tutorial, Dec. 12, 2025 https://dzone.com/articles/itbench-part-3-it-compliance-automation-with-genai

[12] IBM Sovereign Core https://newsroom.ibm.com/2026-05-05-think-2026-ibm-makes-digital-sovereignty-operational-with-general-availability-of-ibm-sovereign-core

[13] S. Shlomov, A. Oved, S. Marreed, et al, From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enterprise Production, Dec 9, 2025, https://arxiv.org/pdf/2510.23856

[14] D. Patel, S. Lin, J. Rayfield, et al, AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance, Jun 4, 2025, https://arxiv.org/abs/2506.03828

[15] Fearghal O'Donncha, Nianjun Zhou, Natalia Martinez, et al.Evidence-Driven Reasoning for Industrial Maintenance Using Heterogeneous Data https://arxiv.org/abs/2603.08171

[16] IBM Maximo and Condition Insights agent https://www.ibm.com/new/announcements/maximo-condition-insight

#enterprise-ai #ai-agents #ai-adoption #scalable-ai #large-language-models #agent-logic