Page 3 of 586
1 2 3 4 5 586

Balancing cost and performance: Agentic AI development

The C-suite loves what agentic AI promises: autonomous systems that can think, decide, and act without constant human intervention. The potential for productivity and lower costs is undeniable — until the bills start rolling in. 

If your “strategy” is to ship first and figure out the cost later, you’re not building agentic AI. You’re financing a science project.

The goal is not to cut costs. It’s to engineer cost, speed, and quality to move together from day one. Because once an agent is in production, every weak decision you made in architecture, governance, and infrastructure becomes a recurring charge.

When cloud costs can spike by more than 200% overnight and development cycles stretch months beyond plan, that “transformative” agent stops looking like innovation and starts looking like a resource sink you can’t justify — to the board, to the business, or to your own team.

This isn’t another “how to save money on artificial intelligence” listicle. It reflects how leading teams using DataRobot align architecture, governance, and infrastructure with spend so autonomy doesn’t turn into a blank check. This is a comprehensive strategic framework for enterprise leaders who refuse to choose between innovation and financial discipline. We’ll surface the real cost drivers, call out where competitors routinely bleed money (so you don’t), and lay out infrastructure and operating strategies that keep your agentic AI initiatives from becoming cutting-room-floor casualties.

Key takeaways

  • Agentic AI can be more expensive than traditional AI because of orchestration, persistent context, and heavier governance and observability needs, not just raw compute.
  • The real budget killers are hidden costs like monitoring, debugging, governance, and token-heavy workflows, which compound over time if you don’t design for cost from the start.
  • Dollar-per-decision is a better ROI metric for agentic systems than cost-per-inference because it captures both the cost and the business value of each autonomous decision.
  • You can reduce development and run costs without losing quality by pairing the right models with each task, using dynamic cloud scaling, leveraging open source frameworks, and automating testing and deployment.
  • Infrastructure and operations are often the largest cost lever, and platforms like DataRobot help teams contain spend by unifying observability, governance, and agent orchestration in one place.

What is agentic AI, and why is it cost-intensive?

Agentic AI isn’t a reactive system that waits for inputs and spits out predictions. These are agents that act on their own, guided by the rules and logic you build into them. They’re contextually aware of their environment, learning from and making decisions by taking action across multiple connected systems, workflows, and business processes simultaneously.

That autonomy is the whole point — and it’s exactly why agentic AI gets expensive in a hurry.

The cost of autonomy hits you in three ways. 

  1. Computational complexity explodes. Instead of running a single model inference, agentic systems orchestrate multiple AI components and continuously adapt based on new information. 
  2. Infrastructure requirements multiply. Real-time data access, enterprise integrations, persistent memory, and scaling behavior become table stakes, not nice-to-haves.
  3. Oversight and governance get harder. When AI can take action without a human in the loop, your control plane needs to be real, not aspirational.

Where traditional AI might cost $0.001 per inference, agentic systems can run $0.10–$1.00 per complex decision cycle. Multiply that by hundreds or thousands of daily interactions, and you’re looking at monthly bills that are hard to defend, even when the use case is “working.”

An important component here is that hidden costs in agentic AI often dwarf the obvious ones. Compute costs aren’t the real budget killers. It’s the operational complexity that nobody talks about (until it’s too late).

Key cost drivers in agentic AI projects

Let’s cut through the vendor marketing and look at where your money actually goes. Agentic AI costs break down into four major buckets, each with its own optimization challenges and budget multipliers:

  • Inference costs are the most visible, but often the least controllable. Every decision your agent makes triggers LLM calls, context retrievals, ranking steps, and reasoning cycles. A single customer service interaction might involve sentiment classification, knowledge base searches, policy checks, and response generation — each one adding to your token bill.
  • Infrastructure costs scale differently than traditional AI workloads. Agentic systems need persistent memory, real-time data pipelines, and active integration middleware running continuously. Unlike batch jobs that spin up and down, these agents maintain state and context over time. That “always on” design is where spend creeps.
  • Development costs because you’re likely building orchestration layers, testing multi-agent systems and their interactions, and debugging emergent behaviors that only appear at scale… all at once. Testing an agent that makes autonomous decisions across multiple systems makes traditional MLOps look simple by comparison.
  • Maintenance costs drain budgets in the long term. Agents drift, integrations break, and edge cases creep up that require constant tuning. Unlike static systems that degrade predictably, agentic systems can fail in unexpected ways that demand immediate attention, and teams pay for that urgency.

Enterprises getting this right aren’t necessarily spending less overall. They’re just a) using their dollars in smarter ways and b) understanding which categories offer the most optimization potential and cost controls for their architecture from day one.

Hidden expenses that derail budgets

The costs that ultimately kill agentic AI projects are the operational realities that show up only after your agents start making real decisions in production environments: real invoices, real headcount burn, and real executive scrutiny.

Monitoring and debugging overhead

Your agentic AI system made 10,000 autonomous decisions overnight. Now, three customers are complaining about issues with their accounts. How do you debug that?

Traditional monitoring assumes you know what to look for. Agentic systems generate emergent behaviors that require entirely new observability approaches. You need to track decision paths, conversation flows, multi-agent interactions, tool calls, and the reasoning behind each action.

Here’s the expensive truth: Without proper observability, debugging turns into days of forensic work. That’s where labor costs quietly explode — engineers pulled off roadmap work, incident calls multiplying, and leadership demanding certainty you can’t provide because you didn’t instrument the system to explain itself.

Building observability into agent architecture is mandatory from the start. Selective logging, automated anomaly detection, and decision replay systems make debugging tractable without turning your platform into a logging furnace. And this is where unified platforms matter, because if your observability is stitched together across tools, your costs and blind spots multiply together, too.

Governance, security, and compliance

Retrofitting governance and security controls onto autonomous systems that are already making production decisions can turn your “cheap” agentic AI implementation into an expensive rewrite.

A few requirements are non-negotiable for enterprise deployments: 

  • Role-based access control
  • Audit trails
  • Explainability frameworks
  • Security layers that protect against prompt injection and data exfiltration 

Each adds another layer and cost that scales as your agent ecosystem grows.

The reality is that misbehaving AI costs scale with autonomy. When a traditional system makes a bad prediction, you can often catch it downstream. But when an agent takes incorrect actions across multiple business processes, damage branches fast, and you pay twice: once to fix the problem and again to restore trust.

That’s why compliance needs to be built into agent architecture right away. Mature governance frameworks can scale with an agent ecosystem rather than trying to secure systems designed for speed over control.

Token consumption

Agentic systems consume compute resources continuously through maintaining context, processing multi-turn conversations, and executing reasoning chains that can span thousands of tokens per single decision.

The math is brutal. A customer support agent that looks efficient at 100 tokens per interaction can easily use 2,000–5,000 tokens when the scenario requires multiple tool calls, context retrieval, and multi-step reasoning. Multiply that by enterprise-scale volumes and you can rack up monthly token bills that dwarf even your infrastructure spend.

CPU and GPU utilization follow the same compounding pattern. Every extra thousand tokens is more GPU time. At scale, those seemingly small token decisions become one of your biggest cost line items. Even an “idle” agent can still consume resources through polling, background workflows, state management, monitoring, and context upkeep.

This is exactly why infrastructure and tooling are levers, not afterthoughts. You control token burn by controlling orchestration design, context strategy, caching, routing, evaluation discipline, and the guardrails that prevent looping and runaway workflows.

Cost-effective strategies to reduce development costs without losing quality

Architectural intelligence is the focus of cost optimization in agentic AI. The choices you make here either compound efficiency, or compound regret.

Adopt lightweight or fine-tuned foundation models

Tough truth time: Using the newest, shiniest, most advanced possible engine for every task isn’t the way to go.

Most agent decisions don’t need heavyweight reasoning. Configure your agents to use lightweight models for routine decisions and keep expensive, large language models (LLMs) for more complex scenarios that truly need advanced reasoning. 

Fine-tuned, domain-specific engines often outperform larger general-purpose models while consuming fewer tokens and computational resources. This is what happens when architecture is designed intentionally. DataRobot makes this operational by turning model evaluation and routing into an architectural control, not a developer preference — which is the only way this works at enterprise scale.

Utilize dynamic scaling for cloud infrastructure

Infrastructure that scales with demand, not peak capacity, is necessary for controlling agentic AI costs. Auto-scaling and serverless architectures eliminate waste from over-provisioned resources while keeping performance humming during demand spikes.

Kubernetes configurations that understand agentic workload patterns can deliver 40–60% infrastructure savings since agent workloads have predictable patterns (higher during business hours, lower overnight, and spikes during specific business events).

This is where practitioner teams get ruthless: They treat idle capacity as a design bug. DataRobot syftr is built for that reality, helping teams right-size and optimize infrastructure so experimentation and production don’t inherit runaway cloud habits.

Off-peak optimization offers more savings opportunities. Schedule non-urgent agent tasks during low-cost periods, pre-compute common responses, and use spot instances for development and testing workloads. These strategies can reduce infrastructure costs without affecting user experience — as long as you design for them instead of bolting them on.

Leverage open source frameworks and pre-trained models

Open source frameworks like LangChain, AutoGen, and Haystack provide production-ready orchestration capabilities without the licensing costs of commercial alternatives. 

Here’s the catch: Open source gives you building blocks, but doesn’t give you enterprise-grade observability, governance, or cost control by default. DataRobot complements these frameworks by giving you the control plane — the visibility, guardrails, and operational discipline required to run agentic AI at scale without duct tape.

Commercial agent platforms can charge $2,000–$50,000+ per month for features that open source frameworks provide for the cost of infrastructure and internal development. For enterprises with technical capability, this can lead to substantial long-term savings.

Open source also provides flexibility that commercial solutions often lack. You can customize orchestration logic, integrate with existing systems, and avoid vendor lock-in that becomes expensive as your agent ecosystem scales.

Automate testing and deployment

Manual processes collapse under agentic complexity. Automation saves you time and reduces costs and risks, enabling reliable scaling.

Automated evaluation pipelines test agent performance across multiple scenarios to catch issues before they reach production. CI/CD for prompts and configurations accelerates iteration without increasing risk. 

Regression testing becomes vital when agents make autonomous decisions. Automated testing frameworks can simulate thousands of scenarios and validate that behavior remains consistent as you improve the system. This prevents the expensive rollbacks and emergency fixes that come with manual deployment processes — and it keeps “small” changes from becoming million-dollar incidents.

Optimizing infrastructure and operations for scalable AI agents

Infrastructure isn’t a supporting actor in agentic AI. It’s a significant chunk of the total cost-savings opportunity, and the fastest way to derail a program if ignored. Getting this right means treating infrastructure as a strategic advantage rather than another cost center.

Caching strategies designed for agentic workloads deliver immediate cost benefits. Agent responses, context retrievals, and reasoning chains often have reusable components. And sometimes, too much context is a bad thing. Intelligent caching can reduce compute costs while improving response times.

This goes hand in hand with pipeline optimization, which focuses on eliminating redundant processing. Instead of running separate inference flows for each agent task, build shared pipelines multiple agents can use.

Your deployment model choice (on-prem, cloud, or hybrid) has massive cost implications.

  • Cloud provides elasticity, but can become expensive at scale. 
  • On-prem offers cost predictability but requires a significant upfront investment (and real estate). 
  • Hybrid approaches let you optimize for both cost and performance based on workload characteristics.

Here’s your optimization checklist:

  • Implement intelligent caching. 
  • Optimize model inference pipelines. 
  • Right-size infrastructure for actual demand.
  • Automate scaling based on usage patterns.
  • Monitor and optimize token consumption.

Build vs. buy: Choosing the right path for agentic AI

The build-versus-buy decision will define both your cost structure and competitive advantage for years. Get it wrong, and you’ll either overspend on unnecessary features or under-invest in capabilities that determine success.

Building your own solution makes sense when you have specific requirements, technical capabilities, and long-term cost optimization goals. Custom development might cost $200,000–$300,000 upfront, but offers complete control and lower operational costs. You own your intellectual property and can optimize for your specific use cases.

Buying a pre-built platform provides faster time-to-market and lower upfront investment. Commercial platforms typically charge $15,000–$150,000+ annually but include support, updates, and proven scalability. The trade-off is vendor lock-in and ongoing licensing costs that grow as you scale.

Hybrid approaches allow enterprises to build core orchestration and governance capabilities while taking advantage of commercial solutions for specialized functions. This balances control with speed-to-market.

FactorHighMediumLow
Technical capabilityBuildHybridBuy
Time pressureBuyHybridBuild
BudgetBuildHybridBuy
Customization needsBuildHybridBuy

A future-proof approach to cost-aware AI development

Cost discipline cannot be bolted on later. It’s a signal of readiness and a priority that needs to be embedded into your development lifecycle from day one — and frankly, it’s one of the fastest ways to tell whether an organization is ready for agentic AI or just excited about it.

This is how future-forward enterprises move fast without breaking trust or budgets. 

  • Design for cost from the beginning. Every architectural decision has cost implications that compound over time. So choose frameworks, models, and integration patterns that optimize for long-term efficiency, not just initial development speed.
  • Progressive enhancement prevents over-engineering while maintaining upgrade paths. Start with simpler agents that handle your most routine scenarios effectively, then add complexity only when the business value justifies the added costs. This “small-batch” approach lets you deliver immediate ROI while building toward more sophisticated capabilities.
  • Modular component architecture helps with optimization and reuse across your agent ecosystem. Shared authentication, logging, and data access eliminate redundant infrastructure costs. Reusable agent templates and orchestration patterns also accelerate eventual future development while maintaining your standards.
  • Governance frameworks that scale with your agents prevent the expensive retrofitting that kills many enterprise AI projects. Build approval workflows, audit capabilities, and security controls that grow with your system rather than constraining it.

Drive real outcomes while keeping costs in check

Cost control and performance can coexist. But only if you stop treating cost like a finance problem and start treating it like an engineering requirement.

Your highest-impact optimizations are made up of a few key areas:

  • Intelligent model selection that matches capability to cost
  • Infrastructure automation that eliminates waste
  • Caching strategies that reduce redundant processing
  • Open source frameworks that provide flexibility without vendor lock-in

But optimization isn’t a one-time effort. Build continuous improvement into operations through regular cost audits, optimization sprints, and performance reviews that balance efficiency with business impact. The organizations that win treat cost optimization as a competitive advantage — not a quarterly clean-up effort when Finance comes asking.

DataRobot’s Agent Workforce Platform addresses these challenges directly, unifying orchestration, observability, governance, and infrastructure control so enterprises can scale agentic AI without scaling chaos. With DataRobot’s syftr, teams can actively optimize infrastructure consumption instead of reacting to runaway spend after the fact.

Learn how DataRobot helps AI leaders deliver outcomes without excuses.

FAQs

Why is agentic AI more expensive than traditional AI or ML?
Agentic AI is costlier because it does more than return a single prediction. Agents reason through multi-step workflows, maintain context, call multiple tools, and act across systems. That means more model calls, more infrastructure running continuously, and more governance and monitoring to keep everything safe and compliant.

Where do most teams underestimate their agentic AI costs?
Most teams focus on model and GPU pricing and underestimate operational costs. The big surprises usually come from monitoring and debugging overhead, token-heavy conversations and loops, and late-stage governance work that has to be added after agents are already in production.

How do I know if my agentic AI use case is actually worth the cost?
Use a dollar-per-decision view instead of raw infrastructure numbers. For each decision, compare total cost per decision against the value created, such as labor saved, faster resolution times, or revenue protected. If the value per decision does not clearly exceed the cost, you either need to rework the use case or simplify the agent.

What are the fastest ways to cut costs without hurting performance?
Start by routing work to lighter or fine-tuned models for routine tasks, and reserve large general models for complex reasoning. Then, tighten your infrastructure with auto-scaling, caching, and better job scheduling, and turn on automated evaluation so you catch regressions before they trigger expensive rollbacks or support work.How can a platform like DataRobot help with cost control?
A platform like DataRobot helps by bringing observability, governance, and infra controls into one place. You can see how agents behave, what they cost at a decision level, and where they drift, then adjust models, workflows, or infra settings without stitching together multiple tools. That makes it easier to keep both spend and risk under control as you scale.

The post Balancing cost and performance: Agentic AI development appeared first on DataRobot.

Production-ready agentic AI: key challenges and solutions 

As great as your AI agents may be in your POC environment, that same success may not make its way to production. Often, those perfect demo experiences don’t translate to the same level of reliability in production, if at all.

Taking your agents from POC to production requires overcoming these five fundamental challenges:

  1. Defining success by translating business intent into measurable agent performance.

Building a reliable agent starts by converting vague business goals, such as “improve customer service,” into concrete, quantitative evaluation thresholds. The business context determines what you should evaluate and how you will monitor it. 

For example, a financial compliance agent typically requires 99.9% functional accuracy and strict governance adherence, even if that comes at the expense of speed. In contrast, a customer support agent may prioritize low latency and economic efficiency, accepting a “good enough” 90% resolution rate to balance performance with cost.

  1. Proving your agents work across models, workflows, and real-world conditions.

To reach production readiness, you need to evaluate multiple agentic workflows across different combinations of large language models (LLMs), embedding strategies, and guardrails, while still meeting strict quality, latency, and cost objectives. 

Evaluation extends beyond functional accuracy to cover corner cases, red-teaming for toxic prompts and responses, and defenses against threats such as prompt injection attacks. 

This effort combines LLM-based evaluations with human review, using both synthetic data and real-world use cases. In parallel, you assess operational performance, including latency, throughput at hundreds or thousands of requests per second, and the ability to scale up or down with demand.

  1. Ensuring agent behavior is observable so you can debug and iterate with confidence.

Tracing the execution of agent workflows step by step allows you to understand why an agent behaves the way it does. By making each decision, tool call, and handoff visible, you can identify root causes of unexpected behavior, debug failures quickly, and iterate toward the desired agentic workflow before deployment.

  1. Monitoring agents continuously in production and intervening before failures escalate.

Monitoring deployed agents in production with real-time alerting, moderation, and the ability to intervene when behavior deviates from expectations is crucial. Signals from monitoring, along with periodic reviews, should trigger re-evaluation so you can iterate on or restructure agentic workflows as agents drift from desired behavior over time. And trace root causes of these easily.

  1. Enforce governance, security, and compliance across the entire agent lifecycle.

You need to apply governance controls at every stage of agent development and deployment to manage operational, security, and compliance risks. Treating governance as a built-in requirement, rather than a bolt-on at the end, ensures agents remain safe, auditable, and compliant as they evolve.

Letting success hinge on hope and good intentions isn’t good enough. Strategizing around this framework is what separates successful enterprise artificial intelligence initiatives from those that get stuck as a proof of concept. 

Why agentic systems require evaluation, monitoring, and governance

As Agentic AI moves beyond POCs to production systems to automate enterprise workflows, their execution and outcomes will directly impact business operations. The waterfall effects of agent failures can significantly impact business processes, and it can all happen very fast, preventing the ability of humans to intervene. 

For a comprehensive overview of the principles and best practices that underpin these enterprise-grade requirements, see The Enterprise Guide to Agentic AI

Evaluating agentic systems across multiple reliability dimensions

Before rolling out agents, organizations need confidence in reliability across multiple dimensions, each addressing a different class of production risk.

Functional

Reliability at the functional level depends on whether an agent correctly understands and carries out the task it was assigned. This involves measuring accuracy, assessing task adherence, and detecting failure modes such as hallucinations or incomplete responses.

Operational

Operational reliability depends on whether the underlying infrastructure can consistently support agent execution at scale. This includes validating scalability, high availability, and disaster recovery to prevent outages and disruptions. 

Operational reliability also depends on the robustness of integrations with existing enterprise systems, CI/CD pipelines, and approval workflows for deployments and updates. In addition, teams must assess runtime performance characteristics such as latency (for example, time to first token), throughput, and resource utilization across CPU and GPU infrastructure.

Security 

Secure operation requires that agentic systems meet enterprise security standards. This includes validating authentication and authorization, enforcing role-based access controls aligned with organizational policies, and limiting agent access to tools and data based on least-privilege principles. Security validation also includes testing guardrails against threats such as prompt injection and unauthorized data access.

Governance and Compliance

Effective governance requires a single source of truth for all agentic systems and their associated tools, supported by clear lineage and versioning of agents and components. 

Compliance readiness further requires real-time monitoring, moderation, and intervention to address risks such as toxic or inappropriate content and PII leakage. In addition, agentic systems must be tested against applicable industry and government regulations, with audit-ready documentation readily available to demonstrate ongoing compliance.

Economic

Sustainable deployment depends on the economic viability of agentic systems. This includes measuring execution costs such as token consumption and compute usage, assessing architectural trade-offs like dedicated versus on-demand models, and understanding overall time to production and return on investment.

Monitoring, tracing, and governance across the agent lifecycle

Pre-deployment evaluation alone is not sufficient to ensure reliable agent behavior. Once agents operate in production, continuous monitoring becomes essential to detect drift from expected or desired behavior over time.

Monitoring typically focuses on a subset of metrics drawn from each evaluation dimension. Teams configure alerts on predefined thresholds to surface early signals of degradation, anomalous behavior, or emerging risk. Monitoring provides visibility into what is happening during execution, but it does not on its own explain why an agent produced a particular outcome. 

To uncover root causes, monitoring must be paired with execution tracing. Execution tracing exposes: 

  • How an agent arrived at a result by capturing the sequence of reasoning steps it followed
  • The tools or functions it invoked
  • The inputs and outputs at each stage of execution. 

This visibility extends to relevant metrics such as accuracy or latency at both the input and output of each step, enabling effective debugging, faster iteration, and more confident refinement of agentic workflows.

And finally, governance is necessary at every phase of the agent lifecycle, from building and experimentation to deployment in production. 

Governance can be classified broadly into 3 categories: 

  • Governance against security risks: Ensures that agentic systems are protected from unauthorized or unintended actions by enforcing robust, auditable approval workflows at every stage of the agent build, deployment, and update process. This includes strict role-based access control (RBAC) for all tools, resources, and enterprise systems an agent can access, as well as custom alerts applied throughout the agent lifecycle to detect and prevent accidental or malicious deployments.
  • Governance against operational risks: Focuses on maintaining safe and reliable behavior during runtime by implementing multi-layer defense mechanisms that prevent unwanted or harmful outputs, including PII or other confidential information leakage. This governance layer relies on real-time monitoring, notifications, intervention, and moderation capabilities to identify issues as they occur and enable rapid response before operational failures propagate.
  • Governance against regulatory risks: Ensures that all agentic solutions remain compliant with applicable industry-specific and government regulations, policies, and standards while maintaining strong security controls across the entire agent ecosystem. This includes validating agent behavior against regulatory requirements, enforcing compliance consistently across deployments, and supporting auditability and documentation needed to demonstrate adherence to evolving regulatory frameworks.

Together, monitoring, tracing, and governance form a continuous control loop for operating agentic systems reliably in production. 

Monitoring and tracing provide the visibility needed to detect and diagnose issues, while governance ensures ongoing alignment with security, operational, and regulatory requirements. We will examine governance in more detail later in this article. 

Differences between agentic tool evaluation and monitoring vs classic ML systems

Many of the evaluation and monitoring practices used today were designed for traditional machine learning systems, where behavior is largely deterministic and execution paths are well defined. Agentic systems break these assumptions by introducing autonomy, state, and multi-step decision-making. As a result, evaluating and operating agentic tools requires fundamentally different approaches than those used for classic ML models.

From deterministic models to autonomous agentic systems

Classic ML system evaluation is rooted in determinism and bounded behavior, as the system’s inputs, transformations, and outputs are largely predefined. Metrics such as accuracy, precision/recall, latency, and error rates assume a fixed execution path: the same input reliably produces the same output. Observability focuses on known failure modes, such as data drift, model performance decay, and infrastructure health, and evaluation is typically performed against static test sets or clearly defined SLAs.

By contrast, agentic tool evaluation must account for autonomy and decision-making under uncertainty. An agent does not simply produce an output; it decides what to do next: which tool to call, in what order, and with what parameters. 

As a result, evaluation shifts from single-output correctness to trajectory-level correctness, measuring whether the agent selected appropriate tools, followed intended reasoning steps, and adhered to constraints while pursuing a goal.

State, context, and compounding failures

Agentic systems by design are complex multi-component systems, consisting of a combination of large language models and other tools, which may include predictive AI models. They achieve their outcomes using a sequence of interactions with these tools, and through autonomous decision-making by the LLMs based on tool responses. Across these steps and interactions, agents maintain state and make decisions from accumulated context.

These factors make agentic evaluation significantly more complex than that of predictive AI systems. Predictive AI systems are evaluated simply based on the quality of their predictions, whether the predictions were accurate or not, and there is no preservation of state. Agentic AI systems, on the other hand, need to be judged on quality of reasoning, consistency of decision-making, and adherence to the assigned task. Additionally, there is always a risk of errors compounding across multiple interactions due to state preservation.

Governance, safety, and economics as first-class evaluation dimensions

Agentic evaluation also places far greater emphasis on governance, safety, and cost. Because agents can take actions, access sensitive data, and operate continuously, evaluation must track lineage, versioning, access control, and policy compliance across entire workflows.

Economic metrics, such as token usage, tool invocation cost, and compute consumption, become first-class signals, since inefficient reasoning paths translate directly into higher operational cost.

Agentic systems preserve state across interactions and use it as context in future interactions. For example, to be effective, a customer support agent needs access to previous conversations, account history, and ongoing issues. Losing context means starting over and degrading the user experience.

In short, while traditional evaluation asks, “Was the answer correct?”, agentic tool evaluation asks, “Did the system act correctly, safely, efficiently, and in alignment with its mandate while reaching the answer?”

Metrics and frameworks to evaluate and monitor agents

As enterprises adopt complex, multi-agent autonomous AI workflows, effective evaluation requires more than just accuracy. Metrics and frameworks must span functional behavior, operational efficiency, security, and economic cost. 

Below, we define four key categories for agentic workflow evaluation necessary to establish visibility and control.

Functional metrics

Functional metrics measure whether the agentic workflow performs the task it was designed for and adheres to its expected behavior.

Core functional metrics: 

  • Agent goal accuracy: Evaluates the performance of the LLM in identifying and achieving the goals of the user. Can be evaluated with reference datasets where “correct” goals are known or without them.
  • Agent task adherence: Assesses whether the agent’s final response satisfies the original user request.
  • Tool call accuracy: Measures whether the agent correctly identifies and calls external tools or functions required to complete a task (e.g., calling a weather API when asked about weather).
  • Response quality (correctness / faithfulness): Beyond success/failure, evaluates whether the output is accurate and corresponds to ground truth or external data sources. Metrics such as correctness and faithfulness assess output validity and reliability. 

Why these matter: Functional metrics validate whether agentic workflows solve the problem they were built to solve and are often the first line of evaluation in playgrounds or test environments.

Operational metrics 

Operational metrics quantify system efficiency, responsiveness, and the use of computational resources during execution. 

Key operational metrics

  • Time to first token (TTFT): Measures the delay between sending a prompt to the agent and receiving the first model response token. This is a common latency measure in generative AI systems and critical for user experience.
  • Latency & throughput: Measures of total response time and tokens per second that indicate responsiveness at scale.
  • Compute utilization: Tracks how much GPU, CPU, and memory the agent consumes during inference or execution. This helps identify bottlenecks and optimize infrastructure usage.

Why these matter: Operational metrics ensure that workflows not only work but do so efficiently and predictably, which is critical for SLA compliance and production readiness.

Security and safety metrics 

Security metrics evaluate risks related to data exposure, prompt injection, PII leakage, hallucinations, scope violation, and control access within agentic environments.

Security controls & metrics

  • Safety metrics: Real-time guards evaluating if agent outputs comply with safety and behavioral expectations, including detection of toxic or harmful language, identification and prevention of PII exposure, prompt-injection resistance, adherence to topic boundaries (stay-on-topic), and emotional tone classification, among other safety-focused controls.
  • Access management and RBAC: Role-based access control (RBAC) ensures that only authorized users can view or modify workflows, datasets, or monitoring dashboards.
  • Authentication compliance (OAuth, SSO): Enforcing secure authentication (OAuth 2.0, single sign-on) and logging access attempts supports audit trails and reduces unauthorized exposure.

Why these matter: Agents often process sensitive data and can interact with enterprise systems; security metrics are essential to prevent data leaks, abuse, or exploitation.

Economic & cost metrics

Economic metrics quantify the cost efficiency of workflows and help teams monitor, optimize, and budget agentic AI applications. 

Common economic metrics

  • Token usage: Tracking the number of prompt and completion tokens used per interaction helps understand billing impact since many providers charge per token.
  • Overall cost and cost per task: Aggregates performance and cost metrics (e.g., cost per successful task) to estimate ROI and identify inefficiencies.
  • Infrastructure costs (GPU/CPU Minutes): Measures compute cost per task or session, enabling teams to attribute workload costs and align budget forecasting.

Why these matter: Economic metrics are crucial for sustainable scale, cost governance, and showing business value beyond engineering KPIs.  

Governance and compliance frameworks for agents

Governance and compliance measures ensure workflows are traceable, auditable, compliant with regulations, and governed by policy. Governance can be classified broadly into 3 categories. 

Governance in the face of: 

  • Security Risks 
  • Operational Risks
  • Regulatory Risks

Fundamentally, they have to be ingrained in the entire agent development and deployment process, as opposed to being bolted on afterwards. 

Security risk governance framework

Ensuring security policy enforcement requires tracking and adhering to organizational policies across agentic systems. 

Tasks include, but are not limited to, validation and enforcement of access management through authentication and authorization that mirror broader organizational access permissions for all tools and enterprise systems that agents access. 

It also includes setting up and enforcing robust, auditable approval workflows to prevent unauthorized or unintended deployments and updates to agentic systems within the enterprise.

Operational risk governance framework

Ensuring operational risk governance requires tracking, evaluating, and enforcing adherence to organizational policies such as privacy requirements, prohibited outputs, fairness constraints, and red-flagging instances where policies are violated. 

Beyond alerting, operational risk governance systems for agents should provide effective real-time moderation and intervention capabilities to address undesired inputs or outputs. 

Finally, a critical component of operational risk governance involves lineage and versioning, including tracking versions of agents, tools, prompts, and datasets used in agentic workflows to create an auditable record of how decisions were made and to prevent behavioral drift across deployments.

Regulatory risk governance framework

Ensuring regulatory risk governance requires validating that all agentic systems comply with applicable industry-specific and government regulations, policies, and standards. 

This includes, but is not limited to, testing for compliance with frameworks such as the EU AI Act, NIST RMF, and other country- or state-level guidelines to identify risks including bias, hallucinations, toxicity, prompt injection, and PII leakage.

Why governance metrics matter 

Governance metrics reduce legal and reputational exposure while meeting growing regulatory and stakeholder expectations around trustworthiness and fairness. They provide enterprises with the confidence that agentic systems operate within defined security, operational, and regulatory boundaries, even as workflows evolve over time. 

By making policy enforcement, access controls, lineage, and compliance continuously measurable, governance metrics enable organizations to scale agentic AI responsibly, maintain auditability, and respond quickly to emerging risks without slowing innovation.

Turning agentic AI into reliable, production-ready systems

Agentic AI introduces a fundamentally new operating model for enterprise automation, one where systems reason, plan, and act autonomously at machine speed.

This enhanced power comes with risk. Organizations that succeed with agentic AI are not the ones with the most impressive demos, but the ones that rigorously evaluate behavior, monitor systems continuously in production, and embed governance across the entire agent lifecycle. Reliability, safety, and scale are not accidental outcomes. They are engineered through disciplined metrics, observability, and control.

If you’re working to move agentic AI from proof of concept into production, adopting a full-lifecycle approach can help reduce risk and improve reliability. Platforms such as DataRobot support this by bringing together evaluation, monitoring, tracing, and governance to give teams better visibility and control over agentic workflows.

To see how these capabilities can be applied in practice, you can explore a free DataRobot demo.

The post Production-ready agentic AI: key challenges and solutions  appeared first on DataRobot.

Underwater robots inspired by nature are making progress, but hurdles remain

Underwater robots face many challenges before they can truly master the deep, such as stability in choppy currents. A new paper published in the journal npj Robotics provides a comprehensive update of where the technology stands today, including significant progress inspired by the movement of rays.

Adaptive motion system helps robots achieve human-like dexterity with minimal data

Despite rapid robotic automation advancements, most systems struggle to adapt their pre-trained movements to dynamic environments with objects of varying stiffness or weight. To tackle this challenge, researchers from Japan have developed an adaptive motion reproduction system using Gaussian process regression.

Taking humanoid soccer to the next level: An interview with RoboCup trustee Alessandra Rossi

A core objective of RoboCup is to promote and advance robotics and AI research through the challenges offered by its various leagues. The ultimate goal of the soccer competition is that, by 2050, a team of fully autonomous humanoid robots will defeat the most recent winner of the FIFA World Cup. To bring this vision closer to reality, the RoboCup Federation has announced several changes to the leagues. We spoke with Alessandra Rossi, a trustee who has been involved in the humanoid soccer league for many years, to learn more.

Could you start by introducing yourself and tell us how you’ve been involved in RoboCup throughout the years, because you’ve been involved in so many aspects of the competition!

I am Alessandra Rossi from the University of Naples “Federico II”, where I am an Assistant Professor of Computer Science. I began working with and collaborating in RoboCup in 2016, when I started my PhD at the University of Hertfordshire in the UK. I am still affiliated with the University of Hertfordshire, as I remain a member of the humanoid KidSize team Bold Hearts, the longest continuously active team in the UK. After a few years, I became the team leader of Bold Hearts.

In 2019, I became a member of both the Technical Committee and the Organizing Committee of the Humanoid League. After serving on the Technical Committee for two years, I was elected to the Executive Committee of the Humanoid League. In 2025, I was elected to the Board of Trustees for the first time.

Over the years, I have steadily increased my involvement and commitment to RoboCup. I have always sought to actively engage the RoboCup community, both during competitions and outside of competition periods. I also work to encourage engagement between the major and junior leagues and to participate in regional RoboCup events.

While working at the University of Hertfordshire as a Visiting Lecturer, we launched an online module that uses RoboCup as a benchmark for teaching robotics to undergraduate students. The module is still running. I initially served as the module leader, and this role has since been taken over by our Bold Hearts teammate, Bente Riegler.

Last year, Maike Paetzel-Prüsmann, Merel Keijsers, and I (as lead authors), in collaboration with several trustees and many members from different leagues, published a paper on the current and future challenges in humanoid robotics. The paper was published in Autonomous Robots and is, to the best of my knowledge, the first to involve such a large and diverse group of contributors from across the RoboCup leagues. It discusses research within RoboCup and the collaboration and synergies between the leagues.

Group photo of the humanoid league teams at RoboCup 2025.

I understand that there are some changes planned for the leagues. Could say something about that, and specifically the changes that affect the soccer and the humanoid side.

The 2050 goal of the RoboCup Federation, as many people are probably aware, is for a team of humanoid robots to play against the winners of the FIFA World Cup. To achieve this, it is necessary to push further in that direction. One of the key changes, therefore, will be a stronger focus on humanoid robots.

Another major change will be the merger of the Standard Platform League (SPL) and the KidSize Humanoid League. This merged league will have the freedom to define its exact format and to develop a new roadmap that aligns the entire league toward a shared objective. While the 2050 goal itself remains unchanged, the path toward achieving it will need to be adjusted.

It is crucial to continue fostering the engagement of teams in the leagues that will be affected by these changes. At the same time, we must recognize that technology is advancing rapidly. Over the past year, in particular, we have seen significant progress in both hardware platforms and large language models. As RoboCup serves as a global benchmark for robotics research, we should continuously strive to advance technology and research—while still having fun.

Soccer is the complex task and behavior we are studying, and it is complex in many dimensions: from physical control and robot motion, to communication and strategy, and even human-like interactions. These include responding to the referee’s whistle, verbal and non-verbal communication among team members, interactions with the coach, and communication with the referee. All of these aspects will ultimately need to be incorporated into the humanoid league.

The RoboCup Federation has agreed some new partnerships with Unitree, Fourier and Booster. What impact will this have on the humanoid league? Will there be a standard platform element with teams using a specific humanoid robot?

I believe we will see a mix of different robots. With the three companies currently sponsoring RoboCup, we have already seen that their robots can achieve a wide range of behaviors, and there have been significant improvements in robot control. Some of these robots can walk very quickly—almost to the point of running.

Initially, there may be the possibility of multiple teams using the same platform. However, we must keep in mind that both hardware and software can become obsolete very quickly, so we need to remain open to multiple options. A robot that is state of the art today may no longer be so in a year or two. As a result, committing to a single standard platform could limit future progress.

For this reason, the current idea is to remain open to multiple platforms. Many teams already have excellent custom-built robots, and further improvements to these platforms should be encouraged. That said, the exact structure has not yet been decided, and these decisions will be made in consultation with the teams. It is important to give the RoboCup community the time it needs to adapt and move forward.

There have been some big advances in the humanoid adult-size league in the past couple of years. What improvements stood out to you at RoboCup2025 in Brazil?

One major change is that we have added extra robots to each team. Previously, teams played with just two robots per side, but matches are now played three versus three.

Another important improvement is the reduced presence of humans on the field. There is no longer a handler assigned to each robot. In the past, a team member had to walk behind the robots in case they fell and risked being damaged.

I have actually played in a match against the winning humanoid team. Naturally, the human team won, but it was an enjoyable and very interesting game, as the robots were surprisingly fast.


Action from the human vs humanoid match at RoboCup 2025.


Further action from the human vs humanoid match at RoboCup 2025.

What has been the general reaction from the RoboCup community to the changes? I guess it depends on which league you’re in as to how much it affects you.

Yes, it depends on which league you are part of. The reactions have been a mix of excitement and passion. Of course, everyone is keen to see improvements, and participants have always been prepared for changes to the rules and the league structures. However, there are still some open questions, and teams are waiting to see how things will evolve. Tomorrow, there will be a meeting with the President and several trustees to address questions raised by the leagues.

The overall direction of RoboCup, guided by the 2050 goal has not changed. Each league has been extremely valuable and has contributed in different ways toward achieving that goal. RoboCup has also been immensely valuable for robotics research more broadly. Beyond being fun, the challenges involved in making robots play soccer are extraordinarily complex. The research and solutions developed within RoboCup can be applied to many other fields and applications.

About Alessandra Rossi

Alessandra is Assistant Professor at the University of Naples Federico II, Italy. Her PhD thesis was part of the Marie Sklodowska-Curie Research ETN SECURE project at the University of Hertfordshire (UK). Her research interests include Human–(Multi) Robot Interaction, social robotics, trust, XAI, multi-agent systems and user profiling. She is Project Manager and co-supervisor of the MSCA PERSEO (955778), TRAIL (101072488) and SWEET (101168792). She is also co-PI of the project ERROR (FA8655-23-1-7060), and part of several national and international projects. Alessandra is also trustee member of RoboCup Federation, and member of the Humanoid League team called Bold Hearts. She is Chair of the IEEE P3108™ “Study Design”, and a member of the “Appendix” groups, she is Program Chair of IEEE RO-MAN 2027, she has been Robotic Challenge Chair at ICSR 2025, Special Session Chair of IEEE RO-MAN 2024, Publicity Chair of IEEE RO-MAN 2022 and 2023, Organising Chair of the 26th RoboCup International Symposium 2023, and she is on the program committee of several international conferences on human–robot interaction and artificial intelligence.

Robots to navigate hiking trails

If you’ve ever gone hiking, you know trails can be challenging and unpredictable. A path that was clear last week might be blocked today by a fallen tree. Poor maintenance, exposed roots, loose rocks, and uneven ground further complicate the terrain, making trails difficult for a robot to navigate autonomously. After a storm, puddles can form, mud can shift, and erosion can reshape the landscape. This was the fundamental challenge in our work: how can a robot perceive, plan, and adapt in real time to safely navigate hiking trails?

Autonomous trail navigation is not just a fun robotics problem; it has potential for real-world impact. In the United States alone, there are over 193,500 miles of trails on federal lands, with many more managed by state and local agencies. Millions of people hike these trails every year.

Robots capable of navigating trails could help with:

  • Trail monitoring and maintenance
  • Environmental data collection
  • Search-and-rescue operations
  • Assisting park staff in remote or hazardous areas

Driving off-trail introduces even more uncertainty. From an environmental perspective, leaving the trail can damage vegetation, accelerate erosion, and disturb wildlife. Still, there are moments when staying strictly on the trail is unsafe or impossible. So our question became: how can a robot get from A to B while staying on the trail when possible, and intelligently leaving it when necessary for safety?

Seeing the world two ways: geometry + semantics

Our main contribution is handling uncertainty by combining two complementary ways of understanding and mapping the environment:

  • Geometric Terrain Analysis using LiDAR, which tells us about slopes, height changes, and large obstacles.
  • Semantic-based terrain detection, using the robot camera images, which tells us what the robot is looking at: trail, grass, rocks, tree trunks, roots, potholes, and so on.

Geometry is great for detecting big hazards, but it struggles with small obstacles and terrain that looks geometrically similar, like sand versus firm ground, or shallow puddles versus dry soil, that are dangerous enough to get a robot stuck or damaged. Semantic perception can visually distinguish these cases, especially the trail the robot is meant to follow. However, camera-based systems are sensitive to lighting and visibility, making them unreliable on their own. By fusing geometry and semantics, we obtain a far more robust representation of what is safe to drive on.

We built a hiking trail dataset, labeling images into eight terrain classes, and trained a semantic segmentation model. Notably, the model became very good at recognizing established trails. These semantic labels were projected into 3D using depth and combined with the LiDAR based geometric terrain analysis map. Using a dual k-d tree structure, we fuse everything into a single traversability map, where each point in space has a cost representing how safe it is to traverse, prioritizing trail terrain.

The next step is deciding where the robot should go next, which we address using a hierarchical planning approach. At the global level, instead of planning a full path in a single pass, the planner operates in a receding-horizon manner, continuously replanning as the robot moves through the environment. We developed a custom RRT* that biases its search toward areas with higher traversability probability and uses the traversability values as its cost function. This makes it effective at generating intermediate waypoints. A local planner then handles motion between waypoints using precomputed arc trajectories and collision avoidance from the traversability and terrain analysis maps.

In practice, this makes the robot prefer staying on the trail, but not stubborn. If the trail ahead is blocked by a hazard, such as a large rock or a steep drop, it can temporarily route through grass or another safe area around the trail and then rejoin it once conditions improve. This behavior turns out to be crucial for real trails, where obstacles are common and rarely marked in advance.

We tested our system at the West Virginia University Core Arboretum using a Clearpath Husky robot. The video below summarizes our approach, showing the robot navigating the trail alongside the geometric traversability map, the semantic map, and the combined representation that ultimately drives planning decisions.

Overall, this work shows that robots do not need perfectly paved roads to navigate effectively. With the right combination of perception and planning, they can handle winding, messy, and unstructured hiking trails.

What is next?

There is still plenty of room for improvement. Expanding the dataset to include different seasons and trail types would increase robustness. Better handling of extreme lighting and weather conditions is another important step. On the planning side, we see opportunities to further optimize how the robot balances trail adherence against efficiency.

If you’re interested in learning more, check out our paper Autonomous Hiking Trail Navigation via Semantic Segmentation and Geometric Analysis. We’ve also made our dataset and code open-source. And if you’re an undergraduate student interested in contributing, keep an eye out for summer REU opportunities at West Virginia University, we’re always excited to welcome new people into robotics.

Playing AI Catch-Up

Training Now the Chokepoint

Wall Street Journal writer Christopher Mims reports that while AI is plenty smart across a wide spectrum of tasks, too few people know how to use AI well.

Observes Mims: “There is a huge gap between what AI can already do today and what most people are actually doing with it.”

In other news and analysis on AI writing:

*Dead Heat: New Study Finds ChatGPT, Gemini, Claude Equally Powerful: A new study finds that ChatGPT, Gemini and Claude essentially deliver the same level of results when it comes to general AI use, agentic use, programming use and scientific reasoning use.

That’s gotta sting for Google, which just a few weeks ago, lunged ahead as the AI chatbot-to-beat across a wide range of benchmarks.

Even so, picking the best AI for your own use boils down to giving all contenders a thorough run-through on how you personally use AI — and then choosing a personal favorite.

For example: For AI-generated writing, I still strongly prefer ChatGPT 4.0, which is still the most creative writer of the bunch to this day.

*ChatGPT Still Most Popular AI – By a Mile: While Google has been coming on strong, ChatGPT still dominates the AI universe.

New analysis from Windows Latest, for example, finds that ChatGPT owns 64.5% of the market, followed by Google’s Gemini at 21%.

Somewhat embarrassing for Microsoft: Its Copilot Chatbot only commands 1% of the AI market.

*Free-for-All: AI Gmail Tools for Writing, Summarizing and Email Drafts Now Gratis: AI users just got a generous present from Google for 2026: Free access to a number of powerful AI tools for Gmail:

–Help Me Write, which helps you draft everyday emails in Gmail

–Suggested Replies, which reads your email and auto-generates a reply that includes context and tone

–AI Emails Summary, which pops-up offering a bulleted summary of key points extracted from an email thread

*ChatGPT for Power Users: A Curated Video Guide: Skill Leap offers an excellent rundown on advanced uses of the chatbot in this 17-minute video.

Among the picks:

–Creating different writing styles with ChatGPT for different use cases

–Scheduling daily or weekly reminders with ChatGPT

–Getting ChatGPT to ‘disappear’ certain chats for privacy reasons

*Microsoft Copilot: Rough Going for Gmail and Outlook Email Users: In an unusual move, Microsoft CEO Satya Nadella has openly admitted that Microsoft Copilot barely works with Gmail and Outlook Email.

Observes writer Matthias Bastian: “This wasn’t a one-off complaint. Over the past few months, Microsoft’s CEO has essentially become the company’s top product (Copilot) manager.”

“To close the technical gaps, Nadella is personally investing in recruiting. He calls potential hires himself and approves unusually high salaries to poach top talent from OpenAI and Google DeepMind.”

*Brain Rot?: Not Everyone Gung-Ho on AI in the Schools: AI’s push into K-12 and beyond has some educators worried that the tech will diminish critical thinking, cause developmental issues in the young and trigger a widespread cheating culture.

Observes writer Natasha Singer: “Teachers currently have few rigorous studies to guide generative AI use in schools.”

And “researchers are just beginning to follow the long-term effects of AI chatbots on teenagers and schoolchildren,” Singer adds.

*AI and the Law: What to Expect in 2026: Fourteen experts in AI law have released a free eBook serving-up their predictions on how AI will reshape the law in 2026 and beyond.

Key co-authors include:

–Richard Troman, founder, Artificial Lawyer – a media outlet

–Adam Wehler, Director of e-Discovery Strategies and Litigation Technology, Smith Anderson

–Melina Efstathiou, AI Strategic Advisor, Legal Data Intelligence

*Top Five AI Writing Tools for 2026: SSBCrack News has released its list of the top five AI writing tools for the coming year.

All are AI writing pioneers. And all have appeared on many top five and top ten lists for years now.

SSB’s Take: While no tool is perfect, these five tools balance features like content generation, editing and optimization.

*AI Big Picture: Chinese AI Running Seven Months Behind U.S.: Despite releasing head-turning, extremely inexpensive alternatives to top AI, China is still about seven months behind the U.S. in AI development.

The new study, released by Epoch AI, reveals that the trend has persisted since 2023, when Chinese alternatives to ChatGPT and similar began popping up on the market.

One downside to Chinese AI: Researchers have found that some Chinese AI apps include code that can be used to forward your data to the Chinese Communist Party.

Share a Link:  Please consider sharing a link to https://RobotWritersAI.com from your blog, social media post, publication or emails. More links leading to RobotWritersAI.com helps everyone interested in AI-generated writing.

Joe Dysart is editor of RobotWritersAI.com and a tech journalist with 20+ years experience. His work has appeared in 150+ publications, including The New York Times and the Financial Times of London.

Never Miss An Issue
Join our newsletter to be instantly updated when the latest issue of Robot Writers AI publishes
We respect your privacy. Unsubscribe at any time -- we abhor spam as much as you do.

The post Playing AI Catch-Up appeared first on Robot Writers AI.

Page 3 of 586
1 2 3 4 5 586