What to look for when evaluating AI agent monitoring capabilities
Your AI agents are making hundreds — sometimes thousands — of decisions every hour. Approving transactions. Routing customers. Triggering downstream actions you don’t directly control.
Here’s the uncomfortable question most enterprise leaders can’t answer with confidence: Do you actually know what those agents are doing?
If that question gives you pause, you’re not alone. Many organizations deploy agentic AI, wire up basic dashboards, and assume they’re covered. Uptime looks fine, latency is acceptable, and nothing is on fire, so why question it?
Because unmonitored agents can quietly change behavior, stretch policy boundaries, or drift away from the intent you originally set up. And they can do it without tripping traditional alerts, which is a governance, compliance, and liability nightmare waiting to happen.
While traditional applications generally follow predictable code paths, AI agents make their own decisions, adapt to new inputs, and interact with other systems in ways that can cascade across your entire infrastructure. When something breaks (and it will), logs and metrics won’t explain why. Without monitoring and visibility into reasoning, context, and decision paths, teams react too late and repeat the same failures.
Choosing an AI agent monitoring platform is more about control than tooling. At enterprise scale, you either have deep visibility into how agents reason, decide, and act, or you accept gaps that regulators, auditors, and incident reviews won’t tolerate. The best platforms are converging around a clear standard: decision-level transparency, end-to-end traceability, and enforceable governance built for systems that think and act autonomously.
Key takeaways
- AI agent monitoring isn’t just about uptime and latency — enterprises need visibility into why agents act the way they do so they can manage governance, risk, and performance.
- The most important capabilities fall into three buckets: reliability (drift and anomaly detection), compliance (audit trails, role-based access, policy enforcement), and optimization (cost and performance insights tied to business outcomes).
- Many tools solve only a part of the problem. Point solutions can monitor traces or tokens, but they often lack the governance, lifecycle management, and cross-environment coverage enterprises need.
- Choosing the right platform means weighing tradeoffs between control and convenience, specialization and integration, and cost and capability — especially as requirements evolve and monitoring needs to cover predictive, generative, and agentic workflows together.
What is AI agent monitoring, and why does it matter?
Traditional observability tells you what happened, but AI agent monitoring builds on observability by telling you why it happened.
When you monitor a web application, behavior is predictable: user clicks button, system processes request, database returns result. The logic is deterministic, and the failure modes are well understood.
AI agents operate differently. They evaluate context, weigh options, and make decisions based on real-time inputs and environmental factors.
Because agent behavior is non-deterministic, effective monitoring depends on observability signals: reasoning traces, context, and tool-call paths. An agent might choose to escalate a customer service request to a human representative, recommend a specific product, or trigger a supply chain adjustment — all based on some sort of inference criterion. The outcome is clear, but the reasoning isn’t.
Here’s why that gap matters more than most teams realize:
- Governance becomes even more important: Every agent decision needs to be traceable, explainable, and auditable. When a financial services agent denies a loan application or a healthcare agent recommends a treatment path, you need complete visibility into the “why” behind the decision, not just the outcome.
- Performance degradation is subtle: Traditional systems fail faster and more obviously. Agents can drift slowly. They start making slightly different choices, responding to edge cases differently, or exhibiting bias that compounds over time. Without proper monitoring, these changes go undetected until it’s too late.
- Compliance exposure multiplies: Every autonomous decision carries regulatory risk. In regulated industries, agents that operate without in-depth monitoring create compliance gaps that auditors will find (and regulators will penalize).
With so much at stake, letting agents make autonomous decisions without visibility is a gamble you can’t afford.
Key features to look for in AI agent observability
Enterprise observability tools need to move beyond logging and alerting to deliver full-lifecycle visibility across AI agents, data flows, and governance controls.
But instead of getting lost in checklists as you compare solutions, focus on the capabilities that deliver the clearest business value.
Reliability features that prevent failures:
- Real-time drift detection → fewer silent failures and faster intervention
- Context-aware anomaly analysis → detect anomalies across massive volumes of data
- Adaptive alerting → lower alert fatigue and faster response times
- Cross-agent dependency mapping → visibility into how failures cascade across multi-agent systems
Compliance features that reduce risk:
- Decision-level audit trails → faster audits and defensible explanations under regulatory scrutiny
- Role-based access controls → prevention of unauthorized actions instead of after-the-fact remediation
- Automated bias and fairness monitoring → early detection of emerging risk before it becomes a compliance issue
- Policy enforcement and remediation → consistent enforcement of governance policies across teams and environments
Optimization features that improve ROI:
- Cost monitoring across multi-cloud environments → predictable spend and fewer budget surprises
- Usage-driven performance tuning → higher throughput without overprovisioning
- Resource utilization tracking → reduced waste and smarter capacity planning
- Business impact correlation → clear linkage between agent behavior, revenue, and operational outcomes
The best platforms integrate monitoring into existing enterprise workflows, security frameworks, and governance processes. Be skeptical of tools that lean too heavily on flashy promises like “self-healing agents” or vague “AI-powered root cause analysis.” These capabilities can be helpful, but they shouldn’t distract from core fundamentals like transparent traces, robust governance, and strong integration with your existing stack.
How to choose the right AI agent monitoring tool
Choosing a monitoring platform is about fit, not features. The biggest mistake enterprises make is underestimating governance.
Point solutions often work as add-ons. They observe external flows but can’t govern them. That means no versioning, limited documentation, weak quota and policy management, and no way to intervene when agents cross boundaries.
When evaluating platforms, focus on:
- Governance alignment: Built-in governance can save months of custom development and reduce regulatory risk.
- Integration depth: The most sophisticated monitoring platform is worthless if it doesn’t integrate with your existing infrastructure, security frameworks, and operational processes.
- Scalability: Proofs of concept don’t predict production reality. Plan for 10x growth. Will the platform handle expansions without major architectural changes? If not, it’s the wrong choice.
- Expertise requirements: Some platforms with custom frameworks require specialized skills (like sustained engineering expertise) that you may not have.
For most enterprises, the winning combination is a platform that balances governance maturity, operational simplicity, and ecosystem integration. Tools that excel in all three areas may justify higher upfront investments thanks to a lower barrier to entry and faster time to value.
See real business outcomes with enterprise-grade AI
Monitoring enables confidence at scale: Organizations with mature observability outperform peers on the uptime, mean time to detection, compliance readiness, and cost control metrics that matter to executive leadership.
Of course, metrics only matter if they translate to business outcomes.
When you can see what your agents are doing, understand why they’re doing it, and predict how changes will ripple across systems with confidence, AI becomes an operational asset instead of a gamble.
DataRobot’s Agent Workforce Platform delivers that confidence through unified observability and governance that spans the entire AI lifecycle. It removes the operational drag that slows AI initiatives and scales with enterprise ambition.
It’s time to look beyond point solutions. See what enterprise-gradeAI observabilitylooks like in practice with DataRobot.
FAQs
How is AI agent monitoring different from traditional application monitoring?
Traditional monitoring focuses on system health signals like CPU, memory, and uptime. AI agent monitoring has to go deeper. It tracks how agents reason, which tools they call, how they interact with other agents, and whether their behavior is drifting away from business rules or policies. In other words, it explains why something happened, not just that it happened.
What features matter most when choosing an AI agent monitoring platform?
For enterprises, the must-haves fall into three groups: reliability features like drift detection, guardrails, and anomaly analysis; compliance features like tracing, role-based access, and policy enforcement; and optimization features such as cost monitoring, performance tuning insights, and links between agent behavior and business KPIs. Anything that does not support one of those outcomes is usually secondary.
Do we really need a dedicated agent monitoring tool if we already have an observability stack?
General observability tools are useful for infrastructure and application health, but they rarely capture agent reasoning paths, decision context, or policy adherence out of the box. Most organizations end up layering a dedicated AI or agent monitoring solution on top so they can see how models and agents behave, not just how servers and APIs perform.
Should we build our own monitoring framework or buy a platform?
Building can make sense if you have strong platform engineering teams and highly specialized needs, but it is a large, ongoing investment. Monitoring requirements and metrics are changing quickly as agent architectures evolve. Most enterprises get better long-term value by buying a platform that already covers predictive, generative, and agentic components, then extending it where needed.
Where does DataRobot fit among these AI agent monitoring tools?
DataRobot AI Observability is designed as a unified platform rather than a point solution. It monitors models and agents across environments, ties monitoring to governance and compliance, and supports both predictive and generative workflows. For enterprises that want one place to manage visibility, risk, and performance across their AI estate, it serves as the central foundation other tools plug into.
The post What to look for when evaluating AI agent monitoring capabilities appeared first on DataRobot.
Introducing MirrorBot, a robot designed to foster human connection
AI agent observability: what enterprises need to know
You wouldn’t run a hospital without monitoring patients’ vitals. Yet most enterprises deploying AI agents have no real visibility into what those agents are actually doing — or why.
What began as chatbots and demos has evolved into autonomous systems embedded in core workflows: handling customer interactions, executing decisions, and orchestrating actions across complex infrastructures. The stakes have changed. The monitoring hasn’t.
Traditional tools tell you if your servers are up and your APIs are responding. They don’t tell you why your customer service agent started hallucinating responses, or why your multi-agent workflow failed three steps into a decision tree.
That visibility gap scales with every agent you deploy. When agents operate autonomously across critical business processes, guesswork isn’t a strategy.
If you can’t see reasoning, tool calls, and behavior over time, you don’t have real observability. You have infrastructure telemetry.
Deploying agents at scale requires observability that exposes behavior, decision paths, and outcomes across the entire agent workforce. Anything less breaks down fast.
Key takeaways
- AI agent observability isn’t an extension of traditional monitoring. It’s a different discipline entirely, focused on reasoning chains, tool usage, multi-agent coordination, and behavioral drift.
- Agentic systems evolve dynamically. Without deep visibility, failures stay hidden, costs creep up, and compliance risk grows.
- Evaluating platforms means looking past basic tracing and asking harder questions about governance integration, multi-cloud support, drift detection, security controls, and explainability.
- Treating observability as core infrastructure (not a debugging add-on) accelerates growth at scale, improves reliability, and makes agentic AI safe to run in production.
What is AI agent observability?
AI agent observability gives you visibility into behavior, reasoning, tool interactions, and outcomes across your agents. It shows how agents think, act, and coordinate — not just whether they run.
Traditional app monitoring looks mostly at system health and performance metrics. Agent observability opens the intelligence layer and helps teams answer questions like:
- Why did the agent choose this approach?
- What context shaped the decision?
- How did agents coordinate across a workflow?
- Where exactly did execution fall apart?
If a platform can’t answer these questions, it isn’t agent-ready.
When agents act autonomously, human teams stay accountable for outcomes. Observability is how that accountability stays grounded in facts, covering incident prevention, cost control, compliance, and behavior understanding at scale.
There’s also a distinction worth making between monitoring and observability that most teams underestimate. Monitoring tells you what happened. Observability helps you detect what should have happened but didn’t.
If an agent is supposed to trigger every time a new sales lead arrives, and that trigger silently fails, monitoring may never surface it. Observability catches the absence, flagging that an agent ran twice today when it should have run fifty times.
Multi-agent systems raise the bar further. Individual agents may look fine in isolation, while coordination failures, context handoffs, or resource conflicts quietly degrade results. Traditional monitoring misses all of it.
Why AI agents require different monitoring than traditional apps
Traditional monitoring assumes predictable behavior. AI agents don’t work that way. They reason probabilistically, adapt to context, and change behavior as underlying components evolve.
Here are common failure patterns that standard monitoring misses entirely:
- Execution failures show up as silent failures, not dramatic system crashes: permission errors, API rate limits, or bad parameters that slip through and cause slow, hidden performance decay that traditional alerts never catch.
- Context window overflow happens when agents continue to run, but with incomplete context. Different large language models (LLMs) have varying context limits, and when agents exceed those boundaries, they lose important information, leading to misinformed decisions that standard monitoring can’t detect.
- Agent orchestration issues grow more complex in sophisticated architectures. Traditional monitoring may see successful API calls and normal resource utilization, while missing coordination failures that compromise the entire workflow.
- Behavioral drift happens when models, templates, or training data change, causing agents to behave differently over time. Invisible to system-level metrics, it can completely alter agent performance and decision quality.
- Cost explosion occurs when agents get caught in loops of repeated actions, such as redundant API calls, excessive token usage, or inefficient tool interactions. Traditional monitoring treats this as normal system activity.
- Latency as a false signal: For traditional systems, latency is a reliable health indicator. For LLMs, it isn’t. A request might take two seconds or 60 seconds, and both outcomes can be perfectly valid. Treating latency spikes as failure signals generates noise that obscures what actually matters: behavior, decision quality, and outcome accuracy.
If your monitoring stops at infrastructure health, you’re only seeing the shadows of agent behavior, not the behavior itself.
Key features of modern agent observability platforms
The right platforms deliver outcomes enterprises actually care about:
- Security and access controls: Strong RBAC, PII detection and redaction, audit trails, and policy enforcement let agents operate in sensitive workflows without losing control or exposing the organization to regulatory risk.
- Granular cost tracking and guardrails: Fine-grained visibility into spend by agent, workflow, and team helps leaders understand where value is coming from, shut down waste early, and prevent cost overruns before they turn into budget surprises.
- Reproducibility: When something goes wrong, “we don’t know why” isn’t an acceptable answer. Replaying agent decisions gives teams a clear line of sight into what happened, why it happened, and how to fix it, whether the issue is performance, safety, or compliance.
- Multiple testing environments: Enterprises can’t afford to discover agent behavior issues in production. Full observability in pre-production environments lets teams pressure-test agents, validate changes, and catch failures before customers or regulators do.
- Unified visibility across environments: A single, consistent view across clouds, tools, and teams makes it possible to understand agent behavior end to end. Most platforms don’t deliver this without heavy customization.
- Reasoning trace capture: Seeing how agents reason — not just what they output — supports better decision review, faster debugging, and real accountability when autonomous decisions impact the business.
- Multi-agent workflow visualization: Visualizing how agents hand off context, delegate tasks, and coordinate work exposes bottlenecks and failure points that directly affect reliability, customer experience, and operational efficiency.
- Drift detection: Detecting when behavior slowly moves away from expectations lets teams intervene early, protecting decision quality and business outcomes as systems evolve.
- Context window monitoring: Tracking context usage helps teams spot when agents are operating with incomplete information, preventing silent degradation that’s invisible to traditional performance metrics.
How to evaluate an AI agent observability platform
Choosing the right platform goes beyond surface-level monitoring. Your evaluation process should prioritize:
(H3) Integration with existing infrastructure
Most enterprises already run across multiple clouds, on-prem systems, and custom orchestration layers. An observability platform has to fit into that reality, integrating with frameworks like LangChain, CrewAI, and custom agent orchestration layers without requiring significant architectural changes.
Cloud flexibility matters just as much. Observability should behave consistently across AWS, Azure, GCP, and hybrid or on-prem environments. If visibility changes depending on where agents run, blind spots creep in fast.
Look for OpenTelemetry (OTel) compatibility and data export capabilities. Vendor lock-in at the observability layer is especially painful because historical traces, behavioral baselines, and behavior data carry long-term operational value.
Cost and scalability considerations
Pricing models vary widely and can become expensive fast as agent usage scales. Review structures carefully, especially for high-volume workflows that generate extensive trace data.
Many platforms charge based on data ingestion, storage, or API calls, costs that aren’t always obvious upfront. Validate pricing against realistic scaling scenarios, including data retention costs for traces, logs, and reasoning histories.
For multi-cloud deployments, keep ingress and egress costs in mind. Data movement between regions or providers can create unexpected expenses that compound quickly at scale.
Security, compliance, and governance fit
Once agents touch sensitive data or regulated workflows, observability becomes part of the organization’s risk posture. Platforms need to support enterprise-grade security without relying on bolt-ons or manual processes.
That starts with strong access controls, encryption, and auditability. AI leaders should also look for real-time PII detection and redaction, policy enforcement tied to agent behavior, and clear audit trails that explain how decisions were made and who had access.
Alignment with relevant compliance frameworks is also a priority here, including SOC 2, HIPAA, GDPR, and industry-specific requirements that govern your organization. The platform should provide governance integration that supports audit processes and regulatory reporting.
Support for bring-your-own LLM deployments, private infrastructure, and air-gapped environments is also a differentiator. Enterprises running sensitive workloads need observability that works where their agents run — not just where vendors prefer them to run.
Dashboards, alerts, and user experience
Different stakeholders need different views of agent behavior. Builders need deep traces and reasoning paths. Operators need clear signals when workflows degrade or costs spike. Leaders need summaries that explain performance and risk in business terms.
Look for role-based views that surface the right level of detail without overwhelming each audience. Executives shouldn’t have to wade through logs to understand whether agents are behaving safely. Teams on the ground need to drill down fast when something breaks.
The platform should automatically flag drift, safety issues, or unexpected behavior, and route those alerts directly into collaboration tools like Slack or Microsoft Teams, so teams can respond without living in a dashboard.
Best practices for implementing agent observability
Getting observability right isn’t a one-time setup. It requires ongoing attention as your agents and the systems they operate in continue to evolve.
Establish clear metrics and KPIs
System performance is important, but agent observability only delivers value when metrics align with business outcomes. Define KPIs that reflect decision quality, business impact, and operational efficiency.
That means looking at how reliably agents achieve their goals, putting guardrails in place to prevent harmful behavior, and monitoring cost-per-action to keep execution efficient.
Metrics should apply to both individual agents and multi-agent workflows. Complex workflows require coordination metrics that individual-agent KPIs don’t capture.
Leverage continuous evaluation and feedback loops
Set up automated evaluation pipelines that catch drift or unexpected behaviors before they affect real business operations. Waiting until something breaks is not a detection strategy.
For sensitive, high-impact tasks, automated evaluation isn’t enough. Human review is still essential where the stakes are too high to rely solely on automated signals.
Run A/B comparisons as agents are updated to validate that changes actually improve performance. This matters, especially as agents evolve through model updates or configuration changes.
The foundation of scalable, trustworthy agentic AI
Observability connects everything — platform evaluation, multi-agent monitoring, governance, security, and continuous improvement — into one operational framework. Without it, scaling agents means scaling risk.
When teams can see what agents are doing and why, autonomy becomes something to expand, not fear.
Ready to build a stronger foundation? Download the enterprise guide to agentic AI.
FAQs
How is agent observability different from traditional AI or application monitoring?
Traditional monitoring focuses on infrastructure health — CPU, memory, uptime, error rates. Agent observability goes deeper, capturing reasoning paths, tool-call chains, context usage, and multi-step workflows. That visibility explains why agents behave the way they do, not just whether systems stay up.
What metrics matter most when evaluating multi-agent system performance?
Teams need to track both technical health and decision quality. That includes tool-call success rates, reasoning accuracy, latency across workflows, cost per decision, and behavioral drift over time. For multi-agent systems, coordination signals like message passing and task delegation matter just as much.
How do I know which observability platform is best for my organization’s agent architecture?
The right platform supports multi-agent workflows, exposes reasoning paths, integrates with orchestration layers, and meets enterprise security standards. Tools that stop at tracing or token counts usually fall short in regulated or large-scale deployments. DataRobot unifies observability, governance, and lifecycle oversight in one platform, making it purpose-built for enterprise scale.
What observability capabilities are essential for maintaining compliance and safety in enterprise agent deployments?
Prioritize full audit trails, RBAC, PII protection, explainable decisions, drift detection, and automated guardrails. A unified platform simplifies this by handling observability and governance together, rather than forcing teams to stitch controls across tools.
The post AI agent observability: what enterprises need to know appeared first on DataRobot.
Do you trust me? A framework for making networks of robots and vehicles safer
Gemma 4: Byte for byte, the most capable open models
Air-powered artificial muscles could help robots lift 100 times their weight
How to design and run an agent in rehearsal – before building it
Most AI agents fail because of a gap between design intent and production reality. Developers often spend days building only to find that escalation logic or tool calls fail in the wild, forcing a total restart. DataRobot Agent Assist closes this gap. It is a natural language CLI tool that lets you design, simulate, and validate your agent’s behavior in “rehearsal mode” before you write any implementation code. This blog will show you how to execute the full agent lifecycle from logic design to deployment within a single terminal session, saving you extra steps, rework, and time.
How to quickly develop and ship an agent from a CLI
DataRobot’s Agent Assist is a CLI tool built for designing, building, simulating, and shipping production AI agents. You run it from your terminal, describe in natural language what you want to build, and it guides the full journey from idea to deployed agent, without switching contexts, tools, or environments.
It works standalone and integrates with the DataRobot Agent Workforce Platform for deployment, governance, and monitoring. Whether you’re a solo developer prototyping a new agent or an enterprise team shipping to production, the workflow is the same: design, simulate, build, deploy.
Users are going from idea to a running agent quickly, reducing the scaffolding and setup time from days to minutes.
Why not just use a general-purpose coding agent?
General AI coding agents are built for breadth. That breadth is their strength, but it is exactly why they fall short for production AI agents.
Agent Assist was built for one thing: AI agents. That focus shapes every part of the tool. The design conversation, the spec format, the rehearsal system, the scaffolding, and the deployment are all purpose-built for how agents actually work. It understands tool definitions natively. It knows what a production-grade agent needs structurally before you tell it. It can simulate behavior because it was designed to think about agents end to end.

The agent building journey: from conversation to production
Step 1: Start designing your agent with a conversation
You open your terminal and run dr assist. No project setup, no config files, no templates to fill out. You’ll immediately get a prompt asking what you want to build.
Agent Assist asks follow-up questions, not only technical ones, but business ones too. What systems does it need access to? What does a good escalation look like versus an unnecessary one? How should it handle a frustrated customer differently from someone with a simple question?
Guided questions and prompts will help with building a complete picture of the logic, not just collecting a list of requirements. You can keep refining your ideas for the agent’s logic and behavior in the same conversation. Add a capability, change the escalation rules, adjust the tone. The context carries forward and everything updates automatically.
For developers who want fine-grained control, Agent Assist also provides configuration options for model selection, tool definitions, authentication setup, and integration configuration, all generated directly from the design conversation.
When the picture is complete, Agent Assist generates a full specification: system prompt, model selection, tool definitions, authentication setup, and integration configuration. Something a developer can build from and a business stakeholder can actually review before any code exists. From there, that spec becomes the input to the next step: running your agent in rehearsal mode, before a single line of implementation code is written.
Step 2: Watch your agent run before you build it
This is where Agent Assist does something no other tool does.
Before writing any implementation, it runs your agent in rehearsal mode. You describe a scenario and it executes tool calls against your actual requirements, showing you exactly how the agent would behave. You see every tool that fires, every API call that gets made, every decision the agent takes.
If the escalation logic is wrong, you catch it here. If a tool returns data in an unexpected format, you see it now instead of in production. You fix it in the conversation and run it again.
You validate the logic, the integrations, and the business rules all at once, and only move to code when the behavior is exactly what you want.
Step 3: The code that comes out is already production-ready
When you move to code generation, Agent Assist does not hand you a starting point. It hands you a foundation.
The agent you designed and simulated comes scaffolded with everything it needs to run in production, including OAuth authentication (no shared API keys), modular MCP server components, deployment configuration, monitoring, and testing frameworks. Out of the box, Agent Assist handles infrastructure that normally takes days to piece together.
The code is clean, documented, and follows standard patterns. You can take it and continue building in your preferred environment. But from the very first file, it is something you could show to a security team or hand off to ops without a disclaimer.
Step 4: Deploy from the same terminal you built in
When you are ready to ship, you stay in the same workflow. Agent Assist knows your environment, the models available to you, and what a valid deployment requires. It validates the configuration before touching anything.
One command. Any environment: on-prem, edge, cloud, or hybrid. Validated against your target environment’s security and model constraints. The same agent that helped you design and simulate also knows how to ship it.
What teams are saying about Agent Assist
“The hardest part of AI agent development is requirement definition, specifically bridging the gap between technical teams and domain experts. Agent Assist solves this interactively. A domain user can input a rough idea, and the tool actively guides them to flesh out the missing details. Because domain experts can immediately test and validate the outputs themselves, Agent Assist dramatically shortens the time from requirement scoping to actual agent implementation.”
The road ahead for Agent Assist
AI agents are becoming core business infrastructure, not experiments, and the tooling around them needs to catch up. The next phase of Agent Assist goes deeper on the parts that matter most once agents are running in production: richer tracing and evaluation so you can understand what your agent is actually doing, local experimentation so you can test changes without touching a live environment, and tighter integration with the broader ecosystem of tools your agents work with. The goal stays the same: less time debugging, more time shipping.
The hard part was never writing the code. It was everything around it: knowing what to build, validating it before it touched production, and trusting that what shipped would keep working. Agent Assist is built around that reality, and that is the direction it will keep moving in.
Get started with Agent Assist in 3 steps
Ready to ship your first production agent? Here’s all you need:
1. Install the toolchain:
brew install datarobot-oss/taps/dr-cli uv pulumi/tap/pulumi go-task node git python
2. Install Agent Assist:
dr plugin install assist
3. Launch:
dr assist
Full documentation, examples, and advanced configuration are in the Agent Assist documentation.
The post How to design and run an agent in rehearsal – before building it appeared first on DataRobot.
ANYmal deployed at Northern Lights CCS Facility
Back to school: robots learn from factory workers
By Anthony King
What if training a robot to handle dirty, dangerous work on the factory floor was as simple as showing it how? Czech startup RoboTwin is doing exactly that, helping factory workers teach robots new skills by demonstration.
Instead of writing complex code, workers perform the job once and RoboTwin’s technology turns those movements into a robot programme – opening the door to automation for smaller manufacturers.
Founded in Prague in 2021, RoboTwin builds handheld devices and no-code software that capture human movements and translate them into instructions for industrial robots. The aim is to make automation faster, simpler and more accessible to manufacturers that do not have specialist robotics programmers.
“The robot basically copies the human demonstration,” said Megi Mejdrechová, RoboTwin’s co-founder and chief technology officer. “People with no coding skills can transfer their know-how and experience to robots.”
Mejdrechová, a mechanical engineer trained at the Czech Technical University in Prague, developed the core technology behind RoboTwin during her work in robotics research and industry. Her experience in robot control using AI and computer vision inspired her to create something practical for European manufacturers.
“Czech engineering is quite traditional and focused on scientific papers,” said Mejdrechová. “Visits to Singapore and Canada and other work experiences led me to focus on making a product that people could use.”
Getting started
In 2021, Mejdrechová entered a jump‑starter programme and won first prize in the manufacturing category. “We saw then that there was potential for the technology,” she said.
This encouraged her to start RoboTwin with colleagues Ladislav Dvořák and David Polák, who shared her enthusiasm for human‑robot partnerships. Mejdrechová received backing from Women TechEU, an EU scheme supporting women founders of deep‑tech startups.
The RoboTwin team shared their results on the Horizon Results Platform, an online showcase for EU‑funded innovations, which led to an invitation to the EU’s Empowering Start‑ups and SMEs initiative.
This helped fund their trip to Hannover Messe 2025, a major global manufacturing trade fair, and opened doors to new business contacts and deals.
Through a mix of public and private investment, RoboTwin has secured funding to refine its technology and expand to manufacturers in Central Europe, the Netherlands, Mexico and Canada.
In 2025, Mejdrechová was named in Forbes Czechia’s 30 Under 30 list for her work in making the training of robots accessible to more manufacturers.
Schooling robots
At the heart of RoboTwin’s system is a handheld device equipped with sensors. When a worker performs a task, for example spray painting a metal component, the system records the movement and converts it into a robot programme that can be reused in production.
Instead of requiring a specialist engineer to manually code every movement, the system captures the worker’s natural technique and translates it into precise instructions a robot can follow.
“We started with jobs that are ugly, dirty and unhealthy for workers to do manually,” said Mejdrechová.
Thanks to the no‑coding system, the process can be completed in a few steps and typically takes about a minute. For factories producing small batches or frequently changing products, this speed can make automation far more practical than traditional robot programming.
Making automation easy for all
Robotics in manufacturing is not new. The automotive industry already leads the way with about 23 000 new robots added to production lines in 2024. But while large companies can invest heavily, automation remains challenging and expensive for many SMEs.
This is where RoboTwin lends a hand. It has assisted firms in the surface‑treatment industry – companies that powder coat, paint or polish metal or plastic parts for car factories.
“Even if the batch of products you are producing is small, with our approach you can create a robot programme fast and easily,” said Mejdrechová.
For example, RoboTwin has assisted RobPainting, a Dutch company that robotises painting for SMEs to improve quality, reduce costs and minimise rework.
“With our device we can teach the robot precise trajectories that are needed for a product and about its surroundings,” said David Vobr, a robotics specialist at RoboTwin who often assists customers.
Dangerous jobs
RoboTwin’s system can work with a wide range of industrial robots, including collaborative robots designed to operate safely alongside humans.
“We can have manipulators or painting robots and also collaborative robots, which can work alongside humans because they have sensors that tell them when to stop moving if someone could get hurt,” said Vobr.
RoboTwin initially focused on surface treatment in manufacturing, where tasks such as spray painting require workers to wear protective clothing and perform repetitive movements.
“These jobs are difficult to automate because there is often a lot of hidden movement involved,” said Mejdrechová, referring to small adjustments and gestures that workers make instinctively.
The sector also faces labour shortages.
“People are often not happy doing these things and there is a lack of workers willing to take these jobs. So there is a high demand for automation.”
Customers report that many robot programmes can now be created without shutting down a production line.
RoboTwin has already worked with a number of companies, including Surfin Technology, a Czech company specialising in robotic coating solutions, and Innovative Finishing Solutions in Canada, which brings its technology to North American customers.
Scaling up
EU support for RoboTwin is ongoing. A €2.3 million grant from the European Innovation Council secured in 2025 will help accelerate product development and market expansion.
The funding will support the next generation of RoboTwin’s technology. Instead of relying solely on manual demonstrations, the system will increasingly use stored experience and data to generate robot programmes automatically based on the shape of an object.
The company says this could make automation viable for many manufacturing tasks that were previously too complex or costly to automate.
For Europe, technologies like RoboTwin could play an important role in strengthening digital sovereignty and smart industrial innovation. They can help smaller manufacturers adopt advanced robotics without needing specialised programming expertise.
As factories become more flexible and data-driven, the ability to quickly teach robots new tasks may prove increasingly valuable.
Mejdrechová believes this shift will help bring automation within reach of a much wider range of companies.
“Our goal is to make robot training something that workers can do themselves,” she said. “If we succeed, automation will no longer be limited to large factories with specialised engineers. It will become a tool that any manufacturer can use.”
Research in this article was funded by the EU’s Horizon Programme. The views of the interviewees don’t necessarily reflect those of the European Commission. If you liked this article, please consider sharing it on social media.
This article was originally published in Horizon, the EU Research and Innovation magazine.
Researchers build a robotic swarm with no electronics, no batteries and no brains
Cutting the Cord Beneath the Surface
Combining the robot operating system with LLMs for natural-language control
SAP NLP Search Solutions
SAP NLP Search Solutions: Adding Intelligent Search to Your SAP Environment
The Data Access Problem Most SAP Shops Have Stopped Talking About
The data is in SAP. Everyone knows it is there. But getting to it requires knowing which transaction code to use, which fields to filter, and often which table names to query — knowledge that lives in a small group of power users and SAP consultants, not in the operations team, the supply chain planner, or the plant manager who actually needs it.
The result is a predictable pattern: analysts spend hours pulling reports. Decisions wait for data. The people closest to the operational problem rely on spreadsheet exports that are already 24 hours stale by the time they reach the right desk.
SAP NLP search solves this at the access layer. It lets users ask questions in plain language and get answers drawn from live SAP data — without transaction codes, without filter configurations, and without a power user in the loop.
USM Business Systems is a CMMi Level 3, Oracle Gold Partner Artificial Intelligence (AI) and IT services firm based in Ashburn, VA. We design and deploy SAP NLP search solutions for manufacturers, pharma companies, logistics operators, and other enterprises where the gap between SAP data and operational decision-making is costing time and accuracy.
What SAP NLP Search Actually Is?
SAP NLP search is a natural language interface layered on top of SAP data. A user types or speaks a question — ‘Which suppliers are running more than 5 days late on open POs this week?’ or ‘What is the current inventory for material X across all plants?’ — and the system retrieves the relevant SAP data and returns a plain-language answer or a structured result.
The technical architecture underneath involves three components working together:
- A retrieval layer that connects to SAP Datasphere views, HANA models, or structured data extracts and fetches the records relevant to the query
- An LLM (large language model) that interprets the natural language question, reasons about the retrieved data, and formulates a response the user can act on
- A user interface layer, typically embedded in SAP Fiori or a standalone web application, that surfaces the interaction in a format the team already uses
This architecture is known as retrieval-augmented generation (RAG). It is the standard pattern for enterprise AI search because it grounds the LLM’s responses in your actual data rather than its training knowledge — which means the answers are accurate to your environment, not generic.
Where SAP NLP Search Delivers Measurable Value?
- Supply Chain and Procurement
Supply chain teams field constant questions about supplier performance, open purchase order status, inventory positions, and demand deviations. In a typical SAP environment without NLP search, each of these questions requires a different transaction, a different filter configuration, and often a trip to the analyst team.
With NLP search on SAP Ariba and S/4HANA data, a supply chain planner asks the question directly and gets the answer in under 30 seconds. Forrester research found that enterprises deploying AI-assisted data access in supply chain operations reduced average data retrieval time by 68% within 90 days of deployment.
- Manufacturing Operations
Plant managers and production supervisors need fast access to quality data, work order status, equipment maintenance history, and production schedule adherence. In SAP PP and SAP PM, this data exists but requires navigation through multiple transaction codes.
NLP search allows a plant manager to ask ‘What is the current first-pass yield for line 3 this week compared to last week?’ and get an answer pulled from SAP QM data — in the moment, on a tablet on the shop floor. The decision that used to wait for an end-of-day report happens in real time.
- Finance and Compliance
Finance teams use SAP NLP search to answer variance questions, retrieve specific transaction histories, and surface exceptions without constructing custom reports. Compliance teams in regulated environments use it to pull audit-relevant data on demand — a capability that previously required either a SAP power user or a scheduled report.
- Procurement and Sourcing
Buyers and category managers use NLP search to surface contract terms, pricing history, and supplier qualification status from SAP Ariba without navigating the full Ariba interface. A buyer preparing for a supplier negotiation asks what the last five purchase prices were for a given material category and gets the answer directly from SAP contract and PO data.
How does NLP search on SAP handle questions the system cannot answer?
A well-designed SAP NLP search system will indicate when a query falls outside its data coverage rather than generating a fabricated answer. This is controlled by the retrieval layer — if the relevant data is not in the configured Datasphere view or HANA model, the system returns a ‘data not available’ response. Configuration of the retrieval layer’s scope is a key design decision during deployment.
Can SAP NLP search be used by non-technical users without SAP training?
Yes — that is the primary value proposition. Users who have never navigated an SAP transaction code can access operational data through plain language questions. The system requires user management and access controls, but the operational interface requires no SAP knowledge. Teams report adoption rates of 80%+ within 30 days when the deployment covers data that users actively need.
What a SAP NLP Search Deployment Involves?
- Phase 1: Data Domain Scoping (Weeks 1-2)
Define which SAP data the search system will cover. This is not ‘all of SAP’ — it is a specific set of data domains aligned to the team or use case being served first. Supply chain planner access to procurement and inventory data is a typical first domain. Finance team access to transaction history and variance data is another common starting point.
- Phase 2: Data Readiness (Weeks 2-4)
Build or validate the Datasphere views or HANA models that the retrieval layer will query. This phase surfaces master data quality issues that need resolution before the NLP layer can produce reliable answers. Budget 2-4 weeks depending on the cleanliness of the target data domain.
- Phase 3: Retrieval Layer Build (Weeks 4-6)
Configure the retrieval system that connects user queries to the relevant SAP data. This includes the embedding model that converts queries and data into a format the LLM can reason about, the vector search or structured retrieval logic, and the data access controls that ensure users only see data they are authorized to access.
- Phase 4: LLM Integration and Response Configuration (Weeks 6-8)
Connect the retrieval layer to the LLM, configure the response format, and build the prompt structure that guides the model to produce useful, accurate answers rather than general responses. Test on 50-100 representative queries across the target data domain. Tune accuracy.
- Phase 5: UI Integration and Rollout (Weeks 8-10)
Deploy the interface — typically a Fiori tile, a Teams integration, or a standalone web application — and roll out to the target user group. Collect feedback on query coverage gaps and expand the data domain in the next iteration.
A first-domain deployment typically reaches productive use in 10-12 weeks. Enterprises that have invested in SAP Datasphere can move faster because the data layer is already structured.
What Separates Good SAP NLP Search From Poor Implementations?
- Scoped retrieval, not open-ended LLM access. The model must be grounded in your SAP data, not relying on its training knowledge. RAG architecture is the standard. Implementations without a proper retrieval layer produce hallucinated data.
- SAP data structure knowledge. The engineers building the retrieval layer need to understand SAP table relationships, master data objects, and SAP Datasphere modeling — not just LLM APIs. The two skill sets are both required.
- Access control from the start. SAP data carries access restrictions for good reasons. An NLP search system that allows any user to query any data field is a governance problem. Role-based data access needs to be designed into the retrieval layer from the beginning.
- Iteration planning. No first deployment covers every query the users will try. The difference between a successful deployment and an abandoned one is whether the team has a process for expanding data coverage based on user feedback.
Why USM Business Systems?
USM Business Systems is a CMMi Level 3, Oracle Gold Partner AI and IT services firm headquartered in Ashburn, VA. With 1,000+ engineers, 2,000+ delivered applications, and 27 years of enterprise delivery experience, USM specializes in AI implementation for supply chain, pharma, manufacturing, and SAP environments. Our SAP AI practice places specialized engineers inside enterprise programs within days — on contract, as dedicated delivery pods, or on a project basis.
Ready to put SAP AI into production? Book a 30-minute scoping call with our SAP AI team.
[contact-form-7]
FAQ
- Does SAP NLP search require SAP Datasphere, or can it work with HANA directly?
Both work. SAP Datasphere is preferred for new deployments because it provides a governed, semantically structured data layer that is well-suited to retrieval-augmented generation. HANA views and OData APIs can serve as the retrieval source for organizations that have not yet adopted Datasphere, though more custom engineering is required.
- Which LLM works best for SAP NLP search?
The answer depends on your governance requirements. Azure OpenAI (GPT-4) is the most common choice for enterprises with existing Microsoft agreements and data residency requirements. Anthropic Claude and AWS Bedrock models are increasingly common in regulated industries that require stronger content controls. The LLM selection is less important than the retrieval layer architecture.
- How is accuracy measured for SAP NLP search?
The primary accuracy metric is the rate at which the system returns a correct answer to queries tested against known SAP data. A second metric is the rate of ‘I cannot answer this’ responses versus hallucinated answers — the former is acceptable; the latter is not. Measure both during the testing phase and set minimum thresholds before production rollout.
- Can SAP NLP search write data back to SAP, or is it read-only?
Most initial deployments are read-only — the system retrieves and presents data but does not modify SAP records. Write-back capability, where the system can initiate a SAP workflow or update a field based on a user instruction, is the next level and requires agentic architecture rather than pure NLP search.
- What user adoption approach works best for SAP NLP search?
Start with the team that has the most acute data access pain and the most frequent need to query SAP. Supply chain planners, procurement buyers, and plant managers are typically the highest-value early adopters. Get that team productive, collect their feedback on query gaps, and use their results as the business case for expanding to the next team.