Page 1 of 528
1 2 3 528

Today’s humanoid robots look remarkable, but there’s a design flaw holding them back

Watch Boston Dynamics' Atlas robot doing training routines, or the latest humanoids from Figure loading a washing machine, and it's easy to believe the robot revolution is here. From the outside, it seems the only remaining challenge is perfecting the AI (artificial intelligence) software to enable these machines to handle real-life environments.

Balancing accuracy, cost, and real‑world performance with NVIDIA Nemotron models

Every week, new models are released, along with dozens of benchmarks. But what does that mean for a practitioner deciding which model to use? How should they approach assessing the quality of a newly released model? And how do benchmarked capabilities like reasoning translate into real-world value?

In this post, we’ll evaluate the newly released NVIDIA Llama Nemotron Super 49B 1.5 model. We use syftr, our generative AI workflow exploration and evaluation framework, to ground the analysis in a real business problem and explore the tradeoffs of a multi-objective analysis.

After examining more than a thousand workflows, we offer actionable guidance on the use cases where the model shines.

The number of parameters count, but they’re not everything

It should be no surprise that parameter count drives much of the cost of serving LLMs. Weights need to be loaded into memory, and key-value (KV) matrices cached. Bigger models typically perform better — frontier models are almost always massive. GPU advancements were foundational to AI’s rise by enabling these increasingly large models.

But scale alone doesn’t guarantee performance.

Newer generations of models often outperform their larger predecessors, even at the same parameter count. The Nemotron models  from NVIDIA are a good example. The models build on existing open models, , pruning unnecessary parameters, and distilling new capabilities.

That means a smaller Nemotron model can often outperform its larger predecessor across multiple dimensions: faster inference, lower memory use, and stronger reasoning.

We wanted to quantify those tradeoffs — especially against some of the largest models in the current generation.

How much more accurate? How much more efficient? So, we loaded them onto our cluster and got to work.

How we assessed accuracy and cost

Step 1: Identify the problem

With models in hand, we needed a real-world challenge. One that tests reasoning, comprehension, and performance inside an agentic AI flow.

Picture a junior financial analyst trying to ramp up on a company. They should be able to answer questions like: “Does Boeing have an improving gross margin profile as of FY2022?”

But they also need to explain the relevance of that metric: “If gross margin is not a useful metric, explain why.”

To test our models, we’ll assign it the task of synthesizing data delivered through an agentic AI flow and then measure their ability to efficiently deliver an accurate answer.

To answer both types of questions correctly, the models needs to:

  • Pull data from multiple financial documents (such as annual and quarterly reports)
  • Compare and interpret figures across time periods
  • Synthesize an explanation grounded in context


FinanceBench benchmark is designed for exactly this type of task. It pairs filings with expert-validated Q&A, making it a strong proxy for real enterprise workflows. That’s the testbed we used.


Step 2: Models to workflows

To test in a context like this, you need to build and understand the full workflow — not just the prompt — so you can feed the right context into the model.

And you have to do this every time you evaluate a new model–workflow pair.

With syftr, we’re able to run hundreds of workflows across different models, quickly surfacing tradeoffs. The result is a set of Pareto-optimal flows like the one shown below.

financebench workflows

In the lower left, you’ll see simple pipelines using another model as the synthesizing LLM. These are inexpensive to run, but their accuracy is poor.

In the upper right are the most accurate —  but more  expensive since these typically rely on agentic strategies that break down the question, make multiple LLM calls, and analyze each chunk independently. This is why reasoning requires efficient computing and optimizations to keep inference costs in check.

Nemotron shows up strongly here, holding its own across the remaining Pareto frontier.


Step 3: Deep dive

To better understand model performance, we grouped workflows by the LLM used at each step and plotted the Pareto frontier for each.

financebench response synthesizer llm

The performance gap is clear. Most models struggle to get anywhere near Nemotron’s performance. Some have trouble generating reasonable answers without heavy context engineering. Even then, it remains less accurate and more expensive than larger models.

But when we switch to using the LLM for (Hypothetical Document Embeddings) HyDE, the story changes. (Flows marked N/A don’t include HyDE.)

financebench hyde retrieval generative model

Here, several models perform well, with affordability while delivering high‑accuracy flows.

 Key takeaways:

  • Nemotron shines in synthesis, producing high‑fidelity answers without added cost
  • Using other models that excel at HyDE frees Nemotron to focus on high-value reasoning
  • Hybrid flows are the most efficient setup, using each model where it performs best

Optimizing for value, not just size

When evaluating new models, success isn’t just about accuracy. It’s about finding the right balance of quality, cost, and fit for your workflow. Measuring latency, efficiency, and overall impact helps ensure you’re getting real value 

NVIDIA Nemotron models are built with this in mind. They’re designed not only for power, but for practical performance that helps teams drive impact without runaway costs.

Pair that with a structured, Syftr-guided evaluation process, and you’ve got a repeatable way to stay ahead of model churn while keeping compute and budget in check.

To explore syftr further, check out the GitHub repository.

The post Balancing accuracy, cost, and real‑world performance with NVIDIA Nemotron models appeared first on DataRobot.

Simplified wrist mechanism gives robots a hand

Give robots a specific job—say, placing a can on a conveyor belt in a factory—and they can be extremely efficient. But in less-structured environments with varied tasks, even seemingly simple things like unscrewing a light bulb or turning a door handle, things get a lot trickier.

ChatGPT-5 Released: Top Ten Takeaways

In one of the most anticipated product launches of all-time, OpenAI has released a major update to ChatGPT, which is currently used by 700 million people — each week — worldwide.

The skinny: With ChatGPT-5, OpenAI is promising a faster, easier, smarter and much more accurate experience – although many long-term users have been turned-off by ChatGPT’s ‘new personality,’ which they find cold and distant.

Either way, as anticipated, ChatGPT-5’s release has dramatically altered the AI landscape.

Here are the Top Ten Takeaways:

*Expect PhD-level Intelligence: No matter what the question, ChatGPT-5 is trained to respond to you on the PhD level. Observes lead writer Angela Yang: “The company said the new model, GPT-5, is its smartest and fastest to date with wide-ranging improvements to ChatGPT’s skills in areas like coding, writing and taking on complex actions.”

*Stick With GPT-5 Thinking for Consistency for Now: ChatGPT’s overhaul comes with a new router, which is programmed to automatically select the best AI engine for your query. It selects a weaker AI engine for your easy questions, for example and a powerful AI engine for tougher questions.

The problem: The router is less-than-perfect, often routing tough questions to a weak AI engine, resulting in disappointing responses. Consequently, the best bet for answers with consistent quality is to use GPT-5 Thinking – even though this AI engine takes longer to respond.

*Feel Free to Interrupt ChatGPT for a Quick Answer: This feature is one of the workarounds when using the slower-responding – but smarter – GPT-5 Thinking. You can click the “Interrupt for Quick Answer” link inside GPT-5 Thinking any time you’re using that AI engine and believe a weaker AI engine can deliver a good enough response.

*Look for Faster Responses: Early adopters report that using ChatGPT-5 is faster overall. Observes Nick Turley, head of product, ChatGPT: “You really get the best of both worlds. You have it reason when it needs to reason, but you don’t have to wait as long.”

*Expect Fewer Hallucinations/Made-up Facts: Early adopters also report ChatGPT-5 is less prone to make-up facts. In fact, sometimes ChatGPT-5 will simply admit it does not have an answer for you. Othertimes, it will ask you follow-up questions to try and clarify your question.

*Even at the Free Level, Get Access to the Most Powerful Version: With ChatGPT-5, even free users get access – albeit limited – to the most powerful AI engine available from its maker, OpenAI. Previously, free users were only given access to weaker AI engines.

*Bank-on Using Advanced Voice Mode for Free, if You Prefer: If you like interacting with ChatGPT using just your voice, you can do so even at the free level now. Plus, those who currently use Advanced Voice with their paid subscription should expect higher usage limits.

*Gear-up for a New ChatGPT Personality: Many early adopters report that GPT-5’s default personality is colder, terser and far less engaging. Overall: GPT-5 is not interested in being your friend. Instead, GPT-5 is optimized to bring back results, get the job done and move on. Period.

While some users prefer this default personality, others have been seriously turned-off.

Observes writer Ryan Whitwam: “On the OpenAI community forums and Reddit, long-time chatters are expressing sorrow at losing access to models like GPT-4o.

“They explain the feeling as ‘mentally devastating,’ and ‘like a buddy of mine has been replaced by a customer service representative.’ These threads are full of people pledging to end their paid subscriptions.”

*Hold-Out for ChatGPT-4o’s Return: Responding to widespread critiques that GPT-5 projects a cold, terse, standoffish personality, its maker OpenAI is promising to bring back ChatGPT-4o as an option for ChatGPT Plus users.

*Check-Out the Excellent, First-Take Video Overviews on GPT-5 Already Available: Fortunately, YouTube is awash with a number of extremely informative videos on what ChatGPT-5 looks like in action. Here are some choice picks:

–Introducing GPT-5: This is the one hour-plus video that ChatGPT’s maker released with the official launch of ChatGPT-5. It’s a great place to start for a detailed overview of all the new features -– albeit from the ‘proud parent’ perspective of ChatGPT-5’s creator.

–7 Big Changes in GPT-5 (With Live Demos): Matt Maher offers an excellent, concise and balanced look at how ChatGPT-5 performs in this 22-minute video. Maher’s take is mostly positive -– but he also includes some reservations about some downsides.

–What People Love and Hate About GPT-5: This 8-minute, AI Daily Brief (AIDB) video offers an unvarnished critique of the new GPT-5. People are jazzed about the new release feel GPT-5’s ability to pick the right AI engine for every question is, on balance, the right move, according to AIDB.

And they also report lightning-quick responses and expect GPT-5’s true power will only be revealed over time.

On the downside: ChatGPT-5’s one-size-fits-all, auto AI engine picker too often picks an engine that is weaker than what’s actually needed, according to AIDB.

–GPT-5 in Microsoft 365 Copilot: Turns-out Microsoft wasted no time embedding GPT-5 as one of the AI engines you can use with its own chatbot, Microsoft Copilot. Click here for the 53-second video.

–10 Things that GPT-5 Changes: The AI Daily Brief offers an extremely thoughtful, 19-minute analysis of how things change long-term now that GPT-5 is live.

–AI Insiders Breakdown the GPT-5 Update: Peter Diamandis and friends – some of the top minds in AI – offer an extremely in-depth examination of the GPT-5 release in this nearly two-hour video.

Share a Link:  Please consider sharing a link to https://RobotWritersAI.com from your blog, social media post, publication or emails. More links leading to RobotWritersAI.com helps everyone interested in AI-generated writing.

Joe Dysart is editor of RobotWritersAI.com and a tech journalist with 20+ years experience. His work has appeared in 150+ publications, including The New York Times and the Financial Times of London.

Never Miss An Issue
Join our newsletter to be instantly updated when the latest issue of Robot Writers AI publishes
We respect your privacy. Unsubscribe at any time -- we abhor spam as much as you do.

The post ChatGPT-5 Released: Top Ten Takeaways appeared first on Robot Writers AI.

Robotic drummer gradually acquires human-like behaviors

Humanoid robots, robots with a human-like body structure, have so far been primarily tested on manual tasks that entail supporting humans in their daily activities, such as carrying objects, collecting samples in hazardous environments, supporting older adults or acting as physical therapy assistants. In contrast, their potential for completing expressive physical tasks rooted in creative disciplines, such as playing an instrument or participating in performance arts, remains largely unexplored.

Are your AI agents still stuck in POC? Let’s fix that.

Most AI teams can build a demo agent in days. Turning that demo into something production-ready that meets enterprise expectations is where progress stalls.

Weeks of iteration become months of integration, and suddenly the project is stuck in PoC purgatory while the business waits.

Turning prototypes into production-ready agents isn’t just hard. It’s a maze of tools, frameworks, and security steps that slow teams down and increase risk.

In this post, you’ll learn step by step how to build, deploy, and govern them using the Agent Workforce Platform from DataRobot.

Why teams struggle to get agents into production 

Two factors keep most teams stuck in PoC purgatory:

1. Complex builds
Translating business requirements into a reliable agent workflow isn’t simple. It requires evaluating countless combinations of LLMs, smaller models, embedding strategies, and guardrails while balancing strict quality, latency, and cost objectives. The iteration alone can take weeks.

2. Operational drag
Even after the workflow works, deploying it in production is a marathon. Teams spend months managing infrastructure, applying security guardrails, setting up monitoring, and enforcing governance to reduce compliance and operational risks.

Today’s options don’t make this easier:

  • Many tools may speed up parts of the build process but often lack integrated governance, observability, and control. They also lock users into their ecosystem, limit flexibility with model selection and GPU resources, and provide minimal support for evaluation, debugging, or ongoing monitoring.
  • Bring-your-own stacks offer more flexibility but require heavy lifting to configure, secure, and connect multiple systems. Teams must handle infrastructure, authentication, and compliance on their own — turning what should be weeks into months.


The result? Most teams never make it past proof of concept to a production-ready agent.

A unified approach to the agent lifecycle

Instead of juggling multiple tools for build, evaluation, deployment, and governance, the Agent Workforce Platform brings these stages into one workflow while supporting deployments across cloud, on-premises, hybrid, and air-gapped environments.

  • Build anywhere: Develop in Codespaces, VSCode, Cursor, or any notebook using OSS frameworks like LangChain, CrewAI, or LlamaIndex, then upload with a single command.
  • Evaluate and compare workflows: Use built-in operational and behavioral metrics, LLM-as-a-judge, and human-in-the-loop reviews for side-by-side comparisons.
  • Trace and debug issues quickly: Visualize execution at every step, then edit code in-platform and re-run evaluations to resolve errors faster.
  • Deploy with one click or command: Move agents to production without manual infrastructure setup, whether on DataRobot or your own environment.
  • Monitor with built-in and custom metrics: Track functional and operational metrics in the DataRobot dashboard or export your own preferred observability tool using OTel-compliant data.
  • Govern from day one: Apply real-time guardrails and automated compliance reporting to enforce security, manage risk, and maintain audit readiness without extra tools.


Enterprise-grade capabilities include:

  • Managed RAG workflows with your choice of vector databases like Pinecone and Elastic for retrieval-augmented generation.
  • Elastic compute for hybrid environments, scaling to meet high-performance workloads without compromising compliance or security.
  • Broad NVIDIA NIM integration for optimized inference across cloud, hybrid, and on-premises environments.
  • “Batteries included” LLM access to OSS and proprietary models (Anthropic, OpenAI, Azure, Bedrock, and more) with a single set of credentials — eliminating API key management overhead.
  • OAuth 2.0-compliant authentication and role-based access control (RBAC) for secure agent execution and data governance.
blog image

From prototype to production: step by step

Every team’s path to production looks different. The steps below represent common jobs to be done when managing the agent lifecycle — from building and debugging to deploying, monitoring, and governing.

Use the steps that fit your workflow or follow the full sequence for an end-to-end process.

1. Build your agent

Start with the frameworks you know. Use agent templates for LangGraph, CrewAI, and LlamaIndex from DataRobot’s public GitHub repo, and the CLI for quick setup.

Clone the repo locally, edit the agent.py file, and push your prototype with a single command to prepare it for production and deeper evaluation. The Agent Workforce Platform handles dependencies, Docker containers, and integrations for tracing and authentication.

Build your agent

2. Evaluate and compare workflows

After uploading your agent, configure evaluation metrics to measure performance across agents, sub-agents, and tools.

Choose from built-in options such as PII and toxicity checks, NeMo guardrails, LLM-as-a-judge, and agent-specific metrics like tool call accuracy and goal adherence.

Then, use the agent playground to prompt your agent and compare responses with evaluation scores. For deeper testing, generate synthetic data or add human-in-the-loop reviews.

Evaluate and compare workflows

3. Trace and debug

Use the agent playground to view execution traces directly in the UI. Drill into each task to see inputs, outputs, metadata, evaluation details, and context for every step in the pipeline.

Traces cover the top-level agent as well as sub-components, guard models, and evaluation metrics. Use this visibility to quickly identify which component is causing errors and pinpoint issues in your code. 

Trace and debug

4. Edit and re-test your agent

If evaluation metrics or traces reveal issues, open a code space in the UI to update the agent logic. Save your changes and re-run the agent without leaving the platform. Updates are stored in the registry, ensuring a single source of truth as you iterate.

This is not only useful when you are first testing your agent, but also over time as new models, tools, and data need to be incorporated to upgrade it.

Iterate rapidly

5. Deploy your agent

Deploy your agent to production with a single click or command. The platform manages hardware setup and configuration across cloud, on-premises, or hybrid environments and registers the deployment in the platform for centralized tracking.

Deploy your agent with DataRobot

6. Monitor and trace deployed agents

Track agent performance and behavior in real time with built-in monitoring and tracing. View key metrics such as cost, latency, task adherence, goal accuracy, and safety indicators like PII exposure, toxicity, and prompt injection risks.

OpenTelemetry (OTel)-compliant traces provide visibility into every step of execution, including tool inputs, outputs, and performance at both the component and workflow levels.

Set alerts to catch issues early and modularize components so you can upgrade tools, models, or vector databases independently while tracking their impact.

Monitor and trace deployed agents with DataRobot

7. Apply governance by design

Manage security, compliance, and risk as part of the workflow, not as an add-on. The registry within the Agent Workforce Platform provides a centralized source of truth for all agents and models, with access control, lineage, and traceability.

Real-time guardrails monitor for PII leakage, jailbreak attempts, toxicity, hallucinations, policy violations, and operational anomalies. Automated compliance reporting supports multiple regulatory frameworks, reducing audit effort and manual work.

Apply governance by design with DataRobot

What makes the Agent Workforce Platform different

These are the capabilities that cut months of work down to days, without sacrificing security, flexibility, or oversight.

One platform, full lifecycle: Manage the entire agent lifecycle across on premises, multi-cloud, air-gapped, and hybrid environments without stitching together separate tools.

Evaluation, debugging, and observability built in: Perform comprehensive evaluation, trace execution, debug issues, and monitor real-time performance without leaving the platform. Get detailed metrics and alerting, even for mission-critical projects.

Integrated governance and compliance:  A central AI registry versions and tracks lineage for every asset, from agents and data to models and applications. Real-time guardrails and automated reporting eliminate manual compliance work and simplify audits.

Flexibility without trade-offs: Use any open source, proprietary framework, or model on a platform built for enterprise-grade security and scalability.

From prototype to production and beyond

Building enterprise-ready agents is just the first step. As your use cases grow, this guide gives you a foundation for moving faster while maintaining governance and control.

Ready to build? Start your free trial.

The post Are your AI agents still stuck in POC? Let’s fix that. appeared first on DataRobot.

Engineers design alternating-pressure mattress for bedsore prevention

Mechanical engineering researchers at the UCLA Samueli School of Engineering have designed a mattress that helps prevent bedsores by alternating pressure across the body and, at times, increasing peak pressure rather than reducing it to restore blood flow.

Climate-optimized construction with robots

A straight wall is not necessarily a climate-optimized wall. Depending on the wall's exposure to sun and shade, there is an ideal angle for individual bricks. The calculations come from a digital design configurator—and in the future, a robot will help craftsmen to position the bricks precisely. In a workshop with apprentice bricklayers, this human-machine cooperation in construction has been tested under real-world conditions by the Technical University of Munich (TUM) and the Munich-Ebersberg Construction Guild.
Page 1 of 528
1 2 3 528