Generations in Dialogue: Bridging Perspectives in AI is a podcast from AAAI featuring thought-provoking discussions between AI experts, practitioners, and enthusiasts from different age groups and backgrounds. Each episode delves into how generational experiences shape views on AI, exploring the challenges, opportunities, and ethical considerations that come with the advancement of this transformative technology.
Embodied AI, robotics, perception, and action with Professor Roberto Martín-Martín
In the third episode of this new series from AAAI, host Ella Lan chats to Professor Roberto Martín-Martín about taking a screwdriver to his toys as a child, how his research focus has evolved over time, how different generations interact with technology, making robots for everyone, being inspired by colleagues, advice for early-career researchers, and how machines can enhance human capabilities.
About Professor Roberto Martín-Martín:
Roberto Martín-Martín is an Assistant Professor of Computer Science at the University of Texas at Austin, where his research integrates robotics, computer vision, and machine learning to build autonomous agents capable of perceiving, learning, and acting in the real world. His work spans low-level tasks like pick-and-place and navigation to complex activities such as cooking and mobile manipulation, often drawing inspiration from human cognition and integrating insights from psychology and cognitive science. He previously worked as an AI Researcher at Salesforce AI and as a Postdoctoral Scholar at the Stanford Vision and Learning Lab with Silvio Savarese and Fei-Fei Li, leading projects in visuomotor learning, mobile manipulation, and human-robot interaction. He earned his Ph.D. and M.S. from Technische Universität Berlin under Oliver Brock and a B.S. from Universidad Politécnica de Madrid. His work has been recognized with best paper awards at RSS and ICRA, and he serves as Chair of the IEEE/RAS Technical Committee on Mobile Manipulation.
About the host
Ella Lan, a member of the AAAI Student Committee, is the host of “Generations in Dialogue: Bridging Perspectives in AI.” She is passionate about bringing together voices across career stages to explore the evolving landscape of artificial intelligence. Ella is a student at Stanford University tentatively studying Computer Science and Psychology, and she enjoys creating spaces where technical innovation intersects with ethical reflection, human values, and societal impact. Her interests span education, healthcare, and AI ethics, with a focus on building inclusive, interdisciplinary conversations that shape the future of responsible AI.
While keeping pace with the seemingly endless parade of AI tools can be exhausting, getting crystal clear on the raw, new power embedded in Google’s new Nano Banana Pro image generator is well worth a huff-and-puff.
In a phrase, Nano Banana Pro (NBP) – released a few weeks ago – is the new, gold standard in AI imaging now, capable of rendering virtually anything imaginable.
Essentially: Writers now have a tool that can auto-generate one or more supplemental images for their work with a precision and power that currently has no rival.
Plus, unlike other image generators, NBP has an incredible amount of firepower under-the-hood that is simply not available to the competition.
For example: NBP is an exquisite image generator in its own right.
But it is also powered by Google’s Gemini 3.0 Pro, now widely considered the gold standard in consumer AI.
And, NBP can also be easily combined with Google Search, the world’s number one search engine.
Like many things AI, the secret to achieving master prowess with NBP is to sample how countless, highly inspired human imaginations are already working with the tool – and then synthesize that rubber-meets-the-road knowledge to forge your own method for working with NBP.
Towards that end, here are ten excellent videos on NBP, complete with detailed demos, of how imaginative folks are artfully using the AI – and surfacing truly world-class, head-turning images:
*Quick Overview: NBP Key Features: This 15-minute video from AI Master offers a great intro into the key new capabilities of NBP – complete with captivating visual examples. Demos include: –blending multiple images into one –converting stick figures into an image-rich scene –experimenting with visual style changes on the same image –working with much more reliable text-on-images
*A Torrent of NBP Use Cases: This incredibly organized and informative 11-minute video from Digital Assets dives deep in the wide array of use cases you can tap into using NBP. Demos include: –Historical event image generator, based on location, date and approximate time (example: conjure Apollo moon landing) –multi-angle product photography –Alternate reality generator (example: depict architecture of ancient Rome as immersed in a futuristic setting) –Hyper-realistic, 3D-diorama generation
*Another Torrent of NBP Use Cases: Click on this 27-minute video from Astrovah for a slew of more mind-bending use cases, including: –Text-on-image analysis of any photo you upload, including its context and key facts to know about the image –How to make an infographic in seconds –How to inject season and weather changes to any image –Making exploded-view images of any product –Auto-generated blueprints of any image
*Generating Hyper-Realistic Photos With NBP: This great, 22-minute video from Tao Prompts offers an inside look at how to ensure any image you generate with NBP is hyper-photorealistic – right down to the brand of photo film you’re looking to emulate.
*Infinite Camera Angles on Tap: Getting just the right camera angle on any image is now child’s play with NBP. This 11-minute video from Chase AI serves-up demos on how to be the director of any image you create with NBP. Included is a detailed prompt library you can use featuring the same camera angle descriptions used by pro photographers.
*Swapping a Face in Seconds: Short-and-sweet, this 4-minute video from AsapGuide offers a quick, down-and-dirty way to transplant any face onto any image you provide.
*Aging/De-Aging a Person in Seconds: Another great collection of use cases, this 16-minute video from Atomic Gains includes an easy-to-replicate demo on making a person look younger, or vice-versa. Also included are demos on instantly changing the lighting in an image, changing the position of a character in an image and surgically removing specific details from any image.
*NBP: Getting Technical: Once you’ve played with NBP informally, you can pick up some extremely helpful, technical tips on how to manipulate NBP with this 29-minute video from AI Samson. Tricks include how to zoom in/out on an image, how to maintain character consistency and how to use complex cinematic stylings.
*Amplifying NBP With Google AI Studio: This 58-minute video from David Ondrej recommends using NBP in the free Google Studio interface. The reason: Google AI Studio will give you much more granular control over your results, including precise image size, creating accurate slides with text and using NBP with Google Search. Caveat: To use Google AI Studio, you need to switch to a special Google Gemini API subscription.
*Working with NBP in Photoshop: Adobe has already integrated NBP into its toolset. And this is the perfect video (8 minutes from Sebastien Jefferies) to check-out how to combine the power of NBP with the incredible precision of Photoshop. Included are lots of great demos that answer the question: NBP and Photoshop: What’s the long-term impact?
Share a Link: Please consider sharing a link to https://RobotWritersAI.com from your blog, social media post, publication or emails. More links leading to RobotWritersAI.com helps everyone interested in AI-generated writing.
–Joe Dysart is editor of RobotWritersAI.com and a tech journalist with 20+ years experience. His work has appeared in 150+ publications, including The New York Times and the Financial Times of London.
New findings challenge the widespread belief that AI is an environmental villain. By analyzing U.S. economic data and AI usage across industries, researchers discovered that AI’s energy consumption—while significant locally—barely registers at national or global scales. Even more surprising, AI could help accelerate green technologies rather than hinder them.
Navigating a sea of documents, scattered across various platforms, can be a daunting task, often leading to slow decision-making and missed insights. As organizational knowledge and data multiplies, teams that can’t centralize or surface the right information quickly will struggle to make decisions, innovate, and stay competitive.
This blog explores how the new Talk to My Docs (TTMDocs)agent provides a solution to the steep costs of knowledge fragmentation.
The high cost of knowledge fragmentation
Knowledge fragmentation is not just an inconvenience — it’s a hidden cost to productivity, actively robbing your team of time and insight.
A survey by Starmind across 1,000+ knowledge workers found that employees only tap into 38% of their available knowledge/expertise because of this fragmentation.
Another study by McKinsey & Associates found that knowledge workers spend over a quarter of their time searching for the information they need across different platforms such as Google Drive, Box, or local systems.
The constraints of existing solutions
While there are a few options on the market designed to ease the process of querying across key documents and materials living in a variety of places, many have significant constraints in what they can actually deliver.
For example:
Vendor lock-in can severely hinder the promised experience. Unless you are strictly using the supported integrations of your vendor of choice, which in most instances is unrealistic, you end up with a limited subset of information repositories you can connect to and interact with.
Security and compliance considerations add another layer of complexity. If you have access to one platform or documents, you may not need access to another, and any misstep or missed vulnerability can open up your organization to potential risk.
Talk to My Docs takes a different approach
DataRobot’s new Talk to My Docs agent represents a different approach. We provide the developer tools and support you need to build AI solutions that actually work in enterprise contexts. Not as a vendor-controlled service, but as a customizable open-source template you can tailor to your needs.
The differentiation isn’t subtle. With TTMDocs you get:
Enterprise security and compliance built in from day one
Multi-source connectivity instead of vendor lock-in
Zero-trust access control (Respects Existing Permissions)
Complete observability through DataRobot platform integration
Multi-agent architecture that scales with complexity
Full code access and customizability instead of black box APIs
Modern infrastructure-as-code for repeatable deployments
What makes Talk to My Docs different
Talk To My Docs is an open-source application template that gives you the intuitive, familiar chat-style experience that modern knowledge workers have come to expect, coupled with the control and customizability you actually need.
This isn’t a SaaS product you subscribe to; but rather a developer-friendly template you can deploy, modify, and make your own.
Multi-source integration and real security
TTMDocs connects to Google Drive, Box, and your local filesystems out of the box, with Sharepoint and JIRA integrations coming soon.
Preserve existing controls: We provide out-of-the-box OAuth integration to handle authentication securely through existing credentials. You’re not creating a parallel permission structure to manage—if you don’t have permission to see a document in Google Drive, you won’t see it in TTMDocs either.
Meet data where it lives: Unlike vendor-locked solutions, you’re not forced to migrate your document ecosystem. You can seamlessly leverage files stored in structured and unstructured connectors like Google Drive, Box, Confluence, Sharepoint available on the DataRobot platform or upload your files locally.
Multi-agent architecture that scales
TTMDocs uses CrewAI for multi-agent orchestration, so you can have specialized agents handling different aspects of a query.
Modular & flexible: The modular architecture means you can also swap in your preferred agentic framework, whether that’s LangGraph, LlamaIndex, or any other, if it better suits your needs.
Customizable: Want to change how agents interpret queries? Adjust the prompts. Need custom tools for domain-specific tasks? Add them. Have compliance requirements? Build those guardrails directly into the code.
Scalable: As your document collection grows and use cases become more complex, you can add agents with specialized tools and prompts rather than trying to make one agent do everything. For example, one agent might retrieve financial documents, another handle technical specifications, and a third synthesize cross-functional insights.
Enterprise platform integration
Another key aspect of Talk to my Docs is that it integrates with your existing DataRobot infrastructure.
Guarded RAG & LLM access: The template includes a Guarded RAG LLM Model for controlled document retrieval and LLM Gateway integration for access to 80+ open and closed-source LLMs.
Full observability: Every query is logged. Every retrieval is tracked. Every error is captured. This means you have full tracing and observability through the DataRobot platform, allowing you to actually troubleshoot when something goes wrong.
Modern, modular components
The template is organized into clean, independent pieces that can be developed and deployed separately or as part of the full stack:
Component
Description
agent_retrieval_agent
Multi-agent orchestration using CrewAI. Core agent logic and query routing.
core
Shared Python logic, common utilities, and functions.
frontend_web
React and Vite web frontend for the user interface.
web
FastAPI backend. Manages API endpoints, authentication, and communication.
infra
Pulumi infrastructure-as-code for provisioning cloud resources.
The power of specialization: Talk to My Docs use cases
The pattern is productionized specialized agents, working together across your existing document sources, with security and observability built in.
Here are a few examples of how this is applied in the enterprise:
M&A due diligence: Cross-reference financial statements (Box), legal contracts (Google Drive), and technical documentation (local files). The permission structure ensures only the deal team sees sensitive materials.
Clinical trial documentation: Verify trial protocols align with regulatory guidelines across hundreds of documents, flagging inconsistencies before submission.
Legal discovery: Search across years of emails, contracts, and memos scattered across platforms, identifying relevant and privileged materials while respecting strict access controls.
Product launch readiness: Verify marketing materials, regulatory approvals, and supply chain documentation are aligned across regions and backed by certifications.
Insurance claims investigation: Pull policy documents, adjuster notes, and third-party assessments to cross-reference coverage terms and flag potential fraud indicators.
Research grant compliance: Cross-reference budget documents, purchase orders, and grant agreements to flag potential compliance issues before audits.
Use case: Clinical trial documentation
The challenge
A biotech company preparing an FDA submission is drowning in documentation spread across multiple systems: FDA guidance in Google Drive, trial protocols in SharePoint, lab reports in Box, and quality procedures locally. The core problem is ensuring consistency across all documents (protocols, safety, quality) before a submission or inspection, which demands a quick, unified view.
How TTMDocs helps
The company deploys a customized healthcare regulatory agent, a unified system that can answer complex compliance questions across all document sources.
Regulatory agent:
Identifies applicable FDA submission requirements for the specific drug candidate.
Clinical review agent:
Reviews trial protocols against industry standards for patient safety and research ethics.
Safety compliance agent:
Checks that safety monitoring and adverse event reporting procedures meet FDA timelines.
The result
A regulatory team member asks: “What do we need for our submission, and are our safety monitoring procedures up to standard?”
Instead of spending days gathering documents and cross-referencing requirements, they get a structured response within minutes. The system identifies their submission pathway, flags three high-priority gaps in their safety procedures, notes two issues with their quality documentation, and provides a prioritized action plan with specific timelines.
Where to look: The code that makes it happen
The best way to understand TTMDocs is to look at the actual code. The repository is completely open source and available on Github.
Here are the key places to start exploring:
Agent architecture (agent_retrieval_agent/custom_model/agent.py): See how CrewAI coordinates different agents, how prompts are structured, and where you can inject custom behavior.
Tool integration (agent_retrieval_agent/custom_model/tool.py): Shows how agents interact with external systems. This is where you’d add custom tools for querying an internal API or processing domain-specific file formats.
OAuth and security (web/app/auth/oauth.py): See exactly how authentication works with Google Drive and Box and how your user permissions are preserved throughout the system.
Web backend (web/app/): The FastAPI application that ties everything together. You’ll see how the frontend communicates with agents, and how conversations are managed.
The future of enterprise AI is open
Enterprise AI is at an inflection point. The gap between what end-user AI tools can do and what enterprises actually need is growing. Your company is realizing that “good enough” consumer AI products create more problems than they solve when you cannot compromise on enterprise requirements like security, compliance, and integration.
The future isn’t about choosing between convenience and control. It’s about having both. Talk to my Docs puts both the power and the flexibility into your hands, delivering results you can trust.
With DataRobot application templates, you’re never locked into rigid black-box systems. Gain a flexible foundation that lets you adapt, experiment, and innovate on your terms. Whether refining existing workflows or creating new AI-powered applications, DataRobot gives you the clarity and confidence to move forward.
According to Goldman Sachs, the humanoid robot market is projected to reach $38 billion by 2035, with shipments surpassing 1.4 million units, driven by labor shortages and advancements in AI technology.
Claire chatted to Shimon Whiteson from Waymo about machine learning for autonomous vehicles.
Shimon Whiteson is a Professor of Computer Science at the University of Oxford and a Senior Staff Research Scientist at Waymo UK. His research focuses on deep reinforcement learning and imitation learning, with applications in robotics and video games. He completed his doctorate at the University of Texas at Austin in 2007. He spent eight years as an Assistant and then an Associate Professor at the University of Amsterdam before joining Oxford as an Associate Professor in 2015. His spin-out company Latent Logic was acquired by Waymo in 2019.
Researchers at the University of Maryland, Baltimore County (UMBC) have extracted the building blocks of precise hand gestures used in the classical Indian dance form Bharatanatyam—and found a richer "alphabet" of movement compared to natural grasps. The work could improve how we teach hand movements to robots and offer humans better tools for physical therapy.
Enterprise AI World 2025, co-located with KMWorld 2025, offered a clear signal this year: the era of “drop a chatbot on the intranet and call it transformation” is over. The conversations shifted toward AI that sits inside real work—capturing tacit […]
Underwater octopuses change their body color and texture in the blink of an eye to blend perfectly into their surroundings when evading predators or capturing prey. They transform their bodies to match the colors of nearby corals or seaweed, turning blue or red, and move by softly curling their arms or snatching prey.
For LAPP USA, Corvus One turned a labor-intensive, error-prone process into a nightly automated workflow that provides reliable visibility, reduces costs, and improves customer service.
EPFL scientists have integrated discarded crustacean shells into robotic devices, leveraging the strength and flexibility of natural materials for robotic applications.
The ReWiND method, which consists of three phases: learning a reward function, pre-training, and using the reward function and pre-trained policy to learn a new language-specified task online.
In their paper ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations, which was presented at CoRL 2025, Jiahui Zhang, Yusen Luo, Abrar Anwar, Sumedh A. Sontakke, Joseph J. Lim, Jesse Thomason, Erdem Bıyık and Jesse Zhang introduce a framework for learning robot manipulation tasks solely from language instructions without per-task demonstrations. We asked Jiahui Zhang and Jesse Zhang to tell us more.
What is the topic of the research in your paper, and what problem were you aiming to solve?
Our research addresses the problem of enabling robot manipulation policies to solve novel, language-conditioned tasks without collecting new demonstrations for each task. We begin with a small set of demonstrations in the deployment environment, train a language-conditioned reward model on them, and then use that learned reward function to fine-tune the policy on unseen tasks, with no additional demonstrations required.
Tell us about ReWiND – what are the main features and contributions of this framework?
ReWiND is a simple and effective three-stage framework designed to adapt robot policies to new, language-conditioned tasks without collecting new demonstrations. Its main features and contributions are:
Reward function learning in the deployment environment
We first learn a reward function using only five demonstrations per task from the deployment environment.
The reward model takes a sequence of images and a language instruction, and predicts per-frame progress from 0 to 1, giving us a dense reward signal instead of sparse success/failure.
To expose the model to both successful and failed behaviors without having to collect failed behavior demonstrations, we introduce a video rewind augmentation: For a video segmentation V(1:t), we choose an intermediate point t1. We reverse the segment V(t1:t) to create V(t:t1) and append it back to the original sequence. This generates a synthetic sequence that resembles “making progress then undoing progress,” effectively simulating failed attempts.
This allows the reward model to learn a smoother and more accurate dense reward signal, improving generalization and stability during policy learning.
Policy pre-training with offline RL
Once we have the learned reward function, we use it to relabel the small demonstration dataset with dense progress rewards. We then train a policy offline using these relabeled trajectories.
Policy fine-tuning in the deployment environment
Finally, we adapt the pre-trained policy to new, unseen tasks in the deployment environment. We freeze the reward function and use it as the feedback for online reinforcement learning. After each episode, the newly collected trajectory is relabeled with dense rewards from the reward model and added to the replay buffer. This iterative loop allows the policy to continually improve and adapt to new tasks without requiring any additional demonstrations.
Could you talk about the experiments you carried out to test the framework?
We evaluate ReWiND in both the MetaWorld simulation environment and the Koch real-world setup. Our analysis focuses on two aspects: the generalization ability of the reward model and the effectiveness of policy learning. We also compare how well different policies adapt to new tasks under our framework, demonstrating significant improvements over state-of-the-art methods.
(Q1) Reward generalization – MetaWorld analysis
We collect a metaworld dataset in 20 training tasks, each task include 5 demos, and 17 related but unseen tasks for evaluation. We train the reward function with the metaworld dataset and a subset of the OpenX dataset.
We compare ReWiND to LIV[1], LIV-FT, RoboCLIP[2], VLC[3], and GVL[4]. For generalization to unseen tasks, we use video–language confusion matrices. We feed the reward model video sequences paired with different language instructions and expect the correctly matched video–instruction pairs to receive the highest predicted rewards. In the confusion matrix, this corresponds to the diagonal entries having the strongest (darkest) values, indicating that the reward function reliably identifies the correct task description even for unseen tasks.
Video-language reward confusion matrix. See the paper for more information.
For demo alignment, we measure the correlation between the reward model’s predicted progress and the actual time steps in successful trajectories using Pearson r and Spearman ρ. For policy rollout ranking, we evaluate whether the reward function correctly ranks failed, near-success, and successful rollouts. Across these metrics, ReWiND significantly outperforms all baselines—for example, it achieves 30% higher Pearson correlation and 27% higher Spearman correlation than VLC on demo alignment, and delivers about 74% relative improvement in reward separation between success categories compared with the strongest baseline LIV-FT.
(Q2) Policy learning in simulation (MetaWorld)
We pre-train on the same 20 tasks and then evaluate RL on 8 unseen MetaWorld tasks for 100k environment steps.
Using ReWiND rewards, the policy achieves an interquartile mean (IQM) success rate of approximately 79%, representing a ~97.5% improvement over the best baseline. It also demonstrates substantially better sample efficiency, achieving higher success rates much earlier in training.
(Q3) Policy learning in real robot (Koch bimanual arms)
Setup: a real-world tabletop bimanual Koch v1.1 system with five tasks, including in-distribution, visually cluttered, and spatial-language generalization tasks.
We use 5 demos for the reward model and 10 demos for the policy in this more challenging setting. With about 1 hour of real-world RL (~50k env steps), ReWiND improves average success from 12% → 68% (≈5× improvement), while VLC only goes from 8% → 10%.
Are you planning future work to further improve the ReWiND framework?
Yes, we plan to extend ReWiND to larger models and further improve the accuracy and generalization of the reward function across a broader range of tasks. In fact, we already have a workshop paper extending ReWiND to larger-scale models.
In addition, we aim to make the reward model capable of directly predicting success or failure, without relying on the environment’s success signal during policy fine-tuning. Currently, even though ReWiND provides dense rewards, we still rely on the environment to indicate whether an episode has been successful. Our goal is to develop a fully generalizable reward model that can provide both accurate dense rewards and reliable success detection on its own.
References
[1] Yecheng Jason Ma et al. “Liv: Language-image representations and rewards for robotic control.” International Conference on Machine Learning. PMLR, 2023.
[2] Sumedh Sontakke et al. “Roboclip: One demonstration is enough to learn robot policies.” Advances in Neural Information Processing Systems 36 (2023): 55681-55693.
[3] Minttu Alakuijala et al. “Video-language critic: Transferable reward functions for language-conditioned robotics.” arXiv:2405.19988 (2024).
[4] Yecheng Jason Ma et al. “Vision language models are in-context value learners.” The Thirteenth International Conference on Learning Representations. 2024.
About the authors
Jiahui Zhang is a Ph.D. student in Computer Science at the University of Texas at Dallas, advised by Prof. Yu Xiang. He received his M.S. degree from the University of Southern California, where he worked with Prof. Joseph Lim and Prof. Erdem Bıyık.
Jesse Zhang is a postdoctoral researcher at the University of Washington, advised by Prof. Dieter Fox and Prof. Abhishek Gupta. He completed his Ph.D. at the University of Southern California, advised by Prof. Jesse Thomason and Prof. Erdem Bıyık at USC, and Prof. Joseph J. Lim at KAIST.
In the future, tiny flying robots could be deployed to aid in the search for survivors trapped beneath the rubble after a devastating earthquake. Like real insects, these robots could flit through tight spaces larger robots can't reach, while simultaneously dodging stationary obstacles and pieces of falling rubble.
Imagine having a continuum soft robotic arm bend around a bunch of grapes or broccoli, adjusting its grip in real time as it lifts the object. Unlike traditional rigid robots that generally aim to avoid contact with the environment as much as possible and stay far away from humans for safety reasons, this arm senses subtle forces, stretching and flexing in ways that mimic more of the compliance of a human hand. Its every motion is calculated to avoid excessive force while achieving the task efficiently.