Page 298 of 538
1 296 297 298 299 300 538

Emergent Bartering Behaviour in Multi-Agent Reinforcement Learning

In our recent paper, we explore how populations of deep reinforcement learning (deep RL) agents can learn microeconomic behaviours, such as production, consumption, and trading of goods. We find that artificial agents learn to make economically rational decisions about production, consumption, and prices, and react appropriately to supply and demand changes.

Designing societally beneficial Reinforcement Learning (RL) systems

By Nathan Lambert, Aaron Snoswell, Sarah Dean, Thomas Krendl Gilbert, and Tom Zick

Deep reinforcement learning (DRL) is transitioning from a research field focused on game playing to a technology with real-world applications. Notable examples include DeepMind’s work on controlling a nuclear reactor or on improving Youtube video compression, or Tesla attempting to use a method inspired by MuZero for autonomous vehicle behavior planning. But the exciting potential for real world applications of RL should also come with a healthy dose of caution – for example RL policies are well known to be vulnerable to exploitation, and methods for safe and robust policy development are an active area of research.

At the same time as the emergence of powerful RL systems in the real world, the public and researchers are expressing an increased appetite for fair, aligned, and safe machine learning systems. The focus of these research efforts to date has been to account for shortcomings of datasets or supervised learning practices that can harm individuals. However the unique ability of RL systems to leverage temporal feedback in learning complicates the types of risks and safety concerns that can arise.

This post expands on our recent whitepaper and research paper, where we aim to illustrate the different modalities harms can take when augmented with the temporal axis of RL. To combat these novel societal risks, we also propose a new kind of documentation for dynamic Machine Learning systems which aims to assess and monitor these risks both before and after deployment.

What’s Special About RL? A Taxonomy of Feedback

Reinforcement learning systems are often spotlighted for their ability to act in an environment, rather than passively make predictions. Other supervised machine learning systems, such as computer vision, consume data and return a prediction that can be used by some decision making rule. In contrast, the appeal of RL is in its ability to not only (a) directly model the impact of actions, but also to (b) improve policy performance automatically. These key properties of acting upon an environment, and learning within that environment can be understood as by considering the different types of feedback that come into play when an RL agent acts within an environment. We classify these feedback forms in a taxonomy of (1) Control, (2) Behavioral, and (3) Exogenous feedback. The first two notions of feedback, Control and Behavioral, are directly within the formal mathematical definition of an RL agent while Exogenous feedback is induced as the agent interacts with the broader world.

1. Control Feedback

First is control feedback – in the control systems engineering sense – where the action taken depends on the current measurements of the state of the system. RL agents choose actions based on an observed state according to a policy, which generates environmental feedback. For example, a thermostat turns on a furnace according to the current temperature measurement. Control feedback gives an agent the ability to react to unforeseen events (e.g. a sudden snap of cold weather) autonomously.

Figure 1: Control Feedback.

2. Behavioral Feedback

Next in our taxonomy of RL feedback is ‘behavioral feedback’: the trial and error learning that enables an agent to improve its policy through interaction with the environment. This could be considered the defining feature of RL, as compared to e.g. ‘classical’ control theory. Policies in RL can be defined by a set of parameters that determine the actions the agent takes in the future. Because these parameters are updated through behavioral feedback, these are actually a reflection of the data collected from executions of past policy versions. RL agents are not fully ‘memoryless’ in this respect–the current policy depends on stored experience, and impacts newly collected data, which in turn impacts future versions of the agent. To continue the thermostat example – a ‘smart home’ thermostat might analyze historical temperature measurements and adapt its control parameters in accordance with seasonal shifts in temperature, for instance to have a more aggressive control scheme during winter months.

Figure 2: Behavioral Feedback.

3. Exogenous Feedback

Finally, we can consider a third form of feedback external to the specified RL environment, which we call Exogenous (or ‘exo’) feedback. While RL benchmarking tasks may be static environments, every action in the real world impacts the dynamics of both the target deployment environment, as well as adjacent environments. For example, a news recommendation system that is optimized for clickthrough may change the way editors write headlines towards attention-grabbing  clickbait. In this RL formulation, the set of articles to be recommended would be considered part of the environment and expected to remain static, but exposure incentives cause a shift over time.

To continue the thermostat example, as a ‘smart thermostat’ continues to adapt its behavior over time, the behavior of other adjacent systems in a household might change in response – for instance other appliances might consume more electricity due to increased heat levels, which could impact electricity costs. Household occupants might also change their clothing and behavior patterns due to different temperature profiles during the day. In turn, these secondary effects could also influence the temperature which the thermostat monitors, leading to a longer timescale feedback loop.

Negative costs of these external effects will not be specified in the agent-centric reward function, leaving these external environments to be manipulated or exploited. Exo-feedback is by definition difficult for a designer to predict. Instead, we propose that it should be addressed by documenting the evolution of the agent, the targeted environment, and adjacent environments.

Figure 3: Exogenous (exo) Feedback.

How can RL systems fail?

Let’s consider how two key properties can lead to failure modes specific to RL systems: direct action selection (via control feedback) and autonomous data collection (via behavioral feedback).

First is decision-time safety. One current practice in RL research to create safe decisions is to augment the agent’s reward function with a penalty term for certain harmful or undesirable states and actions. For example, in a robotics domain we might penalize certain actions (such as extremely large torques) or state-action tuples (such as carrying a glass of water over sensitive equipment). However it is difficult to anticipate where on a pathway an agent may encounter a crucial action, such that failure would result in an unsafe event. This aspect of how reward functions interact with optimizers is especially problematic for deep learning systems, where numerical guarantees are challenging.

Figure 4: Decision time failure illustration.

As an RL agent collects new data and the policy adapts, there is a complex interplay between current parameters, stored data, and the environment that governs evolution of the system. Changing any one of these three sources of information will change the future behavior of the agent, and moreover these three components are deeply intertwined. This uncertainty makes it difficult to back out the cause of failures or successes.

In domains where many behaviors can possibly be expressed, the RL specification leaves a lot of factors constraining behavior unsaid. For a robot learning locomotion over an uneven environment, it would be useful to know what signals in the system indicate it will learn to find an easier route rather than a more complex gait. In complex situations with less well-defined reward functions, these intended or unintended behaviors will encompass a much broader range of capabilities, which may or may not have been accounted for by the designer.

Figure 5: Behavior estimation failure illustration.

While these failure modes are closely related to control and behavioral feedback, Exo-feedback does not map as clearly to one type of error and introduces risks that do not fit into simple categories. Understanding exo-feedback requires that stakeholders in the broader communities (machine learning, application domains, sociology, etc.) work together on real world RL deployments.

Risks with real-world RL

Here, we discuss four types of design choices an RL designer must make, and how these choices can have an impact upon the socio-technical failures that an agent might exhibit once deployed.

Scoping the Horizon

Determining the timescale on which aRL agent can plan impacts the possible and actual behavior of that agent. In the lab, it may be common to tune the horizon length until the desired behavior is achieved. But in real world systems, optimizations will externalize costs depending on the defined horizon. For example, an RL agent controlling an autonomous vehicle will have very different goals and behaviors if the task is to stay in a lane,  navigate a contested intersection, or route across a city to a destination. This is true even if the objective (e.g. “minimize travel time”) remains the same.

Figure 6: Scoping the horizon example with an autonomous vehicle.

Defining Rewards

A second design choice is that of actually specifying the reward function to be maximized. This immediately raises the well-known risk of RL systems, reward hacking, where the designer and agent negotiate behaviors based on specified reward functions. In a deployed RL system, this often results in unexpected exploitative behavior – from bizarre video game agents to causing errors in robotics simulators. For example, if an agent is presented with the problem of navigating a maze to reach the far side, a mis-specified reward might result in the agent avoiding the task entirely to minimize the time taken.

Figure 7: Defining rewards example with maze navigation.

Pruning Information

A common practice in RL research is to redefine the environment to fit one’s needs – RL designers make numerous explicit and implicit assumptions to model tasks in a way that makes them amenable to virtual RL agents. In highly structured domains, such as video games, this can be rather benign.However, in the real world redefining the environment amounts to changing the ways information can flow between the world and the RL agent. This can dramatically change the meaning of the reward function and offload risk to external systems. For example, an autonomous vehicle with sensors focused only on the road surface shifts the burden from AV designers to pedestrians. In this case, the designer is pruning out information about the surrounding environment that is actually crucial to robustly safe integration within society.

Figure 8: Information shaping example with an autonomous vehicle.

Training Multiple Agents

There is growing interest in the problem of multi-agent RL, but as an emerging research area, little is known about how learning systems interact within dynamic environments. When the relative concentration of autonomous agents increases within an environment, the terms these agents optimize for can actually re-wire norms and values encoded in that specific application domain. An example would be the changes in behavior that will come if the majority of vehicles are autonomous and communicating (or not) with each other. In this case, if the agents have autonomy to optimize toward a goal of minimizing transit time (for example), they could crowd out the remaining human drivers and heavily disrupt accepted societal norms of transit.

Figure 9: The risks of multi-agency example on autonomous vehicles.

Making sense of applied RL: Reward Reporting

In our recent whitepaper and research paper, we proposed Reward Reports, a new form of ML documentation that foregrounds the societal risks posed by sequential data-driven optimization systems, whether explicitly constructed as an RL agent or implicitly construed via data-driven optimization and feedback. Building on proposals to document datasets and models, we focus on reward functions: the objective that guides optimization decisions in feedback-laden systems. Reward Reports comprise questions that highlight the promises and risks entailed in defining what is being optimized in an AI system, and are intended as living documents that dissolve the distinction between ex-ante (design) specification and ex-post (after the fact) harm. As a result, Reward Reports provide a framework for ongoing deliberation and accountability before and after a system is deployed.

Our proposed template for a Reward Reports consists of several sections, arranged to help the reporter themselves understand and document the system. A Reward Report begins with (1) system details that contain the information context for deploying the model. From there, the report documents (2) the optimization intent, which questions the goals of the system and why RL or ML may be a useful tool. The designer then documents (3) how the system may affect different stakeholders in the institutional interface. The next two sections contain technical details on (4) the system implementation and (5) evaluation. Reward reports conclude with (6) plans for system maintenance as additional system dynamics are uncovered.

The most important feature of a Reward Report is that it allows documentation to evolve over time, in step with the temporal evolution of an online, deployed RL system! This is most evident in the change-log, which is we locate at the end of our Reward Report template:

Figure 10: Reward Reports contents.

What would this look like in practice?

As part of our research, we have developed a reward report LaTeX template, as well as several example reward reports that aim to illustrate the kinds of issues that could be managed by this form of documentation. These examples include the temporal evolution of the MovieLens recommender system, the DeepMind MuZero game playing system, and a hypothetical deployment of an RL autonomous vehicle policy for managing merging traffic, based on the Project Flow simulator.

However, these are just examples that we hope will serve to inspire the RL community–as more RL systems are deployed in real-world applications, we hope the research community will build on our ideas for Reward Reports and refine the specific content that should be included. To this end, we hope that you will join us at our (un)-workshop.

Work with us on Reward Reports: An (Un)Workshop!

We are hosting an “un-workshop” at the upcoming conference on Reinforcement Learning and Decision Making (RLDM) on June 11th from 1:00-5:00pm EST at Brown University, Providence, RI. We call this an un-workshop because we are looking for the attendees to help create the content! We will provide templates, ideas, and discussion as our attendees build out example reports. We are excited to develop the ideas behind Reward Reports with real-world practitioners and cutting-edge researchers.

For more information on the workshop, visit the website or contact the organizers at geese-org@lists.berkeley.edu.

This post is based on the following papers:

Innovative ‘smart socks’ could help millions living with dementia

Left: The display that carers will see in the Milbotix app. Right: Milbotix founder and CEO Dr Zeke Steer

Inventor Dr Zeke Steer quit his job and took a PhD at Bristol Robotics Laboratory so he could find a way to help people like his great-grandmother, who became anxious and aggressive because of her dementia.

Milbotix’s smart socks track heart rate, sweat levels and motion to give insights on the wearer’s wellbeing – most importantly how anxious the person is feeling.

They look and feel like normal socks, do not need charging, are machine washable and provide a steady stream of data to carers, who can easily see their patient’s metrics on an app.

Current alternatives to Milbotix’s product are worn on wrist straps, which can stigmatise or even cause more stress.

Dr Steer said: “The foot is actually a great place to collect data about stress, and socks are a familiar piece of clothing that people wear every day.

“Our research shows that the socks can accurately recognise signs of stress – which could really help not just those with dementia and autism, but their carers too.”

Dr Steer was working as a software engineer in the defence industry when his great-grandmother, Kath, began showing the ill effects of dementia.

Once gentle and with a passion for jazz music, Kath became agitated and aggressive, and eventually accused Dr Steer’s grandmother of stealing from her.

Dr Steer decided to investigate how wearable technologies and artificial intelligence could help with his great-grandmother’s symptoms. He studied for a PhD at Bristol Robotics Laboratory, which is jointly run by the University of Bristol and UWE Bristol.

During the research, he volunteered at a dementia care home operated by the St Monica Trust. Garden House Care Home Manager, Fran Ashby said: “Zeke’s passion was clear from his first day with us and he worked closely with staff, relatives and residents to better understand the effects and treatment of dementia.

“We were really impressed at the potential of his assisted technology to predict impending agitation and help alert staff to intervene before it can escalate into distressed behaviours.

“Using modern assistive technology examples like smart socks can help enable people living with dementia to retain their dignity and have better quality outcomes for their day-to-day life.”

While volunteering Dr Steer hit upon the idea of Milbotix, which he launched as a business in February 2020.

“I came to see that my great grandmother wasn’t an isolated episode, and that distressed behaviours are very common,” he explained.

Milbotix are currently looking to work with innovative social care organisations to refine and evaluate the smart socks.

The business recently joined SETsquared Bristol, the University’s world-leading incubator for high growth tech businesses.

Dr Steer was awarded one of their Breakthrough Bursaries, which provides heavily subsidised membership to founders from diverse backgrounds. Dr Steer is also currently on the University’s QUEST programme, which support founders to commercialise their products.

Charity Alzheimer’s Society says there will be 1.6 million people with dementia in the UK by 2040, with one person developing dementia every three minutes. Dementia is thought to cost the UK £34.7 billion a year.

Meanwhile, according to the Government autism affects 1% of the UK population, or some 700,000 people, 15-30% of whom are non-verbal part or all of the time.

Dr Steer is now growing the business: testing the socks with people living with mid to late-stage dementia and developing the tech before bringing the product to market next year. Milbotix will begin a funding round later this year.

Milbotix is currently a team of three, including Jacqui Arnold, who has been working with people living with dementia for 40 years.

She said: “These socks could make such a difference. Having that early indicator of someone’s stress levels rising could provide the early intervention they need to reduce their distress – be that touch, music, pain relief or simply having someone there with them.”

Milbotix will be supported by Alzheimer’s Society through their Accelerator Programme, which is helping fund the smart socks’ development, providing innovation support and helping test what it described as a “brilliant product”.

Natasha Howard-Murray, Senior Innovator at Alzheimer’s Society, said: “Some people with dementia may present behaviours such as aggression, irritability and resistance to care.

“This innovative wearable tech is a fantastic, accessible way for staff to better monitor residents’ distress and agitation.”

Professor Judith Squires, Deputy Vice-Chancellor at the University of Bristol, said: “It is fantastic to see Zeke using the skills he learnt with us to improve the wellbeing of some of those most in need.

“The innovative research that Zeke has undertaken has the potential to help millions live better lives. We hope to see Milbotix flourish.”

What producers of Star Wars movies are getting wrong about androids

Robin Murphy, a roboticist at Texas A&M University has published a Focus piece in the journal Science Robotics outlining her views on the robots portrayed in "Star Wars," most particularly those featured in "The Mandalorian" and "The Book of Boba Fett." In her article, she says she believes that the portrayals of robots in both movies are quite creative, but suggests they are not wild enough to compete with robots that are made and used in the real world today.

The Imperative to Automate to Relieve Labor Constraints: The Reason to Attend the Automate Show

Because there is such a shortage of fork truck drivers for the foreseeable future, automation has become an imperative. The only alternative is to automate bringing loads of finish goods (or sub-assembly) from docks to warehouses and warehouse to docks.

A Generalist Agent

Inspired by progress in large-scale language modelling, we apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens.

A reconfigurable robotic system for cleaning and maintenance

Reconfigurable or "transformer" systems are robots or other systems that can adapt their state, configuration, or morphology to perform different tasks more effectively. In recent years, roboticists and computer scientists worldwide have developed new autonomous and reconfigurable systems for various applications, including surveillance, cleaning, maintenance, and search and rescue.

Is AI-generated art really creative? It depends on the presentation

Ai-Da sits behind a desk, paintbrush in hand. She looks up at the person posing for her, and then back down as she dabs another blob of paint onto the canvas. A lifelike portrait is taking shape. If you didn't know a robot produced it, this portrait could pass as the work of a human artist.

A new robotic system for automated laundry

Researchers at University of Bologna and Electrolux have recently developed a new robotic system that could assist humans with one of their most common everyday chores, doing laundry. This system, introduced in a paper published in SpringerLink's Human-Friendly Robotics, was successfully trained to insert items and pick them up from the washing machine once a washing cycle is complete.

Swiss Robotics Day showcases innovations and collaborations between academia and industry

As the next edition of the Swiss Robotics Day is in preparation in Lausanne, let’s revisit the November 2021 edition, where the vitality and richness of Switzerland’s robotics scene was on full display at StageOne Event and Convention Hall in Zurich. It was the first edition of NCCR Robotics’s flagship event after the pandemic, and it surpassed the scale of previous editions, drawing in almost 500 people. You can see the photo gallery here.

Welcome notes from ETH President Joël Mesot and NCCR Robotics Director Dario Floreano opened a dense conference programme, chaired by NCCR Robotics co-Director Robert Riener and that included scientific presentations from Marco Hutter (ETH Zurich), Stéphanie Lacour and Herb Shea (both from EPFL), as well as the industry perspective from ABB’s Marina Bill, Simon Johnson from the Drone Industry Association and Hocoma co-founder Gery Colombo. A final roundtable – including Robert Riener, Hocoma’s Serena Maggioni, Liliana Paredes from Rehaklinik and Georg Rauter from the University of Basel – focused on the potential and the challenges of innovation in healthcare robotics.

Over 50 exhibitors – including scientific laboratories as well as start-ups and large companies – filled the 3,300 square-meter venue, demonstrating technologies ranging from mobile robots to wearable exoskeletons, from safe delivery drones to educational robots and much more. Sixteen young companies presented their innovations in a start-up carousel. Dozens of professional meetings took place throughout the day, allowing a diverse audience of entrepreneurs, funders, academics and policy makers to network and explore possible collaborations. A crowd of young researchers participated in a mentoring session where Marina Bill (ABB), Auke Ijspeert (EPFL) and Iselin Frøybu (Emovo Care) provided advice on academic and industrial careers. Sixteen students participated in the Cybathlon @school competition, experimenting with robotic technologies for disability, and in the end officially announcing CYBATHLON 2024.

During the event, the next chapter for Swiss robotics was also announced: the launch of the NTN Innovation Booster on robotics, which will run from 2022 to 2025 and will be led by EPFL’s Aude Billard. Funded by Innosuisse, the NTN will act as a platform for new ideas and partnerships, supporting innovation through “idea generator bubbles” and specific funding calls.

The 2021 Swiss Robotics Day marked the beginning of NCCR Robotics’s final year. The project, launched in 2010, is on track to meet all its scientific goals in the three areas of wearable, rescue and educational robotics, while continuing to focus on supporting spin-offs, advancing robotics education and improving Swiss Robotics Day showcases innovation equality of opportunities for all robotics researchers. The conclusion of NCCR Robotics will be marked by the next edition of the Swiss Robotics Day as larger, two-days public event that will take place in Lausanne on 4 and 5 November 2022.

Wearable robotics

The goal of the NCCR Grand Challenge on Wearable Robotic is to develop a novel generation of wearable robotic systems, which will be more comfortable for patients and more extensively usable in a clinical environment. These new technological solutions will help in the recovery of movement and grasping after cardiovascular accidents and spinal cord lesions. They can be used to enhance physiotherapy by improving training, thus encouraging the brain to repair networks (neurorehabilitation). And they can be used as assistive devices (e.g. prosthetic limbs and exoskeletons) to support paralysed people in daily life situations.

While current wearable robots are making huge advances in the lab, there is some way to go before they become part of everyday life for people with disabilities. In order to be functional, robots must work with the user and not cause damage or irritation (in the case of externally worn devices) or be rejected by the host (in the case of implants), they must have their own energy source that does not need to be constantly plugged in or re-charged, and they need to be affordable.

Rescue robotics

After a natural disaster such as an earthquake or flood, it is often very dangerous for teams of rescue workers to go into affected areas to look for victims and survivors.

The idea behind robots for rescue activities is to create robust robots that can travel into areas too dangerous for humans and rescue dogs. Robots can be used to assess the situation and to locate people who may be trapped and to relay the location back to the rescue teams, so that all efforts can be concentrated on areas where victims are known to be. Robots are also being developed to carry medical supplies and food, thereby focusing resources where they are most needed.

The main research issues within the field of mobile robotics for search and rescue mission are durability and usability of robots – how to design robots that are easily transported, can function efficiently in all weather conditions and that have long lasting power, and robots that can navigate themselves and have effective enough sensors to pick out victims.

Educational robotics

In the 1970’s and 1980’s, robots were typically introduced in schools as a tool for teaching robotics or other Science, Technology, Engineering and Mathematics (STEM) subjects. However, this specificity held back their adoption for wider educational purposes. This early failure of adoption of robots in classrooms happened because they were unreliable, expensive and with limited applications.

Nowadays, with robots being cheaper and more easily deployable, applications in education have become easier. In the past fifteen years, there have been an increasing number of extracurricular robotics activities showing the popularity of robotics in an informal educational context. However, robots are still underused in schools for formal education. Although there is no agreement over the exact reasons for this situation, it seems clear, from different studies, that teachers play a key role in the introduction of technology in schools.

During the first two phases of NCCR Robotics two products were developed: the Thymio robot— a mobile robot increasingly used to teach robotics and programming, and Cellulo — a small, inexpensive and robust robot that kids can move with their hands and use in groups.

Current research focuses on two aspects. The first one is inventing new forms of interactions between learners and tangible swarms based on the Cellulo robot, and studying the learning outcomes enabled by these interactions.

The second aspect is investigating teacher adoption of robotics from two points of view: platform usability and teacher training. Research will show how to train teachers and exploit Thymio and Cellulo in their daily activities and how to minimize their orchestration load. Activities relating to computational thinking skills are the main target, with school topics outside the STEM domains also included.

Page 298 of 538
1 296 297 298 299 300 538