Archive 06.10.2020

Page 15 of 50
1 13 14 15 16 17 50

Plan2Explore: Active model-building for self-supervised visual reinforcement learning

By Oleh Rybkin, Danijar Hafner and Deepak Pathak

To operate successfully in unstructured open-world environments, autonomous intelligent agents need to solve many different tasks and learn new tasks quickly. Reinforcement learning has enabled artificial agents to solve complex tasks both in simulation and real-world. However, it requires collecting large amounts of experience in the environment for each individual task.

Self-supervised reinforcement learning has emerged as an alternative, where the agent only follows an intrinsic objective that is independent of any individual task, analogously to unsupervised representation learning. After acquiring general and reusable knowledge about the environment through self-supervision, the agent can adapt to specific downstream tasks more efficiently.


In this post, we explain our recent publication that develops Plan2Explore. While many recent papers on self-supervised reinforcement learning have focused on model-free agents, our agent learns an internal world model that predicts the future outcomes of potential actions. The world model captures general knowledge, allowing Plan2Explore to quickly solve new tasks through planning in its own imagination. The world model further enables the agent to explore what it expects to be novel, rather than repeating what it found novel in the past. Plan2Explore obtains state-of-the-art zero-shot and few-shot performance on continuous control benchmarks with high-dimensional input images. To make it easy to experiment with our agent, we are open-sourcing the complete source code.

How does Plan2Explore work?

At a high level, Plan2Explore works by training a world model, exploring to maximize the information gain for the world model, and using the world model at test time to solve new tasks (see figure above). Thanks to effective exploration, the learned world model is general and captures information that can be used to solve multiple new tasks with no or few additional environment interactions. We discuss each part of the Plan2Explore algorithm individually below. We assume a basic understanding of reinforcement learning in this post and otherwise recommend these materials as an introduction.

Learning the world model

Plan2Explore learns a world model that predicts future outcomes given past observations $o_{1:t}$ and actions $a_{1:t}$ (see figure below). To handle high-dimensional image observations, we encode them into lower-dimensional features $h$ and use an RSSM model that predicts forward in a compact latent state-space $s$, from which the observations can be decoded. The latent state aggregates information from past observations that is helpful for future prediction, and is learned end-to-end using a variational objective.


A novelty metric for active model-building

To learn an accurate and general world model we need an exploration strategy that collects new and informative data. To achieve this, Plan2Explore uses a novelty metric derived from the model itself. The novelty metric measures the expected information gained about the environment upon observing the new data. As the figure below shows, this is approximated by the disagreement of an ensemble of $K$ latent models. Intuitively, large latent disagreement reflects high model uncertainty, and obtaining the data point would reduce this uncertainty. By maximizing latent disagreement, Plan2Explore selects actions that lead to the largest information gain, therefore improving the model as quickly as possible.


Planning for future novelty

To effectively maximize novelty, we need to know which parts of the environment are still unexplored. Most prior work on self-supervised exploration used model-free methods that reinforce past behavior that resulted in novel experience. This makes these methods slow to explore: since they can only repeat exploration behavior that was successful in the past, they are unlikely to stumble onto something novel. In contrast, Plan2Explore plans for expected novelty by measuring model uncertainty of imagined future outcomes. By seeking trajectories that have the highest uncertainty, Plan2Explore explores exactly the parts of the environments that were previously unknown.

To choose actions $a$ that optimize the exploration objective, Plan2Explore leverages the learned world model as shown in the figure below. The actions are selected to maximize the expected novelty of the entire future sequence $s_{t:T}$, using imaginary rollouts of the world model to estimate the novelty. To solve this optimization problem, we use the Dreamer agent, which learns a policy $\pi_\phi$ using a value function and analytic gradients through the model. The policy is learned completely inside the imagination of the world model. During exploration, this imagination training ensures that our exploration policy is always up-to-date with the current world model and collects data that are still novel.


Curiosity-driven exploration behavior

We evaluate Plan2Explore on 20 continuous control tasks from the DeepMind Control Suite. The agent only has access to image observations and no proprioceptive information. Instead of random exploration, which fails to take the agent far from the initial position, Plan2Explore leads to diverse movement strategies like jumping, running, and flipping. Later, we will see that these are effective practice episodes that enable the agent to quickly learn to solve various continuous control tasks.







Solving tasks with the world model

Once an accurate and general world model is learned, we test Plan2Explore on previously unseen tasks. Given a task specified with a reward function, we use the model to optimize a policy for that task. Similar to our exploration procedure, we optimize a new value function and a new policy head for the downstream task. This optimization uses only predictions imagined by the model, enabling Plan2Explore to solve new downstream tasks in a zero-shot manner without any additional interaction with the world.

The following plot shows the performance of Plan2Explore on tasks from DM Control Suite. Before 1 million environment steps, the agent doesn’t know the task and simply explores. The agent solves the task as soon as it is provided at 1 million steps, and keeps improving fast in a few-shot regime after that.


Plan2Explore () is able to solve most of the tasks we benchmarked. Since prior work on self-supervised reinforcement learning used model-free agents that are not able to adapt in a zero-shot manner (ICM, ), or did not use image observations, we compare by adapting this prior work to our model-based plan2explore setup. Our latent disagreement objective outperforms other previously proposed objectives. More interestingly, the final performance of Plan2Explore is comparable to the state-of-the-art oracle agent that requires task rewards throughout training (). In our paper, we further report performance of Plan2Explore in the zero-shot setting where the agent needs to solve the task before any task-oriented practice.

Future directions

Plan2Explore demonstrates that effective behavior can be learned through self-supervised exploration only. This opens multiple avenues for future research:

  • First, to apply self-supervised RL to a variety of settings, future work will investigate different ways of specifying the task and deriving behavior from the world model. For example, the task could be specified with a demonstration, description of the desired goal state, or communicated to the agent in natural language.

  • Second, while Plan2Explore is completely self-supervised, in many cases a weak supervision signal is available, such as in hard exploration games, human-in-the-loop learning, or real life. In such a semi-supervised setting, it is interesting to investigate how weak supervision can be used to steer exploration towards the relevant parts of the environment.

  • Finally, Plan2Explore has the potential to improve the data efficiency of real-world robotic systems, where exploration is costly and time-consuming, and the final task is often unknown in advance.

By designing a scalable way of planning to explore in unstructured environments with visual observations, Plan2Explore provides an important step toward self-supervised intelligent machines.


We would like to thank Georgios Georgakis for the useful feedback.

This post is based on the following paper:

This article was initially published on the BAIR blog, and appears here with the authors’ permission.

AutoML solutions overview

Introduction

 

I have been looking for a list of AutoML solutions and a way to compare them, but I haven’t been able to find it. So I thought I might as well compile that list for others to use. If you are not familiar with AutoML read this post for a quick introduction and pros and cons.

 

I haven’t been able to test them all and make a proper review, so this is just a comparison based on features. I tried to pick the features that felt most important to me, but it might not be the most important for you. If you think some features are missing or if you know an AutoML solution that should be on the list, just let me know.

 

Before we go to the list I’d just quickly go through the features and how I interpret them.

 

Features

Deployment 

Some solutions can be auto deployed directly to the cloud with a one-click deployment. Some just export to Tensorflow and some even have specific export to edge devices.

Types 

This can be Text, Images, video, tabular. I guess some of the open source ones can be stretched to do anything if put in the work, so it might not be the complete truth.

Explainable 

Explainability in AI is a hot topic and a very important feature for some projects. Some solutions give you no insights and some gives you a lot and it might even be a strategic differentiator for the provider. I have simply divided this feature into Little, Some and Very Explainable.

Monitor 

Monitoring models after deployment to avoid drifting of models can be a very useful feature. I divided this into Yes and No.

Accessible

Some of the providers are very easy to use and some of them require coding and at least basic data science understanding. So I took this feature in so you can pick the tool that corresponds to the abilities you have access to.

Labeling tool

Some have an internal labelling tool so you can directly label data before training the model. That can be very useful in some cases.

General / Specialized

Most AutoML solutions are generalized for all industries but a few are specialized to specific industries. I suspect this will become more popular, so I took this feature in.

Open Source

Self-explanatory. Is it open source or not.

Includes transfer Learning

Transfer learning is one of the big advantages of AutoML. You get to piggyback on big models so you can get great results with very little data.

 

AutoML solutions list

 

Google AutoML

 

Google AutoML is the one I’m the most familiar with. I found it pretty easy to use even without coding. The biggest issue I’ve had is that the API requires a bunch of setup and is not just a simple token or Oauth-based authentication.

 

Deployment: To cloud, export, edge

Types: Text, Images, Video, Tabular

Explainable: Little

Monitor: No

Accessible: Very

Labeling tool: Used to have but is closed

General / Specialized: Generalized

Open Source: No

Includes transfer Learning: Yes

Link: https://cloud.google.com/automl

 

Azure AutoML

Microsoft's cloud AutoML seems to be more Xplainable than Google’s but with only tabular data models.

 

Deployment: To cloud, some Local

Types: Only Tabular

Explainable: Some

Monitor: No

Accessible: Very

Labeling tool: No

General / Specialized: Generalized

Open Source: No

Includes transfer Learning: Yes

Link: https://azure.microsoft.com/en-us/services/machine-learning/automatedml/

Lobe.AI

This solution is still in beta but works very well in my experience. I’ll write a review as soon as it goes public. Lobe is so easy to use that you can let a 10-year old use it to train deep learning models. I’d really recommend this for education purposes.

 Deployment: Local and export to Tensorflow

Types: Images

Explainable: Little

Monitor: -

Accessible: Very - A third grader can use this

Labeling tool: Yes

General / Specialized: Generalized

Open Source: No

Includes transfer Learning: Yes

Link: https://lobe.ai/

 

Kortical

Kortical seems to be one the AutoML solutions that differentiates itself by being as explainable as possible. This can be a huge advantage when not just trying to get good results but also understand the business problem better. For that I’m a bit of a fan.

Deployment: To cloud

Types: Tabular

Explainable: Very

Monitor: No

Accessible: Very

Labeling tool: No

General / Specialized: Generalized

Open Source: No

Includes transfer Learning: Not sure

Link: https://kortical.com/

DataRobot

A big player that might even be the first pure AutoML to go IPO.

Deployment: To cloud

Types: Text, Images and Tabular

Explainable: Very

Monitor: Yes

Accessible: Very

Labeling tool: No

General / Specialized: Generalized

Open Source: No

Includes transfer Learning: Yes

Link: https://www.datarobot.com/platform/automated-machine-learning/

 

AWS Sagemaker Autopilot

Amazons AutoML. Requires more technical skills than the other big cloud suppliers and is quite limited and supports only two algorithms: XGBoost and Logistic regression. 

 
Deployment: To cloud and export

Types: Tabular

Explainable: Some

Monitor: Yes

Accessible: Requires coding

Labeling tool: Yes

General / Specialized: Generalized

Open Source: No

Includes transfer Learning: Yes

Link: https://aws.amazon.com/sagemaker/autopilot/

MLJar

 Deployment: Export and Cloud

Types: Tabular

Explainable: Yes

Monitor: -

Accessible: Very

Labeling tool: No

General / Specialized: Generalized

Open Source: MLJar has both and Open source(https://github.com/mljar/mljar-supervised ) and closed source solution.

Includes transfer Learning: Yes

Link: https://mljar.com/

Autogluon

 Deployment: Export

Types: Text, Images, tabular

Explainable: -

Monitor: -

Accessible: Requires coding

Labeling tool: No

General / Specialized: Generalized

Open Source: Yes

Includes transfer Learning: Yes

Link: https://autogluon.mxnet.io/

JadBio

 Deployment: Cloud and Export

Types: Tabular

Explainable: Some

Monitor: No

Accessible: Very

Labeling tool: No

General / Specialized: LifeScience

Open Source: No

Includes transfer Learning: -

Link: https://www.jadbio.com/

  

AUTOWEKA

This solution supports Bayesian models which is pretty cool.

 

Deployment : Export

Types: -

Explainable: -

Monitor: -

Accessible: Requires Code

Labeling tool: No

General / Specialized: Generalized

Open Source: Yes

Includes transfer Learning:No

Link: https://www.cs.ubc.ca/labs/beta/Projects/autoweka/

 

H2o Driverless AI 

Also supports bayesian models

Deployment: Export

Types: -

Explainable: -

Monitor: -

Accessible: Semi

Labeling tool: No

General / Specialized: Generalized

Open Source: Both options

Includes transfer Learning: -

Link: https://www.h2o.ai/

 

Autokeras

Autokeras is one of the most popular open source solutions and is definitely worth trying out.

Deployment: Export

Types: Text, Images, tabular

Explainable: Possible

Monitor: -

Accessible: Requires Code

Labeling tool: No

General / Specialized: Generalized

Open Source: Yes

Includes transfer Learning: -

Link: https://autokeras.com/

 

TPOT

 Deployment: Export

Types: Images and Tabular

Explainable: Possible

Monitor: -

Accessible: Requires Code

Labeling tool: No

General / Specialized: Generalized

Open Source: Yes

Includes transfer Learning: -

Link: http://epistasislab.github.io/tpot/

 

Pycaret

Deployment: Export

Types: Text, Tabular

Explainable: Possible

Monitor: -

Accessible: Requires Code

Labeling tool: No

General / Specialized: Generalized

Open Source: Yes

Includes transfer Learning: -

Link: https://github.com/pycaret/pycaret

AutoSklearn

Deployment: Export

Types: Tabular

Explainable: Possible

Monitor: -

Accessible: Requires Code

Labeling tool: No

General / Specialized: Generalized

Open Source: Yes

Includes transfer Learning: -

Link: https://automl.github.io/auto-sklearn/master/

TransmogrifAI

Made by Salesforce.

Deployment: Export

Types: Text and Tabular

Explainable: Possible

Monitor: -

Accessible: Requires Code

Labeling tool: No

General / Specialized: Generalized

Open Source: Yes

Includes transfer Learning: -

Link: https://transmogrif.ai/

 

Wearable technologies to make rehab more precise

Therapist holding patient's arm, who is wearing an intelligent wereable device
A team led by Wyss Associate Faculty member Paolo Bonato, Ph.D., found in a recent study that wearable technology is suitable to accurately track motor recovery of individuals with brain injuries and thus allow clinicians to choose more effective interventions and to improve outcomes. Credit: Shutterstock/Dmytro Zinkevych

By Tim Sullivan / Spaulding Rehabilitation Hospital Communications

A group based out of the Spaulding Motion Analysis Lab at Spaulding Rehabilitation Hospital published “Enabling Precision Rehabilitation Interventions Using Wearable Sensors and Machine Learning to Track Motor Recovery” in the newest issue of Nature Digital Medicine. The aim of the study is to lay the groundwork for the design of “precision rehabilitation” interventions by using wearable technologies to track the motor recovery of individuals with brain injury.

The study found that the technology is suitable to accurately track motor recovery and thus allow clinicians to choose more effective interventions and to improve outcomes. The study was a collaborative effort under students and former students connected to the Motion Analysis Lab under faculty mentorship.

Paolo Bonato, Ph.D., Director of the Spaulding Motion Analysis Lab and senior author on the study said, “By providing clinicians precise data will enable them to design more effective interventions to improve the care we deliver. To have so many of our talented young scientists and researchers from our lab collaborate to create this meaningful paper is especially gratifying for all of our faculty who support our ongoing research enterprise.” Bonato is also an Associate Faculty member at Harvard’s Wyss Institute for Biologically Inspired Engineering.

Catherine Adans-Dester, P.T., Ph.D., a member of Dr. Bonato’s team served as lead author on the manuscript. “The need to develop patient-specific interventions is apparent when one considers that clinical studies often report satisfactory motor gains only in a portion of participants, which suggests that clinical outcomes could be improved if we had better tools to develop patient-specific interventions. Data collected using wearable sensors provides clinicians with the opportunity to do so with little burden on clinicians and patients,” said Dr. Adans-Dester. The approach proposed in the paper relied on machine learning-based algorithms to derive clinical score estimates from wearable sensor data collected during functional motor tasks. Sensor-based score estimates showed strong agreement with those generated by clinicians.

By providing clinicians precise data will enable them to design more effective interventions to improve the care we deliver

Paolo Bonato

The results of the study demonstrated that wearable sensor data can be used to derive accurate estimates of clinical scores utilized in the clinic to capture the severity of motor impairments and the quality of upper-limb movement patterns. In the study, the upper-limb Fugl-Meyer Assessment (FMA) scale was used to generate clinical scores of the severity of motor impairments, and the Functional Ability Scale (FAS) was used to generate clinical scores of the quality of movement. Wearable sensor data (i.e., accelerometer data) was collected during the performance of eight functional motor tasks taken from the Wolf-Motor Function Test, thus providing a sample of gross arm movements and fine motor control tasks. Machine learning-based algorithms were developed to derive accurate estimates of the FMA and FAS clinical scores from the sensor data. A total of 37 study participants (16 stroke survivors and 21 traumatic brain injury survivors) participated in the study.

Involved in the study in addition to Dr. Bonato and Dr. Adans-Dester were Nicolas Hankov, Anne O’Brien, Gloria Vergara-Diaz, Randie Black-Schaffer, MD, Ross Zafonte, DO, from the Harvard Medical School Department of Physical Medicine & Rehabilitation at Spaulding Rehabilitation Hospital, Boston MA, USA, Jennifer Dy Department of Electrical and Computer Engineering, Northeastern University, Boston MA, and Sunghoon I. Lee of the College of Information and Computer Sciences, University of Massachusetts Amherst, Amherst MA.

Insects found to use natural wing oscillations to stabilize flight

A team of researchers from the University of California, the University of North Carolina at Chapel Hill and Pacific Northwest National Laboratory has found that insects use natural oscillations to stabilize their flight. In their study, published in the journal Science Robotics, the researchers used what they describe as "a type of calculus" (chronological calculus) to better understand the factors that are involved in keeping flapping winged insects aloft. Matěj Karásek, with Delft University of Technology has published a Focus piece in the same journal issue describing the work done by the team on this new effort.

Brain activity reveals individual attitudes toward humanoid robots

The way humans interpret the behavior of AI-endowed artificial agents, such as humanoid robots, depends on specific individual attitudes that can be detected from neural activity. Researchers at IIT-Istituto Italiano di Tecnologia (Italian Institute of Technology) demonstrated that people's bias toward robots—that is, attributing intentionality or considering them as "mindless things"—can be correlated with distinct brain activity patterns. The research results have been published in Science Robotics and are important for understanding the way humans engage with robots, while also considering their acceptance in healthcare applications and daily life.

AWAC: Accelerating online reinforcement learning with offline datasets


By Ashvin Nair and Abhishek Gupta

Robots trained with reinforcement learning (RL) have the potential to be used across a huge variety of challenging real world problems. To apply RL to a new problem, you typically set up the environment, define a reward function, and train the robot to solve the task by allowing it to explore the new environment from scratch. While this may eventually work, these “online” RL methods are data hungry and repeating this data inefficient process for every new problem makes it difficult to apply online RL to real world robotics problems. What if instead of repeating the data collection and learning process from scratch every time, we were able to reuse data across multiple problems or experiments? By doing so, we could greatly reduce the burden of data collection with every new problem that is encountered.



Our method learns complex behaviors by training offline from prior datasets (expert demonstrations, data from previous experiments, or random exploration data) and then fine-tuning quickly with online interaction.

With hundreds to thousands of robot experiments being constantly run, it is of crucial importance to devise an RL paradigm that can effectively use the large amount of already available data while still continuing to improve behavior on new tasks.

The first step towards moving RL towards a data driven paradigm is to consider the general idea of offline (batch) RL. Offline RL considers the problem of learning optimal policies from arbitrary off-policy data, without any further exploration. This is able to eliminate the data collection problem in RL, and incorporate data from arbitrary sources including other robots or teleoperation. However, depending on the quality of available data and the problem being tackled, we will often need to augment offline training with targeted online improvement. This problem setting actually has unique challenges of its own. In this blog post, we discuss how we can move RL from training from scratch with every new problem to a paradigm which is able to reuse prior data effectively, with some offline training followed by online finetuning.


Figure 1: The problem of accelerating online RL with offline datasets. In (1), the robot learns a policy entirely from an offline dataset. In (2), the robot gets to interact with the world and collect on-policy samples to improve the policy beyond what it could learn offline.

Challenges in Offline RL with Online Fine-tuning

We analyze the challenges in the problem of learning from offline data and subsequent fine-tuning, using the standard benchmark HalfCheetah locomotion task. The following experiments are conducted with a prior dataset consisting of 15 demonstrations from an expert policy and 100 suboptimal trajectories sampled from a behavioral clone of these demonstrations.


Figure 2: On-policy methods are slow to learn compared to off-policy methods, due to the ability of off-policy methods to “stitch” good trajectories together, illustrated on the left. Right: in practice, we see slow online improvement using on-policy methods.

1. Data Efficiency

A simple way to utilize prior data such as demonstrations for RL is to pre-train a policy with imitation learning, and fine-tune with on-policy RL algorithms such as AWR or DAPG. This has two drawbacks. First, the prior data may not be optimal so imitation learning may be ineffective. Second, on-policy fine-tuning is data inefficient as it does not reuse the prior data in the RL stage. For real-world robotics, data efficiency is vital. Consider the robot on the right, trying to reach the goal state with prior trajectory $\tau_1$ and $\tau_2$. On-policy methods cannot effectively use this data, but off-policy algorithms that do dynamic programming can, by effectively “stitching” $\tau_1$ and $\tau_2$ together with the use of a value function or model. This effect can be seen in the learning curves in Figure 2, where on-policy methods are an order of magnitude slower than off-policy actor-critic methods.


Figure 3: Bootstrapping error is an issue when using off-policy RL for offline training. Left: an erroneous Q value far away from the data is exploited by the policy, resulting in a poor update of the Q function. Middle: as a result, the robot may take actions that are out of distribution. Right: bootstrap error causes poor offline pretraining when using SAC and its variants.

2. Bootstrapping Error

Actor-critic methods can in principle learn efficiently from off-policy data by estimating a value estimate $V(s)$ or action-value estimate $Q(s, a)$ of future returns by Bellman bootstrapping. However, when standard off-policy actor-critic methods are applied to our problem (we use SAC), they perform poorly, as shown in Figure 3: despite having a prior dataset in the replay buffer, these algorithms do not benefit significantly from offline training (as seen by the comparison between the SAC(scratch) and SACfD(prior) lines in Figure 3). Moreover, even if the policy is pre-trained by behavior cloning (“SACfD (pretrain)”) we still observe an initial decrease in performance.

This challenge can be attributed to off-policy bootstrapping error accumulation. During training, the Q estimates will not be fully accurate, particularly in extrapolating actions that are not present in the data. The policy update exploits overestimated Q values, making the estimated Q values worse. The issue is illustrated in the figure: incorrect Q values result in an incorrect update to the target Q values, which may result in the robot taking a poor action.

3. Non-stationary Behavior Models

Prior offline RL algorithms such as BCQ, BEAR, and BRAC propose to address the bootstrapping issue by preventing the policy from straying too far from the data. The key idea is to prevent bootstrapping error by constraining the policy $\pi$ close to the “behavior policy” $\pi_\beta$: the actions that are present in the replay buffer. The idea is illustrated in the figure below: by sampling actions from $\pi_\beta$, you avoid exploiting incorrect Q values far away from the data distribution.

However, $\pi_\beta$ is typically not known, especially for offline data, and must be estimated from the data itself. Many offline RL algorithms (BEAR, BCQ, ABM) explicitly fit a parametric model to samples from the replay buffer for the distribution $\pi_\beta$. After forming an estimate $\hat{\pi}_\beta$, prior methods implement the policy constraint in various ways, including penalties on the policy update (BEAR, BRAC) or architecture choices for sampling actions for policy training (BCQ, ABM).

While offline RL algorithms with constraints perform well offline, they struggle to improve with fine-tuning, as shown in the third plot in Figure 1. We see that the purely offline RL performance (at “0K” in Fig.1) is much better than SAC. However, with additional iterations of online fine-tuning, the performance increases very slowly (as seen from the slope of the BEAR curve in Fig 1). What causes this phenomenon?

The issue is in fitting an accurate behavior model as data is collected online during fine-tuning. In the offline setting, behavior models must only be trained once, but in the online setting, the behavior model must be updated online to track incoming data. Training density models online (in the “streaming” setting) is a challenging research problem, made more difficult by a potentially complex multi-modal behavior distribution induced by the mixture of online and offline data. In order to address our problem setting, we require an off-policy RL algorithm that constrains the policy to prevent offline instability and error accumulation, but is not so conservative that it prevents online fine-tuning due to imperfect behavior modeling. Our proposed algorithm, which we discuss in the next section, accomplishes this by employing an implicit constraint, which does not require any explicit modeling of the behavior policy.


Figure 4: an illustration of AWAC. High-advantage transitions are regressed on with high weight, while low advantage transitions have low weight. Right: algorithm pseudocode.

Advantage Weighted Actor Critic

In order to avoid these issues, we propose an extremely simple algorithm – advantage weighted actor critic (AWAC). AWAC avoids the pitfalls in the previous section with careful design decisions. First, for data efficiency, the algorithm trains a critic that is trained with dynamic programming. Now, how can we use this critic for offline training while avoiding the bootstrapping problem, while also avoiding modeling the data distribution, which may be unstable? For avoiding bootstrapping error, we optimize the following problem:

We can compute the optimal solution for this equation and project our policy onto it, which results in the following actor update:

This results in an intuitive actor update, that is also very effective in practice. The update resembles weighted behavior cloning; if the Q function was uninformative, it reduces to behavior cloning the replay buffer. But with a well-formed Q estimate, we weight the policy towards only good actions. An illustration is given in the figure above: the agent regresses onto high-advantage actions with a large weight, while almost ignoring low-advantage actions. Please see the paper for an expanded derivation and implementation details.

Experiments

So how well does this actually do at addressing our concerns from earlier? In our experiments, we show that we can learn difficult, high-dimensional, sparse reward dexterous manipulation problems from human demonstrations and off-policy data. We then evaluate our method with suboptimal prior data generated by a random controller. Results on standard MuJoCo benchmark environments (HalfCheetah, Walker, and Ant) are also included in the paper.

Dexterous Manipulation


Figure 5. Top: performance shown for various methods after online training (pen: 200K steps, door: 300K steps, relocate: 5M steps). Bottom: learning curves on dextrous manipulation tasks with sparse rewards are shown. Step 0 corresponds to the start of online training after offline pre-training.

We aim to study tasks representative of the difficulties of real-world robot learning, where offline learning and online fine-tuning are most relevant. One such setting is the suite of dexterous manipulation tasks proposed by Rajeswaran et al., 2017. These tasks involve complex manipulation skills using a 28-DoF five-fingered hand in the MuJoCo simulator: in-hand rotation of a pen, opening a door by unlatching the handle, and picking up a sphere and relocating it to a target location. These environments exhibit many challenges: high dimensional action spaces, complex manipulation physics with many intermittent contacts, and randomized hand and object positions. The reward functions in these environments are binary 0-1 rewards for task completion. Rajeswaran et al. provide 25 human demonstrations for each task, which are not fully optimal but do solve the task. Since this dataset is very small, we generated another 500 trajectories of interaction data by constructing a behavioral cloned policy, and then sampling from this policy.

First, we compare our method on the dexterous manipulation tasks described earlier against prior methods for off-policy learning, offline learning, and bootstrapping from demonstrations. The results are shown in the figure above. Our method uses the prior data to quickly attain good performance, and the efficient off-policy actor-critic component of our approach fine-tunes much quicker than DAPG. For example, our method solves the pen task in 120K timesteps, the equivalent of just 20 minutes of online interaction. While the baseline comparisons and ablations are able to make some amount of progress on the pen task, alternative off-policy RL and offline RL algorithms are largely unable to solve the door and relocate task in the time-frame considered. We find that the design decisions to use off-policy critic estimation allow AWAC to significantly outperform AWR while the implicit behavior modeling allows AWAC to significantly outperform ABM, although ABM does make some progress.

Fine-Tuning from Random Policy Data

An advantage of using off-policy RL for reinforcement learning is that we can also incorporate suboptimal data, rather than only demonstrations. In this experiment, we evaluate on a simulated tabletop pushing environment with a Sawyer robot.

To study the potential to learn from suboptimal data, we use an off-policy dataset of 500 trajectories generated by a random process. The task is to push an object to a target location in a 40cm x 20cm goal space.

The results are shown in the figure to the right. We see that while many methods begin at the same initial performance, AWAC learns the fastest online and is actually able to make use of the offline dataset effectively as opposed to some methods which are completely unable to learn.

Future Directions

Being able to use prior data and fine-tune quickly on new problems opens up many new avenues of research. We are most excited about using AWAC to move from the single-task regime in RL to the multi-task regime, with data sharing and generalization between tasks. The strength of deep learning has been its ability to generalize in open-world settings, which we have already seen transform the fields of computer vision and natural language processing. To achieve the same type of generalization in robotics, we will need RL algorithms that take advantage of vast amounts of prior data. But one key distinction in robotics is that collecting high-quality data for a task is very difficult – often as difficult as solving the task itself. This is opposed to, for instance computer vision, where humans can label the data. Thus, the active data collection (online learning) will be an important piece of the puzzle.

This work also suggests a number of algorithmic directions to move forward. Note that in this work we focused on mismatched action distributions between the policy $\pi$ and the behavior data $\pi_\beta$. When doing off-policy learning, there is also a mismatched marginal state distribution between the two. Intuitively, consider a problem with two solutions A and B, with B being a higher return solution and off-policy data demonstrating solution A provided. Even if the robot discovers solution B during online exploration, the off-policy data still consists of mostly data from path A. Thus the Q-function and policy updates are computed over states encountered while traversing path A even though it will not encounter these states when executing the optimal policy. This problem has been studied previously. Accounting for both types of distribution mismatch will likely result in better RL algorithms.

Finally, we are already using AWAC as a tool to speed up our research. When we set out to solve a task, we do not usually try to solve it from scratch with RL. First, we may teleoperate the robot to confirm the task is solvable; then we might run some hard-coded policy or behavioral cloning experiments to see if simple methods can already solve it. With AWAC, we can save all of the data in these experiments, as well as other experimental data such as when hyperparameter sweeping an RL algorithm, and use it as prior data for RL.


A preprint of the work this blog post is based on is available here. Code is now included in rlkit. The code documentation also contains links to the data and environments we used. The project website is available here.

This article was initially published on the BAIR blog, and appears here with the authors’ permission.

6 things you should know before beginning with AI projects

Artificial intelligence(AI) projects are becoming commonplace for big business and entrepreneurs alike. As a result many people with no prior experience with AI are now being put in charge of AI projects. Almost 5 years ago that happened to me for the first time and I’ve since learned a lot. So here’s six things I wish I had known, when I did my first AI project.

1. Data is the most expensive part

AI is often talked about as being technically very difficult requiring extensive resources to develop. But in fact that’s not the complete truth. The development can be costly but the vast majority of the work and resources needed is usually in acquiring, cleaning and preparing data for the development to take place. 

Data is also the most crucial element when trying to make the AI successfully do its job. As a result you should always prefer superior data over superior technology when making AI models.

So when budgeting for an AI project make sure that you set a side most of the time and money for getting a lot of good quality data. And remember that you might even need to acquire fresh data continuously if the domain you work in has changing conditions.

2. AI technology is more accessible than you think

In a very short time AI have made the jump from requiring specialist data scientists and machine learning engineers to where we can now make AI models without a single line of code. A multitude of AutoML(Automatic machine learning) vendors have appeared in the later years and they are rapidly improving. That means that getting started on AI doesn’t require as much investment as before. 

The data acquisition and the human processes, like training and onboarding, still requires hard work though and neither should be underestimated.

3. AI is experimental 

Developing AI is an experimental process. You cannot know how long it will take to develop what you have in mind or how good it will be. In some cases you cannot even be sure that AI is a feasible solution to your problem before trying. 

The best way to succeed with uncertain project conditions like this, is to time cap and milestone fund the project. Set short milestones and only release more funds for a project if the goals for each milestone has been met or at least that you see meaningful progress. If you fund the whole project up front you might end up pouring all your money into a dead end that could have been caught early. 

4. Be clear on what the succes is to your project 

Before getting started you should be very clear with your stakeholders what a successful project will look like. New technology like AI can quickly be held to golden standards that it will never achieve. If expectations are not aligned before the kick off you might end up thinking you made a fantastic solution while some of your stakeholders are disappointed. In my experience the exact same AI solution can amaze some people and seem novel to others.

A good way to deal with this is to make all stakeholders agree that the first version of the AI should just be able to deliver the status quo. From there you can improve and gradually increase the value.

5. Users will lose sense of control

It can be hard to explain the inner workings and the reasoning behind an AI’s output. At the same time you cannot exactly know what output it will give, given a specific input. That will make it feel just as or even more unpredictable than humans doing the same tasks. As the users of an AI cannot ask questions or know if feedback given the AI will make a difference, the users will often feel a lost sense of control. 

To avoid that feeling you must first of all prepare the user of this new paradigm. It’s much easier if they buy in on these conditions before they get to try the AI. If possible you can also provide feedback mechanisms so the users will at least feel that they can make a difference. Not that it will work every time but it’s better than nothing. 

It’s also a good idea to manage expectations through the right narrative. Make it clear if the AI is a decision system, that makes decisions on it’s own or a support system that is just suggesting. By clearly understanding the purpose of the AI the users usually get more comfortable with it quicker.

6. People have very different understandings of what AI is 

As a rule of thumb everyone has a different understanding of AI. The managers, users, developers and all other stakeholders will have their unique understanding of what AI actually is. That’s very fair since AI does not have one definite definition, but it will be a source of problems if everyone involved in a project has a different understanding of what is going on. So before you start a project, make sure not to take for granted that anybody thinks the way you do. Be explicit about what AI means to you and how you will approach it.

Wearable exosuit that lessens muscle fatigue could redesign the future of work

Vanderbilt University engineers have determined that their back-assist exosuit, a clothing-like device that supports human movement and posture, can reduce fatigue by an average of 29-47 percent in lower back muscles. The exosuit's functionality presents a promising new development for individuals who work in physically demanding fields and are at risk for back pain, including medical professionals and frontline workers.

A 3D-printed tensegrity structure for soft robotics applications

Tensegrity is a design principle that has often been applied by artists, architects and engineers to build a wide range of structures, including sculptures, frames and buildings. This principle essentially describes the dynamics that occur when a structure maintains its stability via a pervasive tensional force.
Page 15 of 50
1 13 14 15 16 17 50