Page 341 of 433
1 339 340 341 342 343 433

End-to-end deep reinforcement learning without reward engineering


By Avi Singh

Communicating the goal of a task to another person is easy: we can use language, show them an image of the desired outcome, point them to a how-to video, or use some combination of all of these. On the other hand, specifying a task to a robot for reinforcement learning requires substantial effort. Most prior work that has applied deep reinforcement learning to real robots makes uses of specialized sensors to obtain rewards or studies tasks where the robot’s internal sensors can be used to measure reward. For example, using thermal cameras for tracking fluids, or purpose-built computer vision systems for tracking objects. Since such instrumentation needs to be done for any new task that we may wish to learn, it poses a significant bottleneck to widespread adoption of reinforcement learning for robotics, and precludes the use of these methods directly in open-world environments that lack this instrumentation.

We have developed an end-to-end method that allows robots to learn from a modest number of images that depict successful completion of a task, without any manual reward engineering. The robot initiates learning from this information alone (around 80 images), and occasionally queries a user for additional labels. In these queries, the robot shows the user an image and asks for a label to determine whether that image represents successful completion of the task or not. We require a small number of such queries (around 25-75), and using these queries, the robot is able to learn directly in the real world in 1-4 hours of interaction time, resulting in one of the most efficient real-world image-based robotic RL methods. We have open-sourced our implementation.






Our method allows us to solve a host of real world robotics problems from pixels in an end-to-end fashion without any hand-engineered reward functions.

Classifier-based rewards

While most prior work uses purpose-built systems for obtaining rewards to solve the task at hand, a simple alternative has been previously explored. We can specify the task using a set of goal images, and then train a classifier to distinguish between goal and non-goal images. The success probabilities from this classifier can then be used as reward for training an RL agent to achieve the goal.






It’s often straightforward to specify a task via example images. For examples, in the images above, the task could be pour this much wine in the glass, fold clothes like this, and set the table like this.

Problem with classifiers

While classifiers are an intuitive and straightforward solution to specify tasks for RL agents in the real world, they also pose a number of issues when applied to real-world problems. A user that is specifying a task with goal classifiers must provide not only positive examples for the task, but also negative examples. Moreover, this set of negative examples must be exhaustive and cover all parts of the space that the robot can potentially visit. If the set of negative examples is not exhaustive, then the RL algorithm can easily fool the classifier by finding situations that the classifier did not see during training. An example of this classifier exploitation problem can be seen below.




In this task, the goal is to push the green object onto the red marker. The robot is trained via RL using a classifier as a reward function. The success probability from the classifier is visualized with time in the lower right. As we see, while the classifier outputs a success probability of 1.0, the robot does not solve the task. The RL algorithm has managed to exploit the classifier by moving the robot arm in a peculiar way, since the classifier was not trained on this specific kind of negative examples.

Overcoming classifier exploitation

Our recent approach, which we call variational inverse control with events (VICE) seeks to solve this issue by instead mining the negative examples required by the classifier in an adversarial fashion. The method begins by randomly initializing the classifiers and the policy. It first fixes the classifier and updates the policy to maximize the reward. Then, it trains the classifier to distinguish between user-provided goal examples and samples collected by the policy. The RL algorithm then utilizes this updated classifier as reward for learning a policy to achieve the desired goal, and this alternating process continues until the samples collected by the policy are indistinguishable from the user-proved goal examples. This process resembles generative adversarial networks and is based on a form of inverse reinforcement learning, but in contrast to standard inverse reinforcement learning, it does not require example demonstrations – only example success images provided at the beginning of training for the classifier. VICE (as shown below) is effective at combating the exploitation problem faced by naive classifiers, and the user no longer needs to provide any negative examples at all.




We see that the success probabilities learned by the classifier correlate strongly with actual success, allowing the robot to learn a policy that successfully accomplishes the task.

Leveraging active learning

While VICE is capable of learning end-to-end policies for solving real world robotic tasks without any engineering for obtaining rewards, it does have a limitation: it needs thousands of positive examples provided upfront in order to learn, and this could be a burden on the human user. To combat this problem, we developed a new approach that enables the robot to query the user for labels, in addition to using a modest number of initially-provided goal examples. We refer to this approach as reinforcement learning with active goal queries (RAQ). In these active queries, the robot shows the user an image and asks for a label to determine whether the image represents successful completion of the task. While requesting labels for every single state would amount to asking the user to manually provide the reward signal, our method requires labels for only a tiny fraction of the images seen during training, making it an efficient and practical approach for learning skills without manually engineered rewards.




In this task, the goal is to place a book into any one of the empty slots in the bookshelf. This figure shows some example queries made by our algorithm. The algorithm has picked each of these images from the experience it collected while learning to solve the task (using probability estimates from the learned classifier), and the user provides a binary success/failure label for each of them.

The combined method, which we call VICE-RAQ, is able to solve real world robotics tasks with about 80 goal example images provided up front, followed by 25-75 active queries. We make use of the recently introduced soft actor-critic algorithm for policy optimization, and are able to solve tasks in about 1-4 hours of real world interaction time, which is much faster than prior work for a policy trained end-to-end on images.




Our method is able to learn the pushing task (where the goal is to push the mug onto the white coaster) in slightly over an hour of interaction time, and only requires for 25 queries. Even for the more complex bookshelf and draping tasks, our method requires under four hours of interaction time and less than 75 active queries.

Solving tasks involving deformable objects

Since we learn a reward function on pixels, we can solve tasks for which it would be difficult to manually specify a reward function. One of the tasks in our experiments is to drape a cloth over a box, which is essentially a miniaturized version of a tablecloth draping task. To succeed, the robot must drape the cloth smoothly, without crumpling it and without creating any wrinkles. We see that our method is able to successfully solve this task. To demonstrate the challenges associated with this task, we evaluate a method that only uses the robot’s end-effector position as observation and a hand-defined reward function on this observation (Euclidean distance to the goal). We observe that this baseline fails to achieve the objective of this task, as it simply moves the end effector in a straight line motion to the goal, while this task cannot be solved using any straight-line trajectory.





Left: resulting policy with a hand-defined reward on the gripper position. Right: resulting policy from a learned reward function on pixels.

Solving tasks with multiple goal conditions

Classifiers are more expressive than just goal images for describing a task, and this can best be seen in tasks for which there are multiple images that describe our goal. In the bookshelf task in our experiments, the goal is to insert a book into an empty slot on a bookshelf. The initial position of the arm holding the book is randomized, requiring the robot to succeed from any starting position. Crucially, the bookshelf has several open slots, which means that, from different starting positions, different slots may be preferred. Here, we see that our method learns a policy to insert the book in different slots in the bookshelf depending on where the book is at the start of a trajectory. The robot usually prefers to put the book in the nearest slot, since this maximizes the reward that it can obtain from the classifier.





Left: robot chooses to insert book in left slot. Right: robot chooses to insert book in the right slot.

Several data-driven approaches have been proposed for the reward specification problem, and inverse reinforcement learning (IRL) is one of the more prominent frameworks in this setting. VICE is closely related to recent IRL methods like guided cost learning and adversarial inverse reinforcement learning. While these methods require trajectories of (state,action) pairs provided by a human expert, VICE only requires the final desired state, making it substantially easier to specify the task, and also making it possible for the reinforcement learning algorithm to discover novel ways to complete the task on its own (instead of simply mimicking the expert).

Our method is also related to generative adversarial networks. Techniques inspired by GANs have been applied to control problems, but these techniques also require expert trajectories similar to the IRL techniques mentioned before. Our method demonstrates that such adversarial learning frameworks can be extended to settings where we don’t have expert demonstrations, and only have examples of desired states that we would like to achieve.

End-to-end perception and control for robotics have gained prominence in the last few years, but initial approaches either required access to low-dimensional states (e.g. the positions of objects) at training time, or separately trained intermediate representations. More recent approaches are able to learn policies directly on pixels without using low-dimensional states during training, but still require instrumentation for obtaining rewards. Our method goes a step further – it learns both a policy as well as a reward function on pixels. This allows us to solve tasks for which rewards to would be otherwise hard to specify, such as the draping task.

Conclusion

By enabling robotic reinforcement learning without user-programmed reward functions or demonstrations, we believe that our approach represents a significant step towards making reinforcement learning a practical, automated, and readily usable tool for enabling versatile and capable robotic manipulation. By making it possible for robots to improve their skills directly in real-world environments, without any instrumentation or manual reward design, we believe that our method also represents a step toward enabling lifelong learning for robotic systems that learn directly “in the wild”. This capability can make it feasible in the future for robots to acquire broad and highly generalizable skill repertoires directly through interaction with the real world.

This post is based on the following papers:

I would like to thank Sergey Levine, Chelsea Finn and Kristian Hartikainen for their feedback while writing this blog post. This article was initially published on the BAIR blog, and appears here with the authors’ permission.

Sensor-packed glove learns signatures of the human grasp

MIT researchers have developed a low-cost, sensor-packed glove that captures pressure signals as humans interact with objects. The glove can be used to create high-resolution tactile datasets that robots can leverage to better identify, weigh, and manipulate objects.
Image: Courtesy of the researchers
By Rob Matheson

Wearing a sensor-packed glove while handling a variety of objects, MIT researchers have compiled a massive dataset that enables an AI system to recognize objects through touch alone. The information could be leveraged to help robots identify and manipulate objects, and may aid in prosthetics design.

The researchers developed a low-cost knitted glove, called “scalable tactile glove” (STAG), equipped with about 550 tiny sensors across nearly the entire hand. Each sensor captures pressure signals as humans interact with objects in various ways. A neural network processes the signals to “learn” a dataset of pressure-signal patterns related to specific objects. Then, the system uses that dataset to classify the objects and predict their weights by feel alone, with no visual input needed.

In a paper published today in Nature, the researchers describe a dataset they compiled using STAG for 26 common objects — including a soda can, scissors, tennis ball, spoon, pen, and mug. Using the dataset, the system predicted the objects’ identities with up to 76 percent accuracy. The system can also predict the correct weights of most objects within about 60 grams.

Similar sensor-based gloves used today run thousands of dollars and often contain only around 50 sensors that capture less information. Even though STAG produces very high-resolution data, it’s made from commercially available materials totaling around $10.

The tactile sensing system could be used in combination with traditional computer vision and image-based datasets to give robots a more human-like understanding of interacting with objects.

“Humans can identify and handle objects well because we have tactile feedback. As we touch objects, we feel around and realize what they are. Robots don’t have that rich feedback,” says Subramanian Sundaram PhD ’18, a former graduate student in the Computer Science and Artificial Intelligence Laboratory (CSAIL). “We’ve always wanted robots to do what humans can do, like doing the dishes or other chores. If you want robots to do these things, they must be able to manipulate objects really well.”

The researchers also used the dataset to measure the cooperation between regions of the hand during object interactions. For example, when someone uses the middle joint of their index finger, they rarely use their thumb. But the tips of the index and middle fingers always correspond to thumb usage. “We quantifiably show, for the first time, that, if I’m using one part of my hand, how likely I am to use another part of my hand,” he says.

Prosthetics manufacturers can potentially use information to, say, choose optimal spots for placing pressure sensors and help customize prosthetics to the tasks and objects people regularly interact with.

Joining Sundaram on the paper are: CSAIL postdocs Petr Kellnhofer and Jun-Yan Zhu; CSAIL graduate student Yunzhu Li; Antonio Torralba, a professor in EECS and director of the MIT-IBM Watson AI Lab; and Wojciech Matusik, an associate professor in electrical engineering and computer science and head of the Computational Fabrication group.  

STAG is laminated with an electrically conductive polymer that changes resistance to applied pressure. The researchers sewed conductive threads through holes in the conductive polymer film, from fingertips to the base of the palm. The threads overlap in a way that turns them into pressure sensors. When someone wearing the glove feels, lifts, holds, and drops an object, the sensors record the pressure at each point.

The threads connect from the glove to an external circuit that translates the pressure data into “tactile maps,” which are essentially brief videos of dots growing and shrinking across a graphic of a hand. The dots represent the location of pressure points, and their size represents the force — the bigger the dot, the greater the pressure.

From those maps, the researchers compiled a dataset of about 135,000 video frames from interactions with 26 objects. Those frames can be used by a neural network to predict the identity and weight of objects, and provide insights about the human grasp.

To identify objects, the researchers designed a convolutional neural network (CNN), which is usually used to classify images, to associate specific pressure patterns with specific objects. But the trick was choosing frames from different types of grasps to get a full picture of the object.

The idea was to mimic the way humans can hold an object in a few different ways in order to recognize it, without using their eyesight. Similarly, the researchers’ CNN chooses up to eight semirandom frames from the video that represent the most dissimilar grasps — say, holding a mug from the bottom, top, and handle.

But the CNN can’t just choose random frames from the thousands in each video, or it probably won’t choose distinct grips. Instead, it groups similar frames together, resulting in distinct clusters corresponding to unique grasps. Then, it pulls one frame from each of those clusters, ensuring it has a representative sample. Then the CNN uses the contact patterns it learned in training to predict an object classification from the chosen frames.

“We want to maximize the variation between the frames to give the best possible input to our network,” Kellnhofer says. “All frames inside a single cluster should have a similar signature that represent the similar ways of grasping the object. Sampling from multiple clusters simulates a human interactively trying to find different grasps while exploring an object.”

For weight estimation, the researchers built a separate dataset of around 11,600 frames from tactile maps of objects being picked up by finger and thumb, held, and dropped. Notably, the CNN wasn’t trained on any frames it was tested on, meaning it couldn’t learn to just associate weight with an object. In testing, a single frame was inputted into the CNN. Essentially, the CNN picks out the pressure around the hand caused by the object’s weight, and ignores pressure caused by other factors, such as hand positioning to prevent the object from slipping. Then it calculates the weight based on the appropriate pressures.

The system could be combined with the sensors already on robot joints that measure torque and force to help them better predict object weight. “Joints are important for predicting weight, but there are also important components of weight from fingertips and the palm that we capture,” Sundaram says.

Omron Helps University of Houston Engineering Students Gain Real-World Skills with New Design and Robotics Laboratory

Omron Foundation, the charitable arm of the U.S.-based operations of industrial automation solutions provider Omron, donated a new laboratory complete with workstations and state-of-the-art equipment to give University of Houston students the opportunity to prepare for real-

#287: Robonomics Platform: Integrating Robots into the Economy, with Aleksandr Kapitonov



In this episode, Lilly Clark interviews Aleksandr Kapitonov, “robot economics” academic society professor at Airalab, on his work for Robonomics Platform, an Ethereum network infrastructure for integrating robots and cyber-physical systems directly into the economy. Kapitonov discusses the advantages of using blockchain, use cases including a fully autonomous vending machine, and the Robonomics technology stack.

Below are two videos showing the Robonomics Platform in action via a fully autonomous robot artist and drones for environmental monitoring.

Aleksandr Kapitonov

Aleksandr Kapitonov is a “robot economics” academic society progressor at Airalab (the team behind Robonomics Platform), an assistant professor of Control Systems and Robotics at ITMO University, and regional coordinator of the Erasmus+ IOT-OPEN.EU project for researching and developing IoT education practices. His research focuses on navigation, computer vision, control of mobile robots and communication for multi-agents systems.

Links

Model-based reinforcement learning from pixels with structured latent variable models

By Marvin Zhang and Sharad Vikram

Imagine a robot trying to learn how to stack blocks and push objects using visual inputs from a camera feed. In order to minimize cost and safety concerns, we want our robot to learn these skills with minimal interaction time, but efficient learning from complex sensory inputs such as images is difficult. This work introduces SOLAR, a new model-based reinforcement learning (RL) method that can learn skills – including manipulation tasks on a real Sawyer robot arm – directly from visual inputs with under an hour of interaction. To our knowledge, SOLAR is the most efficient RL method for solving real world image-based robotics tasks.



Our robot learns to stack a Lego block and push a mug onto a coaster with only inputs from a camera pointed at the robot. Each task takes an hour or less of interaction to learn.

In the RL setting, an agent such as our robot learns from its own experience through trial and error, in order to minimize a cost function corresponding to the task at hand. Many challenging tasks have been solved in recent years by RL methods, but most of these success stories come from model-free RL methods, which typically require substantially more data than model-based methods. However, model-based methods often rely on the ability to accurately predict into the future in order to plan the agent’s actions. This is an issue for image-based learning as predicting future images itself requires large amounts of interaction, which we wish to avoid.

There are some model-based RL methods that do not require accurate future prediction, but these methods typically place stringent assumptions on the state. The LQR-FLM method has been shown to learn new tasks very efficiently, including for real robotic systems, by modeling the dynamics of the state as approximately linear. This assumption, however, is prohibitive for image-based learning, as the dynamics of pixels in a camera feed are far from linear. The question we study in our work is: how can we relax this assumption in order to develop a model-based RL method that can solve image-based tasks without requiring accurate future predictions?

We tackle this problem by learning a latent state representation using deep neural networks. When our agent is faced with images from the task, it can encode the images into their latent representations, which can then be used as the state inputs to LQR-FLM rather than the images themselves. The key insight in SOLAR is that, in addition to learning a compact latent state that accurately captures the objects, we specifically learn a representation that works well with LQR-FLM by encouraging the latent dynamics to be linear. To that end, we introduce a latent variable model that explicitly represents latent linear dynamics, and this model combined with LQR-FLM provides the basis for the SOLAR algorithm.

Stochastic Optimal Control with Latent Representations

SOLAR stands for stochastic optimal control with latent representations, and it is an efficient and general solution for image-based RL settings. The key ideas behind SOLAR are learning latent state representations where linear dynamics are accurate, as well as utilizing a model-based RL method that does not rely on future prediction, which we describe next.

Linear Dynamical Control


Using the system state, LQR-FLM and related methods have been used to successfully learn a myriad of tasks including robotic manipulation and locomotion. We aim to extend these capabilities by automatically learning the state input to LQR-FLM from images.

One of the best-known results in control theory is the linear-quadratic regulator (LQR), a set of equations that provides the optimal control strategy for a system in which the dynamics are linear and the cost is quadratic. Though real world systems are almost never linear, approximations to LQR such as LQR with fitted linear models (LQR-FLM) have been shown to perform well at a variety of robotic control tasks. LQR-FLM has been one of the most efficient RL methods at learning control skills, even compared to other model-based RL methods. This efficiency is enabled by the simplicity of linear models as well as the fact that these models do not need to predict accurately into the future. This makes LQR-FLM an appealing method to build from, however the key limitation of this method is that it normally assumes access to the system state, e.g., the joint configuration of the robot and the positions of objects of interest, which can often be reasonably modeled as approximately linear. We instead work from images and relax this assumption by learning a representation that we can use as the input to LQR-FLM.

Learning Latent States from Images


The graphical model we set up presumes that the images we observe are a function of a latent state, and the states evolve according to linear dynamics modulated by actions, and where the costs are given by a quadratic function of the state and action.

We want our agent to extract, from its visual input, a state representation where the dynamics of the state are as close to linear as possible. To accomplish this, we devise a latent variable model in which the latent states obey linear dynamics, as detailed in the graphic above. The dark nodes are what we observe from interacting with the environment – namely, images, actions taken by the agent, and costs. The light nodes are the underlying states, which is the representation that we wish to learn, and we posit that the next state is a linear function of the current state and action. This model bears strong resemblance to the structured variational auto-encoder (SVAE), a model previously applied to applications such as characterizing videos of mice. The method that we use to fit our model is also based off of the method presented in this prior work.

At a high level, our method learns both the state dynamics and an encoder, which is a function that takes as input the current and past images and outputs a guess of the current state. If we encode many observation sequences corresponding to the agent’s interactions with the environment, we can see if these state sequences behave according to our learned linear dynamics – if they don’t, we adjust our dynamics and our encoder to bring them closer in line. One key aspect of this procedure is that we do not directly optimize our model to be accurate at predicting into the future, since we only fit linear models retrospectively to the agent’s previous interactions. This strongly complements LQR-FLM which, again, does not rely on prediction for good performance. Our paper provides more details about our model learning procedure.

The SOLAR Algorithm


Our robot iteratively interacts with its environment, uses this data to update its model, uses this model to estimate the latent states and their dynamics, and uses these dynamics to update its behavior.

Now that we have described the building blocks of our method, how do these pieces fit together into the SOLAR method? The agent acts in the environment according to its policy, which prescribes actions based on the current latent state estimate. These interactions produce trajectories of images, actions, and costs that are then used to fit the model detailed in the previous section. Afterwards, using these entire trajectories of interactions, our model retrospectively refines its estimate of the latent dynamics, which allows LQR-FLM to produce an updated policy that should perform better at the given task, i.e., incur lower costs. The updated policy is then used to collect more trajectories, and the procedure repeats. The graphic above depicts these stages of the algorithm.

The key difference between LQR-FLM and most other model-based RL methods is that the resulting models are only used for policy improvement and not for prediction into the future. This is useful in settings where the observations are complex and difficult to predict, and we extend this benefit into image-based settings by introducing latent states that we can estimate alongside the dynamics. As seen in the next section, SOLAR can produce good policies for image-based robotic manipulation tasks using only one hour of interaction time with the environment.

Experiments


Left: For Lego block stacking, we experiment with multiple starting positions of the arm and block. For pushing, we only use sparse rewards provided by a human pushing a key when the robot succeeds. Example image observations are given in the bottom row. Right: Examples of successful behaviors learned by SOLAR.

Our main testbed for SOLAR is the Sawyer robotic arm, which has seven degrees of freedom and can be used for a variety of manipulation tasks. We feed the robot images from a camera pointed at its arm and the relevant objects in the scene, and we task our robot with learning Lego block stacking and mug pushing, as detailed below.

Lego Block Stacking

https://youtube.com/watch?v=X5RjE–TUGs%3Frel%3D0

Using SOLAR, our Sawyer robot efficiently learns stacking from only image observations from all three initial positions. The ablations are less successful, and DVF does not learn as quickly as SOLAR. In particular, these methods have difficulty with the challenging setting where the block starts on the table.

The main challenge for block stacking stems from the precision required to succeed, as the robot must very accurately place the block in order to properly connect the pieces. Using SOLAR, the Sawyer learns this precision from only the camera feed, and moreover the robot can successfully learn to stack from a number of starting configurations of the arm and block. In particular, the configuration where the block starts on the table is the most challenging, as the Sawyer must learn to first lift the block off the table before stacking it – in other words, it can’t be “greedy” and simply move toward the other block.

We first compare SOLAR to an ablation that uses a standard variational auto-encoder (VAE) rather than the SVAE, which means that the state representation is not learned to follow linear dynamics. This ablation is only successful on the easiest starting configuration. In order to understand what benefits we extract from not requiring accurate future predictions, we compare to another ablation which replaces LQR-FLM with an alternative planning method known as model-predictive control (MPC), and we also compare to a state-of-the-art prior method that uses MPC, deep visual foresight (DVF). MPC has been used in a number of prior and subsequent works, and it relies on being able to generate accurate future predictions using the learned model in order to determine what actions are likely to lead to good performance.

The MPC ablation learns more quickly on the two easier configurations, however, it fails in the most difficult setting because MPC greedily reduces the distance between the two blocks rather than lifting the block off the table. MPC acts greedily because it only plans over a short horizon, as predicting future images becomes increasingly inaccurate over longer horizons, and this is exactly the failure mode that SOLAR is able to overcome by utilizing LQR-FLM to avoid future predictions altogether. Finally, we find that DVF can make progress but ultimately is not able to solve the two harder settings even with more data than what we use for our method. This highlights our method’s data efficiency, as we use in total a few hours of robot data compared to days or weeks of data as in DVF.

Mug Pushing

https://youtube.com/watch?v=buk4YE2mFTs%3Frel%3D0

Despite the challenge of only having sparse rewards provided by a human key press, our robot running SOLAR learns to push the mug onto the coaster in under an hour. DVF is again not as efficient and does not learn as quickly as SOLAR.

We add an additional challenge to mug pushing by replacing the costs with a sparse reward signal, i.e., the robot only gets told when it has completed the task, and it is told nothing otherwise. As seen in the picture above, the human presses a key on the keyboard in order to provide the sparse reward, and the robot must reason about how improve its behavior in order to achieve this reward. This is implemented via a straightforward extension to SOLAR, as we detail in the paper. Despite this additional challenge, our method learns a successful policy in about an hour of interaction time, whereas DVF performs worse than our method using a comparable amount of data.

Simulated Comparisons


Left: an illustration of the car and reacher environments we experiment with, along with example image observations in the bottom row. Right: our method generally performs better than the ablations we compare to, as well as RCE. PPO has better final performance, however PPO requires one to three orders of magnitude more data than SOLAR to reach this performance.

In addition to the Sawyer experiments, we also run several comparisons in simulation, as most prior work does not experiment with real robots. In particular, we set up a 2D navigation domain where the underlying system actually has linear dynamics and quadratic cost, but we can only observe images that show a top-down view of the agent and the goal. We also include two domains that are more complex: a car that must drive from the bottom right to the top left of a 2D plane, and a 2 degree of freedom arm that is tasked with reaching to a goal in the bottom left. All domains are learned with only image observations that provide a top down view of the task.

We compare to robust locally-linear controllable embeddings (RCE), which takes a different approach to learning latent state representations that follow linear dynamics. We also compare to proximal policy optimization (PPO), a model-free RL method that has been used to solve a number of simulated robotics domains but is not data efficient enough for real world learning. We find that SOLAR learns faster and achieves better final performance than RCE. PPO typically learns better final performance than SOLAR, but this typically requires one to three orders of magnitude more data, which again is prohibitive for most real world learning tasks. This kind of tradeoff is typical: model-free methods tend to achieve better final performance, but model-based methods learn much faster. Videos of the experiments can be viewed on our project website.

Related Work

Approaches to learning latent representations of images have proposed objectives such as reconstructing the image and predicting future images. These objectives do not line up perfectly with our objective of accomplishing tasks – for example, a robot tasked with sorting objects into bins by color does not need to perfectly reconstruct the color of the wall in front of it. There has also been work on learning state representations that are suitable for control, including identifying points of interest within the image and learning latent states such that dimensions are independently controllable. A recent survey paper categorizes the landscape of state representation learning.

Separately from control, there has been a number of recent works that learn structured representations of data, many of which extend VAEs. The SVAE is an example of one such framework, and some other methods also attempt to explain the data with linear dynamics. Beyond this, there have been works that learn latent representations with mixture model structure, various discrete structures, and Bayesian nonparametric structures.

Ideas that are closely related to ours have been proposed in prior and subsequent work. As mentioned before, DVF has also learned robotics tasks directly from vision, and a recent blog post summarizes their results. Embed to control and its successor RCE also aim to learn latent state representations with linear dynamics. We compare to these methods in our paper and demonstrate that our method tends to exhibit better performance. Subsequent to our work, PlaNet learns latent state representations with a mixture of deterministic and stochastic variables and uses them in conjunction with MPC, one of the baseline methods in our evaluation, demonstrating good results on several simulated tasks. As shown by our experiments, LQR-FLM and MPC each have their respective strengths and weaknesses, and we found that LQR-FLM was typically more successful for robotic control, avoiding the greedy behavior of MPC.

Future Work

We see several exciting directions for future work, and we’ll briefly mention two. First, we want our robots to be able to learn complex, multi-stage tasks, such as building Lego structures rather than just stacking one block, or setting a table rather than just pushing one mug. One way we may realize this is by providing intermediate images of the goals we want the robot to accomplish, and if we expect that the robot can learn each stage separately, it may be able to string these policies together into more complex and interesting behaviors. Second, humans don’t just learn representations of states but also actions – we don’t think about individual muscle movements, we group such movements together into “macro-actions” to perform highly coordinated and sophisticated behaviors. If we can similarly learn action representations, we can enable our robots to more efficiently learn how to use hardware such as dexterous hands, which will further increase their ability to handle complex, real-world environments.

This post is based on the following paper:

We would like to thank our co-authors, without whom this work would not be possible, for also contributing to and providing feedback on this post, in particular Sergey Levine. We would also like to thank the many people that have provided insightful discussions, helpful suggestions, and constructive reviews that have shaped this work. This article was initially published on the BAIR blog, and appears here with the authors’ permission.

Page 341 of 433
1 339 340 341 342 343 433