Learning preferences by looking at the world
By Rohin Shah and Dmitrii Krasheninnikov
It would be great if we could all have household robots do our chores for us. Chores are tasks that we want done to make our houses cater more to our preferences; they are a way in which we want our house to be different from the way it currently is. However, most “different” states are not very desirable:
Surely our robot wouldn’t be so dumb as to go around breaking stuff when we ask it to clean our house? Unfortunately, AI systems trained with reinforcement learning only optimize features specified in the reward function and are indifferent to anything we might’ve inadvertently left out. Generally, it is easy to get the reward wrong by forgetting to include preferences for things that should stay the same, since we are so used to having these preferences satisfied, and there are so many of them. Consider the room below, and imagine that we want a robot waiter that serves people at the dining table efficiently. We might implement this using a reward function that provides 1 reward whenever the robot serves a dish, and use discounting so that the robot is incentivized to be efficient. What could go wrong with such a reward function? How would we need to modify the reward function to take this into account? Take a minute to think about it.
Here’s an incomplete list we came up with:
- The robot might track dirt and oil onto the pristine furniture while serving food, even if it could clean itself up, because there’s no reason to clean but there is a reason to hurry.
- In its hurry to deliver dishes, the robot might knock over the cabinet of wine bottles, or slide plates to people and knock over the glasses.
- In case of an emergency, such as the electricity going out, we don’t want the robot to keep trying to serve dishes – it should at least be out of the way, if not trying to help us.
- The robot may serve empty or incomplete dishes, dishes that no one at the table wants, or even split apart dishes into smaller dishes so there are more of them.
Note that we’re not talking about problems with robustness and distributional shift: while those problems are worth tackling, the point is that even if we achieve robustness, the simple reward function still incentivizes the above unwanted behaviors.
It’s common to hear the informal solution that the robot should try to minimize its impact on the environment, while still accomplishing the task. This could potentially allow us to avoid the first three problems above, though the last one still remains as an example of specification gaming. This idea leads to impact measures that attempt to quantify the “impact” that an agent has, typically by looking at the difference between what actually happened and what would have happened had the robot done nothing. However, this also penalizes things we want the robot to do. For example, if we ask our robot to get us coffee, it might buy coffee rather than making coffee itself, because that would have “impact” on the water, the coffee maker, etc. Ultimately, we’d like to only prevent negative impacts, which means that we need our AI to have a better idea of what the right reward function is.
Our key insight is that while it might be hard for humans to make their preferences explicit, some preferences are implicit in the way the world looks: the world state is a result of humans having acted to optimize their preferences. This explains why we often want the robot to by default “do nothing” – if we have already optimized the world state for our preferences, then most ways of changing it will be bad, and so doing nothing will often (though not always) be one of the better options available to the robot.
Since the world state is a result of optimization for human preferences, we should be able to use that state to infer what humans care about. For example, we surely don’t want dirty floors in our pristine room; otherwise we would have done that ourselves. We also can’t be indifferent to dirty floors, because then at some point we would have walked around the room with dirty shoes and gotten a dirty floor. The only explanation is that we want the floor to be clean.
A simple setting
Let’s see if we can apply this insight in the simplest possible setting: gridworlds with a small number of states, a small number of actions, a known dynamics model (i.e. a model of “how the world works”), but an incorrect reward function. This is a simple enough setting that our robot understands all of the consequences of its actions. Nevertheless, the problem remains: while the robot understands what will happen, it still cannot distinguish good consequences from bad ones, since its reward function is incorrect. In these simple environments, it’s easy to figure out what the correct reward function is, but this is infeasible in a real, complex environment.
For example, consider the room to the right, where Alice asks her robot to navigate to the purple door. If we were to encode this as a reward function that only rewards the robot while it is at the purple door, the robot would take the shortest path to the purple door, knocking over and breaking the vase – since no one said it shouldn’t do that. The robot is perfectly aware that its plan causes it to break the vase, but by default it doesn’t realize that it shouldn’t break the vase.
In this environment, does it help us to realize that Alice was optimizing the state of the room for her preferences? Well, if Alice didn’t care about whether the vase was broken, she would have probably broken it some time in the past. If she wanted the vase broken, she definitely would have broken it some time in the past. So the only consistent explanation is that Alice cared about the vase being intact, as illustrated in the gif below.
While this example has the robot infer that it shouldn’t take the action of breaking a vase, the robot can also infer goals that it should actively pursue. For example, if the robot observes a basket of apples near an apple tree, it can reasonably infer that Alice wants to harvest apples, since the apples didn’t walk into the basket themselves – Alice must have put effort into picking the apples and placing them in the basket.
Reward Learning by Simulating the Past
We formalize this idea by considering an MDP in which our robot observes the initial state $s_0$ at deployment, and assumes that it is the result of a human optimizing some unknown reward for $T$ timesteps.
Before we get to our actual algorithm, consider a completely intractable algorithm that should do well: for each possible reward function, simulate the trajectories that Alice would take if she had that reward, and see if the resulting states are compatible with $s_0$. This set of compatible reward functions give the candidates for Alice’s reward function. This is the algorithm that we implicitly use in the gif above.
Intuitively, this works because:
- Anything that requires effort on Alice’s part (e.g. keeping a vase intact) will not happen for the vast majority of reward functions, and will force the reward functions to incentivize that behavior (e.g. by rewarding intact vases).
- Anything that does not require effort on Alice’s part (e.g. a vase becoming dusty) will happen for most reward functions, and so the inferred reward functions need not incentivize that behavior (e.g. there’s no particular value on dusty/clean vases).
Another way to think of it is that we can consider all possible past trajectories that are compatible with $s_0$, infer the reward function that makes those trajectories most likely, and keep those reward functions as plausible candidates, weighted by the number of past trajectories they explain. Such an algorithm should work for similar reasons. Phrased this way, it sounds like we want to use inverse reinforcement learning to infer rewards for every possible past trajectory, and aggregate the results. This is still intractable, but it turns out we can take this insight and turn it into a tractable algorithm.
We follow Maximum Causal Entropy Inverse Reinforcement Learning (MCEIRL), a commonly used algorithm for small MDPs. In this framework, we know the action space and dynamics of the MDP, as well as a set of good features of the state, and the reward is assumed to be linear in these features. In addition, the human is modelled as Boltzmann-rational: Alice’s probability of taking a particular action from a given state is assumed to be proportional to the exponent of the state-action value function Q, computed using soft value iteration. Given these assumptions, we can calculate $p(\tau \mid \theta_A)$, the distribution over the possible trajectories $\tau = s_{-T} a_{-T} \dots s_{-1} a_{-1} s_0$ under the assumption that Alice’s reward was $\theta_A$. MCEIRL then finds the $\theta_A$ that maximizes the probability of a set of trajectories .
Rather than considering all possible trajectories and running MCEIRL on all of them to maximize each of their probabilities individually, we instead maximize the probability of the evidence that we see: the single state $s_0$. To get a distribution over $s_0$, we marginalize out the human’s behavior prior to the robot’s initialization:
We then find a reward $\theta_A$ that maximizes the likelihood above using gradient ascent, where the gradient is analytically computed using dynamic programming. We call this algorithm Reward Learning by Simulating the Past (RLSP) since it infers the unknown human reward from a single state by considering what must have happened in the past.
Using the inferred reward
While RLSP infers a reward that captures the information about human preferences contained in the initial state, it is not clear how we should use that reward. This is a challenging problem – we have two sources of information, the inferred reward from $s_0$, and the specified reward $\theta_{\text{spec}}$, and they will conflict. If Alice has a messy room, $\theta_A$ is not going to incentivize cleanliness, even though $\theta_{\text{spec}}$ might.
Ideally, we would note the scenarios under which the two rewards conflict, and ask Alice how she would like to proceed. However, in this work, to demonstrate the algorithm we use the simple heuristic of adding the two rewards, giving us a final reward $\theta_A + \lambda \theta_{\text{spec}}$, where $\lambda$ is a hyperparameter that controls the tradeoff between the rewards.
We designed a suite of simple gridworlds to showcase the properties of RLSP. The top row shows the behavior when optimizing the (incorrect) specified reward, while the bottom row shows the behavior you get when you take into account the reward inferred by RLSP. A more thorough description of each environment is given in the paper. The last environment in particular shows a limitation of our method. In a room where the vase is far away from Alice’s most probable trajectories, the only trajectories that Alice could have taken to break the vase are all very long and contribute little to the RLSP likelihood. As a result, observing the intact vase doesn’t tell the robot much about whether Alice wanted to actively avoid breaking the vase, since she wouldn’t have been likely to break it in any case.
What’s next?
Now that we have a basic algorithm that can learn the human preferences from one state, the natural next step is to scale it to realistic environments where the states cannot be enumerated, the dynamics are not known, and the reward function is not linear. This could be done by adapting existing inverse RL algorithms, similarly to how we adapted Maximum Causal Entropy IRL to the one-state setting.
The unknown dynamics setting, where we don’t know “how the world works”, is particularly challenging. Our algorithm relies heavily on the assumption that our robot knows how the world works – this is what gives it the ability to simulate what Alice “must have done” in the past. We certainly can’t learn how the world works just by observing a single state of the world, so we would have to learn a dynamics model while acting that can then be used to simulate the past (and these simulations will get better as the model gets better).
Another avenue for future work is to investigate the ways to decompose the inferred reward into $\theta_{A, \text{task}}$ which says which task Alice is performing (“go to the black door”), and $\theta_{\text{frame}}$, which captures what Alice prefers to keep unchanged (“don’t break the vase”). Given the separate $\theta_{\text{frame}}$, the robot could optimize $\theta_{\text{spec}}+\theta_{\text{frame}}$ and ignore the parts of the reward function that correspond to the task Alice is trying to perform.
Since $\theta_{\text{frame}}$ is in large part shared across many humans, we could infer it using models where multiple humans are optimizing their own unique $\theta_{H,\text{task}}$ but the same $\theta_{\text{frame}}$, or we could have one human whose task change over time. Another direction would be to assume a different structure for what Alice prefers to keep unchanged, such as constraints, and learn them separately.
You can learn more about this research by reading our paper, or by checking out our poster at ICLR 2019. The code is available here.
This article was initially published on the BAIR blog, and appears here with the authors’ permission.
Is the green new deal sustainable?
This week Washington DC was abuzz with news that had nothing to do with the occupant of The While House. A group of progressive legislators, led by Alexandra Ocasio-Cortez, in the House of Representatives, introduced “The Green New Deal.” The resolution by the Intergovernmental Panel on Climate Change was in response to the alarming Fourth National Climate Assessment and aims to reduce global “greenhouse gas emissions from human sources of 40 to 60 percent from 2010 levels by 2030; and net-zero global emissions by 2050.” While the bill is largely targeting the transportation industry, many proponents suggest that it would be more impactful, and healthier, to curb America’s insatiable appetite for animal agriculture.
In a recent BBC report, “Food production accounts for one-quarter to one-third of all anthropogenic greenhouse gas emissions worldwide, and the brunt of responsibility for those numbers falls to the livestock industry.” The average US family, “emits more greenhouse gases because of the meat they eat than from driving two cars,” quipped Professor Tim Benton of the University of Leeds. “Most people don’t think of the consequences of food on climate change. But just eating a little less meat right now might make things a whole lot better for our children and grandchildren,” signed Benton.
Americans continue to chow down more than 26 billion pounds of meat a year, distressing environmentalists who assert that the current status quo is unsustainable. While veganism would provide a 70% relief to greenhouse gases worldwide, it is not foreseeable that 7 billion people would instantly change their diets to save the planet. Robotics, and even more so, artificial intelligence, is now being embraced by venture-backed entrepreneurs to artificially grow meat alternatives as creative gastronomic replacements.
Chilean startup, Not Company (NotCo), built a machine learning platform named Giuseppe to search animal ingredient substitutes. NotCo founder Matias Muchnick explains, “Giuseppe was created to understand molecular connections between food and the human perception of taste and texture.” While Muchnick did not disclose his techniques, he revealed to Business Insider that the company has hired teams of food and data scientists to classify ingredients into bits for Giuseppe. Muchnick explains the AI begins the work of processing the “data regarding how the brain works when it’s given certain flavors, when you taste salty, umami, [or] sweet.” Today, the company has a line of egg and milk alternatives on the shelves including: “Not Mayo,” Not Cheese,” “Not Yogurt and “Not Milk.” The NotCo website states that this is only the first step in a larger scheme for the deep learning algorithm: “NotCo currently has a very ambitious development plan for Giuseppe, which includes the generation of new databases with information of a different nature, such as production processes and other molecular properties of food, in such a way that Giuseppe gets closer and closer to be the most advanced chef and food scientist in the world.”
NotCo competes in a growing landscape of other animal substitute upstarts. Hampton Creek, which recently rebranded as JUST, also offers an array of dairy and egg alternatives from plant-based ingredients. The ultimate test for all these companies is creating meat in a petri dish. When responding to the challenge, JUST announced, “Through a first-of-its-kind partnership, JUST will develop cultured Wagyu beef using cells from Toriyama prized cows. Then, Awano Food Group (a premier international supplier of meat and seafood) will market and sell the meat to clients exactly how they do today with conventionally produced Toriyama Wagyu.” Today, a handful of companies, many ironically backed by livestock corporations, are also tackling the $90 billion cellular agriculture market, including: Mosa Meat, Impossible Burger, Beyond Meat, and Memphis Meats. Mosa, backed by Google founder Sergi Brin, unveiled the first synthetic burger in 2013 at a staggering cost of nearly a half million dollars.
While costs are declining, cultured meat is too expensive to supplement the American diet, especially when $1 still buys one a fast food dinner. The key to mass acceptance is attacking the largest pain point in the lab – acquiring enough genetic material from bovine tissue. Currently, the cost of such serums are close to $1,000 an ounce, and not exactly cruelty free as they are derived from animals. Many clean meat founders are proudly vegan with the implicit goal of replacing animal ingredients altogether. In order to accomplish this task, companies like JUST have invested in building robust AI and robotic systems to automatically scour the globe for plant-based alternatives. “Over 300,000 species are in the plant kingdom. That’s over 18 billion proteins, 108 million lipids, and 4 million polysaccharides. It’s an abundance almost entirely unexplored, until now,” exclaims their website. The company boasts that it is on the verge of major discoveries, “The more we explore, the more data we gather along the way. And the faster we’ll find the answers. It’s almost impossible to look at the data and say, ‘Here’s a pattern. Here’s an answer.’ So, we have to come up with algorithms to rank the materials and give downstream experiments a recommendation. In this way, we’re using data to increase the probability of discoveries.”
The next few years will unearth major breakthroughs, already Mosa announced it will have an affordable product on the shelves by 2021. To accomplish this task, the company turned to Merck’s corporate venture arm, M Ventures, and Bell Food to lead its previous financing round. Last July, Forbes reported that the strategic partnerships are critical to Mosa’s vision in mass producing meat. According to Mosa’s founder, Mark Post, “Merck’s experience with cell cultures is very attractive from a strategic standpoint. Cell production is key to scaling cultured meat production, as they still need to figure out how to get cells to grow more rapidly and at higher numbers. In short, new technology needs to be developed. That’s where companies like Merck can lend a hand.” In addition to leveraging the conglomerates expertise in the lab, food-packaging powerhouse, Bell Food, provides a huge distribution advantage. Already, Lorenz Wyss, CEO of Bell Food Group, excitedly predicts, “Meat demand is soaring and in the future it won’t be met by livestock agriculture alone. We believe this technology can become a true alternative for environment-conscious consumers, and we are delighted to bring our know-how and expertise of the meat business into this strategic partnership with Mosa Meat.”
While the Green New Deal has been met with skepticism, the charging forces of climate change and technology are steaming ahead. Today, we have the computational and the mechatronic power to turn back the tides of destruction to implant positive change across the planet, quite possibly starting with scaling back animal agriculture. Even Winston Churchill commented in the 1931, “We shall escape the absurdity of growing a whole chicken in order to eat the breast or wing, by growing these parts separately under a suitable medium.”
Is our food source and AgTech networks under attack? Learn more at the next RobotLab on “Cybersecurity & Machines” with John Frankel of ffVC and Guy Franklin of SOSA on February 12th in New York City, RSVP Today!