Category Robotics Classification

Page 429 of 429
1 427 428 429

To err is algorithm: Algorithm fallibility and economic organisation

Algorithmic fails

Dig below the surface of some of today’s biggest tech controversies and you are likely to find an algorithm misfiring:[1]

These errors are not primarily caused by problems in the data that can make algorithms discriminatory, or their inability to improvise creatively. No, they stem from something more fundamental: the fact that algorithms, even when they are generating routine predictions based on non-biased data, will make errors. To err is algorithm.

The costs and benefits of algorithmic decision-making

We should not stop using algorithms simply because they make errors.[2] Without them, many popular and useful services would be unviable.[3] However, we need to recognise that algorithms are fallible, and that their failures have costs. This points at an important trade-off between more (algorithm-enabled) beneficial decisions and more (algorithm-caused) costly errors. Where lies the balance?

Economics is the science of trade-offs, so why not think about this topic like economists? This is what I have done ahead of this blog, creating three simple economics vignettes that look at key aspects of algorithmic decision-making.[4] These are the key questions:

  • Risk: when should we leave decisions to algorithms, and how accurate do those algorithms need to be?
  • Supervision: How do we combine human and machine intelligence to achieve desired outcomes?
  • Scale: What factors enable and constrain our ability to ramp-up algorithmic decision-making?

The two sections that follow give the gist of the analysis and its implications. The appendix at the end describes the vignettes in more detail (with equations!).

Modelling the modelling

1. Risk: go with the odds

As the American psychologist and economist Herbert Simon once pointed out:

in an information rich world, attention becomes a scarce resource.

This applies to organisations as much as it does to individuals.

The ongoing data revolution risks overwhelming our ability to process information and make decisions, and algorithms can help address this. They are machines that automate decision-making, potentially increasing the number of good decisions that an organisation can make.[5] This explains why they have taken-off first in industries where the volume and frequency of potential decisions goes beyond what a human workforce can process.[6]

What drives this process? For an economist, the main question is how much value will the algorithm create with its decisions. Rational organisations will adopt algorithms with high expected values.

An algorithm’s expected value depends on two factors: its accuracy (the probability that it will make a correct decision), and the balance between the reward from a correct decision and the penalty from an erroneous one.[7]  Riskier decisions (where penalties are big compared to rewards) should be made by highly accurate algorithms. You would not want a flaky robot running a nuclear power station, but it might be ok if it is simply advising you about what TV show to watch tonight.

2. Supervision: watch out

We could bring in human supervisors to check the decisions made by the algorithm and fix any errors they find. This makes more sense if the algorithm is not very accurate (supervisors do not spend a lot of time checking correct decisions), and the net benefits from correcting the wrong decisions (i.e., extra rewards plus avoided penalties) is high. Costs matter too. A rational organisation has more incentives to hire human supervisors if they do not get paid a lot, and if they are highly productive (i.e. it only takes a few of them to do the job).

Following from the example before, if a human supervisor fixes a silly recommendation in a TV website, this is unlikely to create a lot of value for the owner. The situation in a nuclear power station is completely different.

3. Scale: a race between machines and reality

What happens when we scale-up the number of algorithmic decisions? Are there any limits to its growth?

This depends on several things, including whether algorithms gain or lose accuracy as they make more decisions, and the costs of ramping-up algorithmic decision-making. In this situation, there are two interesting races going on.

1. There is a race between an algorithm’s ability to learn from the decisions it makes, and the amount of information that it obtains from new decisions. New machine learning techniques help algorithms ‘learn from experience’, making them more accurate as they make more decisions.[8] However, more decisions can also degrade an algorithm’s accuracy. Perhaps it is forced to deal with weirder cases, or new situations it is not trained to deal with.[9] To make things worse, when an algorithm becomes very popular (makes more decisions), people have more reasons to game it.

My prior is that the ‘entropic forces’ that degrade algorithm accuracy will win out in the end: no matter how much more data you collect, it is just impossible to make perfect predictions about a complex, dynamic reality.

2. The second race is between the data scientists creating the algorithms and the supervisors checking these algorithm’s decisions. Data scientists are likely to ‘beat’ the human supervisors because their productivity is higher: a single algorithm, or an improvement in an algorithm, can be scaled up over millions of decisions. By contrast, supervisors need to check each decision individually. This means that as the number of decisions increases, most of the organisation’s labour bill will be spent on supervision, with potentially spiralling costs as the supervision process gets bigger and more complicated.

What happens at the end?

When considered together, the decline in algorithmic accuracy and the increase in labour costs I just described are likely to limit the number of algorithmic decisions an organisation can make economically. But if and when this happens depends on the specifics of the situation.

Implications for organisations and policy

The processes I discussed above have many interesting organisational and policy implications. Here are some of them:

1. Finding the right algorithm-domain fit

As I said, algorithms making decisions in situations where the stakes are high need to be very accurate to make-up for high penalties when things go wrong.[10] On the flipside, if the penalty from making an error is low, even inaccurate algorithms might be up to the task.

For example, the recommendation engines in platforms like Amazon or Netflix often make irrelevant recommendations, but this is not a big problem because the penalty from these errors is relatively low – we just ignore them. Data scientist Hillary Parker picked up on the need to consider the fit between model accuracy and decision context a recent edition of the ‘Not So Standard Deviations’ podcast:

Most statistical methods have been tuned for the clinical trial implementation where you are talking about people’s lives and people dying with the wrong treatment, whereas in business settings the trade-offs are completely different.

One implication from this is that organisations in ‘low-stakes’ environments can experiment with new and unproven algorithms, including some with low-accuracy early on. As these are improved, they can be transferred to ‘high stake domains’. The tech companies that develop these algorithms often release them as open source software for others to download and improve, making these spill-overs possible.

2. There are limits to algorithmic decision-making in high stakes domains

Algorithms need to be applied much more carefully in domains where the penalties from errors are high, such as health or the criminal justice system, and when dealing with groups who are more vulnerable to algorithmic errors.[11] Only highly accurate algorithms are suitable for these risky decisions, unless they are complemented with expensive human supervisors who can find and fix errors. This will create natural limits to algorithmic decision-making: how many people can you hire to check an expanded number of decisions? Human attention remains a bottleneck to more decisions.

If policymakers want more and better use of algorithms in these domains, they should invest in R&D to improve algorithmic accuracy, encourage the adoption of high-performing algorithms from other sectors, and experiment with new ways of organising that help algorithms and their supervisors work better as a team.

Commercial organisations are not immune to some of these problems: YouTube has for example started blocking adverts in videos with less than ten thousand views. In those videos, the rewards from correct algorithmic ad-matching is probably low (they have low viewership) and the penalties could be high (many of these videos are of dubious quality). In other words, these decisions have low expected value, so YouTube has decided to stop making them. Meanwhile, Facebook just announced that it is hiring 3,000 human supervisors (almost a fifth of its current workforce) to moderate the content in its network. You could imagine how the need to supervise more decisions might put some brakes on its ability to scale up algorithmic decision-making indefinitely.

3. The pros and cons of crowdsourced supervision

One way to keep supervision costs low and coverage of decisions high is to crowdsource supervision to users, for example by giving them tools to report errors and problems. YouTube, Facebook and Google have all done this in response to their algorithmic controversies. Alas, getting users to police online services can feel unfair and upsetting. As Sarah T Roberts, a Law professor pointed out in a recent interview about the Facebook violent video controversy:

The way this material is often interrupted is because someone like you or me encounters it. This means a whole bunch of people saw it and flagged it, contributing their own labour and non-consensual exposure to something horrendous. How are we going to deal with community members who may have seen that and are traumatized today?

4. Why you should always keep a human in the loop

Even when penalties from error are low, it still makes sense to keep humans in the loop of algorithmic decision-making systems.[12] Their supervision provides a buffer against sudden declines in performance if (as) the accuracy of algorithms decreases.  When this happens, the number of erroneous decisions detected by humans and the net benefit from fixing them increase. They can also ring the alarm, letting everyone know that there is a problem with the algorithms that needs fixing.[13]

This could be particularly important in situations where errors create penalties with a delay, or penalties that are hard to measure or hidden (say if erroneous recommendations result in self-fulfilling prophecies, or costs that are incurred outside the organisation).

There are many examples of this. In the YouTube advertising controversy, the big accumulated penalty from previous errors only became apparent with a delay, when brands noticed that their adverts were being posted against hate videos. The controversy with fake news after the US election is an example of hard to measure costs: algorithms’ inability to discriminate between real news and hoaxes creates costs for society, potentially justifying stronger regulations and more human supervision. Politicians have made this point when calling on Facebook to step up its fight against fake news in the run-up to the UK election:

Looking at some of the work that has been done so far, they don’t respond fast enough or at all to some of the user referrals they can get. They can spot quite quickly when something goes viral. They should then be able to check whether that story is true or not and, if it is fake, blocking it or alerting people to the fact that it is disputed. It can’t just be users referring the validity of the story. They [Facebook] have to make a judgment about whether a story is fake or not.

5. From abstract models to real systems

Before we use economic models to inform action, we need to define and measure model accuracy, penalties and rewards, changes in algorithmic performance due to environmental volatility, levels of supervision and their costs, and that is only the beginning.[14]

This is hard but important work that could draw on existing technology assessment and evaluation tools, including methods to quantify non-economic outcomes (e.g. in health).[15] One could even use rich data from an organisation’s information systems to simulate the impact of algorithmic decision-making and its organisation before implementing it. We are seeing more examples of these applications, such as the financial ‘regtech’ pilots that the European Commission is running, or the ‘collusion incubators’ mentioned in a recent Economist article on price discrimination.

Coda: Piecemeal social engineering in an age of algorithms

In a Nature article last year, US researchers Ryan Calo and Kate Crawford called for

a practical and broadly applicable social-systems analysis [that] thinks through all the possible effects of AI systems on all parties [drawing on] philosophy, law, sociology, anthropology and science-and-technology studies, among other disciplines.

Calo and Crawford did not include economists in their list. Yet as this blog suggests, economics thinking has much to contribute to these important analyses and debates. Thinking about algorithmic decisions in terms of their benefits and costs, the organisational designs we can use to manage their downsides, and the impact of more decisions on the value that agorithms create can help us make better decisions about when and how to use them.

This reminds me of a point that Jaron Lanier made in his 2010 book, Who Owns the Future:

With every passing year, economics must become more and more about the design of the machines that mediate human social behaviour. A networked information system guides people in a more direct, detailed and literal way than does policy. Another way to put it is that economics must turn into a large-scale, systemic version of user interface design.

Designing organisations where algorithms and humans work together to make better decisions will be an important part of this agenda.

Acknowledgements

This blog benefited from comments from Geoff Mulgan, and was inspired by conversations with John Davies. The image above represents a precision-recall curve in a multi-label classification problem. It shows the propensity of a random forests classification algorithm to make mistakes when one sets different rules (probability thresholds) for putting observations in a category.

Appendix: Three economics vignettes about algorithmic decision-making

The three vignettes below are very simplified formalisations of algorithmic decision-making situations. My main inspiration was Human fallibility and economic organization, a 1985 paper by Joe Stiglitz and Raj Sah where the authors model how two organisational designs – hierarchies and ‘polyarchies’ (flat organisations) – cope with human error. Their analysis shows that hierarchical organisations where decision-makers lower in the hierarchy are supervised by people further up tend to reject more good projects, while polyarchies where agents make decisions independently from each other, tend to accept more bad projects. A key lesson from their model is that errors are inevitable, and the optimal organisational design depends on context.

Vignette 1: Algorithm says maybe

Let’s imagine an online video company that matches adverts with videos in its catalogue. This company hosts millions of videos so it would be economically inviable for it to rely on human labour to do the job. Instead, its data scientists develop algorithms to do this automatically. [16] The company looks for the algorithm that maximises the expected value of the matching decisions. This value depends on three factors: [17]

-Algorithm accuracy (a): The probability (between 0 and 1) that the algorithm will make the correct decision.[18]

-Decision reward (r): This is the reward when the algorithm makes the right decision

-Error penalty (p): This is the cost of making the wrong decision.

We can combine accuracy, benefit and penalty to calculate the expected value of the decision:

E = ar – (1-a)p [1]

This value is positive when the expected benefits from the algorithm’s decision outweigh the expected costs (or risks):

ar > (1-a)p [2]

Which is the same as saying that:

a/(1-a) > p/r [3]

The odds of making the right decision should be higher than the ratio between penalty and benefit.

Enter human

We can reduce the risk of errors by bringing a human supervisor into the situation. This human supervisor can recognise and fix errors in algorithmic decisions. The impact of this strategy on the expected value of a decision depends on two parameters:

-Coverage ratio (k): k is the probability that the human supervisor will check a decision by the algorithm. If k is 1, this means that all algorithmic decisions are checked by a human.

-Supervision cost (cs(k)): this is the cost of supervising the decisions of the algorithm. The cost depends on the coverage ratio k because checking more decisions takes time.

The expected value of an algorithmic decision with human supervision is the following:[19]

Es = ar + (1-a)kr – (1-a)kp – cs(k) [4]

This equation picks up the fact that some errors are detected and rectified, and others are not. We subtract [3] from [4] to obtain the extra expected value from supervision. After some algebra, we get this.

(r+p)(1-a)k > cs(k) [5]

Supervision only makes economic sense when its expected benefit (which depends on the probability that the algorithm has made a mistake, that this mistake is detected, and the net benefits from flipping a mistake into a correct decision) is larger than the cost of supervision.

Scaling up

Here, I consider what happens when we start increasing n, the number of decisions being made by the algorithm.

The expected value is:

E(n) = nar + n(1-a)kr – n(1-a)(1-k)p [6]

And the costs are C(n)

How do these things change as n grows?

I make some assumptions to simplify things: the organisation wants to hold k constant, and the rewards r and penalties p remain constant as n increases.[20]

This leaves us with two variables that change as n increases: a and C.

  • I assume that algorithmic accuracy a declines with the number of decisions because the processes that degrade accuracy are stronger than those that improve it
  • I assume that C, the production costs, only depend on the labour of data scientists and supervisors. Each of these two occupations gets paid a salary wds and ws.

Based on this, and some calculus, we get the changes in expected benefits as we make more decisions as:

∂E(n)/∂(n) = r + (a+n(∂a/∂n))*(1-k)(r+p) - p(1-k) [7]

This means that as more decisions are made, the aggregated expected benefits grow in a way that is modified by changes in the marginal accuracy of the algorithm. On the one hand, more decisions mean scaled up benefits from more correct decisions. On the other, the decline in accuracy generates an increasing number of errors and penalties. Some of these are offset by human supervisors.

This is what happens with costs:

∂C/∂n = (∂C/∂Lds)(∂Lds/∂n) + (∂C/dLs)(∂Ls/dn) [8]

As the number of decisions increases, costs grow because the organisation has to recruit more data scientists and supervisors.

[8] is the same as saying:

∂C/dn = wds/(∂Lds/dn) + ws/zs/(∂Ls/∂n) [9]

The labour costs of each occupation are directly related to its salary, and inversely related to its marginal productivity. If we assume that data scientists are more productive than supervisors, this means that most of the increases in costs with n will be caused by increases in the supervisor workforce.

The expected value (benefits minus costs) from decision-making for the organisation is maximised with an equilibrium number of decisions ne where the marginal value of an extra decision equals its marginal cost:

r + (a+nda/dn)(1-k)(r+p) - p(1-k) = wds/(∂Lds/∂n) + ws/zs/(∂Ls/∂n) [10]

Extensions

Above, I have kept things simple by making some strong assumptions about each of the situations being modelled. What would happen if we relaxed these assumptions?

Here are some ideas:

Varieties of error

First, the analysis does not take into account that different types of errors (e.g. false positives and negatives, errors made with different degrees of certainty etc.) could have different rewards and penalties. I have also assumed certainty in rewards and penalties, when it would be more realistic to model them as random draws from probability distributions. This extension would help incorporate fairness and bias into the analysis. For example, if errors are more likely to affect vulnerable people (who suffer higher penalties), and these errors are less likely to be detected, this could increase the expected penalty from errors.

Humans are not perfect either

All of the above assumes that algorithms err but humans do not. This is clearly not the case. In many domains, algorithms can be a desirable alternative to humans with deep-rooted biases and prejudices. In those situations, human’s ability to detect and address errors is impaired, and this reduces the incentives to recruit them (this is the equivalent to a decline in their productivity). Organisations deal with all this by investing on technologies (e.g. crowdsourcing platforms) and quality assurance systems (including extra layers of human and algorithmic supervision) that manage the risks of human and algorithmic fallibility.

Scaling up rewards and penalties

Before, I assumed that the marginal penalties and rewards remain constant as the number of algorithmic decisions increase. This need not be the case. The table below shows examples of situations where these parameters change with the number of decisions being made:

Increases with more decisions Decreases with more decisions
Rewards The organisation gains market power, or is able to use price discrimination in more transactions The organisation runs out of valuable decisions to make.
Penalties The organisation becomes more prominent and its mistakes receive more attention Users get accustomed to errors

Getting an empirical handle on these processes is very important, as they could determine if there is a natural limit to the number of algorithmic decisions that an organisation can make economically in a domain or market, with potential implications for its regulation.

Endnotes

[1] I use the term ‘algorithm’ in a restricted sense, to refer to technologies that turn information into predictions (and depending on the system receiving the predictions, decisions). There are many processes to do this, including rule-based systems, statistical systems, machine learning systems and Artificial Intelligence (AI). These systems vary on their accuracy, scalability, interpretability, and ability to learn from experience, so their specific features should be considered in the analysis of algorithmic trade-offs.

[2] One could even say that machine learning is the science that manages trade-offs caused by the impossibility of eliminating algorithmic error. The famous ‘bias-variance’ trade off between fitting a model to known observations and predicting unknown ones is a good example of this.

[3] Some people would say that personalisation is undesirable because it can lead to discrimination and ‘filter bubbles’, but that is a question for another blog post.

[4] Dani Rodrik’s ‘Economics Rules’ makes a compelling case for models as simplistic but useful formalisations of complex reality.

[5] In a 2016 Harvard Business Review article, Ajay Agrawal and colleagues sketched out an economic analysis of machine learning as a technology that lowers the costs of prediction. My way of looking at algorithms is similar because predictions are inputs into decision-making.

[6] This includes personalised experiences and recommendations in e-commerce and social networking sites, or fraud detection and algorithmic trading in finance.

[7] For example, if YouTube shows me an advert which is highly relevant to my interests, I might buy the product, and this generates income for the advertiser, the video producer and YouTube. If it shows me a completely irrelevant or even offensive advert, I might stop using YouTube, or kick up a fuss in my social network of choice.

[8] Reinforcement learning builds agents that use the rewards and penalties from previous actions to make new decisions.

[9] This is what happened with the Google FluTrends system used to predict flu outbreaks based on google searches – people changed their search behaviour, and the algorithm broke down.

[10] In many cases, the penalties might be so high that we decide that an algorithm should never be used, unless it is supervised by humans.

[11] Unfortunately, care is not always taken when implementing algorithmic systems in high-stakes situations. Cathy O’Neil’s ‘Weapons of Maths Destruction’ gives many examples of this, going from the criminal justice system to university admissions.

[12] Mechanisms for accountability and due process are another example of human supervision.

[13] Using Albert Hirschmann’s model of exit, voice and loyalty, we could say that supervision plays the role of ‘voice’, helping organisations detect a decline in quality before users begin exiting.

[14] The appendix flags up some of my key assumptions, and suggests extensions.

[15] This includes rigorous evaluation of algorithmic decision-making and its organisation using Randomised Controlled Trial methods like those proposed by Nesta’s Innovation Growth Lab.

[16] This decision could be based on how well similar adverts perform when matched with different types of videos, on demographic information about the people who watch the videos, or other things.

[17] The analysis in this blog assumes that the results of algorithmic decisions are independent from each other. This assumption might be violated in situations where algorithms generate self-fulfilling prophecies (e.g. logically, a user is more likely to click an advert she is shown that one she is not). This is a hard problem to tackle, but researchers are developing methods based on randomisation of algorithmic decisions to address it.

[18] This does not distinguish between different types of error (e.g. false positives and false negatives). I come back to this at the end.

[19] Here, I am assuming that human supervisors are perfectly accurate. As we know from behavioural economics, this is a very strong assumption. I consider this issue at the end.

[20] I consider the implications of making different assumptions about marginal rewards and penalties at the end.

This post was originally published on Nesta. Click here to view the original.

The Force was strong in this robot competition

An Imperial Snowtrooper inspects a competitor’s entry at the 2017 MIT Mechanical Engineering 2.007 Student Design Final Robot Competition. Photo: Tony Pulsone

Thursday night, dozens of robots designed and built by undergraduates in a mechanical engineering class endured hours of intense, boisterous, and often jubilant competition as they scrambled to rack up points in one-on-one clashes on special “Star Wars”-themed playing arenas.

As has often happened in these contests — which have been going on, and constantly evolving, since 1970 — the ultimate winner in the single-elimination tournament was not the one that’d most consistently racked up the highest scores all evening. Rather, it was a high-scoring bot that triumphed when its competitor missed a crucial scoring opportunity because its starting position was just slightly out of alignment.

The class, 2.007 (Design and Manufacturing I), which has 165 mostly sophomore students, begins by giving each student an identical kit of parts, from which they each have to create a robot to carry out a variety of tasks to score points. This year, in a nod to the 40th anniversary of the first “Star Wars” film, released in 1977, the robots crawled around and over a replica of a “Star Wars” X-wing Starfighter. Students could earn points by pulling up a sliding frame to rescue prisoners trapped in carbonite; by dumping Imperial stormtroopers into a trash trench; by activating a cantina band; or by spinning up one or both of two large cylindrical thrusters on the wings. Students could choose which tasks to have their robot try to accomplish, and had just one semester to design, test, and operate their bot.

The devices could be pre-programmed to carry out set tasks, but could also be manually controlled through a radio-linked controller. As in past years, the open-ended nature of the assignment — and the variety of different ways to score — led to a wide range of strategies and designs, spanning from tall towers that would extend by telescoping out or with hinged sections, to elevator-like lifting devices, to small and nimble bots that scurried around to carry out multiple tasks, to an array of arms and devices for grasping or turning the different pieces. They sported names like Dodocopter, Bonnie and Clyde, Pitfall, Torque Toilet, Spinit to Winit, and Nicki Spinaj.

Students could earn extra points by accomplishing any of the tasks during an initial period when the robot had to perform autonomously, before the start of a manually remote-controlled round. The students were allowed to create multiple robots to carry out different tasks, as long as they were all made from the basic kit of parts, and all fit within a designated starting area. Most of the students opted to build two devices, and some even made three.

Second-place finisher Richard Moyer, with his small but powerful and robust robot called Tornado, consistently scored 960.5 points in every round (the highest score achieved by any of the bots), by spinning both the lower and upper thrusters to their maximum speeds — and by using the lower thruster during the high-scoring autonomous period. But on the final matchup, Tornado was just slightly out of place in the starting box, and missed the thruster, losing out on that big initial score.

The robot used a simple but reliable design, which sported a single horizontally-mounted drive wheel that it used to spin both the lower and upper thrusters, and also to activate an elevator mechanism that carried it from one wing to the other. It was “like the Swiss army knife of robots,” thanks to this multifunction device, said Sangbae Kim, an associate professor of mechanical engineering and co-instructor of the course, who was dressed as the “Star Wars” wookie, Chewbacca.

The grand-prize winner, Tom Frejowski, also built a compact, powerful robot that concentrated on the spinning task, and scored 640 in the final round to take home the top trophy (a replica of the MIT dome). Frejowski’s robot, in order to ensure that it made a straight shot from the starting position to the thruster to line up just right to spin the heavy cylinder, used a single motor to drive both of its front wheels, which helped him earn consistent high scores. “That’s how he goes dead straight every time,” said co-instructor Amos Winter, an assistant professor of mechanical engineering, who was dressed as Darth Vader and shared the emcee duties with Kim.

During the tournament, which took place in the Johnson Ice Rink, all of the course teachers and assistants were dressed in various “Star Wars” costumes, and a packed audience of fellow students, families, and visitors of all ages cheered their encouragement with great enthusiasm. During a break, each of the teaching assistants was presented with a special memento: a beaver-cut twig from a beaver dam in Nova Scotia, symbolizing MIT’s beaver mascot, and nature’s original mechanical engineer.

Echoing the sentiments of many students in the class, sophomore James Li said of the class in a pre-taped video: “I had a bit of building experience, but I never had to design and build anything of this complexity. … It was a great experience.”

RoboCup video series: 20 years of history

RoboCup is an international scientific initiative with the goal to advance the state of the art of intelligent robots. Established in 1997, the original mission was to field a team of robots capable of winning against the human soccer World Cup champions by 2050. 

The competition has now grown into an international movement with a variety of leagues that go beyond soccer. Teams compete to make robots for rescue missions, the home, and industry. And it’s not just researchers, kids also have their own league. Last year, almost 3,000 participants and 1,200 robots competed.

To celebrate 20 years of RoboCup, the Federation is launching a video series featuring each of the leagues with one short video for those who just want a taster, and one long video for the full story. Robohub will be featuring one league every week leading up to RoboCup 2017 in Nagoya, Japan.

This week, we take a whirlwind tour of the RoboCup competition, spanning all the leagues. You’ll hear about the history and ambitions of RoboCup from the trustees, and inspiring teams from around the world.

Short Version

Long Version

Can’t wait to watch the rest? You can view all the videos on the RoboCup playlist below:
https://www.youtube.com/playlist?list=PLEfaZULTeP_-bqFvCLBWnOvFAgkHTWbWC

Please spread the word! and if you would like to join a team, check here for more information.

Watch this omnicopter fetch a ball

We have developed a computationally efficient trajectory generator for six degrees-of-freedom multirotor vehicles, i.e. vehicles that can independently control their position and attitude. The trajectory generator is capable of generating approximately 500’000 trajectories per second that guide the multirotor vehicle from any initial state, i.e. position, velocity and attitude, to any desired final state in a given time. In this video, we show an example application that requires the evaluation of a large number of trajectories in real time.

Multirotor vehicle

The multirotor vehicle used in the demonstration is an omni-directional eight-rotor vehicle. Its unique actuator configuration gives it full force and torque authority in all three dimensions, allowing it to fly novel maneuvers. For more details, please refer to the Youtube video or the research paper: “Design, Modeling and Control of an Omni-Directional Aerial Vehicle”, IEEE International Conference on Robotics and Automation (ICRA), 2016.

Researchers

Dario Brescianini and Raffaello D’Andrea
Institute for Dynamic Systems and Control (IDSC), ETH Zurich, Switzerland – http://www.idsc.ethz.ch

Location

Flying Machine Arena, ETH Zurich, Switzerland.

Acknowledgements

This work is supported by and builds upon prior contributions by numerous collaborators in the Flying Machine Arena project. See the list here. This research was supported by the Swiss National Science Foundation (SNSF).

Kids celebrate robotics at RoboFes 2017

Robo Done, the robotic academy franchise for kids from Osaka, Japan, celebrated Japan’s Day of the Children on the 5th of May at their annual event, Robot Festival 2017 or RoboFes. The event welcomed over 1,000 attendees, including children and their parents.

This was the 2nd time Robo Done has celebrated the festival. In only one year, the number of attendees has increased threefold (350 attendees in 2016 to over 1,012 in 2017). It was celebrated in the KANDAI MeRise Campus of the Kansai University in Osaka, Japan and has become the biggest event at the campus.

The main activity was the Robot Contest, using LEGO Mindstorm, with morning and afternoon leagues. Over 200 children — from 6 years and up — participated in the championship. The kids built robots in pairs and programmed their creations, repeating the process of trial-and-error against a time limit. Several IT and robot related companies had booths, as well as, students of the university, which offered a variety of activities for the kids to enjoy.

Robo Done will hold RoboFes again in 2018, hoping to inspire even more kids to enjoy robotics and programming. We hope RoboFes will become a regular event during Japan’s “Golden Week!”

On the future of human-centered robotics

“The new frontier is learning how to design the relationships between people, robots, and infrastructure,” says David Mindell, the Dibner Professor of the History of Engineering and Manufacturing, and a professor of aeronautics and astronautics. “We need new sensors, new software, new ways of architecting systems.” Photo: Len Rubenstein

Science and technology are essential tools for innovation, and to reap their full potential, we also need to articulate and solve the many aspects of today’s global issues that are rooted in the political, cultural, and economic realities of the human world. With that mission in mind, MIT’s School of Humanities, Arts, and Social Sciences has launched The Human Factor — an ongoing series of stories and interviews that highlight research on the human dimensions of global challenges. Contributors to this series also share ideas for cultivating the multidisciplinary collaborations needed to solve the major civilizational issues of our time.

David Mindell, the Frances and David Dibner Professor of the History of Engineering and Manufacturing and Professor of Aeronautics and Astronautics at MIT, researches the intersections of human behavior, technological innovation, and automation. Mindell is the author of five acclaimed books, most recently “Our Robots, Ourselves: Robotics and the Myths of Autonomy” (Viking, 2015). He is also the co-founder of Humatics Corporation, which develops technologies for human-centered automation. SHASS Communications recently asked him to share his thoughts on the relationship of robotics to human activities, and the role of multidisciplinary research in solving complex global issues.

Q: A major theme in recent political discourse has been the perceived impact of robots and automation on the United States labor economy. In your research into the relationship between human activity and robotics, what insights have you gained that inform the future of human jobs, and the direction of technological innovation?

A: In looking at how people have designed, used, and adopted robotics in extreme environments like the deep ocean, aviation, or space, my most recent work shows how robotics and automation carry with them human assumptions about how work gets done, and how technology alters those assumptions. For example, the U.S. Air Force’s Predator drones were originally envisioned as fully autonomous — able to fly without any human assistance. In the end, these drones require hundreds of people to operate.

The new success of robots will depend on how well they situate into human environments. As in chess, the strongest players are often the combinations of human and machine. I increasingly see that the three critical elements are people, robots, and infrastructure — all interdependent.

Q: In your recent book “Our Robots, Ourselves,” you describe the success of a human-centered robotics, and explain why it is the more promising research direction — rather than research that aims for total robotic autonomy. How is your perspective being received by robotic engineers and other technologists, and do you see examples of research projects that are aiming at human-centered robotics?

A: One still hears researchers describe full autonom as the only way to go; often they overlook the multitude of human intentions built into even the most autonomous systems, and the infrastructure that surrounds them. My work describes situated autonomy, where autonomous systems can be highly functional within human environments such as factories or cities. Autonomy as a means of moving through physical environments has made enormous strides in the past ten years. As a means of moving through human environments, we are only just beginning. The new frontier is learning how to design the relationships between people, robots, and infrastructure. We need new sensors, new software, new ways of architecting systems.

Q: What can the study of the history of technology teach us about the future of robotics?

A: The history of technology does not predict the future, but it does offer rich examples of how people build and interact with technology, and how it evolves over time. Some problems just keep coming up over and over again, in new forms in each generation. When the historian notices such patterns, he can begin to ask: Is there some fundamental phenomenon here? If it is fundamental, how is it likely to appear in the next generation? Might the dynamics be altered in unexpected ways by human or technical innovations?

One such pattern is how autonomous systems have been rendered less autonomous when they make their way into real world human environments. Like the Predator drone, future military robots will likely be linked to human commanders and analysts in some ways as well. Rather than eliding those links, designing them to be as robust and effective as possible is a worthy focus for researchers’ attention.

Q: MIT President L. Rafael Reif has said that the solutions to today’s challenges depend on marrying advanced technical and scientific capabilities with a deep understanding of the world’s political, cultural, and economic realities. What barriers do you see to multidisciplinary, sociotechnical collaborations, and how can we overcome them?

A: I fear that as our technical education and research continues to excel, we are building human perspectives into technologies in ways not visible to our students. All data, for example, is socially inflected, and we are building systems that learn from those data and act in the world. As a colleague from Stanford recently observed, go to Google image search and type in “Grandma” and you’ll see the social bias that can leak into data sets — the top results all appear white and middle class.

Now think of those data sets as bases of decision making for vehicles like cars or trucks, and we become aware of the social and political dimensions that we need to build into systems to serve human needs. For example, should driverless cars adjust their expectations for pedestrian behavior according to the neighborhoods they’re in?

Meanwhile, too much of the humanities has developed islands of specialized discourse that is inaccessible to outsiders. I used to be more optimistic about multidisciplinary collaborations to address these problems. Departments and schools are great for organizing undergraduate majors and graduate education, but the old two-cultures divides remain deeply embedded in the daily practices of how we do our work. I’ve long believed MIT needs a new school to address these synthetic, far-reaching questions and train students to think in entirely new ways.

Interview prepared by MIT SHASS Communications
Editorial team: Emily Hiestand (series editor), Daniel Evans Pritchard

The Drone Center’s Weekly Roundup 5/15/17

Sailors assigned to Explosive Ordnance Disposal Mobile Unit 5 (EODMU5) Platoon 142 recover an unmanned underwater vehicle onto a Coastal Riverine Group 1 Detachment Guam MK VI patrol boat in the Pacific Ocean May 10, 2017. Credit: Mass Communication Specialist 1st Class Torrey W. Lee/ U.S. Navy

May 8, 2017 – May 14, 2017

If you would like to receive the Weekly Roundup in your inbox, please subscribe at the bottom of the page.

News

The International Civil Aviation Organization announced that it plans to develop global standards for small unmanned aircraft traffic management. In a statement at the Association of Unmanned Vehicle Systems International’s Xponential trade conference, the United Nations agency said that as part of the initiative it has issued a Request for Information on air traffic management systems for drones. (GPS World)

Virginia Governor Terry McAuliffe has created a new office dedicated to drones and autonomous systems. According to Gov. McAuliffe, the Autonomous Systems Center for Excellence will serve as a “clearinghouse and coordination point” for research and development programs related to autonomous technologies. (StateScoop)

Commentary, Analysis, and Art

At the Telegraph, Alan Tovey writes that the U.K.’s exit from the European Union is unlikely to affect cross-channel cooperation on developing fighter drones.

At the Dead Prussian Podcast, Ulrike Franke discusses the role that drones currently play in the military.

At IHS Jane’s 360, Daniel Wasserbly writes that the U.S. Marine Corps will slow its acquisition of the Boeing Insitu Blackjack drone.

At the Bulletin of Atomic Scientists, James Rogers argues that the Trump administration policy on drones is “likely to prove counterproductive.”

At IEEE Spectrum, David Schneider examines state and local drone regulations.

In the Journal of Archaeological Science, Sean Field, Matt Waite, and LuAnn Wandsnider consider the utility of drones for archeological surveys.

At RJI Online, Jennifer Nelson looks at what a television station in Idaho is learning about using drones for news coverage.

A report by the European Center for Constitutional and Human Rights considers the “impact of drone attacks on law, warfare and society.”

At The New York Times, William Grimes visits “Drones: Is the Sky the Limit?,” a new exhibition at the Intrepid Sea, Air & Space Museum.

In a paper in the International Organization journal, Matthew Fuhrmann and Michael C. Horowitz consider the reasons that states acquire drones.

At Bloomberg, Justin Bachman looks at how different companies are seeking an advantage in managing data from drones for commercial purposes.

At the Associated Press, Dario Lopez and Joshua Goodman write about a U.S. Coast Guard program using drones to counter maritime smuggling.

In a speech at the Xponential 2017 trade show, Intel Corporation CEO Brian Krzanich argued that data will be the most significant aspect of the drone industry. (AUVSI)

At the South China Morning Post, Li Tao writes that China’s popular consumer drone brands are increasingly turning to the commercial sector.

At Defense One, Marcus Weisgerber writes that the Pentagon is using machine-learning to help identify ISIS targets.

Know Your Drone

Saudi Arabia’s King Abdulaziz City for Science and Technology unveiled the Saqr 1, an armed drone with a range of up to 2,500 km. (IHS Jane’s 360)  

U.S. drone maker AeroVironment unveiled the Snipe, a nano quadcopter that weighs just 150 grams. (New Atlas)

In a test, startup Volans-i flew a delivery drone along a 100-mile route in Texas, a new record for a drone delivery. (Tech Crunch)

Energy firm twingtec is developing a tethered drone that harvests power from the wind. (Design Boom)

The U.S. Army is seeking a midsize cargo drone that could operate with a high level of autonomy. (FlightGlobal)

Nautilus, a California startup, is developing a cargo drone that could carry thousands of pounds of goods over long distances. (Air & Space Magazine)

Drone maker Pulse Aerospace unveiled two new rotorcraft drones for military and commercial applications, the Radius 65 and the Vapor 15. (Press Release)

Piaseki Aerospace will likely submit its ARES demonstrator drone for the U.S. Marine Corps’ Unmanned Expeditionary Capabilities program. (FlightGlobal)

Turkish defense firm Aselsan has unveiled two new counter-drone systems. (IHS Jane’s 360)

Defense firm Kratos confirmed that it has conducted several demonstration flights of a high performance jet drone for an undisclosed customer. (FlightGlobal)

Technology firm Southwest Research Institute has been granted a patent for a system by which military drones can collaborate with unmanned ground vehicles. (Unmanned Aerial Online)

The U.S. Army is interested in developing a mid-size unmanned cargo vehicle that could carry up to 800 pounds of payload. (FlightGlobal)

A student at the Milwaukee Institute of Art and Design has created a drone designed to help parents track their children. (Milwaukee Journal Sentinel)

French drone maker Parrot is set to begin developing a line of prosumer drones. (Recode)

Defense firm Qinetiq has announced that it will pursue the U.S. Army’s Lightweight Reconnaissance Robot program. (IHS Jane’s 360)

The U.S. Army is seeking a replacement engine for the RQ-7 Shadow tactical drone. (FlightGlobal) 

Researchers at Carnegie Mellon have been crashing autonomous drones repeatedly in order to teach them how to avoid crashing. (IEEE)

An Air Force investigation found that the cause of the crash of an MQ-9 Reaper drone in Nevada last summer was pilot error. (Press Release)

A Defense Advanced Research Projects Agency press release describes in detail its recent military academy swarming competition.

Raytheon announced that it has installed ground-based sense-and-avoid systems at a number of air bases in the U.S. (IHS Jane’s 360)

The Digital Circuit has put together a compilation of images of some of the more interesting drones at this year’s Xponential drone conference.

Drones at Work

A drone flying over a bike race in in Rancho Cordova, California crashed into a cyclist. (Market Watch)

Meanwhile, a consumer drone crashed into a car crossing the Sydney Harbor Bridge in Australia. It is the second time a drone has crashed at the site of the bridge in the past nine months. (Sydney Morning Herald)

Insurance company Travelers has trained over 150 drone operators to use drones for insurance appraisals over properties. (Insurance Journal)

Kazakhstan’s armed forces displayed a number of its recently acquired unmanned aircraft during a military parade. (IHS Jane’s 360)

A Latvian technology firm used a large multirotor drone to carry a skydiver to altitude before he parachuted back down to earth. (Phys.org)

Clear Flight Solutions and AERIUM Analytics are set to begin integrating the Robird drone system, a falcon-like drone that scares birds away from air traffic, at Edmonton International Airport. (Unmanned Systems Technology)

Industry Intel

The U.S. Army awarded General Atomics Aeronautical Systems a $221.6 million contract modification for 20 extended range Gray Eagle drones and associated equipment. (DoD)

The U.S. Air Force awarded General Electric a $14 million contract for work that includes the Thermal Management System for unmanned aircraft. (DoD)

The U.S. Navy awarded Boeing Insitu a $8.1 million contract for spare parts for the RQ-21A Blackjack. (DoD)

The United Arab Emirates awarded Canada-based CAE a contract estimated at $40.9 million to train drone operators. (UPI)

Airbus opened a subsidiary in Atlanta that will sell imagery from satellites and drones to commercial clients. (AIN Online)

Turkish Aerospace Industries will begin cooperating with ANTONOV Company on the development of unmanned systems. (Press Release)

Aker, a company that develops drones for agriculture, won $950,000 in funding from the Clean Energy Trust Challenge. (Chicago Tribune)

For updates, news, and commentary, follow us on Twitter. The Weekly Drone Roundup is a newsletter from the Center for the Study of the Drone. It covers news, commentary, analysis and technology from the drone world. You can subscribe to the Roundup here.

Drones land back to Earth at Xponential 2017

PhoneDrone Ethos, Kickstarter campaign. Credit: xCraft/YouTube

JD Claridge’s story epitomizes the current state of the drone industry. Claridge, founder of xCraft, is best known for being the first contestant on Shark Tank to receive money from all the Sharks – even Kevin O’Leary! Walking the floor of Xponential 2017, the annual convention of the Association for Unmanned Vehicle Systems Integration (AUVSI), Claridge remarked to me how the drone industry has grown up since his TV appearance.

Claridge has gone from pitching cellphone cases that turn into drones (aka phonedrone) to solving mission critical problems. The age of fully autonomous flight is near and the drone industry is finally recovering from the hangover of overhyped Kickstarter videos (see Lily drone’s $34 million fraud). xCraft’s pivot to lightweight, power efficient, enterprise drones is an example of this evolved marketplace. During the three days of Xponential 2017, several far-reaching announcements were made between stalwarts of the tech industry and aviation startups. Claridge introduced me to his new partner, Rajant, which is a leader in industrial wireless networks. xCraft’s latest models utilize Rajant’s mesh networks to launch swarms of drones with one controller. More drones flying simultaneously enables users to maximize the flight time limitations of lithium batteries by covering greater areas within a single mission.

Bob Schena, Rajant’s CEO, said, “Rajant’s network technology now makes it possible for one pilot to operate many aircrafts concurrently, with flight times of 45 minutes. We’re pleased to partner with xCraft and bring more intelligence, mobility and autonomy to UAV communication infrastructures covering greater aerial distances while supporting various drone payloads.”

The battery has been the Achilles heel of the small drone industry since inception. While large winged craft relies heavily on fossil fuels, multirotor battery-operated drones have been plagued with shorter missions of under 45 minutes. Innovators like Claridge are leading the way for a new wave of creative solutions:

Solar Powered Wings 

Solar Powered Wings

Airbus showcased its Zephyr drone products or HAPS (High Altitude Pseudo-Satellite) UAVs using solar-winged craft for power. Zephyr UAVs can fly for months at a time, saving thousands of tons of fuel. The HAPS also offers a number of lightweight payload options from voice communications to persistent internet to real-time surveillance. Airbus was not the only solar solution on display; there were a handful of Chinese upstarts and solar cell purveyors for retrofitting existing aircrafts.

Hybrid Fuel Solutions  

In the Startup Pavilion, William Fredericks of the Advanced Aircraft Company (AAC) demoed a novel technology using a hybrid of diesel fuel and lithium batteries with flexible fixed wings and multirotors, resulting in over 3 hours of flying time. AAC’s prototype, the Hercules (above) is remarkably lightweight and fast. Fredricks is an aircraft designer by trade with 12 designs flying in the air, including NASA’s Greased Lightning that looks remarkably similar to Boeing’s Osprey. The Hercules is available for sale on the company’s website for multiple use cases, including: agricultural, first responders, and package delivery. It is interesting to note that a few rows from Frederick’s booth was his former employer, NASA, promoting their new Autonomy Incubator for “intelligent flight systems” and its “autonomy innovation lab,” (definitely an incubator to watch).

Vertical Take Off & Landing

In addition to hybrid fuel strategies, entrepreneurs are also rethinking the launch procedures. AAC’s Hercules and XCraft’s commercial line of drones vertically takeoff to reduce wind resistance and maximize energy consumption. Australian Startup Iridium Dynamics takes this approach to a new level with astonishing results. Its winged craft, Halo, uses a patent-pending “hover thrust” of its entire craft so its wings actually create the vertical lift to hover with minimal power. The drone also has two rotors to fly horizontally. According to Dion Gonano, Control Systems Engineer, it can fly for over 2 hours. The Halo also lands vertically into a stationary mechanical arm. While the website lists a number of commercial applications for this technology, it was unclear in my discussions with Gonano if they have deployed this technology in real tests.

New Charging Efficiencies

Prior to Xponential, Seattle-based WiBotic announced the closing of its $2.5 seed round to fund its next generation of battery charging technologies. The company has created a novel approach to wireless inductive charging for robotics. Its wireless inductive charging platform includes a patent-pending auto detect feature that can begin recharging once the robot enters the proximity of the base station, even during flight. According to Dr. Ben Waters, (CEO), its charge is faster than traditional solutions presently on the market. Dr. Waters demonstrated for me its suite of software tools that monitor battery performance, providing clients with a complete power management analytics platform. WiBotic is already piloting its technology with leading commercial customers in the energy and security sectors. WiBotic is the first inductive charging platform; other companies have created innovating battery-swapping techniques. Airobotics unique drone storage box that is deployed currently at power plants in Israel, includes a robotic arm, housed inside, that services the robot post flight by switching out the payload and battery:

Reducing Payload Weight

In addition to aircraft design, payload weight is a big factor of battery drain. A growing trend within the industry is miniaturizing the size and cost of the components. Ultimately, the mission of a drone is directly related to the type of payload from cameras for collecting images to precise measurements using Light Detection and Ranging sensors (or Lidar). Lidar is typically deployed in autonomous vehicles to provide the most precise position for the robot in a crowded area, like a self-driving car on the road. However, Lidar is currently extremely expensive and large for many multirotor surveys. Chris Brown of Z-Senz, a former scientist with the The National Institute of Standards and Technology (NIST), hopes to change the landscape of drones with his miniaturized Lidar sensor. Brown’s reduced sensor, SKY1, offers major advantages for size, weight, and power consumption without losing accuracy of high distance sensing. A recent study estimates the Lidar market is expected to exceed $5 billion by 2022, with Velodyne and Quanergy already gaining significant investment. Z-Senz is aiming to be commercially available by 2018.

Lidar is not the only measuring methodology, Global Positioning Solutions (GPS) have been deployed widely. Two of the finalists of the Xponetial Startup Showdown were startups focused on reducing GPS chip sizes and increasing functionality. Inertial Sense has produced a chip the size of a dime that is capable of housing an Inertial Measurement Unit (IMU), Attitude Heading Reference System (AHRS), and GPS-aided Inertial Navigation System (INS). Their website claims that their “advanced algorithms fuse output from MEMs inertial sensors, magnetometers, barometric pressure, and a high-sensitivity GPS (GNSS) receiver to deliver fast, accurate, and reliable attitude, velocity, and position even in the most dynamic environments.” The chips and micro navigation accessories are available on the company’s e-store.

The winner of the Showdown, uAvionix, is a leading developer of avionics for both manned and unmanned flight. Their new transceivers and transponders claim to be “the smallest, and lightest and most affordable on the market” (already GPS is a commodity). uAvionix presented its “Ping Network System that reduces weight on average by 40% as compared to the two-piece installations.” The Ping products also claim barometric altitude precision with accuracy beyond 80,000 ft.

Paul Beard, CEO of uAvionix, said, “our customers have asked for even smaller and lighter solutions; integrating the transceivers, GPS receivers, GPS antennas, and barometric pressure sensors into a single form factor facilitates easier installation and lowers weight and power draw requirements resulting in a longer usable flight time.”

As I rushed to the airport to catch my manned flight, I felt reenergized about the drone industry, although follies will persist. I mean who wouldn’t want a pool deckchair drone this summer?

This and all other autonomous subjects will be explored at RobotLabNYC’s next event with Dr. Howard Morgan (FirstRound Capital) and Tom Ryden (MassRobotics) – RSVP.

Back to the core of intelligence … to really move to the future

Guest post by José Hernández-Orallo, Professor at Technical University of Valencia

Two decades ago I started working on metrics of machine intelligence. By that time, during the glacial days of the second AI winter, few were really interested in measuring something that AI lacked completely. And very few, such as David L. Dowe and I, were interested in metrics of intelligence linked to algorithmic information theory, where the models of interaction between an agent and the world were sequences of bits, and intelligence was formulated using Solomonoff’s and Wallace’s theories of inductive inference.

In the meantime, seemingly dozens of variants of the Turing test were proposed every year, the CAPTCHAs were introduced and David showed how easy it is to solve some IQ tests using a very simple program based on a big-switch approach. And, today, a new AI spring has arrived, triggered by a blossoming machine learning field, bringing a more experimental approach to AI with an increasing number of AI benchmarks and competitions (see a previous entry in this blog for a survey).

Considering this 20-year perspective, last year was special in many ways. The first in a series of workshops on evaluating general-purpose AI took off, echoing the increasing interest in the assessment of artificial general intelligence (AGI) systems, capable of finding diverse solutions for a range of tasks. Evaluating these systems is different, and more challenging, than the traditional task-oriented evaluation of specific systems, such as a robotic cleaner, a credit scoring model, a machine translator or a self-driving car. The idea of evaluating general-purpose AI systems using videogames had caught on. The arcade learning environment (the Atari 2600 games) or the more flexible Video Game Definition Language and associated competition became increasingly popular for the evaluation of AGI and its recent breakthroughs.

Last year also witnessed the introduction of a different kind of AI evaluation platforms, such as Microsoft’s Malmö, GoodAI’s School, OpenAI’s Gym and Universe, DeepMind’s Lab, Facebook’s TorchCraft and CommAI-env. Based on a reinforcement learning (RL) setting, these platforms make it possible to create many different tasks and connect RL agents through a standard interface. Many of these platforms are well suited for the new paradigms in AI, such as deep reinforcement learning and some open-source machine learning libraries. After thousands of episodes or millions of steps against a new task, these systems are able to excel, with usually better than human performance.

Despite the myriads of applications and breakthroughs that have been derived from this paradigm, there seems to be a consensus in the field that the main open problem lies in how an AI agent can reuse the representations and skills from one task to new ones, making it possible to learn a new task much faster, with a few examples, as humans do. This can be seen as a mapping problem (usually under the term transfer learning) or can be seen as a sequential problem (usually under the terms gradual, cumulative, incremental, continual or curriculum learning).

One of the key notions that is associated with this capability of a system of building new concepts and skills over previous ones is usually referred to as “compositionality”, which is well documented in humans from early childhood. Systems are able to combine the representations, concepts or skills that have been learned previously in order to solve a new problem. For instance, an agent can combine the ability of climbing up a ladder with its use as a possible way out of a room, or an agent can learn multiplication after learning addition.

In my opinion, two of the previous platforms are better suited for compositionality: Malmö and CommAI-env. Malmö has all the ingredients of a 3D game, and AI researchers can experiment and evaluate agents with vision and 3D navigation, which is what many research papers using Malmö have done so far, as this is a hot topic in AI at the moment. However, to me, the most interesting feature of Malmö is building and crafting, where agents must necessarily combine previous concepts and skills in order to create more complex things.

CommAI-env is clearly an outlier in this set of platforms. It is not a video game in 2D or 3D. Video or audio don’t have any role there. Interaction is just produced through a stream of input/output bits and rewards, which are just +1, 0 or -1. Basically, actions and observations are binary. The rationale behind CommAI-env is to give prominence to communication skills, but it still allows for rich interaction, patterns and tasks, while “keeping all further complexities to a minimum”.

Examples of interaction within the CommAI-mini environment.

When I was aware that the General AI Challenge was using CommAI-env for their warm-up round I was ecstatic. Participants could focus on RL agents without the complexities of vision and navigation. Of course, vision and navigation are very important for AI applications, but they create many extra complications if we want to understand (and evaluate) gradual learning. For instance, two equal tasks for which the texture of the walls changes can be seen as requiring higher transfer effort than two slightly different tasks with the same texture. In other words, this would be extra confounding factors that would make the analysis of task transfer and task dependencies much harder. It is then a wise choice to exclude this from the warm-up round. There will be occasions during other rounds of the challenge for including vision, navigation and other sorts of complex embodiment. Starting with a minimal interface to evaluate whether the agents are able to learn incrementally is not only a challenging but an important open problem for general AI.

Also, the warm-up round has modified CommAI-env in such a way that bits are packed into 8-bit (1 byte) characters. This makes the definition of tasks more intuitive and makes the ASCII coding transparent to the agents. Basically, the set of actions and observations is extended to 256. But interestingly, the set of observations and actions is the same, which allows many possibilities that are unusual in reinforcement learning, where these subsets are different. For instance, an agent with primitives such as “copy input to output” and other sequence transformation operators can compose them in order to solve the task. Variables, and other kinds of abstractions, play a key role.

This might give the impression that we are back to Turing machines and symbolic AI. In a way, this is the case, and much in alignment to Turing’s vision in his 1950 paper: “it is possible to teach a machine by punishments and rewards to obey orders given in some language, e.g., a symbolic language”. But in 2017 we have a range of techniques that weren’t available just a few years ago. For instance, Neural Turing Machines and other neural networks with symbolic memory can be very well suited for this problem.

By no means does this indicate that the legion of deep reinforcement learning enthusiasts cannot bring their apparatus to this warm-up round. Indeed they won’t be disappointed by this challenge if they really work hard to adapt deep learning to this problem. They won’t probably need a convolutional network tuned for visual pattern recognition, but there are many possibilities and challenges in how to make deep learning work in a setting like this, especially because the fewer examples, the better, and deep learning usually requires many examples.

As a plus, the simple, symbolic sequential interface opens the challenge to many other areas in AI, not only recurrent neural networks but techniques from natural language processing, evolutionary computation, compression-inspired algorithms or even areas such as inductive programming, with powerful string-handling primitives and its appropriateness for problems with very few examples.

I think that all of the above makes this warm-up round a unique competition. Of course, since we haven’t had anything similar in the past, we might have some surprises. It might happen that an unexpected (or even naïve) technique could behave much better than others (and humans) or perhaps we find that no technique is able to do something meaningful at this time.

I’m eager to see how this round develops and what the participants are able to integrate and invent in order to solve the sequence of micro and mini-tasks. I’m sure that we will learn a lot from this. I hope that machines will, too. And all of us will move forward to the next round!

José Hernández-Orallo is a professor at Technical University of Valencia and author of “The Measure of All Minds, Evaluating Natural and Artificial Intelligence”, Cambridge University Press, 2017.


Back to the core of intelligence … to really move to the future was originally published in AI Roadmap Institute Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Unsolved Problems in AI

Guest post by Simon Andersson, Senior Research Scientist @GoodAI

Executive summary

  • Tracking major unsolved problems in AI can keep us honest about what remains to be achieved and facilitate the creation of roadmaps towards general artificial intelligence.
  • This document currently identifies 29 open problems.
  • For each major problem, example tests are suggested for evaluating research progress.

Introduction

This document identifies open problems in AI. It seeks to provide a concise overview of the greatest challenges in the field and of the current state of the art, in line with the “open research questions” theme of focus of the AI Roadmap Institute.

The challenges are grouped into AI-complete problems, closed-domain problems, and fundamental problems in commonsense reasoning, learning, and sensorimotor ability.

I realize that this first attempt at surveying the open problems will necessarily be incomplete and welcome reader feedback.

To help accelerate the search for general artificial intelligence, GoodAI is organizing the General AI Challenge (GoodAI, 2017), that aims to solve some of the problems outlined below, through a series of milestone challenges starting in early 2017.

Sources, method, and related work

The collection of problems presented here is the result of a review of the literature in the areas of

  • Machine learning
  • Machine perception and robotics
  • Open AI problems
  • Evaluation of AI systems
  • Tests for the achievement of human-level intelligence
  • Benchmarks and competitions

To be considered for inclusion, a problem must be

  1. Highly relevant for achieving general artificial intelligence
  2. Closed in scope, not subject to open-ended extension
  3. Testable

Problems vary in scope and often overlap. Some may be contained entirely in others. The second criterion (closed scope) excludes some interesting problems such as learning all human professions; a few problems of this type are mentioned separately from the main list. To ensure that problems are testable, each is presented together with example tests.

Several websites, some listed below, provide challenge problems for AI.

In the context of evaluating AI systems, Hernández-Orallo (2016a) reviews a number of open AI problems. Lake et al. (2016) offers a critique of the current state of the art in AI and discusses problems like intuitive physics, intuitive psychology, and learning from few examples.

A number of challenge problems for AI were proposed in (Brooks, et al., 1996) and (Brachman, 2006).

The challenges

The rest of the document lists AI challenges as outlined below.

  1. AI-complete problems
  2. Closed-domain problems
  3. Commonsense reasoning
  4. Learning
  5. Sensorimotor problems

AI-complete problems

AI-complete problems are ones likely to contain all or most of human-level general artificial intelligence. A few problems in this category are listed below.

  1. Open-domain dialog
  2. Text understanding
  3. Machine translation
  4. Human intelligence and aptitude tests
  5. Coreference resolution (Winograd schemas)
  6. Compound word understanding

Open-domain dialog

Open-domain dialog is the problem of conducting competently a dialog with a human when the subject of the discussion is not known in advance. The challenge includes language understanding, dialog pragmatics, and understanding the world. Versions of the tasks include spoken and written dialog. The task can be extended to include multimodal interaction (e.g., gestural input, multimedia output). Possible success criteria are usefulness and the ability to conduct dialog indistinguishable from human dialog (“Turing test”).

Tests

Dialog systems are typically evaluated by human judges. Events where this has been done include

  1. The Loebner prize (Loebner, 2016)
  2. The Robo chat challenge (Robo chat challenge, 2014)

Text understanding

Text understanding is an unsolved problem. There has been remarkable progress in the area of question answering, but current systems still fail when common-sense world knowledge, beyond that provided in the text, is required.

Tests

  1. McCarthy (1976) provided an early text understanding challenge problem.
  2. Brachman (2006) suggested the problem of reading a textbook and solving its exercises.

Machine translation

Machine translation is AI-complete since it includes problems requiring an understanding of the world (e.g., coreference resolution, discussed below).

Tests

While translation quality can be evaluated automatically using parallel corpora, the ultimate test is human judgement of quality. Corpora such as the Corpus of Contemporary American English (Davies, 2008) contain samples of text from different genres. Translation quality can be evaluated using samples of

  1. Newspaper text
  2. Fiction
  3. Spoken language transcriptions

Intelligence tests

Human intelligence and aptitude tests (Hernández-Orallo, 2017) are interesting in that they are designed to be at the limit of human ability and to be hard or impossible to solve using memorized knowledge. Human-level performance has been reported for Raven’s progressive matrices (Lovett and Forbus, 2017) but artificial systems still lack the general reasoning abilities to deal with a variety of problems at the same time (Hernández-Orallo, 2016b).

Tests

  1. Brachman (2006) suggested using the SAT as an AI challenge problem.

Coreference resolution

The overlapping problems of coreference resolution, pronoun disambiguation, and Winograd schemas require picking out the referents of pronouns or noun phrases.

Tests

  1. Davis (2011) lists 144 Winograd schemas.
  2. Commonsense Reasoning (2016b) lists pronoun disambiguation problems: 62 sample problems and 60 problems used in the first Winograd Schema Challenge, held at IJCAI-16.

Compound word understanding

In many languages, there are compound words with set meanings. Novel compound words can be produced, and we are good at guessing their meaning. We understand that a water bird is a bird that lives near water, not a bird that contains or is constituted by water, and that schadenfreude is felt when others, not we, are hurt.

Tests

  1. The meaning of noun phrases” at (Commonsense Reasoning, 2015)

Closed-domain problems

Closed-domain problems are ones that combine important elements of intelligence but reduce the difficulty by limiting themselves to a circumscribed knowledge domain. Game playing agents are examples of this and artificial agents have achieved superhuman performance at Go (Silver et al., 2016) and more recently poker (Aupperlee, 2017; Brown and Sandholm, 2017). Among the open problems are:

  1. Learning to play board, card, and tile games from descriptions
  2. Producing programs from descriptions
  3. Source code understanding

Board, card, and tile games from descriptions

Unlike specialized game players, systems that have to learn new games from descriptions of the rules cannot rely on predesigned algorithms for specific games.

Tests

  1. The problem of learning new games from formal-language descriptions has appeared as a challenge at the AAAI conference (Genesereth et al., 2005; AAAI, 2013).
  2. Even more challenging is the problem of learning games from natural language descriptions; such descriptions for card and tile games are available from a number of websites (e.g., McLeod, 2017).

Programs from descriptions

Producing programs in a programming language such as C from natural language input is a problem of obvious practical interest.

Tests

  1. The “Description2Code” challenge proposed at (OpenAI, 2016) has 5000 descriptions for programs collected by Ethan Caballero.

Source code understanding

Related to source code production is source code understanding, where the system can interpret the semantics of code and detect situations where the code differs in non-trivial ways from the likely intention of its author. Allamanis et al. (2016) reports progress on the prediction of procedure names.

Tests

  1. The International Obfuscated C Code Contest (OCCC, 2016) publishes code that is intentionally hard to understand. Source code understanding could be tested as the ability to improve the readability of the code as scored by human judges.

Commonsense reasoning

Commonsense reasoning is likely to be a central element of general artificial intelligence. Some of the main problems in this area are listed below.

  1. Causal reasoning
  2. Counterfactual reasoning
  3. Intuitive physics
  4. Intuitive psychology

Causal reasoning

Causal reasoning requires recognizing and applying cause-effect relations.

Tests

  1. Strength of evidence” at (Commonsense Reasoning, 2015)
  2. Wolves and rabbits” at (Commonsense Reasoning, 2015)

Counterfactual reasoning

Counterfactual reasoning is required for answering hypothetical questions. It uses causal reasoning together with the system’s other modeling and reasoning capabilities to consider situations possibly different from anything that ever happened in the world.

Tests

  1. The cruel and unusual Yale shooting problem” at (Commonsense Reasoning, 2015)

Intuitive physics

A basic understanding of the physical world, including object permanence and the ability to predict likely trajectories, helps agents learn faster and make better predictions. This is now a very active research area; some recent work is reported in (Agrawal et al., 2016; Chang et al., 2016; Degrave et al., 2016; Denil et al., 2016; Finn et al., 2016; Fragkiadaki et al., 2016; Hamrick et al., 2016; Li et al., 2016; Mottaghi et al., 2016; Nair et al., 2016; Stewart and Ermon, 2016).

Tests

  1. The “Physical reasoning” section at (Commonsense Reasoning, 2015) (8 problems)
  2. The handle problem” at (Commonsense Reasoning, 2015)

Intuitive psychology

Intuitive psychology, or theory of mind, allows the agent to understand goals and beliefs and infer them from the behavior of other agents.

Tests

  1. The “Naive psychology” section at (Commonsense Reasoning, 2015) (4 problems)

Learning

Despite remarkable advances in machine learning, important learning-related problems remain mostly unsolved. They include:

  1. Gradual learning
  2. Unsupervised learning
  3. Strong generalization
  4. Category learning from few examples
  5. Learning to learn
  6. Compositional learning
  7. Learning without forgetting
  8. Transfer learning
  9. Knowing when you don’t know
  10. Learning through action

Gradual learning

Humans are capable of lifelong learning of increasingly complex tasks. Artificial agents should be, too. Versions of this idea have been discussed under the rubrics of life-long (Thrun and Mitchell, 1995), continual, and incremental learning. At GoodAI, we have adopted the term gradual learning (Rosa et al., 2016) for the long-term accumulation of knowledge and skills. It requires the combination of several abilities discussed below:

  • Compositional learning
  • Learning to learn
  • Learning without forgetting
  • Transfer learning

Tests

  1. A possible test applies to a household robot that learns household and house maintenance tasks, including obtaining tools and materials for the work. The test evaluates the agent on two criteria: Continuous operation (Nilsson in Brooks, et al., 1996) where the agent needs to function autonomously without reprogramming during its lifetime, and improving capability, where the agent must exhibit, at different points in its evolution, capabilities not present at an earlier time.

Unsupervised learning

Unsupervised learning has been described as the next big challenge in machine learning (LeCun 2016). It appears to be fundamental to human lifelong learning (supervised and reinforcement signals do not provide nearly enough data) and is closely related to prediction and common-sense reasoning (“filling in the missing parts”). A hard problem (Yoshua Bengio, in the “Brains and bits” panel at NIPS 2016) is unsupervised learning in hierarchical systems, with components learning jointly.

Tests

In addition to the possible tests in the vision domain, speech recognition also presents opportunities for unsupervised learning. While current state-of-the-art speech recognizers rely largely on supervised learning on large corpora, unsupervised recognition requires discovering, without supervision, phonemes, word segmentation, and vocabulary. Progress has been reported in this direction, so far limited to small-vocabulary recognition (Riccardi and Hakkani-Tur, 2003, Park and Glass, 2008, Kamper et al., 2016).

  1. A full-scale test of unsupervised speech recognition could be to train on the audio part of a transcribed speech corpus (e.g., TIMIT (Garofolo, 1993)), then learn to predict the transcriptions with only very sparse supervision.

Strong generalization

Humans can transfer knowledge and skills across situations that share high-level structure but are otherwise radically different, adapting to the particulars of a new setting while preserving the essence of the skill, a capacity that (Tarlow, 2016; Gaunt et al., 2016) refer to as strong generalization. If we learn to clean up a room, we know how to clean up most other rooms.

Tests

  1. A general assembly robot could learn to build a toy castle in one material (e.g., lego blocks) and be tested on building it from other materials (sand, stones, sticks).
  2. A household robot could be trained on cleaning and cooking tasks in one environment and be tested in highly dissimilar environments.

Category learning from few examples

Lake et al. (2015) achieved human-level recognition and generation of characters using few examples. However, learning more complex categories from few examples remains an open problem.

Tests

  1. The ImageNet database (Deng et al., 2009) contains images organized by the semantic hierarchy of WordNet (Miller, 1995). Correctly determining ImageNet categories from images with very little training data could be a challenging test of learning from few examples.

Learning to learn

Learning to learn or meta-learning (e.g., Harlow, 1949; Schmidhuber, 1987; Thrun and Pratt, 1998; Andrychowicz et al., 2016; Chen et al., 2016; de Freitas, 2016; Duan et al., 2016; Lake et al., 2016; Wang et al., 2016) is the acquisition of skills and inductive biases that facilitate future learning. The scenarios considered in particular are ones where a more general and slower learning process produces a faster, more specialized one. An example is biological evolution producing efficient learners such as human beings.

Tests

  1. Learning to play Atari video games is an area that has seen some remarkable recent successes, including in transfer learning (Parisotto et al., 2016). However, there is so far no system that first learns to play video games, then is capable of learning a new game, as humans can, from a few minutes of play (Lake et al., 2016).

Compositional learning

Compositional learning (de Freitas, 2016; Lake et al., 2016) is the ability to recombine primitive representations to accelerate the acquisition of new knowledge. It is closely related to learning to learn.

Tests

Tests for compositional learning need to verify both that the learner is effective and that it uses compositional representations.

  1. Some ImageNet categories correspond to object classes defined largely by their arrangements of component parts, e.g., chairs and stools, or unicycles, bicycles, and tricycles. A test could evaluate the agent’s ability to learn categories with few examples and to report the parts of the object in an image.
  2. Compositional learning should be extremely helpful in learning video games (Lake et al., 2016). A learner could be tested on a game already mastered, but where component elements have changed appearance (e.g., different-looking fish in the Frostbite game). It should be able to play the variant game with little or no additional learning.

Learning without forgetting

In order to learn continually over its lifetime, an agent must be able to generalize over new observations while retaining previously acquired knowledge. Recent progress towards this goal is reported in (Kirkpatrick et al., 2016) and (Li and Hoiem, 2016). Work on memory augmented neural networks (e.g., Graves et al., 2016) is also relevant.

Tests

A test for learning without forgetting needs to present learning tasks sequentially (earlier tasks are not repeated) and test for retention of early knowledge. It may also test for declining learning time for new tasks, to verify that the agent exploits the knowledge acquired so far.

  1. A challenging test for learning without forgetting would be to learn to recognize all the categories in ImageNet, presented sequentially.

Transfer learning

Transfer learning (Pan and Yang, 2010) is the ability of an agent trained in one domain to master another. Results in the area of text comprehension are currently poor unless the agent is given some training on the new domain (Kadlec, et al., 2016).

Tests

Sentiment classification (Blitzer et al., 2007) provides a possible testing ground for transfer learning. Learners can be trained on one corpus, tested on another, and compared to a baseline learner trained directly on the target domain.

  1. Reviews of movies and of businesses are two domains dissimilar enough to make knowledge transfer challenging. Corpora for the domains are Rotten Tomatoes movie reviews (Pang and Lee, 2005) and the Yelp Challenge dataset (Yelp, 2017).

Knowing when you don’t know

While uncertainty is modeled differently by different learning algorithms, it seems to be true in general that current artificial systems are not nearly as good as humans at “knowing when they don’t know.” An example are deep neural networks that achieve state-of-the-art accuracy on image recognition but assign 99.99% confidence to the presence of objects in images completely unrecognizable to humans (Nguyen et al., 2015).

Human performance on confidence estimation would include

  1. In induction tasks, like program induction or sequence completion, knowing when the provided examples are insufficient for induction (multiple reasonable hypotheses could account for them)
  2. In speech recognition, knowing when an utterance has not been interpreted reliably
  3. In visual tasks such as pedestrian detection, knowing when a part of the image has not been analyzed reliably

Tests

  1. A speech recognizer can be compared against a human baseline, measuring the ratio of the average confidence to the confidence on examples where recognition fails.
  2. The confidence of image recognition systems can be tested on generated adversarial examples.

Learning through action

Human infants are known to learn about the world through experiments, observing the effects of their own actions (Smith and Gasser, 2005; Malik, 2015). This seems to apply both to higher-level cognition and perception. Animal experiments have confirmed that the ability to initiate movement is crucial to perceptual development (Held and Hein, 1963) and some recent progress has been made on using motion in learning visual perception (Agrawal et al., 2015). In (Agrawal et al., 2016), a robot learns to predict the effects of a poking action.

“Learning through action” thus encompasses several areas, including

  • Active learning, where the agent selects the training examples most likely to be instructive
  • Undertaking epistemological actions, i.e., activities aimed primarily at gathering information
  • Learning to perceive through action
  • Learning about causal relationships through action

Perhaps most importantly, for artificial systems, learning the causal structure of the world through experimentation is still an open problem.

Tests

For learning through action, it is natural to consider problems of motor manipulation where in addition to the immediate effects of the agent’s actions, secondary effects must be considered as well.

  1. Learning to play billiards: An agent with little prior knowledge and no fixed training data is allowed to explore a real or virtual billiard table and should learn to play billiards well.

Sensorimotor problems

Outstanding problems in robotics and machine perception include:

  1. Autonomous navigation in dynamic environments
  2. Scene analysis
  3. Robust general object recognition and detection
  4. Robust, life-time simultaneous location and mapping (SLAM)
  5. Multimodal integration
  6. Adaptive dexterous manipulation

Autonomous navigation

Despite recent progress in self-driving cars by companies like Tesla, Waymo (formerly the Google self-driving car project) and many others, autonomous navigation in highly dynamic environments remains a largely unsolved problem, requiring knowledge of object semantics to reliably predict future scene states (Ess et al., 2010).

Tests

  1. Fully automatic driving in crowded city streets and residential areas is still a challenging test for autonomous navigation.

Scene analysis

The challenge of scene analysis extends far beyond object recognition and includes the understanding of surfaces formed by multiple objects, scene 3D structure, causal relations (Lake et al., 2016), and affordances. It is not limited to vision but can depend on audition, touch, and other modalities, e.g., electroreception and echolocation (Lewicki et al., 2014; Kondo et al., 2017). While progress has been made, e.g., in recognizing anomalous and improbable scenes (Choi et al., 2012), predicting object dynamics (Fouhey and Zitnick, 2014), and discovering object functionality (Yao et al., 2013), we are still far from human-level performance in this area.

Tests

Some possible challenges for understanding the causal structure in visual scenes are:

  1. Recognizing dangerous situations: A corpus of synthetic images could be created where the same objects are recombined to form “dangerous” and “safe” scenes as classified by humans.
  2. Recognizing physically improbable scenes: A synthetic corpus could be created to show physically plausible and implausible scenes containing the same objects.
  3. Recognizing useless objects: Images of useless objects have been created by (Kamprani, 2017).

Object recognition

While object recognition has seen great progress in recent years (e.g., Han et al., 2016), matches or surpasses human performance for many problems (Karpathy, 2014), and can approach perfection in closed environments (Song et al., 2015), state-of-the-art systems still struggle with the harder cases such as open objects (interleaved with background), broken objects, truncation and occlusion in dynamic environments (e.g., Rajaram et al., 2015).

Tests

Environments that are cluttered and contain objects drawn from a large, open-ended, and changing set of types are likely to be challenging for an object recognition system. An example would be

  1. Seeing photos of the insides of pantries and refrigerators and listing the ingredients available to the owners

Simultaneous location and mapping

While the problem of simultaneous location and mapping (SLAM) is considered solved for some applications, the challenge of SLAM for long-lived autonomous robots, in large-scale, time-varying environments, remains open (Cadena et al., 2016).

Tests

  1. Lifetime location and mapping, without detailed maps provided in advance and robust to changes in the environment, for an autonomous car based in a large city

Multimodal integration

The integration of multiple senses (Lahat, 2015) is important, e.g., in human communication (Morency, 2015) and scene understanding (Lewicki et al., 2014; Kondo et al., 2017). Having multiple overlapping sensory systems seems to be essential for enabling human children to educate themselves by perceiving and acting in the world (Smith and Gasser, 2005).

Tests

Spoken communication in noisy environments, where lip reading and gestural cues are indispensable, can provide challenges for multimodal fusion. An example would be

  1. A robot bartender: The agent needs to interpret customer requests in a noisy bar.

Adaptive dexterous manipulation

Current robot manipulators do not come close to the versatility of the human hand (Ciocarlie, 2015). Hard problems include manipulating deformable objects and operating from a mobile platform.

Tests

  1. Taking out clothes from a washing machine and hanging them on clothes lines and coat hangers in varied places while staying out of the way of humans

Open-ended problems

Some noteworthy problems were omitted from the list for having a too open-ended scope: they encompass sets of tasks that evolve over time or can be endlessly extended. This makes it hard to decide whether a problem has been solved. Problems of this type include

  • Enrolling in a human university and take classes like humans (Goertzel, 2012)
  • Automating all types of human work (Nilsson, 2005)
  • Puzzlehunt challenges, e.g., the annual TMOU game in the Czech republic (TMOU, 2016)

Conclusion

I have reviewed a number of open problems in an attempt to delineate the current front lines of AI research. The problem list in this first version, as well as the problem descriptions, example tests, and mentions of ongoing work in the research areas, are necessarily incomplete. I plan to extend and improve the document incrementally and warmly welcome suggestions either in the comment section below or at the institute’s discourse forum.

Acknowledgements

I thank Jan Feyereisl, Martin Poliak, Petr Dluhoš, and the rest of the GoodAI team for valuable discussion and suggestions.

References

AAAI. “AAAI-13 International general game playing competition.” Online under http://www.aaai.org/Conferences/AAAI/2013/aaai13games.php (2013)

Agrawal, Pulkit, Joao Carreira, and Jitendra Malik. “Learning to see by moving.” Proceedings of the IEEE International Conference on Computer Vision. 2015.

Agrawal, Pulkit, et al. “Learning to poke by poking: Experiential learning of intuitive physics.” arXiv preprint arXiv:1606.07419 (2016).

AI•ON. “The AI•ON collection of open research problems.” Online under http://ai-on.org/projects (2016)

Allamanis, Miltiadis, Hao Peng, and Charles Sutton. “A convolutional attention network for extreme summarization of source code.” arXiv preprint arXiv:1602.03001 (2016).

Andrychowicz, Marcin, et al. “Learning to learn by gradient descent by gradient descent.” Advances in Neural Information Processing Systems. 2016.

Aupperlee, Aaron. “No bluff: Supercomputer outwits humans in poker rematch.” Online under http://triblive.com/local/allegheny/11865933-74/rematch-aaron-aupperlee (2017)

Blitzer, John, Mark Dredze, and Fernando Pereira. “Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification.” ACL. Vol. 7. 2007.

Brachman, Ronald J. “AI more than the sum of its parts.” AI Magazine 27.4 (2006): 19.

Brooks, R., et al. “Challenge problems for artificial intelligence.” Thirteenth National Conference on Artificial Intelligence-AAAI. 1996.

Brown, Noam, and Tuomas Sandholm. “Safe and Nested Endgame Solving for Imperfect-Information Games.” Online under http://www.cs.cmu.edu/~noamb/papers/17-AAAI-Refinement.pdf (2017)

Cadena, Cesar, et al. “Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age.” IEEE Transactions on Robotics 32.6 (2016): 1309–1332.

Chang, Michael B., et al. “A compositional object-based approach to learning physical dynamics.” arXiv preprint arXiv:1612.00341 (2016).

Chen, Yutian, et al. “Learning to Learn for Global Optimization of Black Box Functions.” arXiv preprint arXiv:1611.03824 (2016).

Choi, Myung Jin, Antonio Torralba, and Alan S. Willsky. “Context models and out-of-context objects.” Pattern Recognition Letters 33.7 (2012): 853–862.

Ciocarlie, Matei. “Versatility in Robotic Manipulation: the Long Road to Everywhere.” Online under https://www.youtube.com/watch?v=wiTQ6qOR8o4 (2015)

Commonsense Reasoning. “Commonsense reasoning problem page.” Online under http://commonsensereasoning.org/problem_page.html (2015)

Commonsense Reasoning. “Commonsense reasoning Winograd schema challenge.” Online under http://commonsensereasoning.org/winograd.html (2016a)

Commonsense Reasoning. “Commonsense reasoning pronoun disambiguation problems” Online under http://commonsensereasoning.org/disambiguation.html (2016b)

Davies, Mark. The corpus of contemporary American English. BYE, Brigham Young University, 2008.

Davis, Ernest. “Collection of Winograd schemas.” Online under http://www.cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WSCollection.html (2011)

de Freitas, Nando. “Learning to Learn and Compositionality with Deep Recurrent Neural Networks: Learning to Learn and Compositionality.” Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016.

Degrave, Jonas, Michiel Hermans, and Joni Dambre. “A Differentiable Physics Engine for Deep Learning in Robotics.” arXiv preprint arXiv:1611.01652 (2016).

Deng, Jia, et al. “Imagenet: A large-scale hierarchical image database.” Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009.

Denil, Misha, et al. “Learning to Perform Physics Experiments via Deep Reinforcement Learning.” arXiv preprint arXiv:1611.01843 (2016).

Duan, Yan, et al. “RL²: Fast Reinforcement Learning via Slow Reinforcement Learning.” arXiv preprint arXiv:1611.02779 (2016).

Ess, Andreas, et al. “Object detection and tracking for autonomous navigation in dynamic environments.” The International Journal of Robotics Research 29.14 (2010): 1707–1725.

Finn, Chelsea, and Sergey Levine. “Deep Visual Foresight for Planning Robot Motion.” arXiv preprint arXiv:1610.00696 (2016).

Fouhey, David F., and C. Lawrence Zitnick. “Predicting object dynamics in scenes.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014.

Fragkiadaki, Katerina, et al. “Learning visual predictive models of physics for playing billiards.” arXiv preprint arXiv:1511.07404 (2015).

Garofolo, John, et al. “TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1.” Web Download. Philadelphia: Linguistic Data Consortium, 1993.

Gaunt, Alexander L., et al. “Terpret: A probabilistic programming language for program induction.” arXiv preprint arXiv:1608.04428 (2016).

Genesereth, Michael, Nathaniel Love, and Barney Pell. “General game playing: Overview of the AAAI competition.” AI magazine 26.2 (2005): 62.

Goertzel, Ben. “What counts as a conscious thinking machine?” Online under https://www.newscientist.com/article/mg21528813.600-what-counts-as-a-conscious-thinking-machine (2012)

GoodAI. “General AI Challenge.” Online under https://www.general-ai-challenge.org/ (2017)

Graves, Alex, et al. “Hybrid computing using a neural network with dynamic external memory.” Nature 538.7626 (2016): 471–476.

Hamrick, Jessica B., et al. “Imagination-Based Decision Making with Physical Models in Deep Neural Networks.” Online under http://phys.csail.mit.edu/papers/5.pdf (2016)

Han, Dongyoon, Jiwhan Kim, and Junmo Kim. “Deep Pyramidal Residual Networks.” arXiv preprint arXiv:1610.02915 (2016).

Harlow, Harry F. “The formation of learning sets.” Psychological review 56.1 (1949): 51.

Held, Richard, and Alan Hein. “Movement-produced stimulation in the development of visually guided behavior.” Journal of comparative and physiological psychology 56.5 (1963): 872.

Hernández-Orallo, José. “Evaluation in artificial intelligence: from task-oriented to ability-oriented measurement.” Artificial Intelligence Review(2016a): 1–51.

Hernández-Orallo, José, et al. “Computer models solving intelligence test problems: progress and implications.” Artificial Intelligence 230 (2016b): 74–107.

Hernández-Orallo, José. “The measure of all minds.” Cambridge University Press, 2017.

IOCCC. “The International Obfuscated C Code Contest.” Online under http://www.ioccc.org (2016)

Kadlec, Rudolf, et al. “Finding a jack-of-all-trades: an examination of semi-supervised learning in reading comprehension.” Under review at ICLR 2017, online under https://openreview.net/pdf?id=rJM69B5xx

Kamper, Herman, Aren Jansen, and Sharon Goldwater. “Unsupervised word segmentation and lexicon discovery using acoustic word embeddings.” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 24.4 (2016): 669–679.

Kamprani, Katerina. “The uncomfortable.” Online under http://www.kkstudio.gr/#the-uncomfortable (2017)

Karpathy, Andrej. “What I learned from competing against a ConvNet on ImageNet.” Online under http://karpathy.github.io/2014/09/02/what-i-learnedfrom-competing-against-a-convnet-on-imagenet (2014)

Kirkpatrick, James, et al. “Overcoming catastrophic forgetting in neural networks.” arXiv preprint arXiv:1612.00796 (2016).

Kondo, H. M., et al. “Auditory and visual scene analysis: an overview.” Philosophical transactions of the Royal Society of London. Series B, Biological sciences 372.1714 (2017).

Lahat, Dana, Tülay Adali, and Christian Jutten. “Multimodal data fusion: an overview of methods, challenges, and prospects.” Proceedings of the IEEE 103.9 (2015): 1449–1477.

Lake, Brenden M., Ruslan Salakhutdinov, and Joshua B. Tenenbaum. “Human-level concept learning through probabilistic program induction.” Science 350.6266 (2015): 1332–1338.

Lake, Brenden M., et al. “Building machines that learn and think like people.” arXiv preprint arXiv:1604.00289 (2016).

LeCun, Yann. “The Next Frontier in AI: Unsupervised Learning.” Online under http://www.ri.cmu.edu/event_detail.html?event_id=1211&&menu_id=242&event_type=seminars (2016)

Lewicki, Michael S., et al. “Scene analysis in the natural environment.” Frontiers in psychology 5 (2014): 199.

Li, Wenbin, Aleš Leonardis, and Mario Fritz. “Visual stability prediction and its application to manipulation.” arXiv preprint arXiv:1609.04861 (2016).

Li, Zhizhong, and Derek Hoiem. “Learning without forgetting.” European Conference on Computer Vision. Springer International Publishing, 2016.

Loebner, Hugh. “Home page of the Loebner prize-the first Turing test.” Online under http://www.loebner.net/Prizef/loebner-prize.html (2016).

Lovett, Andrew, and Kenneth Forbus. “Modeling visual problem solving as analogical reasoning.” Psychological Review 124.1 (2017): 60.

Malik, Jitendra. “The Hilbert Problems of Computer Vision.” Online under https://www.youtube.com/watch?v=QaF2kkez5XU (2015)

McCarthy, John. “An example for natural language understanding and the AI Problems it raises.” Online under http://www-formal.stanford.edu/jmc/mrhug/mrhug.html (1976)

McLeod, John. “Card game rules — card games and tile games from around the world.” Online under https://www.pagat.com (2017)

Miller, George A. “WordNet: a lexical database for English.” Communications of the ACM 38.11 (1995): 39–41.

Mottaghi, Roozbeh, et al. ““What happens if…” Learning to Predict the Effect of Forces in Images.” European Conference on Computer Vision. Springer International Publishing, 2016.

Morency, Louis-Philippe. “Multimodal Machine Learning.” Online under https://www.youtube.com/watch?v=pMb_CIK14lU (2015)

Nair, Ashvin, et al. “Combining Self-Supervised Learning and Imitation for Vision-Based Rope Manipulation.” Online under http://phys.csail.mit.edu/papers/15.pdf (2016)

Nguyen, Anh, Jason Yosinski, and Jeff Clune. “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images.” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015.

Nilsson, Nils J. “Human-level artificial intelligence? Be serious!.” AI magazine 26.4 (2005): 68.

OpenAI. “Requests for research.” Online under https://openai.com/requests-for-research (2016)

Pan, Sinno Jialin, and Qiang Yang. “A survey on transfer learning.” IEEE Transactions on knowledge and data engineering 22.10 (2010): 1345–1359.

Pang, Bo, and Lillian Lee. “Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.” Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, 2005.

Parisotto, Emilio, Jimmy Lei Ba, and Ruslan Salakhutdinov. “Actor-mimic: Deep multitask and transfer reinforcement learning.” arXiv preprint arXiv:1511.06342 (2015).

Park, Alex S., and James R. Glass. “Unsupervised pattern discovery in speech.” IEEE Transactions on Audio, Speech, and Language Processing 16.1 (2008): 186–197.

Rajaram, Rakesh Nattoji, Eshed Ohn-Bar, and Mohan M. Trivedi. “An exploration of why and when pedestrian detection fails.” 2015 IEEE 18th International Conference on Intelligent Transportation Systems. IEEE, 2015.

Riccardi, Giuseppe, and Dilek Z. Hakkani-Tür. “Active and unsupervised learning for automatic speech recognition.” Interspeech. 2003.

Robo chat challenge. “Robo chat challenge 2014.” Online under http://www.robochatchallenge.com (2014)

Rosa, Marek, Jan Feyereisl, and The GoodAI Collective. “A Framework for Searching for General Artificial Intelligence.” arXiv preprint arXiv:1611.00685 (2016).

Schmidhuber, Jurgen. “Evolutionary principles in self-referential learning.” On learning how to learn: The meta-meta-… hook.) Diploma thesis, Institut f. Informatik, Tech. Univ. Munich (1987).

Silver, David, et al. “Mastering the game of Go with deep neural networks and tree search.” Nature 529.7587 (2016): 484–489.

Smith, Linda, and Michael Gasser. “The development of embodied cognition: Six lessons from babies.” Artificial life 11.1–2 (2005): 13–29.

Song, Shuran, Linguang Zhang, and Jianxiong Xiao. “Robot in a room: Toward perfect object recognition in closed environments.” CoRR (2015).

Stewart, Russell, and Stefano Ermon. “Label-free supervision of neural networks with physics and domain knowledge.” arXiv preprint arXiv:1609.05566 (2016).

Tarlow, Daniel. “In Search of Strong Generalization.” Online under https://uclmr.github.io/nampi/talk_slides/tarlow-nampi.pdf (2016)

Thrun, Sebastian, and Tom M. Mitchell. “Lifelong robot learning.” Robotics and autonomous systems 15.1–2 (1995): 25–46.

Thrun, Sebastian, and Lorien Pratt. “Learning to learn: Introduction and overview.” Learning to learn. Springer US, 1998. 3–17.

TMOU. “Archiv TMOU.” Online under http://www.tmou.cz/archiv/index (2016)

Verschae, Rodrigo, and Javier Ruiz-del-Solar. “Object detection: current and future directions.” Frontiers in Robotics and AI 2 (2015): 29.

Wang, Jane X., et al. “Learning to reinforcement learn.” arXiv preprint arXiv:1611.05763 (2016).

Yao, Bangpeng, Jiayuan Ma, and Li Fei-Fei. “Discovering object functionality.” Proceedings of the IEEE International Conference on Computer Vision. 2013.

Yelp, “The Yelp Dataset Challenge.”, online under https://www.yelp.com/dataset_challenge (2017)


Unsolved Problems in AI was originally published in AI Roadmap Institute Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Roadmap Comparison at GoodAI

Guest post by Martin Stránský, Research Scientist @GoodAI

Figure 1. GoodAI architecture development roadmap comparison (full-size)

Recent progress in artificial intelligence, especially in the area of deep learning, has been breath-taking. This is very encouraging for anyone interested in the field, yet the true progress towards human-level artificial intelligence is much harder to evaluate.

The evaluation of artificial intelligence is a very difficult problem for a number of reasons. For example, the lack of consensus on the basic desiderata necessary for intelligent machines is one of the primary barriers to the development of unified approaches towards comparing different agents. Despite a number of researchers specifically focusing on this topic (e.g. José Hernández-Orallo or Kristinn R. Thórisson to name a few), the area would benefit from more attention from the AI community.

Methods for evaluating AI are important tools that help to assess the progress of already built agents. The comparison and evaluation of roadmaps and approaches towards building such agents is however less explored. Such comparison is potentially even harder, due to the vagueness and limited formal definitions within such forward-looking plans.

Nevertheless, we believe that in order to steer towards promising areas of research and to identify potential dead-ends, we need to be able to meaningfully compare existing roadmaps. Such comparison requires the creation of a framework that defines processes on how to acquire important and comparable information from existing documents outlining their respective roadmaps. Without such a unified framework, each roadmap might not only differ in its target (e.g. general AI, human-level AI, conversational AI, etc…) but also in its approaches towards achieving that goal that might be impossible to compare and contrast.

This post offers a glimpse of how we, at GoodAI, are starting to look at this problem internally (comparing the progress of our three architecture teams), and how this might scale to comparisons across the wider community. This is still very much a work-in-progress, but we believe it might be beneficial to share these initial thoughts with the community, to start the discussion about, what we believe, is an important topic.

Overview

In the first part of this article, a comparison of three GoodAI architecture development roadmaps is presented and a technique for comparing them is discussed. The main purpose is to estimate the potential and completeness of plans for every architecture to be able to direct our effort to the most promising one.

To manage adding roadmaps from other teams we have developed a general plan of human-level AI development called a meta-roadmap. This meta-roadmap consists of 10 steps which must be passed in order to reach an ‘ultimate’ target. We hope that most of the potentially disparate plans solve one or more problems identified in the meta-roadmap.

Next, we tried to compare our approaches with that of Mikolov et. al by assigning the current documents and open tasks to problems in the meta-roadmap. We found that useful, as it showed us what is comparable and that different techniques of comparison are needed for every problem.

Architecture development plans comparison

Three teams from GoodAI have been working on their architectures for a few months. Now we need a method to measure the potential of the architectures to be able to, for example, direct our effort more efficiently by allocating more resources to the team with the highest potential. We know that determining which way is the most promising based on the current state is still not possible, so we asked the teams working on unfinished architectures to create plans for future development, i.e. to create their roadmaps.

Based on the provided responses, we have iteratively unified requirements for those plans. After numerous discussions, we came up with the following structure:

  • A Unit of a plan is called a milestone and describes some piece of work on a part of the architecture (e.g. a new module, a different structure, an improvement of a module by adding functionality, tuning parameters etc.)
  • Each milestone contains — Time Estimate, i.e. expected time spent on milestone assuming current team size, Characteristic of work or new features and Test of new features.
  • A plan can be interrupted by checkpoints which serve as common tests for two or more architectures.

Now we have a set of basic tools to monitor progress:

  • We will see whether a particular team will achieve their self-designed tests and thereby can fulfill their original expectations on schedule.
  • Due to checkpoints it is possible to compare architectures in the middle of development.
  • We can see how far a team sees. Ideally after finishing the last milestone, the architecture should be prepared to pass through a curriculum (which will be developed in the meantime) and a final test afterwards.
  • Total time estimates. We can compare them as well.
  • We are still working on a unified set (among GoodAI architectures) of features which we will require from an architecture (desiderata for an architecture).

The particular plans were placed side by side (c.f. Figure 1) and a few checkpoints were (currently vaguely) defined. As we can see, teams have rough plans of their work for more than one year ahead, still the plans are not complete in a sense that the architectures will not be ready for any curriculum. Two architectures use a connectivist approach and they are easy to compare. The third, OMANN, manipulates symbols, thus from the beginning it can perform tasks which are hard for the other two architectures and vice versa. This means that no checkpoints for OMANN have been defined yet. We see a lack of common tests as a serious issue with the plan and are looking for changes to make the architecture more comparable with the others, although it may cause some delays with the development.

There was an effort to include another architecture in the comparison, but we have not been able to find a document describing future work in such detail, with the exception of Weston’s et al. paper. After further analysis, we determined that the paper was focused on a slightly different problem than the development of an architecture. We will address this later in the post.

Assumptions for a common approach

We would like to take a look at the problem from the perspective of the unavoidable steps required to develop an intelligent agent. First we must make a few assumptions about the whole process. We realize that these are somewhat vague — we want to make them acceptable to other AI researchers.

  1. A target is to produce a software (referred to as an architecture), which can be a part of some agent in some world.
  2. In the world there will be tasks that the agent should solve, or a reward based on world states that the agent should seek.
  3. An intelligent agent can adapt to an unknown/changing environment and solve previously unseen tasks.
  4. To check whether the ultimate goal was reached (no matter how defined), every approach needs some well defined final test, which shows how intelligent the agent is (preferably compared to humans).

Before the agent is able to pass their final test, there must be a learning phase in order to teach the agent all necessary skills or abilities. If there is a possibility that the agent can pass the final test without learning anything, the final test is insufficient with respect to point 3. Description of the learning phase (which can include also a world description) is called curriculum.

Meta-roadmap

Using the above assumptions (and a few more obvious ones which we won’t enumerate here) we derive Figure 2 describing the list of necessary steps and their order. We call this diagram a meta-roadmap.

Figure 2. Overview of a meta-roadmap (full-size)

The most important and imminent tasks in the diagram are

  • The definition of an ultimate target,
  • A final test specification,
  • The proposed design of a curriculum, and
  • A roadmap for the development of an architecture.

We think that the majority of current approaches solve one or more of these open problems; from different points of view according to an ultimate target and beliefs of authors. In order to make the effort more clear, we will divide approaches described in published papers into groups according to the problem that they solve and compare them within those groups. Of course, approaches are hard to compare among groups (yet it is not impossible, for example final test can be comparable to a curriculum under specific circumstances). Even within one group it can be very hard in some situations, where requirements (which are the first thing that should be defined according to our diagram) differ significantly.

Also an analysis of complexity and completeness of an approach can be made within this framework. For example, if a team omits one or more of the open problems, it indicates that the team may not have considered that particular issue and are proceeding without a complete notion of the ‘big picture’.

Problem assignment

We would like to show an attempt to assign approaches to problems and compare them. First, we have analyzed GoodAI’s and Mikolov/Weston’s approach as the latter is well described. You can see the result in Figure 3 below.

Figure 3. Meta-roadmap with incorporated desiderata for different roadmaps (full-size)

As the diagram suggests, we work on a few common problems. We will not provide the full analysis here, but will make several observations to demonstrate the meaningfulness of the meta-roadmap. In desiderata, according to Mikolov’sA Roadmap towards Machine Intelligence”, a target is an agent which can understand human language. In contrast with the GoodAI approach, other modalities than text are not considered as important. In the curriculum, GoodAI wants to teach an agent in a more anthropocentric way — visual input first, language later — while the entirety of Weston’s curriculum comprises of language-oriented tasks.

Mikolov et al. do not provide a development plan for their architecture, so we can compare their curriculum roadmap to ours, but it is not possible to include their desiderata into the diagram in Figure 1.

Conclusion

We have presented our meta-roadmap and a comparison of three GoodAI development roadmaps. We hope that this post will offer a glimpse into how we started this process at GoodAI and will invigorate a discussion on how this could be improved and scaled beyond internal comparisons. We will be glad to receive any feedback — the generality of our meta-roadmap should be discussed further, as well as our methods for estimating roadmap completeness and their potential to achieve human-level AI.


Roadmap Comparison at GoodAI was originally published in AI Roadmap Institute Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Page 429 of 429
1 427 428 429