Archive 20.02.2017

Back to the core of intelligence … to really move to the future

Guest post by José Hernández-Orallo, Professor at Technical University of Valencia

Two decades ago I started working on metrics of machine intelligence. By that time, during the glacial days of the second AI winter, few were really interested in measuring something that AI lacked completely. And very few, such as David L. Dowe and I, were interested in metrics of intelligence linked to algorithmic information theory, where the models of interaction between an agent and the world were sequences of bits, and intelligence was formulated using Solomonoff’s and Wallace’s theories of inductive inference.

In the meantime, seemingly dozens of variants of the Turing test were proposed every year, the CAPTCHAs were introduced and David showed how easy it is to solve some IQ tests using a very simple program based on a big-switch approach. And, today, a new AI spring has arrived, triggered by a blossoming machine learning field, bringing a more experimental approach to AI with an increasing number of AI benchmarks and competitions (see a previous entry in this blog for a survey).

Considering this 20-year perspective, last year was special in many ways. The first in a series of workshops on evaluating general-purpose AI took off, echoing the increasing interest in the assessment of artificial general intelligence (AGI) systems, capable of finding diverse solutions for a range of tasks. Evaluating these systems is different, and more challenging, than the traditional task-oriented evaluation of specific systems, such as a robotic cleaner, a credit scoring model, a machine translator or a self-driving car. The idea of evaluating general-purpose AI systems using videogames had caught on. The arcade learning environment (the Atari 2600 games) or the more flexible Video Game Definition Language and associated competition became increasingly popular for the evaluation of AGI and its recent breakthroughs.

Last year also witnessed the introduction of a different kind of AI evaluation platforms, such as Microsoft’s Malmö, GoodAI’s School, OpenAI’s Gym and Universe, DeepMind’s Lab, Facebook’s TorchCraft and CommAI-env. Based on a reinforcement learning (RL) setting, these platforms make it possible to create many different tasks and connect RL agents through a standard interface. Many of these platforms are well suited for the new paradigms in AI, such as deep reinforcement learning and some open-source machine learning libraries. After thousands of episodes or millions of steps against a new task, these systems are able to excel, with usually better than human performance.

Despite the myriads of applications and breakthroughs that have been derived from this paradigm, there seems to be a consensus in the field that the main open problem lies in how an AI agent can reuse the representations and skills from one task to new ones, making it possible to learn a new task much faster, with a few examples, as humans do. This can be seen as a mapping problem (usually under the term transfer learning) or can be seen as a sequential problem (usually under the terms gradual, cumulative, incremental, continual or curriculum learning).

One of the key notions that is associated with this capability of a system of building new concepts and skills over previous ones is usually referred to as “compositionality”, which is well documented in humans from early childhood. Systems are able to combine the representations, concepts or skills that have been learned previously in order to solve a new problem. For instance, an agent can combine the ability of climbing up a ladder with its use as a possible way out of a room, or an agent can learn multiplication after learning addition.

In my opinion, two of the previous platforms are better suited for compositionality: Malmö and CommAI-env. Malmö has all the ingredients of a 3D game, and AI researchers can experiment and evaluate agents with vision and 3D navigation, which is what many research papers using Malmö have done so far, as this is a hot topic in AI at the moment. However, to me, the most interesting feature of Malmö is building and crafting, where agents must necessarily combine previous concepts and skills in order to create more complex things.

CommAI-env is clearly an outlier in this set of platforms. It is not a video game in 2D or 3D. Video or audio don’t have any role there. Interaction is just produced through a stream of input/output bits and rewards, which are just +1, 0 or -1. Basically, actions and observations are binary. The rationale behind CommAI-env is to give prominence to communication skills, but it still allows for rich interaction, patterns and tasks, while “keeping all further complexities to a minimum”.

Examples of interaction within the CommAI-mini environment.

When I was aware that the General AI Challenge was using CommAI-env for their warm-up round I was ecstatic. Participants could focus on RL agents without the complexities of vision and navigation. Of course, vision and navigation are very important for AI applications, but they create many extra complications if we want to understand (and evaluate) gradual learning. For instance, two equal tasks for which the texture of the walls changes can be seen as requiring higher transfer effort than two slightly different tasks with the same texture. In other words, this would be extra confounding factors that would make the analysis of task transfer and task dependencies much harder. It is then a wise choice to exclude this from the warm-up round. There will be occasions during other rounds of the challenge for including vision, navigation and other sorts of complex embodiment. Starting with a minimal interface to evaluate whether the agents are able to learn incrementally is not only a challenging but an important open problem for general AI.

Also, the warm-up round has modified CommAI-env in such a way that bits are packed into 8-bit (1 byte) characters. This makes the definition of tasks more intuitive and makes the ASCII coding transparent to the agents. Basically, the set of actions and observations is extended to 256. But interestingly, the set of observations and actions is the same, which allows many possibilities that are unusual in reinforcement learning, where these subsets are different. For instance, an agent with primitives such as “copy input to output” and other sequence transformation operators can compose them in order to solve the task. Variables, and other kinds of abstractions, play a key role.

This might give the impression that we are back to Turing machines and symbolic AI. In a way, this is the case, and much in alignment to Turing’s vision in his 1950 paper: “it is possible to teach a machine by punishments and rewards to obey orders given in some language, e.g., a symbolic language”. But in 2017 we have a range of techniques that weren’t available just a few years ago. For instance, Neural Turing Machines and other neural networks with symbolic memory can be very well suited for this problem.

By no means does this indicate that the legion of deep reinforcement learning enthusiasts cannot bring their apparatus to this warm-up round. Indeed they won’t be disappointed by this challenge if they really work hard to adapt deep learning to this problem. They won’t probably need a convolutional network tuned for visual pattern recognition, but there are many possibilities and challenges in how to make deep learning work in a setting like this, especially because the fewer examples, the better, and deep learning usually requires many examples.

As a plus, the simple, symbolic sequential interface opens the challenge to many other areas in AI, not only recurrent neural networks but techniques from natural language processing, evolutionary computation, compression-inspired algorithms or even areas such as inductive programming, with powerful string-handling primitives and its appropriateness for problems with very few examples.

I think that all of the above makes this warm-up round a unique competition. Of course, since we haven’t had anything similar in the past, we might have some surprises. It might happen that an unexpected (or even naïve) technique could behave much better than others (and humans) or perhaps we find that no technique is able to do something meaningful at this time.

I’m eager to see how this round develops and what the participants are able to integrate and invent in order to solve the sequence of micro and mini-tasks. I’m sure that we will learn a lot from this. I hope that machines will, too. And all of us will move forward to the next round!

José Hernández-Orallo is a professor at Technical University of Valencia and author of “The Measure of All Minds, Evaluating Natural and Artificial Intelligence”, Cambridge University Press, 2017.

Back to the core of intelligence … to really move to the future was originally published in AI Roadmap Institute Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Unsolved Problems in AI

Guest post by Simon Andersson, Senior Research Scientist @GoodAI

Executive summary

  • Tracking major unsolved problems in AI can keep us honest about what remains to be achieved and facilitate the creation of roadmaps towards general artificial intelligence.
  • This document currently identifies 29 open problems.
  • For each major problem, example tests are suggested for evaluating research progress.


This document identifies open problems in AI. It seeks to provide a concise overview of the greatest challenges in the field and of the current state of the art, in line with the “open research questions” theme of focus of the AI Roadmap Institute.

The challenges are grouped into AI-complete problems, closed-domain problems, and fundamental problems in commonsense reasoning, learning, and sensorimotor ability.

I realize that this first attempt at surveying the open problems will necessarily be incomplete and welcome reader feedback.

To help accelerate the search for general artificial intelligence, GoodAI is organizing the General AI Challenge (GoodAI, 2017), that aims to solve some of the problems outlined below, through a series of milestone challenges starting in early 2017.

Sources, method, and related work

The collection of problems presented here is the result of a review of the literature in the areas of

  • Machine learning
  • Machine perception and robotics
  • Open AI problems
  • Evaluation of AI systems
  • Tests for the achievement of human-level intelligence
  • Benchmarks and competitions

To be considered for inclusion, a problem must be

  1. Highly relevant for achieving general artificial intelligence
  2. Closed in scope, not subject to open-ended extension
  3. Testable

Problems vary in scope and often overlap. Some may be contained entirely in others. The second criterion (closed scope) excludes some interesting problems such as learning all human professions; a few problems of this type are mentioned separately from the main list. To ensure that problems are testable, each is presented together with example tests.

Several websites, some listed below, provide challenge problems for AI.

In the context of evaluating AI systems, Hernández-Orallo (2016a) reviews a number of open AI problems. Lake et al. (2016) offers a critique of the current state of the art in AI and discusses problems like intuitive physics, intuitive psychology, and learning from few examples.

A number of challenge problems for AI were proposed in (Brooks, et al., 1996) and (Brachman, 2006).

The challenges

The rest of the document lists AI challenges as outlined below.

  1. AI-complete problems
  2. Closed-domain problems
  3. Commonsense reasoning
  4. Learning
  5. Sensorimotor problems

AI-complete problems

AI-complete problems are ones likely to contain all or most of human-level general artificial intelligence. A few problems in this category are listed below.

  1. Open-domain dialog
  2. Text understanding
  3. Machine translation
  4. Human intelligence and aptitude tests
  5. Coreference resolution (Winograd schemas)
  6. Compound word understanding

Open-domain dialog

Open-domain dialog is the problem of conducting competently a dialog with a human when the subject of the discussion is not known in advance. The challenge includes language understanding, dialog pragmatics, and understanding the world. Versions of the tasks include spoken and written dialog. The task can be extended to include multimodal interaction (e.g., gestural input, multimedia output). Possible success criteria are usefulness and the ability to conduct dialog indistinguishable from human dialog (“Turing test”).


Dialog systems are typically evaluated by human judges. Events where this has been done include

  1. The Loebner prize (Loebner, 2016)
  2. The Robo chat challenge (Robo chat challenge, 2014)

Text understanding

Text understanding is an unsolved problem. There has been remarkable progress in the area of question answering, but current systems still fail when common-sense world knowledge, beyond that provided in the text, is required.


  1. McCarthy (1976) provided an early text understanding challenge problem.
  2. Brachman (2006) suggested the problem of reading a textbook and solving its exercises.

Machine translation

Machine translation is AI-complete since it includes problems requiring an understanding of the world (e.g., coreference resolution, discussed below).


While translation quality can be evaluated automatically using parallel corpora, the ultimate test is human judgement of quality. Corpora such as the Corpus of Contemporary American English (Davies, 2008) contain samples of text from different genres. Translation quality can be evaluated using samples of

  1. Newspaper text
  2. Fiction
  3. Spoken language transcriptions

Intelligence tests

Human intelligence and aptitude tests (Hernández-Orallo, 2017) are interesting in that they are designed to be at the limit of human ability and to be hard or impossible to solve using memorized knowledge. Human-level performance has been reported for Raven’s progressive matrices (Lovett and Forbus, 2017) but artificial systems still lack the general reasoning abilities to deal with a variety of problems at the same time (Hernández-Orallo, 2016b).


  1. Brachman (2006) suggested using the SAT as an AI challenge problem.

Coreference resolution

The overlapping problems of coreference resolution, pronoun disambiguation, and Winograd schemas require picking out the referents of pronouns or noun phrases.


  1. Davis (2011) lists 144 Winograd schemas.
  2. Commonsense Reasoning (2016b) lists pronoun disambiguation problems: 62 sample problems and 60 problems used in the first Winograd Schema Challenge, held at IJCAI-16.

Compound word understanding

In many languages, there are compound words with set meanings. Novel compound words can be produced, and we are good at guessing their meaning. We understand that a water bird is a bird that lives near water, not a bird that contains or is constituted by water, and that schadenfreude is felt when others, not we, are hurt.


  1. The meaning of noun phrases” at (Commonsense Reasoning, 2015)

Closed-domain problems

Closed-domain problems are ones that combine important elements of intelligence but reduce the difficulty by limiting themselves to a circumscribed knowledge domain. Game playing agents are examples of this and artificial agents have achieved superhuman performance at Go (Silver et al., 2016) and more recently poker (Aupperlee, 2017; Brown and Sandholm, 2017). Among the open problems are:

  1. Learning to play board, card, and tile games from descriptions
  2. Producing programs from descriptions
  3. Source code understanding

Board, card, and tile games from descriptions

Unlike specialized game players, systems that have to learn new games from descriptions of the rules cannot rely on predesigned algorithms for specific games.


  1. The problem of learning new games from formal-language descriptions has appeared as a challenge at the AAAI conference (Genesereth et al., 2005; AAAI, 2013).
  2. Even more challenging is the problem of learning games from natural language descriptions; such descriptions for card and tile games are available from a number of websites (e.g., McLeod, 2017).

Programs from descriptions

Producing programs in a programming language such as C from natural language input is a problem of obvious practical interest.


  1. The “Description2Code” challenge proposed at (OpenAI, 2016) has 5000 descriptions for programs collected by Ethan Caballero.

Source code understanding

Related to source code production is source code understanding, where the system can interpret the semantics of code and detect situations where the code differs in non-trivial ways from the likely intention of its author. Allamanis et al. (2016) reports progress on the prediction of procedure names.


  1. The International Obfuscated C Code Contest (OCCC, 2016) publishes code that is intentionally hard to understand. Source code understanding could be tested as the ability to improve the readability of the code as scored by human judges.

Commonsense reasoning

Commonsense reasoning is likely to be a central element of general artificial intelligence. Some of the main problems in this area are listed below.

  1. Causal reasoning
  2. Counterfactual reasoning
  3. Intuitive physics
  4. Intuitive psychology

Causal reasoning

Causal reasoning requires recognizing and applying cause-effect relations.


  1. Strength of evidence” at (Commonsense Reasoning, 2015)
  2. Wolves and rabbits” at (Commonsense Reasoning, 2015)

Counterfactual reasoning

Counterfactual reasoning is required for answering hypothetical questions. It uses causal reasoning together with the system’s other modeling and reasoning capabilities to consider situations possibly different from anything that ever happened in the world.


  1. The cruel and unusual Yale shooting problem” at (Commonsense Reasoning, 2015)

Intuitive physics

A basic understanding of the physical world, including object permanence and the ability to predict likely trajectories, helps agents learn faster and make better predictions. This is now a very active research area; some recent work is reported in (Agrawal et al., 2016; Chang et al., 2016; Degrave et al., 2016; Denil et al., 2016; Finn et al., 2016; Fragkiadaki et al., 2016; Hamrick et al., 2016; Li et al., 2016; Mottaghi et al., 2016; Nair et al., 2016; Stewart and Ermon, 2016).


  1. The “Physical reasoning” section at (Commonsense Reasoning, 2015) (8 problems)
  2. The handle problem” at (Commonsense Reasoning, 2015)

Intuitive psychology

Intuitive psychology, or theory of mind, allows the agent to understand goals and beliefs and infer them from the behavior of other agents.


  1. The “Naive psychology” section at (Commonsense Reasoning, 2015) (4 problems)


Despite remarkable advances in machine learning, important learning-related problems remain mostly unsolved. They include:

  1. Gradual learning
  2. Unsupervised learning
  3. Strong generalization
  4. Category learning from few examples
  5. Learning to learn
  6. Compositional learning
  7. Learning without forgetting
  8. Transfer learning
  9. Knowing when you don’t know
  10. Learning through action

Gradual learning

Humans are capable of lifelong learning of increasingly complex tasks. Artificial agents should be, too. Versions of this idea have been discussed under the rubrics of life-long (Thrun and Mitchell, 1995), continual, and incremental learning. At GoodAI, we have adopted the term gradual learning (Rosa et al., 2016) for the long-term accumulation of knowledge and skills. It requires the combination of several abilities discussed below:

  • Compositional learning
  • Learning to learn
  • Learning without forgetting
  • Transfer learning


  1. A possible test applies to a household robot that learns household and house maintenance tasks, including obtaining tools and materials for the work. The test evaluates the agent on two criteria: Continuous operation (Nilsson in Brooks, et al., 1996) where the agent needs to function autonomously without reprogramming during its lifetime, and improving capability, where the agent must exhibit, at different points in its evolution, capabilities not present at an earlier time.

Unsupervised learning

Unsupervised learning has been described as the next big challenge in machine learning (LeCun 2016). It appears to be fundamental to human lifelong learning (supervised and reinforcement signals do not provide nearly enough data) and is closely related to prediction and common-sense reasoning (“filling in the missing parts”). A hard problem (Yoshua Bengio, in the “Brains and bits” panel at NIPS 2016) is unsupervised learning in hierarchical systems, with components learning jointly.


In addition to the possible tests in the vision domain, speech recognition also presents opportunities for unsupervised learning. While current state-of-the-art speech recognizers rely largely on supervised learning on large corpora, unsupervised recognition requires discovering, without supervision, phonemes, word segmentation, and vocabulary. Progress has been reported in this direction, so far limited to small-vocabulary recognition (Riccardi and Hakkani-Tur, 2003, Park and Glass, 2008, Kamper et al., 2016).

  1. A full-scale test of unsupervised speech recognition could be to train on the audio part of a transcribed speech corpus (e.g., TIMIT (Garofolo, 1993)), then learn to predict the transcriptions with only very sparse supervision.

Strong generalization

Humans can transfer knowledge and skills across situations that share high-level structure but are otherwise radically different, adapting to the particulars of a new setting while preserving the essence of the skill, a capacity that (Tarlow, 2016; Gaunt et al., 2016) refer to as strong generalization. If we learn to clean up a room, we know how to clean up most other rooms.


  1. A general assembly robot could learn to build a toy castle in one material (e.g., lego blocks) and be tested on building it from other materials (sand, stones, sticks).
  2. A household robot could be trained on cleaning and cooking tasks in one environment and be tested in highly dissimilar environments.

Category learning from few examples

Lake et al. (2015) achieved human-level recognition and generation of characters using few examples. However, learning more complex categories from few examples remains an open problem.


  1. The ImageNet database (Deng et al., 2009) contains images organized by the semantic hierarchy of WordNet (Miller, 1995). Correctly determining ImageNet categories from images with very little training data could be a challenging test of learning from few examples.

Learning to learn

Learning to learn or meta-learning (e.g., Harlow, 1949; Schmidhuber, 1987; Thrun and Pratt, 1998; Andrychowicz et al., 2016; Chen et al., 2016; de Freitas, 2016; Duan et al., 2016; Lake et al., 2016; Wang et al., 2016) is the acquisition of skills and inductive biases that facilitate future learning. The scenarios considered in particular are ones where a more general and slower learning process produces a faster, more specialized one. An example is biological evolution producing efficient learners such as human beings.


  1. Learning to play Atari video games is an area that has seen some remarkable recent successes, including in transfer learning (Parisotto et al., 2016). However, there is so far no system that first learns to play video games, then is capable of learning a new game, as humans can, from a few minutes of play (Lake et al., 2016).

Compositional learning

Compositional learning (de Freitas, 2016; Lake et al., 2016) is the ability to recombine primitive representations to accelerate the acquisition of new knowledge. It is closely related to learning to learn.


Tests for compositional learning need to verify both that the learner is effective and that it uses compositional representations.

  1. Some ImageNet categories correspond to object classes defined largely by their arrangements of component parts, e.g., chairs and stools, or unicycles, bicycles, and tricycles. A test could evaluate the agent’s ability to learn categories with few examples and to report the parts of the object in an image.
  2. Compositional learning should be extremely helpful in learning video games (Lake et al., 2016). A learner could be tested on a game already mastered, but where component elements have changed appearance (e.g., different-looking fish in the Frostbite game). It should be able to play the variant game with little or no additional learning.

Learning without forgetting

In order to learn continually over its lifetime, an agent must be able to generalize over new observations while retaining previously acquired knowledge. Recent progress towards this goal is reported in (Kirkpatrick et al., 2016) and (Li and Hoiem, 2016). Work on memory augmented neural networks (e.g., Graves et al., 2016) is also relevant.


A test for learning without forgetting needs to present learning tasks sequentially (earlier tasks are not repeated) and test for retention of early knowledge. It may also test for declining learning time for new tasks, to verify that the agent exploits the knowledge acquired so far.

  1. A challenging test for learning without forgetting would be to learn to recognize all the categories in ImageNet, presented sequentially.

Transfer learning

Transfer learning (Pan and Yang, 2010) is the ability of an agent trained in one domain to master another. Results in the area of text comprehension are currently poor unless the agent is given some training on the new domain (Kadlec, et al., 2016).


Sentiment classification (Blitzer et al., 2007) provides a possible testing ground for transfer learning. Learners can be trained on one corpus, tested on another, and compared to a baseline learner trained directly on the target domain.

  1. Reviews of movies and of businesses are two domains dissimilar enough to make knowledge transfer challenging. Corpora for the domains are Rotten Tomatoes movie reviews (Pang and Lee, 2005) and the Yelp Challenge dataset (Yelp, 2017).

Knowing when you don’t know

While uncertainty is modeled differently by different learning algorithms, it seems to be true in general that current artificial systems are not nearly as good as humans at “knowing when they don’t know.” An example are deep neural networks that achieve state-of-the-art accuracy on image recognition but assign 99.99% confidence to the presence of objects in images completely unrecognizable to humans (Nguyen et al., 2015).

Human performance on confidence estimation would include

  1. In induction tasks, like program induction or sequence completion, knowing when the provided examples are insufficient for induction (multiple reasonable hypotheses could account for them)
  2. In speech recognition, knowing when an utterance has not been interpreted reliably
  3. In visual tasks such as pedestrian detection, knowing when a part of the image has not been analyzed reliably


  1. A speech recognizer can be compared against a human baseline, measuring the ratio of the average confidence to the confidence on examples where recognition fails.
  2. The confidence of image recognition systems can be tested on generated adversarial examples.

Learning through action

Human infants are known to learn about the world through experiments, observing the effects of their own actions (Smith and Gasser, 2005; Malik, 2015). This seems to apply both to higher-level cognition and perception. Animal experiments have confirmed that the ability to initiate movement is crucial to perceptual development (Held and Hein, 1963) and some recent progress has been made on using motion in learning visual perception (Agrawal et al., 2015). In (Agrawal et al., 2016), a robot learns to predict the effects of a poking action.

“Learning through action” thus encompasses several areas, including

  • Active learning, where the agent selects the training examples most likely to be instructive
  • Undertaking epistemological actions, i.e., activities aimed primarily at gathering information
  • Learning to perceive through action
  • Learning about causal relationships through action

Perhaps most importantly, for artificial systems, learning the causal structure of the world through experimentation is still an open problem.


For learning through action, it is natural to consider problems of motor manipulation where in addition to the immediate effects of the agent’s actions, secondary effects must be considered as well.

  1. Learning to play billiards: An agent with little prior knowledge and no fixed training data is allowed to explore a real or virtual billiard table and should learn to play billiards well.

Sensorimotor problems

Outstanding problems in robotics and machine perception include:

  1. Autonomous navigation in dynamic environments
  2. Scene analysis
  3. Robust general object recognition and detection
  4. Robust, life-time simultaneous location and mapping (SLAM)
  5. Multimodal integration
  6. Adaptive dexterous manipulation

Autonomous navigation

Despite recent progress in self-driving cars by companies like Tesla, Waymo (formerly the Google self-driving car project) and many others, autonomous navigation in highly dynamic environments remains a largely unsolved problem, requiring knowledge of object semantics to reliably predict future scene states (Ess et al., 2010).


  1. Fully automatic driving in crowded city streets and residential areas is still a challenging test for autonomous navigation.

Scene analysis

The challenge of scene analysis extends far beyond object recognition and includes the understanding of surfaces formed by multiple objects, scene 3D structure, causal relations (Lake et al., 2016), and affordances. It is not limited to vision but can depend on audition, touch, and other modalities, e.g., electroreception and echolocation (Lewicki et al., 2014; Kondo et al., 2017). While progress has been made, e.g., in recognizing anomalous and improbable scenes (Choi et al., 2012), predicting object dynamics (Fouhey and Zitnick, 2014), and discovering object functionality (Yao et al., 2013), we are still far from human-level performance in this area.


Some possible challenges for understanding the causal structure in visual scenes are:

  1. Recognizing dangerous situations: A corpus of synthetic images could be created where the same objects are recombined to form “dangerous” and “safe” scenes as classified by humans.
  2. Recognizing physically improbable scenes: A synthetic corpus could be created to show physically plausible and implausible scenes containing the same objects.
  3. Recognizing useless objects: Images of useless objects have been created by (Kamprani, 2017).

Object recognition

While object recognition has seen great progress in recent years (e.g., Han et al., 2016), matches or surpasses human performance for many problems (Karpathy, 2014), and can approach perfection in closed environments (Song et al., 2015), state-of-the-art systems still struggle with the harder cases such as open objects (interleaved with background), broken objects, truncation and occlusion in dynamic environments (e.g., Rajaram et al., 2015).


Environments that are cluttered and contain objects drawn from a large, open-ended, and changing set of types are likely to be challenging for an object recognition system. An example would be

  1. Seeing photos of the insides of pantries and refrigerators and listing the ingredients available to the owners

Simultaneous location and mapping

While the problem of simultaneous location and mapping (SLAM) is considered solved for some applications, the challenge of SLAM for long-lived autonomous robots, in large-scale, time-varying environments, remains open (Cadena et al., 2016).


  1. Lifetime location and mapping, without detailed maps provided in advance and robust to changes in the environment, for an autonomous car based in a large city

Multimodal integration

The integration of multiple senses (Lahat, 2015) is important, e.g., in human communication (Morency, 2015) and scene understanding (Lewicki et al., 2014; Kondo et al., 2017). Having multiple overlapping sensory systems seems to be essential for enabling human children to educate themselves by perceiving and acting in the world (Smith and Gasser, 2005).


Spoken communication in noisy environments, where lip reading and gestural cues are indispensable, can provide challenges for multimodal fusion. An example would be

  1. A robot bartender: The agent needs to interpret customer requests in a noisy bar.

Adaptive dexterous manipulation

Current robot manipulators do not come close to the versatility of the human hand (Ciocarlie, 2015). Hard problems include manipulating deformable objects and operating from a mobile platform.


  1. Taking out clothes from a washing machine and hanging them on clothes lines and coat hangers in varied places while staying out of the way of humans

Open-ended problems

Some noteworthy problems were omitted from the list for having a too open-ended scope: they encompass sets of tasks that evolve over time or can be endlessly extended. This makes it hard to decide whether a problem has been solved. Problems of this type include

  • Enrolling in a human university and take classes like humans (Goertzel, 2012)
  • Automating all types of human work (Nilsson, 2005)
  • Puzzlehunt challenges, e.g., the annual TMOU game in the Czech republic (TMOU, 2016)


I have reviewed a number of open problems in an attempt to delineate the current front lines of AI research. The problem list in this first version, as well as the problem descriptions, example tests, and mentions of ongoing work in the research areas, are necessarily incomplete. I plan to extend and improve the document incrementally and warmly welcome suggestions either in the comment section below or at the institute’s discourse forum.


I thank Jan Feyereisl, Martin Poliak, Petr Dluhoš, and the rest of the GoodAI team for valuable discussion and suggestions.


AAAI. “AAAI-13 International general game playing competition.” Online under (2013)

Agrawal, Pulkit, Joao Carreira, and Jitendra Malik. “Learning to see by moving.” Proceedings of the IEEE International Conference on Computer Vision. 2015.

Agrawal, Pulkit, et al. “Learning to poke by poking: Experiential learning of intuitive physics.” arXiv preprint arXiv:1606.07419 (2016).

AI•ON. “The AI•ON collection of open research problems.” Online under (2016)

Allamanis, Miltiadis, Hao Peng, and Charles Sutton. “A convolutional attention network for extreme summarization of source code.” arXiv preprint arXiv:1602.03001 (2016).

Andrychowicz, Marcin, et al. “Learning to learn by gradient descent by gradient descent.” Advances in Neural Information Processing Systems. 2016.

Aupperlee, Aaron. “No bluff: Supercomputer outwits humans in poker rematch.” Online under (2017)

Blitzer, John, Mark Dredze, and Fernando Pereira. “Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification.” ACL. Vol. 7. 2007.

Brachman, Ronald J. “AI more than the sum of its parts.” AI Magazine 27.4 (2006): 19.

Brooks, R., et al. “Challenge problems for artificial intelligence.” Thirteenth National Conference on Artificial Intelligence-AAAI. 1996.

Brown, Noam, and Tuomas Sandholm. “Safe and Nested Endgame Solving for Imperfect-Information Games.” Online under (2017)

Cadena, Cesar, et al. “Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age.” IEEE Transactions on Robotics 32.6 (2016): 1309–1332.

Chang, Michael B., et al. “A compositional object-based approach to learning physical dynamics.” arXiv preprint arXiv:1612.00341 (2016).

Chen, Yutian, et al. “Learning to Learn for Global Optimization of Black Box Functions.” arXiv preprint arXiv:1611.03824 (2016).

Choi, Myung Jin, Antonio Torralba, and Alan S. Willsky. “Context models and out-of-context objects.” Pattern Recognition Letters 33.7 (2012): 853–862.

Ciocarlie, Matei. “Versatility in Robotic Manipulation: the Long Road to Everywhere.” Online under (2015)

Commonsense Reasoning. “Commonsense reasoning problem page.” Online under (2015)

Commonsense Reasoning. “Commonsense reasoning Winograd schema challenge.” Online under (2016a)

Commonsense Reasoning. “Commonsense reasoning pronoun disambiguation problems” Online under (2016b)

Davies, Mark. The corpus of contemporary American English. BYE, Brigham Young University, 2008.

Davis, Ernest. “Collection of Winograd schemas.” Online under (2011)

de Freitas, Nando. “Learning to Learn and Compositionality with Deep Recurrent Neural Networks: Learning to Learn and Compositionality.” Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016.

Degrave, Jonas, Michiel Hermans, and Joni Dambre. “A Differentiable Physics Engine for Deep Learning in Robotics.” arXiv preprint arXiv:1611.01652 (2016).

Deng, Jia, et al. “Imagenet: A large-scale hierarchical image database.” Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009.

Denil, Misha, et al. “Learning to Perform Physics Experiments via Deep Reinforcement Learning.” arXiv preprint arXiv:1611.01843 (2016).

Duan, Yan, et al. “RL²: Fast Reinforcement Learning via Slow Reinforcement Learning.” arXiv preprint arXiv:1611.02779 (2016).

Ess, Andreas, et al. “Object detection and tracking for autonomous navigation in dynamic environments.” The International Journal of Robotics Research 29.14 (2010): 1707–1725.

Finn, Chelsea, and Sergey Levine. “Deep Visual Foresight for Planning Robot Motion.” arXiv preprint arXiv:1610.00696 (2016).

Fouhey, David F., and C. Lawrence Zitnick. “Predicting object dynamics in scenes.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014.

Fragkiadaki, Katerina, et al. “Learning visual predictive models of physics for playing billiards.” arXiv preprint arXiv:1511.07404 (2015).

Garofolo, John, et al. “TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1.” Web Download. Philadelphia: Linguistic Data Consortium, 1993.

Gaunt, Alexander L., et al. “Terpret: A probabilistic programming language for program induction.” arXiv preprint arXiv:1608.04428 (2016).

Genesereth, Michael, Nathaniel Love, and Barney Pell. “General game playing: Overview of the AAAI competition.” AI magazine 26.2 (2005): 62.

Goertzel, Ben. “What counts as a conscious thinking machine?” Online under (2012)

GoodAI. “General AI Challenge.” Online under (2017)

Graves, Alex, et al. “Hybrid computing using a neural network with dynamic external memory.” Nature 538.7626 (2016): 471–476.

Hamrick, Jessica B., et al. “Imagination-Based Decision Making with Physical Models in Deep Neural Networks.” Online under (2016)

Han, Dongyoon, Jiwhan Kim, and Junmo Kim. “Deep Pyramidal Residual Networks.” arXiv preprint arXiv:1610.02915 (2016).

Harlow, Harry F. “The formation of learning sets.” Psychological review 56.1 (1949): 51.

Held, Richard, and Alan Hein. “Movement-produced stimulation in the development of visually guided behavior.” Journal of comparative and physiological psychology 56.5 (1963): 872.

Hernández-Orallo, José. “Evaluation in artificial intelligence: from task-oriented to ability-oriented measurement.” Artificial Intelligence Review(2016a): 1–51.

Hernández-Orallo, José, et al. “Computer models solving intelligence test problems: progress and implications.” Artificial Intelligence 230 (2016b): 74–107.

Hernández-Orallo, José. “The measure of all minds.” Cambridge University Press, 2017.

IOCCC. “The International Obfuscated C Code Contest.” Online under (2016)

Kadlec, Rudolf, et al. “Finding a jack-of-all-trades: an examination of semi-supervised learning in reading comprehension.” Under review at ICLR 2017, online under

Kamper, Herman, Aren Jansen, and Sharon Goldwater. “Unsupervised word segmentation and lexicon discovery using acoustic word embeddings.” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 24.4 (2016): 669–679.

Kamprani, Katerina. “The uncomfortable.” Online under (2017)

Karpathy, Andrej. “What I learned from competing against a ConvNet on ImageNet.” Online under (2014)

Kirkpatrick, James, et al. “Overcoming catastrophic forgetting in neural networks.” arXiv preprint arXiv:1612.00796 (2016).

Kondo, H. M., et al. “Auditory and visual scene analysis: an overview.” Philosophical transactions of the Royal Society of London. Series B, Biological sciences 372.1714 (2017).

Lahat, Dana, Tülay Adali, and Christian Jutten. “Multimodal data fusion: an overview of methods, challenges, and prospects.” Proceedings of the IEEE 103.9 (2015): 1449–1477.

Lake, Brenden M., Ruslan Salakhutdinov, and Joshua B. Tenenbaum. “Human-level concept learning through probabilistic program induction.” Science 350.6266 (2015): 1332–1338.

Lake, Brenden M., et al. “Building machines that learn and think like people.” arXiv preprint arXiv:1604.00289 (2016).

LeCun, Yann. “The Next Frontier in AI: Unsupervised Learning.” Online under (2016)

Lewicki, Michael S., et al. “Scene analysis in the natural environment.” Frontiers in psychology 5 (2014): 199.

Li, Wenbin, Aleš Leonardis, and Mario Fritz. “Visual stability prediction and its application to manipulation.” arXiv preprint arXiv:1609.04861 (2016).

Li, Zhizhong, and Derek Hoiem. “Learning without forgetting.” European Conference on Computer Vision. Springer International Publishing, 2016.

Loebner, Hugh. “Home page of the Loebner prize-the first Turing test.” Online under (2016).

Lovett, Andrew, and Kenneth Forbus. “Modeling visual problem solving as analogical reasoning.” Psychological Review 124.1 (2017): 60.

Malik, Jitendra. “The Hilbert Problems of Computer Vision.” Online under (2015)

McCarthy, John. “An example for natural language understanding and the AI Problems it raises.” Online under (1976)

McLeod, John. “Card game rules — card games and tile games from around the world.” Online under (2017)

Miller, George A. “WordNet: a lexical database for English.” Communications of the ACM 38.11 (1995): 39–41.

Mottaghi, Roozbeh, et al. ““What happens if…” Learning to Predict the Effect of Forces in Images.” European Conference on Computer Vision. Springer International Publishing, 2016.

Morency, Louis-Philippe. “Multimodal Machine Learning.” Online under (2015)

Nair, Ashvin, et al. “Combining Self-Supervised Learning and Imitation for Vision-Based Rope Manipulation.” Online under (2016)

Nguyen, Anh, Jason Yosinski, and Jeff Clune. “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images.” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015.

Nilsson, Nils J. “Human-level artificial intelligence? Be serious!.” AI magazine 26.4 (2005): 68.

OpenAI. “Requests for research.” Online under (2016)

Pan, Sinno Jialin, and Qiang Yang. “A survey on transfer learning.” IEEE Transactions on knowledge and data engineering 22.10 (2010): 1345–1359.

Pang, Bo, and Lillian Lee. “Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.” Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, 2005.

Parisotto, Emilio, Jimmy Lei Ba, and Ruslan Salakhutdinov. “Actor-mimic: Deep multitask and transfer reinforcement learning.” arXiv preprint arXiv:1511.06342 (2015).

Park, Alex S., and James R. Glass. “Unsupervised pattern discovery in speech.” IEEE Transactions on Audio, Speech, and Language Processing 16.1 (2008): 186–197.

Rajaram, Rakesh Nattoji, Eshed Ohn-Bar, and Mohan M. Trivedi. “An exploration of why and when pedestrian detection fails.” 2015 IEEE 18th International Conference on Intelligent Transportation Systems. IEEE, 2015.

Riccardi, Giuseppe, and Dilek Z. Hakkani-Tür. “Active and unsupervised learning for automatic speech recognition.” Interspeech. 2003.

Robo chat challenge. “Robo chat challenge 2014.” Online under (2014)

Rosa, Marek, Jan Feyereisl, and The GoodAI Collective. “A Framework for Searching for General Artificial Intelligence.” arXiv preprint arXiv:1611.00685 (2016).

Schmidhuber, Jurgen. “Evolutionary principles in self-referential learning.” On learning how to learn: The meta-meta-… hook.) Diploma thesis, Institut f. Informatik, Tech. Univ. Munich (1987).

Silver, David, et al. “Mastering the game of Go with deep neural networks and tree search.” Nature 529.7587 (2016): 484–489.

Smith, Linda, and Michael Gasser. “The development of embodied cognition: Six lessons from babies.” Artificial life 11.1–2 (2005): 13–29.

Song, Shuran, Linguang Zhang, and Jianxiong Xiao. “Robot in a room: Toward perfect object recognition in closed environments.” CoRR (2015).

Stewart, Russell, and Stefano Ermon. “Label-free supervision of neural networks with physics and domain knowledge.” arXiv preprint arXiv:1609.05566 (2016).

Tarlow, Daniel. “In Search of Strong Generalization.” Online under (2016)

Thrun, Sebastian, and Tom M. Mitchell. “Lifelong robot learning.” Robotics and autonomous systems 15.1–2 (1995): 25–46.

Thrun, Sebastian, and Lorien Pratt. “Learning to learn: Introduction and overview.” Learning to learn. Springer US, 1998. 3–17.

TMOU. “Archiv TMOU.” Online under (2016)

Verschae, Rodrigo, and Javier Ruiz-del-Solar. “Object detection: current and future directions.” Frontiers in Robotics and AI 2 (2015): 29.

Wang, Jane X., et al. “Learning to reinforcement learn.” arXiv preprint arXiv:1611.05763 (2016).

Yao, Bangpeng, Jiayuan Ma, and Li Fei-Fei. “Discovering object functionality.” Proceedings of the IEEE International Conference on Computer Vision. 2013.

Yelp, “The Yelp Dataset Challenge.”, online under (2017)

Unsolved Problems in AI was originally published in AI Roadmap Institute Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.