Considered a step towards understanding how humans respond to physical forces, researchers have developed a computer system that is capable of predicting how objects respond to physical forces.
Josh Tenenbaum, a professor of brain and cognitive sciences at MIT, and team presented their work at this year’s Conference on Neural Information Processing Systems wherein they have examined the fundamental cognitive abilities that an intelligent agent requires to navigate the world: discerning distinct objects and inferring how they respond to physical forces.
Researchers behind the computer system that such systems could help answer questions about what information-processing resources human beings use at what stages of development. Along the way, the researchers might also generate some insights useful for robotic vision systems.
The work by researchers is spread across four different studies – three out of which deal with inferring information about the physical structure of objects, from both visual and aural data. The fourth deals with predicting how objects will behave on the basis of that data.
All these four papers are tied with machine learning, a technique in which computers learn to perform computational tasks by analyzing huge sets of training data. In a typical machine-learning system, the training data are labeled: Human analysts will have, say, identified the objects in a visual scene or transcribed the words of a spoken sentence. The system attempts to learn what features of the data correlate with what labels, and it’s judged on how well it labels previously unseen data.
In the new study, the computer system is trained to infer a physical model of the world — the 3-D shapes of objects that are mostly hidden from view, for instance. But then it works backward, using the model to resynthesize the input data, and its performance is judged on how well the reconstructed data matches the original data.
For instance, using visual images to build a 3-D model of an object in a scene requires stripping away any occluding objects; filtering out confounding visual textures, reflections, and shadows; and inferring the shape of unseen surfaces. Once the system has built such a model, however, it rotates it in space and adds visual textures back in until it can approximate the input data.
The researchers’ system is based on the influential theories of the MIT neuroscientist David Marr, who died in 1980 at the tragically young age of 35. Marr hypothesized that in interpreting a visual scene, the brain first creates what he called a 2.5-D sketch of the objects it contained — a representation of just those surfaces of the objects facing the viewer. Then, on the basis of the 2.5-D sketch — not the raw visual information about the scene — the brain infers the full, three-dimensional shapes of the objects.
Once the system has been trained on synthetic data, however, it can be fine-tuned using real data. That’s because its ultimate performance criterion is the accuracy with which it reconstructs the input data. It’s still building 3-D models, but they don’t need to be compared to human-constructed models for performance assessment.
In evaluating their system, the researchers used a measure called intersection over union, which is common in the field. On that measure, their system outperforms its predecessors. But a given intersection-over-union score leaves a lot of room for local variation in the smoothness and shape of a 3-D model. Of the study’s participants, 74 percent preferred the new system’s reconstructions to those of its predecessors.