Researchers at Massachusetts Institute of Technology (MIT) have developed a semantic parser that learns through observation, mimicking a child’s language-acquisition process – with the potential to greatly extend computing’s capabilities.
Traditionally, language-learning computer systems are trained on sentences annotated by humans that describe the structure and meaning behind words. These methods underpin web searches, natural-language database querying, and virtual assistants from the likes of Amazon and Google.
The data annotation process is often time consuming and can raise complications around how to correctly label an image, and accurately reflect natural speech patterns.
In an attempt to tackle these issues, MIT researchers have developed a natural language processing parser that learns the structure of language through simple observation.
The parser watched around 400 captioned videos and associated the words with the recorded objects and actions. It could then use what it had learnt about the structure of the language to accurately predict a new sentence’s meaning, without any video.
According to the researchers, this new approach could expand the types of data that can be used and reduce the effort required to train parsers. It’s a “weakly supervised” approach that enables a few annotated sentences to be combined with more easily acquired captioned videos to boost performance.
The process could improve natural interaction between humans and personal robots in the future, allowing robots to constantly observe, and learn from, the interactions going on around them.
As co-author Andrei Barbu, a researcher in the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT explains:
People talk to each other in partial sentences, run-on thoughts, and jumbled language. You want a robot in your home that will adapt to their particular way of speaking … and still figure out what they mean.
Watch & learn
Looking at the system from other side, the parser could also help researchers better understand how young children learn language. Co-author Boris Katz, a principal research scientist and head of the InfoLab Group at CSAIL said:
“A child has access to redundant, complementary information from different modalities, including hearing parents and siblings talk about the world, as well as tactile information and visual information, [which help him or her] to understand the world.
It’s an amazing puzzle, to process all this simultaneous sensory input. This work is part of bigger piece to understand how this kind of learning happens in the world.
First author Candace Ross, a graduate student in the Department of Electrical Engineering and Computer Science and CSAIL, described the advantage of using video instead of images to train the parser:
“There are temporal components — objects interacting with each other and with people — and high-level properties you wouldn’t see in a still image or just in language.”
The researchers plan to progress to modelling interactions, as well as passive observations, again leaning on how children interact with an environment to learn about it.
Internet of Business says
Many people have grown used to interacting with digital assistants, often through smart home products, such as Google Home and Amazon Alexa. However, while these experiences have gradually become less clunky and more capable, they are still a long way off the intelligence and natural language processing ability required for deeper and lengthier interactions.
These platforms usually require users to know a list of commands and functions, thereby adapting to the digital assistant’s requirements, rather than the other way around. Secondary questions and commands can be hit-or-miss, revealing how little such assistants truly understand.
Google’s recently announced Duplex system promises to be more capable in closed domains, but the ultimate goal of a natural language system that can learn without the need for developers to annotate vast quantities of data his been elusive – until now.
This isn’t MIT’s first foray into leveraging computer vision to allow an AI to learn more independently. We recently reported on their advances in object recognition and manipulation.
MIT Media Lab researchers have also developed a system that outperforms existing models in recognising small facial expressions and their corresponding emotions.
With plans for a $1 billion AI college, MIT is well placed to play a key role in natural language processing research going forward.