Combining vision and language could be the key to more capable AI – TechCrunch

Depending on the theory of intelligence you subscribe to, achieving “human-level” AI requires a system that has multiple modalities – e.g. B. sound, image and text – can use to argue about the world. For example, if an image of an overturned truck and a police cruiser is shown on a snowy highway, a human-level AI could conclude that dangerous road conditions caused an accident. Or they would walk on a robot when asked to get a can of soda from the fridge, navigating around people, furniture and pets to retrieve the can and place it within reach of the requester.

Today’s AI falls short. However, new research shows signs of encouraging progress, from robots that can figure out steps to fulfill basic commands (e.g., “Get a water bottle”) to text-generating systems that learn from explanations. In this revived edition of Deep Science, our weekly series on the latest developments in AI and the broader scientific field, we report on work from DeepMind, Google and OpenAI that is making strides towards systems that transform the world – albeit imperfectly understand – can. Solve tight tasks like generating images with impressive robustness.

AI Research Lab OpenAI’s enhanced DALL-E, DALL-E 2, is by far the most impressive project to emerge from the depths of an AI research lab. As my colleague Devin Coldewey writes, while the original DALL-E demonstrated a remarkable ability to create images that match virtually any prompt (e.g., “a dog in a beret”), DALL-E 2 takes it a step further. The images produced are much more detailed and DALL-E 2 can intelligently replace a specific area in an image – for example by inserting a table into a photo of a marble floor with the appropriate reflections.

An example of the types of images DALL-E 2 can produce.

DALL-E 2 got the most attention this week. But on Thursday researchers at Google detailed an equally impressive visual comprehension system called Visually-Driven Prosody for Text-to-Speech – VDTTS – in a post published on Google’s AI blog. VDTTS can produce realistic-sounding, lip-sync speech when only text and video frames of the person speaking are present.

While not a perfect substitute for recorded dialogue, the speech generated by VDTTS is still quite good, with convincingly human-like expressiveness and timing. Google sees it being used in a studio one day to replace original audio that may have been recorded in noisy conditions.

Of course, visual understanding is only one step towards more powerful AI. Another component is language comprehension, which lags behind in many aspects – even if you ignore AIs well documented toxicity and bias problems. As a glaring example, a cutting-edge system owned by Google, Pathways Language Model (PaLM), stored 40% of the data used to “train” it, according to one publication, leading to PaLM plagiarizing text right down to copyright notices code snippets.

Luckily, DeepMind, the Alphabet-backed AI lab, is among the research techniques to address this. In a new one to learnDeepMind researchers are investigating whether AI language systems — which learn to generate text from many examples of existing text (think books and social media) — could benefit Explanations these texts. After annotating dozens of language tasks (e.g., “Answer these questions by noting whether the second sentence is an appropriate paraphrase of the first, metaphorical sentence”) with explanations (e.g., “David’s eyes weren’t literally daggers, it’s a metaphor used to suggest that David was glaring at Paul.”) and evaluating the performance of various systems on them, the DeepMind team found that examples actually improve the systems’ performance.

DeepMind’s approach, if it catches on within the academic community, could one day be applied in robotics, forming the building blocks of a robot that responds to vague prompts (e.g., “throw out the trash”) with no step-by-step instructions – Can understand step-by-step instructions. Google’s new “Do what I can, not what I say‘ project provides a glimpse into that future – albeit with significant limitations.

Do As I Can, Not As I Say, a collaboration between Robotics at Google and the Everyday Robotics team at Alphabet’s X lab, seeks to condition an AI language system to suggest actions that are “feasible” for a robot ‘ and ‘contextually appropriate’ are when given an arbitrary task. The robot acts as the “hands and eyes” of the language system while the system provides high-level semantic knowledge about the task – the theory holds that the language system encodes a wealth of knowledge useful to the robot.

Google robotics

Photo credit: Robotics at Google

A system called SayCan selects which skill the robot should perform in response to a command, taking into account (1) the likelihood that a particular skill will be useful and (2) the chance of successfully performing that skill. For example, if someone says, “I spilled my coke, can you get me something to clean up?” SayCan can instruct the robot to find a sponge, pick up the sponge, and bring it to the person who asked for it.

SayCan is limited by robotics hardware — on more than one occasion, the research team observed that the robot they chose to conduct experiments accidentally dropped objects. Still, along with the work of DALL-E 2 and DeepMind in context understanding, it is an example of how AI systems, when combined, can bring us much closer to one Jetsons guy Future.

Combining vision and language could be the key to more capable AI – TechCrunch Source link Combining vision and language could be the key to more capable AI – TechCrunch

Related Articles

Back to top button