If future AI is looking through our eyes in the form of AR glasses and other wearables, as many tech companies want, we need to learn how to understand the human perspective.Of course, we’re used to it, but Facebook has very few first-person video footage of everyday tasks. Collected thousands of hours for new public datasets..
The challenge Facebook is trying to figure out is that even today’s most striking object and scene recognition models are trained almost exclusively from a third-party perspective. Therefore, you can recognize the person who is cooking, but only when you see that person standing in the kitchen, you cannot recognize it from the human eye. Or it recognizes the bike, but not from the rider’s point of view. It’s a natural part of our experience, so it’s a change of perspective that we take for granted, but computers find it very difficult.
The solution to a machine learning problem is usually either more or better data, in which case it’s okay to have both. That’s why Facebook contacted research partners around the world to collect first-person videos of common activities such as cooking, grocery shopping, shoelace entry, or just hanging out.
Thirteen partner universities have collected thousands of hours of video from more than 700 participants in nine countries. At first, it must be said that they were volunteers and controlled their own level of involvement and identity. These thousands of hours were reduced to 3,000 hours by a research team that watched, edited, and annotated the video, adding their own footage from a staged environment that couldn’t actually be captured. It’s all described in this research treatise.
The footage was captured in a variety of ways, from eyeglass cameras to GoPro and other devices. Others chose to scan the environment in which they were interacting, while others tracked gaze directions and other indicators. It all goes into a Facebook dataset called Ego4D and is freely available to the entire research community.
“In order for AI systems to interact with the world like us, the field of AI needs to evolve into a whole new paradigm of first-person perception, in the context of real-time movement, interaction, and multisensory observation. It means teaching AI to understand the activities of everyday life through the human eye, “said senior researcher Kristin Grauman in a Facebook blog post.
Believe it or not, this survey and Ray-Ban Stories’ smart shades have nothing to do with it, except that Facebook clearly believes that first-person understanding is increasingly important for multiple disciplines. there is not. (3D scanning is for the company Habitat AI Training SimulatorHowever. )
“Our research is strongly motivated by applications in augmented reality and robotics,” Grauman told TechCrunch. “First-person awareness is important to enable future AI assistants, especially because wearables like AR glasses are an integral part of how people live and move in their daily lives. Think about how beneficial it would be if your device assistant could remove the cognitive overload from your life and understand your world through your eyes. “
The global nature of the collected videos is a very deliberate move. Includes only images from a single country or culture is basically myopic. American kitchens look different from French kitchens, Rwandan kitchens, and Japanese kitchens. Making the same dishes with the same ingredients and performing the same common tasks (cleaning, exercise) can look very different not only for the whole culture, but also for individuals. Therefore, according to a Facebook post, “Compared to existing datasets, Ego4D datasets have increased the variety of scenes, people and activities and have been trained for people across backgrounds, ethnicities, professions and ages. Increase the applicability of the model. “
Facebook isn’t just releasing databases. With this kind of leap in data collection, it is common to publish a set of benchmarks to test how much information a particular model is using. For example, you may need a standard benchmark that uses a series of images of a dog and a cat to test the effectiveness of the model in determining which is which.
In this case, things are a little more complicated. Identifying an object from a first-person perspective isn’t too difficult — it’s actually another angle — and it’s neither new nor useful. Do you really need AR glasses to say “it’s a tomato”?No: Like any other tool, the AR device should tell you something please do not To know, and to do it, you need a deeper understanding of things like intent, context, and linked behavior.
To that end, researchers have come up with five tasks that can theoretically be accomplished by analyzing this first-person image, anyway.
- Episodic memory: Track objects and concepts in time and space so you can ask any question, such as “Where is my key?” I can answer.
- predict: Understand the series of events and ask questions such as “What’s next for the recipe?” I can answer. Alternatively, you can preemptively record things like “I left my car key at home.”
- Hand-object interaction: Identify how people grab and manipulate objects, and what happens when they manipulate objects. This can feed episodic memory or signal that the robot’s actions need to be mimicked.
- Audiovisual dialiation: Associate sounds with events and objects to enable intelligent tracking of speeches and music in situations such as what a song is playing in a cafe, what your boss said at the end of a meeting, and so on. (“Dialization” is their “word”.)
- Social interaction: Understand who is talking to whom and what is being said for both the purpose of notifying other processes and for momentary use such as captions in noisy rooms with multiple people do.
Of course, these are not the only possible applications and benchmarks. Here’s a series of first ideas to test if a particular AI model gets what’s really happening in the first person video. Facebook researchers performed the basic level of performing each task described in their treatise. This serves as a starting point.There is also an example of a kind of aerial pie video for each task if these tasks are successful In this video Summarize the study.
3,000 hours (more than 250,000 researchers have been carefully annotated, but Grauman carefully pointed out) is orders of magnitude longer than the current time, but there is still room for growth. She says there are enough. They plan to expand their dataset and are actively adding partners.
If you are interested in using the data, please pay attention to the Facebook AI Research blog and contact one of the many people mentioned in the treatise. Once the consortium understands exactly how to do it, it will be released in the coming months.
Facebook researchers collect thousands of hours of first-person video to train AI – TechCrunch Source link Facebook researchers collect thousands of hours of first-person video to train AI – TechCrunch