New AI tech to bring human-like understanding of our 3D world

Read how AIML鈥檚 leading research is bridging the 3D / 2D domain gap.

Humans move effortlessly around our rich and detailed three-dimensional world without much second thought. But, like most mammals, our eyes actually sense the world two-dimensionally 鈥 it鈥檚 our brains that take those 2D images and interpret them into a 3D understanding of reality.

Even without the stereo visual input from our two eyes, we鈥檙e experts at looking at a flat 2D image and instantly 鈥榣ifting鈥 it back to its 3D origins; we do it every time we watch TV or look at photos on our phone.

But computers and robots have a much harder time doing this 鈥榣ifting鈥. It鈥檚 a problem that AI researchers are hard at work to fix.

Making computers able to understand 3D space from only 2D input is considered such an important capability鈥攚ith diverse applications ranging from mobile phones to driverless vehicles鈥攖hat Professor Simon Lucey, director of the 最新糖心Vlogn Institute for Machine Learning (AIML), has received a $435,000 grant from the 最新糖心Vlogn Research Council to build a geometric reasoning system that can exhibit human-like performance.

human eye up close

Contrary to popular belief, humans can't actually sense the world in 3D. Our brains interpret the stereoscopic 2D input from our eyes into a 3D understanding of the world. Photo: iStock / Mark Kuiken.

鈥淲hen cameras try to sense the world, like humans do, what鈥檚 coming into the robot is still just 2D. It鈥檚 missing that component that we have in our brains that can lift it out to 3D, that鈥檚 what we鈥檙e trying to give it,鈥 Lucey says.

If 3D understanding from normal cameras is so difficult, why not instead equip computer vision systems with proper 3D sensors like LiDAR, a sensing method that uses lasers? It鈥檚 not that easy. Building and improving hardware technology is slow and expensive, and often out of reach for the many smaller tech startups seeking to innovate AI research commercially.

鈥淵ou could take ten years and billions of dollars and it would still be very, very risky to generate鈥ut when you鈥檙e doing something in software, you can deploy it straight away, and you can continually update and make it better,鈥 Lucey explains.

AIML researchers are among the world鈥檚 leaders in computer vision, a field of AI that enables computers to obtain meaningful information from digital images and video footage.

Building computer vision systems that can understand the real world typically requires vast troves of labeled training data using something called supervised machine learning. That means millions of images, each labeled 鈥榙og鈥, 鈥榮trawberry鈥 or 鈥楶resident Obama鈥; or thousands of hours of driving footage where coloured boxes are drawn to mark each pedestrian, stop sign and traffic light. If you鈥檝e ever had a website ask you to 鈥榗lick all the squares with bicycles鈥 to prove you鈥檙e really human, you鈥檝e helped train a supervised machine learning model.

AI researchers are using the vast collections of labeled 2D training data, and working out how to apply it so AI systems can develop a 3D geometric understanding similar to what humans can do.

鈥淗ow can I take 2D supervision that humans can easily provide,鈥 asks Professor Lucey, 鈥渁nd, using some elegant math, allow it to act as 3D supervision for modern AI systems?鈥

One application of this kind of computer vision is something called 3D motion capture, where earlier advances brought us Gollum in The Lord of the Rings movies. It鈥檚 still a popular technique and one that鈥檚 widely used in film visual effects, video game production and even medicine and sports science. But even today it uses a number of expensive and finely calibrated cameras, and sometimes still requires people to wear special reflective dots on their body and perform in front of a greenscreen, and that鈥檚 a problem.

鈥淧eople want the data they鈥檙e collecting to be realistic鈥hey don鈥檛 want a white background. They don鈥檛 want a green screen. They would love to be out in the field, or in areas that are highly unconstrained. And the sheer cost of this limits the application of technology at the moment,鈥 says Professor Lucey. 鈥淵ou can only apply it to problems where companies are willing to invest millions to build these things.鈥

camera lens seen up close with multiple light reflections

Traditional motion capture systems require as many as 40 to 60 finely calibrated cameras to accurately track a person's movement in 3D space. New developments in computer vision AI could reduce that to just two or three. Photo: iStock / Anake Seenadee

But in a 2021 project that saw Professor Lucey collaborate with researchers from Apple and Carnegie Mellon 最新糖心Vlog[1], the team was able to demonstrate a new AI method for 3D motion capture that is sure to make the technology far more accessible and affordable.

鈥淭he work we鈥檝e done on this paper has tried to ask the question: how few cameras could we get away with if we were willing to use AI to do this 3D lifting trick?鈥

The team used something called a neural prior 鈥 a mathematical way of giving an AI system an initial set of beliefs in terms of probability distribution, before any real data is provided.

As a result, the new method can perform 3D motion capture from normal video footage (no green screens or special reflective dots required) using only two or three uncalibrated camera views. It delivers similar 3D reconstruction accuracy that would otherwise require as many as 40-60 cameras using earlier methods.

Professor Lucey highlights the importance of AI research that focuses on finding efficiencies and significant cost breakthroughs as a way of bringing technology to those who鈥檇 otherwise not have been able to afford it.

鈥淚t鈥檚 democratic AI. You could be a small startup and you could use this, whereas with other methods you鈥檇 need to be very well resourced financially,鈥 he said.

The potential applications for this are broad, and not just related to capturing humans in motion, and include everything from mobile phone filters, autonomous vehicles, wildlife conservation, improved robots and even space satellites.


[1] 鈥樷 was presented at the 2021 International Conference on 3D Vision, 1 December 2021.

Tagged in computer vision