Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Moving Camera, Moving People: A Deep Learning Approach to Depth Prediction (googleblog.com)
213 points by skybrian on May 23, 2019 | hide | past | favorite | 41 comments


The best part of this paper is that they used mannequin challenge videos as their training dataset. That's super clever.


I have seen people suggest that the "10 year challenge" was created to build an age-related training dataset. While the mannequin challenge was probably just spontaneous, I wonder if we will see an increasing number of viral challenges in the future that center around the creation of structured information.


You can tell that the authors have a very fast internet connection by the fact that this website weights in at 91.6 Mbyte and takes over a minute to fully load on a 25 Mbit connection.


Jesus. Why couldn't they use embedded video files instead of 30 megabyte gifs?


A state-of-the-art deep learning neural net designed by digital video experts within one of the most technology savvy companies in the world...

What do they use to reveal it to the world? GIFs!


This is an image processing project. It's tradition to demo it to the world exclusively using grainy low resolution images. In this case it needs animation too, so gifs are the obvious choice!


From guidelines -

Be kind. Don't be snarky. Comments should get more thoughtful and substantive, not less, as a topic gets more divisive.


Always worth looking at a point cloud versus a disparity map.

Grayscale disparity/depth maps are somewhat misleading - the large regions of constant intensity suggest that the algorithm is good at segmenting areas of constant depth. However, the flickering in the map suggests that if you actually tried to plot this in 3D, it'd be pretty noisy. Not to disparage the result, but 2D depth/disparity maps tend to look better than what they represent.

You can see this in the synthetic camera wiggle video, focus on the actor's hands, for example.

You can also see this effect in the Stereolabs Zed promo video.


to visualize depth maps it is best to look at derivatives of them (e.g., a directional derivative or the laplacian). Mapping the depths to indensities directly loses a lot of information.


I can’t wait until techniques like this find their way in to open source photogrammetry pipelines. I came up with a way of training neural nets for robotics using a monocular camera, photogrammetry, and a simulation environment with the captured 3D scene, but the photogrammetry was error prone and computationally intensive even on a beefy cloud server.

I’d love for OpenSFM or OpenMVS (check github) to get this kind of software.

Also would love to see an implementation of this on github, but hopefully that will follow in time.


I personally do not believe that depth generated purely from deep learning can be used as input to photogrammetry anytime soon.

Photogrammetry works exceedingly well because the depth maps that they generate are quite precise and accurate, and mesh reconstruction usually assumes that these points are quite close to ground truth.

Deep learning approaches usually have medium accuracy but low precision, which causes the flickering and smooth surfaces that you see on the person. Even the background has flickering despite being computed through stereo, likely because the camera motion is primarily forward-backward (vs. more accurate side-to-side motion), the baseline is likely small, and the depth isn't globally optimized.

This type of research is super great for applications requiring lower accuracy, typically visual-only applications (e.g. selective blurring, faking stereo on a frame, etc.). But as an input to photogrammetry — probably not anytime soon, until the problems above get resolved.


Interesting. Perhaps my idea of this being inserted in to existing algorithms would not work.

However I do ultimately seek a low accuracy “visually approximate” 3D scene that I could use for simulation purposes. I guess I could rephrase my desire as: I’d love to see this kind of approach used to train an end to end deep learning photogrammetry system. I feel like the parallel nature of neural nets as well as their ability to approximate results could result in a much less computationally intensive solution to my photogrammetry desires.

(I want to train my four wheel drive robot to follow forest trails using the training method described in the “world models” research paper, which requires a simulation to work.)


Some of my friends recently put out http://gibsonenv.stanford.edu/

Full simulation with realistic 3D spaces, enables embodied agents to interact and learn from real-world spaces. Not forest trails, but a real world environment.

If you really want to create a 3D model of forest trails, photogrammetry should be sufficient, because forest scenes are richly-textured.


Yes I did come across Gibsonenv and it looks great for indoor scenes.

As far as photogrammetry of forest trails, I found it to be very computationally intensive (taking a GCE 32 core instance 30+ hours using 90+GB of ram to compute a scene, only with errors that made it unusable). It felt very heavy handed and given all the great work I've seen in scene understanding using neural nets, it seems like deep learning would be a promising approach here. Maybe there is commercial photogrammetry software that has better pipelines, but I want to be able to compute my scenes on linux and use hundreds of images.

I did my computation with OpenSFM and OpenMVS. Both wonderful projects for being free and open source. I did get a lot of great results. But I am convinced a simpler way is possible with deep learning.


OpenSFM is quite out of date, so it's quite inefficient and rather inaccurate (e.g. exhaustive matching is O(n^2), and there are a lot of smarter ways that are closer to O(n))

Also, one of the main steps of mesh reconstruction is depth map generation. It typically takes anywhere from 30-75% of compute time for dense reconstruction, IF it's parallelized thru GPU. If you're using the CPU only to calculate depth maps, you're probably slowing yourself down by an order of magnitude.

If you have a GPU, and use a better SFM-MVS solution, then you can quite easily reconstruct datasets of 1k-10k images within 24 hours.


What would you recommend as a better SFM-MVS solution?


> I personally do not believe that depth generated purely from deep learning can be used as input to photogrammetry anytime soon.

6d.ai uses depthnets in its mobile photogrammetry pipeline. demo: https://twitter.com/mattmiesnieks/status/1106722396889702406


How do you know that 6D AI uses deep learning to predict depth maps?

I'm very familiar with their work (they're doing a great job), but the demo video you linked appears to be a photogrammetric-based approach. You can tell because highly-textured surfaces are readily mapped, but low-texture regions remain unmapped, despite high coverage by the camera.

Maybe they use learned features for things like persistent AR, but I'm quite certain that they do not use deep learning to predict depth maps ab initio.


Why wouldn't they use 3d renderings as a large part of their training data set? You could have perfectly generated depth outputs generated alongside the image input, and you could adjust things like focal length to all kinds of values that would make this able to understand how shifting items correlate to depth across a variety of focal lengths. To be honest I'm not even sure how they're training them with live footage, how are they even getting the depth maps from the training footage to begin with?


> we make use of an existing source of data for supervision: YouTube videos in which people imitate mannequins by freezing in a wide variety of natural poses, while a hand-held camera tours the scene. Because the entire scene is stationary (only the camera is moving), triangulation-based methods--like multi-view-stereo (MVS)--work, and we can get accurate depth maps for the entire scene including the people in it

I suspect the reason for not using 3D rendering is the desire to cope with the noise and variability of real video.


The reason why is because there is no training data of the sort you describe out there.

By using MVS-based approaches, they are able to get over the data hurdle by compiling a dataset of your average YouTube video, instead of creating 3D renderings that include dynamic people. Importantly, MVS is really quite accurate, and in many cases can be considered ground truth.

Being able to forgo 3D renderings to use video only is almost certainly a reason why their results are so good.


Here is the actual paper https://arxiv.org/pdf/1904.11111.pdf


Could this technology one day become so good as to eliminate the need for lidar for self-driving cars? Or will lidar be so inexpensive by that point that there will be no need to eliminate it?


This is a great hack, but I'd love to see more detail on how they did pose initialization to approximate ground truth on depth/pose from the Mannequin set. The paper says they are using ORB-SLAM2, but AFAIK ORB still needs a height label.

Maybe it's the case that this system doesn't actually return an X,Y,Z camera pose, but rather just a pixel specific depth, and not a recovered pose for new inputs.


The predictions seem to have rapid flickering, which means the model is saying lots of items are moving back and forth extremely quickly. Since this seems common in video analysis (rapid changes per frame) is it that smoothing or taking into consideration multiple frames is slow? Or does it cause more issues than it solves?


DeepMind seems to be conducting research in a similar direction. Any insight on how are the two projects related or different?

https://deepmind.com/blog/neural-scene-representation-and-re...


At this point I feel like I'm psychic. Every single time I see an image processing project posted on here I think to myself "I bet the only examples are tiny low resolution thumbnails" and every. single. time. I'm proven right. Whyyyyyyy?

To be fair, this particular application doesn't really need more to show it's improvement over other approaches, but still.


Because it's a lot less data to crunch for the network.


I'm not sure if this is due to different mapping into greyscale or is their method completely killing far distance details?

Compared to "Chen et al" which is a bit flickery in the foreground, but full of stable background details, their result is almost completely black 3m in.


Can other companies use YouTube database for free say for research in Computer vision?


I think it's a gray area, but researchers often just do it. Better to ask for forgiveness than permission I guess. You could never collect datasets like ImageNet if you had to obtain individual permissions.


At least some jurisdictions have research exemptions in their copyright laws, so at least I don't need the copyright owner's permission to use any data for research purposes.

I'd still prefer to use explicitly open datasets because it allows for simpler data sharing and easier reproducibility, however in cases where that's not possible whatever is available will do even if I'm restricted in how I can redistribute that data.


Likely to fall under fair use, both for the research aspect and given the very low impact on factors 3&4 of 17USC107 for videos used to train neural nets.


Sure, but having it fully tagged in a data center across from you helps.


I believe yes if the content is under Creative Commons License.


Seems related to the Tesla video-based depth perception work?


Tesla approach works chiefly on video scenes with static objects like parked cars.

They train a DepthCNN to infer depth from monocular images (lidar or stereo for supervision) and make sure it's temporally consistent by adjusting with pixel transformations from the previous and next frame using a PoseCNN. https://arxiv.org/abs/1704.07813

The guys at Google use Optical flow (only previous frame) to make sure their model trained on static object video sequences works when the scene is dynamic using a mask for a specific class an object (humans here). They do have to make sure nothing but humans are dynamic in the scene.


That was my thought too, particularly because Tesla described extracting distance and velocity for moving objects by processing video frames two at a time in the upcoming hardwire.


This can improve fake bokeh on smartphones to a pro-camera level quality


Is it as good as LIDAR


I do not think so




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: