TLDR: 3DGS + no poses during training/inference; shared feature extraction backbone; simultaneous prediction of 3D Gaussian primitives+camera poses in a canonical space from unposed (1 feed-forward step).
TLDR: 3DGS + no poses during training/inference; shared feature extraction backbone; simultaneous prediction of 3D Gaussian primitives+camera poses in a canonical space from unposed (1 feed-forward step).
📖TL;DR: Any-to-Bokeh is a novel one-step video bokeh framework that converts arbitrary input videos into temporally coherent, depth-aware bokeh effects.
📖TL;DR: Any-to-Bokeh is a novel one-step video bokeh framework that converts arbitrary input videos into temporally coherent, depth-aware bokeh effects.
TL;DR: Streamable free-viewpoint videos efficient representations for with dynamic Gaussians. Reduce model size to just 0.7 MB per frame while training in < 5s and rendering at 350 FPS
TL;DR: Streamable free-viewpoint videos efficient representations for with dynamic Gaussians. Reduce model size to just 0.7 MB per frame while training in < 5s and rendering at 350 FPS
TL;DR: Data driven transformer in a feed forward manner; dense reconstruction in dynamic environment with 3D gaussians and velocities; self-supervised scene flows
TL;DR: Data driven transformer in a feed forward manner; dense reconstruction in dynamic environment with 3D gaussians and velocities; self-supervised scene flows
TL;DR: a feed-forward; (reconstructs+tracks dynamic video content); dust3r-like pointmaps for a pair of frames captured at different moments (1/2)
TL;DR: a feed-forward; (reconstructs+tracks dynamic video content); dust3r-like pointmaps for a pair of frames captured at different moments (1/2)
TL;DR: feed-forward model; cascaded learning paradigm with camera pose serving as the critical bridge, recognizing its essential role in mapping 3D structures onto 2D image planes.
TL;DR: feed-forward model; cascaded learning paradigm with camera pose serving as the critical bridge, recognizing its essential role in mapping 3D structures onto 2D image planes.
TL;DR: multi-view generalization to DUSt3R; processing many views in parallel: Transformer-based architecture forwards N images in a single forward pass, bypassing the need for iterative alignment.
TL;DR: multi-view generalization to DUSt3R; processing many views in parallel: Transformer-based architecture forwards N images in a single forward pass, bypassing the need for iterative alignment.
from @alibabagroup.bsky.social's Tongyi Lab with:
Zeyinzi Jiang* Zhen Han* Chaojie Mao*† Jingfeng Zhang Yulin Pan Yu Liu
*Equal contribution, †Project lead
from @alibabagroup.bsky.social's Tongyi Lab with:
Zeyinzi Jiang* Zhen Han* Chaojie Mao*† Jingfeng Zhang Yulin Pan Yu Liu
*Equal contribution, †Project lead
TL;DR: single-step diffusion models; a single-step image diffusion model trained to enhance and remove artifacts in rendered novel views caused by underconstrained regions of the 3D representation.
TL;DR: single-step diffusion models; a single-step image diffusion model trained to enhance and remove artifacts in rendered novel views caused by underconstrained regions of the 3D representation.
TL;DR: SAM2.1 based; distractor-distilled (DiDi) dataset to better study the distractor problem
TL;DR: SAM2.1 based; distractor-distilled (DiDi) dataset to better study the distractor problem
TL;DR: object-level 2D segmentation+relative depth; GPT-based model to analyze inter-object spatial relationships; occlusion-aware large-scale 3D generation model
TL;DR: object-level 2D segmentation+relative depth; GPT-based model to analyze inter-object spatial relationships; occlusion-aware large-scale 3D generation model
"The Art of Deception: Color Visual Illusions and Diffusion Models"
TL;DR: Diffusion models exhibit human-like perceptual shifts in brightness and color within their latent space.
"The Art of Deception: Color Visual Illusions and Diffusion Models"
TL;DR: Diffusion models exhibit human-like perceptual shifts in brightness and color within their latent space.
TL;DR: While more accurate volumetric rendering can help for low numbers of primitives, efficient optimization + large number of Gaussians allows 3DGS to outperform volumetric rendering despite its approximations
TL;DR: While more accurate volumetric rendering can help for low numbers of primitives, efficient optimization + large number of Gaussians allows 3DGS to outperform volumetric rendering despite its approximations
"Sparse Voxels Rasterization: Real-time High-fidelity Radiance Field Rendering"
"Sparse Voxels Rasterization: Real-time High-fidelity Radiance Field Rendering"
TL;DR: Self calibration + cubemap-based resampling strategy to support large FOV images
TL;DR: Self calibration + cubemap-based resampling strategy to support large FOV images
TL;DR: motion from source video + capture environmental representations as conditional inputs. Shape-agnostic mask strategy for character/environment relationship .
TL;DR: motion from source video + capture environmental representations as conditional inputs. Shape-agnostic mask strategy for character/environment relationship .
TL;DR: 1K Multiview Diffusion Transformer pre-trained on 3B Human images without captions; post-trained on 2.5K studio captures with pixel-aligned control via ControlMLP; generates > 5x views at inference
TL;DR: 1K Multiview Diffusion Transformer pre-trained on 3B Human images without captions; post-trained on 2.5K studio captures with pixel-aligned control via ControlMLP; generates > 5x views at inference
Hong Kong University and ByteDance present "Goku: Flow Based Video Generative Foundation Models"
Hong Kong University and ByteDance present "Goku: Flow Based Video Generative Foundation Models"
TL;DR: Unified framework for scene completion; joint models images and camera poses estimation to reconstruct missing parts of casually captured scenes. 1B-parameter diffusion model from scratch.
TL;DR: Unified framework for scene completion; joint models images and camera poses estimation to reconstruct missing parts of casually captured scenes. 1B-parameter diffusion model from scratch.
TL;DR: Manipulating 3D tracking videos; link frames, significantly enhancing for temporal consistency of the generated videos; 3 days oftraining on 8 H800 GPUs using less than 10k videos
TL;DR: Manipulating 3D tracking videos; link frames, significantly enhancing for temporal consistency of the generated videos; 3 days oftraining on 8 H800 GPUs using less than 10k videos
TL;DR: diffusion-based; raymap conditioning to both augment visual features with spatial information from different viewpoints; multi-task generation of images and depth maps
TL;DR: diffusion-based; raymap conditioning to both augment visual features with spatial information from different viewpoints; multi-task generation of images and depth maps
TL;DR: multi-scale temporal attention module for spatial accuracy. Noise rescheduling mechanism & latent transition approach for temporal consistency
TL;DR: multi-scale temporal attention module for spatial accuracy. Noise rescheduling mechanism & latent transition approach for temporal consistency
TL;DR: 360° panoramas using diffusion-based image models. cubemap representations + fine-tuning pretrained txt2img models, CubeDiff simplifies the panorama generation process, delivering high-quality, consistent panoramas.
TL;DR: 360° panoramas using diffusion-based image models. cubemap representations + fine-tuning pretrained txt2img models, CubeDiff simplifies the panorama generation process, delivering high-quality, consistent panoramas.
TL;DR: fully perspective projection model without applying heuristics; depth, focal parameters, 3D pose, and 2D alignment estimation
TL;DR: fully perspective projection model without applying heuristics; depth, focal parameters, 3D pose, and 2D alignment estimation
TL;DR: An online 3D reasoning framework for various 3D tasks from only RGB inputs
TL;DR: An online 3D reasoning framework for various 3D tasks from only RGB inputs