We propose SceneTex, a novel method for effectively generating high-quality and style-consistent textures for indoor scenes using depth-to-image diffusion priors.
We present Text2Tex, a novel method for generating high-quality textures for 3D meshes from the given text prompts.
Performing 3D dense captioning and visual grounding requires a common and shared understanding of the underlying multimodal relationships.
In federated learning, all networked clients contribute to the model training cooperatively.
Our D3Net unifies dense captioning and visual grounding in 3D in a self-critical manner.