no code implementations • NeurIPS 2023 • Grace Luo, Lisa Dunlap, Dong Huk Park, Aleksander Holynski, Trevor Darrell
We propose Diffusion Hyperfeatures, a framework for consolidating multi-scale and multi-timestep feature maps into per-pixel feature descriptors that can be used for downstream tasks.
no code implementations • 1 Dec 2022 • Dong Huk Park, Grace Luo, Clayton Toste, Samaneh Azadi, Xihui Liu, Maka Karalashvili, Anna Rohrbach, Trevor Darrell
We introduce precise object silhouette as a new form of user control in text-to-image diffusion models, which we dub Shape-Guided Diffusion.
no code implementations • 10 Dec 2021 • Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, Trevor Darrell
We investigate fine-grained, continuous control of this model class, and introduce a novel unified framework for semantic diffusion guidance, which allows either language or image guidance, or both.
1 code implementation • 27 Oct 2021 • Xuanlin Li, Brandon Trabucco, Dong Huk Park, Michael Luo, Sheng Shen, Trevor Darrell, Yang Gao
Permutations then serve as target generation orders for training an insertion-based Transformer language model.
no code implementations • 12 Aug 2021 • Josh Beal, Hao-Yu Wu, Dong Huk Park, Andrew Zhai, Dmitry Kislyuk
Large-scale pretraining of visual representations has led to state-of-the-art performance on a range of benchmark computer vision tasks, yet the benefits of these techniques at extreme scale in complex production systems has been relatively unexplored.
Ranked #26 on Image Classification on ObjectNet (using extra training data)
no code implementations • 1 Jan 2021 • Dong Huk Park, Trevor Darrell
To this end, reconstruction-based learning is often used in which the normality of an observation is expressed in how well it can be reconstructed.
1 code implementation • ICLR 2021 • Xuanlin Li, Brandon Trabucco, Dong Huk Park, Michael Luo, Sheng Shen, Trevor Darrell, Yang Gao
One strategy to recover this information is to decode both the content and location of tokens.
no code implementations • 17 Dec 2020 • Josh Beal, Eric Kim, Eric Tzeng, Dong Huk Park, Andrew Zhai, Dmitry Kislyuk
The Vision Transformer was the first major attempt to apply a pure transformer model directly to images as input, demonstrating that as compared to convolutional networks, transformer-based architectures can achieve competitive results on benchmark classification tasks.
no code implementations • 5 Aug 2019 • Andrew Zhai, Hao-Yu Wu, Eric Tzeng, Dong Huk Park, Charles Rosenberg
The solution we present not only allows us to train for multiple application objectives in a single deep neural network architecture, but takes advantage of correlated information in the combination of all training data from each application to generate a unified embedding that outperforms all specialized embeddings previously deployed for each product.
1 code implementation • ICCV 2019 • Dong Huk Park, Trevor Darrell, Anna Rohrbach
We present a novel Dual Dynamic Attention Model (DUDA) to perform robust Change Captioning.
1 code implementation • CVPR 2018 • Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, Marcus Rohrbach
We propose a multimodal approach to explanation, and argue that the two modalities provide complementary explanatory strengths.
no code implementations • 17 Nov 2017 • Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, Marcus Rohrbach
We also introduce a multimodal methodology for generating visual and textual explanations simultaneously.
no code implementations • 14 Dec 2016 • Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Bernt Schiele, Trevor Darrell, Marcus Rohrbach
In contrast, humans can justify their decisions with natural language and point to the evidence in the visual world which led to their decisions.
10 code implementations • EMNLP 2016 • Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, Marcus Rohrbach
Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual representations.