no code implementations • 12 Mar 2024 • Juan Manuel Zambrano Chaves, Shih-Cheng Huang, Yanbo Xu, Hanwen Xu, Naoto Usuyama, Sheng Zhang, Fei Wang, Yujia Xie, Mahmoud Khademi, ZiYi Yang, Hany Awadalla, Julia Gong, Houdong Hu, Jianwei Yang, Chunyuan Li, Jianfeng Gao, Yu Gu, Cliff Wong, Mu Wei, Tristan Naumann, Muhao Chen, Matthew P. Lungren, Akshay Chaudhari, Serena Yeung-Levy, Curtis P. Langlotz, Sheng Wang, Hoifung Poon
Frontier general-domain models such as GPT-4V still have significant performance gaps in multimodal biomedical applications.
1 code implementation • CVPR 2024 • Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, Lu Yuan
We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks.
Ranked #1 on
Visual Grounding
on RefCOCO+ val
(using extra training data)
9 code implementations • 19 Apr 2022 • Chunyuan Li, Haotian Liu, Liunian Harold Li, Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, Ping Jin, Houdong Hu, Zicheng Liu, Yong Jae Lee, Jianfeng Gao
In general, these language-augmented visual models demonstrate strong transferability to a variety of datasets and tasks.
Ranked #1 on
Object Detection
on ELEVATER
no code implementations • 30 Nov 2021 • Xiaotian Han, Quanzeng You, Chunyu Wang, Zhizheng Zhang, Peng Chu, Houdong Hu, Jiang Wang, Zicheng Liu
This dataset provides a more reliable benchmark of multi-camera, multi-object tracking systems in cluttered and crowded environments.
Ranked #2 on
Object Tracking
on MMPTRACK
2 code implementations • 22 Nov 2021 • Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, JianFeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, Pengchuan Zhang
Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications.
1 code implementation • 27 Jul 2021 • Xiaotian Han, Jianwei Yang, Houdong Hu, Lei Zhang, Jianfeng Gao, Pengchuan Zhang
There is a surge of interest in image scene graph generation (object, attribute and relationship detection) due to the need of building fine-grained image understanding models that go beyond object detection.
4 code implementations • ECCV 2020 • Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiao-Wei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, Jianfeng Gao
Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks.
Ranked #1 on
Image Retrieval
on MS COCO
(Recall@10 metric)
no code implementations • 28 Oct 2019 • Kyle Xiao, Houdong Hu, Yan Wang
Query reformulation is the process by which a input search query is refined by the user to match documents outside the original top-n results.
3 code implementations • 24 Sep 2019 • Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, Jianfeng Gao
The model is unified in that (1) it can be fine-tuned for either vision-language generation (e. g., image captioning) or understanding (e. g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models.
Ranked #1 on
Image Captioning
on Flickr30k Captions test
no code implementations • 22 Sep 2019 • Kuang-Huei Lee, Hamid Palangi, Xi Chen, Houdong Hu, Jianfeng Gao
In this work, we tackle two fundamental language-and-vision tasks: image-text matching and image captioning, and demonstrate that neural scene graph generators can learn effective visual relation features to facilitate grounding language to visual relations and subsequently improve the two end applications.
no code implementations • 12 Apr 2018 • Ning Ma, Alexey Volkov, Aleksandr Livshits, Pawel Pietrusinski, Houdong Hu, Mark Bolin
We propose a new framework to rank image attractiveness using a novel pairwise deep network trained with a large set of side-by-side multi-labeled image pairs from a web image index.
6 code implementations • ECCV 2018 • Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, Xiaodong He
Prior work either simply aggregates the similarity of all possible pairs of regions and words without attending differentially to more and less important words or regions, or uses a multi-step attentional process to capture limited number of semantic alignments which is less interpretable.
Ranked #4 on
Image Retrieval
on PhotoChat
no code implementations • 14 Feb 2018 • Houdong Hu, Yan Wang, Linjun Yang, Pavel Komlev, Li Huang, Xi Chen, Jiapei Huang, Ye Wu, Meenaz Merchant, Arun Sacheti
In this paper, we introduce a web-scale general visual search system deployed in Microsoft Bing.