no code implementations • 6 Jul 2024 • Haonan Xu, Dian Chao, Xiangyu Wu, Zhonghua Wan, Yang Yang
Treating texts as images, combining prompts with textual labels for prompt tuning, and leveraging the alignment properties of CLIP have been successfully applied in zero-shot multi-label image recognition.
no code implementations • 5 Jul 2024 • Xiangyu Wu, Zhouyang Chi, Yang Yang, Jianfeng Lu
We designed a three-stage solution for this task.
no code implementations • 4 Jul 2024 • Xiangyu Wu, Jinling Xu, Longfei Huang, Yang Yang
This report introduces a solution to The task of RGB-TIR object detection from the perspective of unmanned aerial vehicles.
no code implementations • 1 Jul 2024 • Yurui Huang, Yang Yang, Shou Chen, Xiangyu Wu, QingGuo Chen, Jianfeng Lu
In this paper, we propose a solution for improving the quality of temporal sound localization.
no code implementations • 1 Jul 2024 • Xiangyu Wu, Hailiang Zhang, Yang Yang, Jianfeng Lu
The retrieval augmentation constructs a mini-knowledge base, enriching the input information of the model, while the similarity bucket further perceives the noise information within the mini-knowledge base, guiding the model to generate higher-quality diagnostic reports based on the similarity prompts.
1 code implementation • 11 May 2024 • Xiangyu Wu, Qing-Yuan Jiang, Yang Yang, Yi-Feng Wu, Qing-Guo Chen, Jianfeng Lu
Then, a co-learning strategy with a dual-adapter module is designed to transfer visual knowledge from pseudo-visual prompt to text prompt, enhancing their visual representation abilities.
no code implementations • 19 Apr 2024 • Longfei Huang, Shupeng Zhong, Xiangyu Wu, Ruoxuan Li
Subsequently, we propose caption-level strategy for the high-quality caption data generated by the image caption models and integrate them with retrieval augmentation strategy into the template to compel the model to generate higher quality, more matching, and semantically enriched captions based on the retrieval augmentation prompts.
no code implementations • 26 Mar 2024 • Dian Chao, Xin Song, Shupeng Zhong, Boyuan Wang, Xiangyu Wu, Chen Zhu, Yang Yang
In this paper, we propose a solution for improving the quality of captions generated for figures in papers.
no code implementations • 10 Oct 2023 • Xiangyu Wu, Yang Yang, Shengdong Xu, Yifeng Wu, QingGuo Chen, Jianfeng Lu
At the data level, inspired by the challenge paper, we categorized the whole questions into eight types and utilized the llama-2-chat model to directly generate the type for each question in a zero-shot manner.
no code implementations • 10 Oct 2023 • Xiangyu Wu, Yi Gao, Hailiang Zhang, Yang Yang, Weili Guo, Jianfeng Lu
In this paper, we present our solution to the New frontiers for Zero-shot Image Captioning Challenge.
no code implementations • 5 Sep 2023 • TaeHoon Kim, Pyunghwan Ahn, Sangyun Kim, Sihaeng Lee, Mark Marsden, Alessandra Sala, Seung Hwan Kim, Bohyung Han, Kyoung Mu Lee, Honglak Lee, Kyounghoon Bae, Xiangyu Wu, Yi Gao, Hailiang Zhang, Yang Yang, Weili Guo, Jianfeng Lu, Youngtaek Oh, Jae Won Cho, Dong-Jin Kim, In So Kweon, Junmo Kim, Wooyoung Kang, Won Young Jhoo, Byungseok Roh, Jonghwan Mun, Solgil Oh, Kenan Emir Ak, Gwang-Gook Lee, Yan Xu, Mingwei Shen, Kyomin Hwang, Wonsik Shin, Kamin Lee, Wonhark Park, Dongkwan Lee, Nojun Kwak, Yujin Wang, Yimu Wang, Tiancheng Gu, Xingchang Lv, Mingmao Sun
In this report, we introduce NICE (New frontiers for zero-shot Image Captioning Evaluation) project and share the results and outcomes of 2023 challenge.
no code implementations • 26 Jun 2023 • Jiaxin Deng, Dong Shen, Shiyao Wang, Xiangyu Wu, Fan Yang, Guorui Zhou, Gaofeng Meng
However, most previous works treat the live as a whole item and explore the Click-through-Rate (CTR) prediction framework on item-level, neglecting that the dynamic changes that occur even within the same live room.
no code implementations • 15 Apr 2023 • Yang Yang, Zhongtian Fu, Xiangyu Wu, Wenjie Li
To address this challenge, in this paper, we experimentally observe that the vision-language divergence may cause the existence of strong and weak modalities, and the hard cross-modal consistency cannot guarantee that strong modal instances' relationships are not affected by weak modality, resulting in the strong modal instances' relationships perturbed despite learned consistent representations. To this end, we propose a novel and directly Coordinated VisionLanguage Retrieval method (dubbed CoVLR), which aims to study and alleviate the desynchrony problem between the cross-modal alignment and single-modal cluster-preserving tasks.
no code implementations • 14 Mar 2023 • Xing Cheng, Xiangyu Wu, Dong Shen, Hezheng Lin, Fan Yang
Video grounding aims to locate the timestamps best matching the query description within an untrimmed video.
no code implementations • 19 Nov 2022 • Jiaxin Deng, Dong Shen, Haojie Pan, Xiangyu Wu, Ximan Liu, Gaofeng Meng, Fan Yang, Size Li, Ruiji Fu, Zhongyuan Wang
Furthermore, based on this dataset, we propose an end-to-end model that jointly optimizes the video understanding objective with knowledge graph embedding, which can not only better inject factual knowledge into video understanding but also generate effective multi-modal entity embedding for KG.
no code implementations • 19 Sep 2022 • Dingqi Zhang, Antonio Loquercio, Xiangyu Wu, Ashish Kumar, Jitendra Malik, Mark W. Mueller
This paper proposes an adaptive near-hover position controller for quadcopters, which can be deployed to quadcopters of very different mass, size and motor constants, and also shows rapid adaptation to unknown disturbances during runtime.
2 code implementations • 9 Sep 2021 • Xing Cheng, Hezheng Lin, Xiangyu Wu, Fan Yang, Dong Shen
In this paper, we propose a multi-stream Corpus Alignment network with single gate Mixture-of-Experts (CAMoE) and a novel Dual Softmax Loss (DSL) to solve the two heterogeneity.
Ranked #9 on Video Retrieval on MSVD (using extra training data)
no code implementations • 7 Aug 2021 • Shuxiao Chen, Xiangyu Wu, Mark W. Mueller, Koushil Sreenath
The capabilities of autonomous flight with unmanned aerial vehicles (UAVs) have significantly increased in recent times.
1 code implementation • 11 Jun 2021 • Xing Cheng, Hezheng Lin, Xiangyu Wu, Fan Yang, Dong Shen, Zhongyuan Wang, Nian Shi, Honglin Liu
The task of multi-label image classification is to recognize all the object labels presented in an image.
Ranked #12 on Multi-Label Classification on MS-COCO
1 code implementation • 10 Jun 2021 • Hezheng Lin, Xing Cheng, Xiangyu Wu, Fan Yang, Dong Shen, Zhongyuan Wang, Qing Song, Wei Yuan
In this paper, we propose a new attention mechanism in Transformer termed Cross Attention, which alternates attention inner the image patch instead of the whole image to capture local information and apply attention between image patches which are divided from single-channel feature maps capture global information.