Search Results for author: Jinfa Huang

Found 22 papers, 17 papers with code

Identity-Preserving Text-to-Video Generation by Frequency Decomposition

1 code implementation26 Nov 2024 Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyuan Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, Li Yuan

We propose a hierarchical training strategy to leverage frequency information for identity preservation, transforming a vanilla pre-trained video generation model into an IPT2V model.

Text-to-Video Generation Video Generation

A Survey of Camouflaged Object Detection and Beyond

1 code implementation26 Aug 2024 Fengyang Xiao, Sujie Hu, Yuqi Shen, Chengyu Fang, Jinfa Huang, Chunming He, Longxiang Tang, Ziyun Yang, Xiu Li

Camouflaged Object Detection (COD) refers to the task of identifying and segmenting objects that blend seamlessly into their surroundings, posing a significant challenge for computer vision systems.

Instance Segmentation Object +4

MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval

1 code implementation20 Aug 2024 Haoran Tang, Meng Cao, Jinfa Huang, Ruyang Liu, Peng Jin, Ge Li, Xiaodan Liang

Text-Video Retrieval (TVR) aims to align and associate relevant video content with corresponding natural language queries.

Mamba Natural Language Queries +2

Evolver: Chain-of-Evolution Prompting to Boost Large Multimodal Models for Hateful Meme Detection

no code implementations30 Jul 2024 Jinfa Huang, Jinsheng Pan, Zhongwei Wan, Hanjia Lyu, Jiebo Luo

To this end, we propose Evolver, which incorporates LMMs via Chain-of-Evolution (CoE) Prompting, by integrating the evolution attribute and in-context information of memes.

Attribute

ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation

4 code implementations26 Jun 2024 Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Ruijie Zhu, Xinhua Cheng, Jiebo Luo, Li Yuan

We propose a novel text-to-video (T2V) generation benchmark, ChronoMagic-Bench, to evaluate the temporal and metamorphic capabilities of the T2V models (e. g. Sora and Lumiere) in time-lapse video generation.

Text-to-Video Generation Video Generation

LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference

1 code implementation26 Jun 2024 Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan

Long-context Multimodal Large Language Models (MLLMs) demand substantial computational resources for inference as the growth of their multimodal Key-Value (KV) cache, in response to increasing input lengths, challenges memory and time efficiency.

MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators

3 code implementations7 Apr 2024 Shenghai Yuan, Jinfa Huang, Yujun Shi, Yongqi Xu, Ruijie Zhu, Bin Lin, Xinhua Cheng, Li Yuan, Jiebo Luo

Recent advances in Text-to-Video generation (T2V) have achieved remarkable success in synthesizing high-quality general videos from textual descriptions.

Text-to-Video Generation Video Generation

LLMBind: A Unified Modality-Task Integration Framework

1 code implementation22 Feb 2024 Bin Zhu, Munan Ning, Peng Jin, Bin Lin, Jinfa Huang, Qi Song, Junwu Zhang, Zhenyu Tang, Mingjun Pan, Xing Zhou, Li Yuan

In the multi-modal domain, the dependence of various models on specific input formats leads to user confusion and hinders progress.

AI Agent Audio Generation +3

Continuous-Multiple Image Outpainting in One-Step via Positional Query and A Diffusion-based Approach

1 code implementation28 Jan 2024 Shaofeng Zhang, Jinfa Huang, Qiang Zhou, Zhibin Wang, Fan Wang, Jiebo Luo, Junchi Yan

At inference, we generate images with arbitrary expansion multiples by inputting an anchor image and its corresponding positional embeddings.

Image Outpainting

GPT-4V(ision) as A Social Media Analysis Engine

1 code implementation13 Nov 2023 Hanjia Lyu, Jinfa Huang, Daoan Zhang, Yongsheng Yu, Xinyi Mou, Jinsheng Pan, Zhengyuan Yang, Zhongyu Wei, Jiebo Luo

Our investigation begins with a preliminary quantitative analysis for each task using existing benchmark datasets, followed by a careful review of the results and a selection of qualitative samples that illustrate GPT-4V's potential in understanding multimodal social media content.

Hallucination Hate Speech Detection +1

A Survey of Large Language Models in Medicine: Progress, Application, and Challenge

1 code implementation9 Nov 2023 Hongjian Zhou, Fenglin Liu, Boyang Gu, Xinyu Zou, Jinfa Huang, Jinge Wu, Yiru Li, Sam S. Chen, Peilin Zhou, Junling Liu, Yining Hua, Chengfeng Mao, Chenyu You, Xian Wu, Yefeng Zheng, Lei Clifton, Zheng Li, Jiebo Luo, David A. Clifton

Therefore, this review aims to provide a detailed overview of the development and deployment of LLMs in medicine, including the challenges and opportunities they face.

Improving Scene Graph Generation with Superpixel-Based Interaction Learning

no code implementations4 Aug 2023 Jingyi Wang, Can Zhang, Jinfa Huang, Botao Ren, Zhidong Deng

(ii) We explore intra-entity and cross-entity interactions among the superpixels to enrich fine-grained interactions between entities at an earlier stage.

Graph Generation Scene Graph Generation +1

Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment

4 code implementations20 May 2023 Peng Jin, Hao Li, Zesen Cheng, Jinfa Huang, Zhennan Wang, Li Yuan, Chang Liu, Jie Chen

In this paper, we propose the Disentangled Conceptualization and Set-to-set Alignment (DiCoSA) to simulate the conceptualizing and reasoning process of human beings.

Retrieval Video Retrieval

Cross-Modality Time-Variant Relation Learning for Generating Dynamic Scene Graphs

1 code implementation15 May 2023 Jingyi Wang, Jinfa Huang, Can Zhang, Zhidong Deng

In this paper, we propose a Time-variant Relation-aware TRansformer (TR$^2$), which aims to model the temporal change of relations in dynamic scene graphs.

Relation Scene Graph Generation +1

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

4 code implementations CVPR 2023 Peng Jin, Jinfa Huang, Pengfei Xiong, Shangxuan Tian, Chang Liu, Xiangyang Ji, Li Yuan, Jie Chen

Contrastive learning-based video-language representation learning approaches, e. g., CLIP, have achieved outstanding performance, which pursue semantic interaction upon pre-defined video-text pairs.

Contrastive Learning Question Answering +5

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

4 code implementations21 Nov 2022 Peng Jin, Jinfa Huang, Fenglin Liu, Xian Wu, Shen Ge, Guoli Song, David A. Clifton, Jie Chen

Most video-and-language representation learning approaches employ contrastive learning, e. g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs.

Ranked #2 on Video Retrieval on LSMDC (text-to-video Mean Rank metric)

Contrastive Learning Representation Learning +5

Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering

no code implementations21 Sep 2022 Hao Li, Jinfa Huang, Peng Jin, Guoli Song, Qi Wu, Jie Chen

Under this setting, these 2D spatial reasoning approaches cannot distinguish the fine-grain spatial relations between visual objects and scene texts on the same image plane, thereby impairing the interpretability and performance of TextVQA models.

Image Captioning Optical Character Recognition (OCR) +2

Guoym at SemEval-2020 Task 8: Ensemble-based Classification of Visuo-Lingual Metaphor in Memes

no code implementations SEMEVAL 2020 YingMei Guo, Jinfa Huang, Yanlong Dong, Mingxing Xu

In our system, we utilize five types of representation of data as input of base classifiers to extract information from different aspects.

Cannot find the paper you are looking for? You can Submit a new open access paper.