Focus on the Whole Character: Discriminative Character Modeling for Scene Text Recognition

1 code implementation8 Jul 2024 Bangbang Zhou, Yadong Qu, Zixiao Wang, Zicheng Li, Boqiang Zhang, Hongtao Xie

To address the above issues, we propose a novel method that enriches the character features to enhance the discriminability of characters.

Scene Text Recognition

Pistis-RAG: A Scalable Cascading Framework Towards Trustworthy Retrieval-Augmented Generation

no code implementations21 Jun 2024 Yu Bai, Yukai Miao, Li Chen, Dan Li, Yanyu Ren, Hongtao Xie, Ce Yang, Xuhui Cai

We argue that the alignment issue between LLMs and external knowledge ranking methods is tied to the model-centric paradigm dominant in RAG systems.

Information Retrieval RAG +1

Hallucination Mitigation Prompts Long-term Video Understanding

no code implementations17 Jun 2024 Yiwei Sun, Zhihang Liu, Chuanbin Liu, Bowei Pu, Zhihan Zhang, Hongtao Xie

Although existing methods achieve a balance between memory and information by aggregating frames, they inevitably introduce the severe hallucination issue.

Answer Generation Hallucination +4

DiffAM: Diffusion-based Adversarial Makeup Transfer for Facial Privacy Protection

1 code implementation CVPR 2024 Yuhao Sun, Lingyun Yu, Hongtao Xie, Jiaming Li, Yongdong Zhang

In this paper, we propose a novel face protection approach, dubbed DiffAM, which leverages the powerful generative ability of diffusion models to generate high-quality protected face images with adversarial makeup transferred from reference images.

Adversarial Attack Face Recognition

DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations

1 code implementation CVPR 2024 Tianhao Qi, Shancheng Fang, Yanze Wu, Hongtao Xie, Jiawei Liu, Lang Chen, Qian He, Yongdong Zhang

The Q-Formers are trained using paired images rather than the identical target, in which the reference image and the ground-truth image are with the same style or semantics.


OTE: Exploring Accurate Scene Text Recognition Using One Token

no code implementations CVPR 2024 Jianjun Xu, Yuxin Wang, Hongtao Xie, Yongdong Zhang

Different from previous sequence-to-sequence methods that rely on a sequence of visual tokens to represent scene text images we prove that just one token is enough to characterize the entire text image and achieve accurate text recognition.

Decoder Scene Text Recognition +1

Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval

1 code implementation19 Dec 2023 Zhihang Liu, Jun Li, Hongtao Xie, Pandeng Li, Jiannan Ge, Sun-Ao Liu, Guoqing Jin

In this paper, we introduce Modal-Enhanced Semantic Modeling (MESM), a novel framework for more balanced alignment through enhancing features at two levels.

cross-modal alignment Moment Retrieval +2

CARIS: Context-Augmented Referring Image Segmentation

1 code implementation ACM MM 2023 Sun-Ao Liu, Yiheng Zhang, Zhaofan Qiu, Hongtao Xie, Yongdong Zhang, Ting Yao

Technically, CARIS develops a context-aware mask decoder with sequential bidirectional cross-modal attention to integrate the linguistic features with visual context, which are then aligned with pixel-wise visual features.

Decoder Image Segmentation +2

Dual-Stream Knowledge-Preserving Hashing for Unsupervised Video Retrieval

no code implementations12 Oct 2023 Pandeng Li, Hongtao Xie, Jiannan Ge, Lei Zhang, Shaobo Min, Yongdong Zhang

Hence, we address this problem by decomposing video information into reconstruction-dependent and semantic-dependent information, which disentangles the semantic extraction from reconstruction constraint.

Retrieval Semantic Retrieval +3

Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition

1 code implementation8 Oct 2023 Zixiao Wang, Hongtao Xie, Yuxin Wang, Jianjun Xu, Boqiang Zhang, Yongdong Zhang

In this paper, we explore the potential of the Contrastive Language-Image Pretraining (CLIP) model in scene text recognition (STR), and establish a novel Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR) to leverage both visual and linguistic knowledge in CLIP.

Optical Character Recognition (OCR) Scene Text Recognition

Balanced Classification: A Unified Framework for Long-Tailed Object Detection

1 code implementation4 Aug 2023 Tianhao Qi, Hongtao Xie, Pandeng Li, Jiannan Ge, Yongdong Zhang

In this paper, we contend that the learning bias originates from two factors: 1) the unequal competition arising from the imbalanced distribution of foreground categories, and 2) the lack of sample diversity in tail categories.

Hallucination Long-tailed Object Detection +1

MomentDiff: Generative Video Moment Retrieval from Random to Real

1 code implementation NeurIPS 2023 Pandeng Li, Chen-Wei Xie, Hongtao Xie, Liming Zhao, Lei Zhang, Yun Zheng, Deli Zhao, Yongdong Zhang

Video moment retrieval pursues an efficient and generalized solution to identify the specific temporal segments within an untrimmed video that correspond to a given language description.

Moment Retrieval Retrieval

TPS++: Attention-Enhanced Thin-Plate Spline for Scene Text Recognition

1 code implementation9 May 2023 Tianlun Zheng, Zhineng Chen, Jinfeng Bai, Hongtao Xie, Yu-Gang Jiang

In this work, we introduce TPS++, an attention-enhanced TPS transformation that incorporates the attention mechanism to text rectification for the first time.

Optical Character Recognition (OCR) Scene Text Recognition

Linguistic More: Taking a Further Step toward Efficient and Accurate Scene Text Recognition

1 code implementation9 May 2023 Boqiang Zhang, Hongtao Xie, Yuxin Wang, Jianjun Xu, Yongdong Zhang

Vision model have gained increasing attention due to their simplicity and efficiency in Scene Text Recognition (STR) task.

Scene Text Recognition

Learning Orthogonal Prototypes for Generalized Few-Shot Semantic Segmentation

1 code implementation CVPR 2023 Sun-Ao Liu, Yiheng Zhang, Zhaofan Qiu, Hongtao Xie, Yongdong Zhang, Ting Yao

POP builds a set of orthogonal prototypes, each of which represents a semantic class, and makes the prediction for each class separately based on the features projected onto its prototype.

Generalized Few-Shot Semantic Segmentation

Progressive Spatio-Temporal Prototype Matching for Text-Video Retrieval

1 code implementation ICCV 2023 Pandeng Li, Chen-Wei Xie, Liming Zhao, Hongtao Xie, Jiannan Ge, Yun Zheng, Deli Zhao, Yongdong Zhang

In the event-sentence prototype matching phase, we design a temporal prototype generation mechanism to associate intra-frame objects and interact inter-frame temporal relations.

Diversity Object +3

Exploring Stroke-Level Modifications for Scene Text Editing

1 code implementation5 Dec 2022 Yadong Qu, Qingfeng Tan, Hongtao Xie, Jianjun Xu, Yuxin Wang, Yongdong Zhang

Moreover, two new datasets (Tamper-Syn2k and Tamper-Scene) are proposed to fill the blank of public evaluation datasets.

Attribute Scene Text Editing

ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Spotting

1 code implementation19 Nov 2022 Shancheng Fang, Zhendong Mao, Hongtao Xie, Yuxin Wang, Chenggang Yan, Yongdong Zhang

In this paper, we argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input.

Blocking Language Modelling +2

Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets

1 code implementation12 Oct 2022 Zhiying Lu, Hongtao Xie, Chuanbin Liu, Yongdong Zhang

On channel aspect, we introduce a dynamic feature aggregation module in MLP and a brand new "head token" design in multi-head self-attention module to help re-calibrate channel representation and make different channel group representation interacts with each other.

Inductive Bias

Geometry Aligned Variational Transformer for Image-conditioned Layout Generation

no code implementations2 Sep 2022 Yunning Cao, Ye Ma, Min Zhou, Chuanbin Liu, Hongtao Xie, Tiezheng Ge, Yuning Jiang

First, self-attention mechanism is adopted to model the contextual relationship within layout elements, while cross-attention mechanism is used to fuse the visual information of conditional images.

Layout Design Object Localization

REMOT: A Region-to-Whole Framework for Realistic Human Motion Transfer

no code implementations1 Sep 2022 Quanwei Yang, Xinchen Liu, Wu Liu, Hongtao Xie, Xiaoyan Gu, Lingyun Yu, Yongdong Zhang

Human Video Motion Transfer (HVMT) aims to, given an image of a source person, generate his/her video that imitates the motion of the driving person.

Partial Class Activation Attention for Semantic Segmentation

1 code implementation CVPR 2022 Sun-Ao Liu, Hongtao Xie, Hai Xu, Yongdong Zhang, Qi Tian

Current attention-based methods for semantic segmentation mainly model pixel relation through pairwise affinity and coarse segmentation.

Relation Segmentation +1

CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition

2 code implementations22 Nov 2021 Tianlun Zheng, Zhineng Chen, Shancheng Fang, Hongtao Xie, Yu-Gang Jiang

In this paper, we propose a novel module called Multi-Domain Character Distance Perception (MDCDP) to establish a visually and semantically related position embedding.

Position Scene Text Recognition

From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network

4 code implementations ICCV 2021 Yuxin Wang, Hongtao Xie, Shancheng Fang, Jing Wang, Shenggao Zhu, Yongdong Zhang

Such operation guides the vision model to use not only the visual texture of characters, but also the linguistic information in visual context for recognition when the visual cues are confused (e. g. occlusion, noise, etc.).

Language Modelling Scene Text Recognition

Look Back Again: Dual Parallel Attention Network for Accurate and Robust Scene Text Recognition

2 code implementations ICMR 2021 Zilong Fu, Guoqing Jin, Hongtao Xie, Junbo Guo

To tackle this issue, in this paper, we propose a dual parallel attention network (DPAN), in which a newly designed parallel context attention module (PCAM) is cascaded with the original PPAM, using linguistic contextual information to compensate for the information inconsistency between queries and keys.

Language Modelling Position +1

PERT: A Progressively Region-based Network for Scene Text Removal

1 code implementation24 Jun 2021 Yuxin Wang, Hongtao Xie, Shancheng Fang, Yadong Qu, Yongdong Zhang

However, there exists two problems: 1) the implicit erasure guidance causes the excessive erasure to non-text areas; 2) the one-stage erasure lacks the exhaustive removal of text region.

Memory Enhanced Embedding Learning for Cross-Modal Video-Text Retrieval

no code implementations29 Mar 2021 Rui Zhao, Kecheng Zheng, Zheng-Jun Zha, Hongtao Xie, Jiebo Luo

The cross-modal memory module is employed to record the instance embeddings of all the datasets for global negative mining.

Text Retrieval Video-Text Retrieval

Frequency-aware Discriminative Feature Learning Supervised by Single-Center Loss for Face Forgery Detection

no code implementations CVPR 2021 Jiaming Li, Hongtao Xie, Jiahong Li, Zhongyuan Wang, Yongdong Zhang

Face forgery detection is raising ever-increasing interest in computer vision since facial manipulation technologies cause serious worries.

Hierarchical Granularity Transfer Learning

no code implementations NeurIPS 2020 Shaobo Min, Hongtao Xie, Hantao Yao, Xuran Deng, Zheng-Jun Zha, Yongdong Zhang

In this paper, we introduce a new task, named Hierarchical Granularity Transfer Learning (HGTL), to recognize sub-level categories with basic-level annotations and semantic descriptions for hierarchical categories.

Transfer Learning

Curriculum Learning for Natural Language Understanding

no code implementations ACL 2020 Benfeng Xu, Licheng Zhang, Zhendong Mao, Quan Wang, Hongtao Xie, Yongdong Zhang

With the great success of pre-trained language models, the pretrain-finetune paradigm now becomes the undoubtedly dominant solution for natural language understanding (NLU) tasks.

Natural Language Understanding

ContourNet: Taking a Further Step toward Accurate Arbitrary-shaped Scene Text Detection

1 code implementation CVPR 2020 Yuxin Wang, Hongtao Xie, Zheng-Jun Zha, Mengting Xing, Zilong Fu, Yongdong Zhang

Then a novel Local Orthogonal Texture-aware Module (LOTM) models the local texture information of proposal features in two orthogonal directions and represents text region with a set of contour points.

Region Proposal Scene Text Detection +1

Graph Structured Network for Image-Text Matching

1 code implementation CVPR 2020 Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, Yongdong Zhang

The GSMN explicitly models object, relation and attribute as a structured phrase, which not only allows to learn correspondence of object, relation and attribute separately, but also benefits to learn fine-grained correspondence of structured phrase.

Attribute Image-text matching +3

Domain-aware Visual Bias Eliminating for Generalized Zero-Shot Learning

1 code implementation CVPR 2020 Shaobo Min, Hantao Yao, Hongtao Xie, Chaoqun Wang, Zheng-Jun Zha, Yongdong Zhang

Recent methods focus on learning a unified semantic-aligned visual representation to transfer knowledge between two domains, while ignoring the effect of semantic-free visual representation in alleviating the biased recognition problem.

Generalized Zero-Shot Learning

Multi-Objective Matrix Normalization for Fine-grained Visual Recognition

1 code implementation30 Mar 2020 Shaobo Min, Hantao Yao, Hongtao Xie, Zheng-Jun Zha, Yongdong Zhang

In this paper, we propose an efficient Multi-Objective Matrix Normalization (MOMN) method that can simultaneously normalize a bilinear representation in terms of square-root, low-rank, and sparsity.

Fine-Grained Visual Recognition

ACE-Net: Biomedical Image Segmentation with Augmented Contracting and Expansive Paths

no code implementations23 Aug 2019 Yanhao Zhu, Zhineng Chen, Shuai Zhao, Hongtao Xie, Wenming Guo, Yongdong Zhang

Nowadays U-net-like FCNs predominate various biomedical image segmentation applications and attain promising performance, largely due to their elegant architectures, e. g., symmetric contracting and expansive paths as well as lateral skip-connections.

Image Segmentation Segmentation +1

Domain-Specific Embedding Network for Zero-Shot Recognition

1 code implementation12 Aug 2019 Shaobo Min, Hantao Yao, Hongtao Xie, Zheng-Jun Zha, Yongdong Zhang

In contrast to previous methods, the DSEN decomposes the domain-shared projection function into one domain-invariant and two domain-specific sub-functions to explore the similarities and differences between two domains.

Decoder Zero-Shot Learning

CA3Net: Contextual-Attentional Attribute-Appearance Network for Person Re-Identification

no code implementations19 Nov 2018 Jiawei Liu, Zheng-Jun Zha, Hongtao Xie, Zhiwei Xiong, Yongdong Zhang

An appearance network is developed to learn appearance features from the full body, horizontal and vertical body parts of pedestrians with spatial dependencies among body parts.

Attribute Multi-Task Learning +1

