no code implementations • 1 Dec 2023 • Bin Xiao, Murat Simsek, Burak Kantarci, Ala Abu Alkheir
Lastly, even though some studies defined the problem to detect more components to provide as much information as other types of solutions, these studies ignore the fact this problem definition is a multi-label detection because row, projected row header and column header can share identical bounding boxes.
no code implementations • 10 Nov 2023 • Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, Lu Yuan
We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks.
no code implementations • 2 Nov 2023 • Xiuli Bi, Bo Liu, Fan Yang, Bin Xiao, Weisheng Li, Gao Huang, Pamela C. Cosman
This paper approaches the generated image detection problem from a new perspective: Start from real images.
no code implementations • 1 Oct 2023 • YanJie Li, Bin Xie, Songtao Guo, Yuanyuan Yang, Bin Xiao
Lots of papers have emerged to investigate the robustness and safety of deep learning models against adversarial attacks.
1 code implementation • ICCV 2023 • Kan Wu, Houwen Peng, Zhenghong Zhou, Bin Xiao, Mengchen Liu, Lu Yuan, Hong Xuan, Michael Valenzuela, Xi, Chen, Xinggang Wang, Hongyang Chao, Han Hu
In this paper, we propose a novel cross-modal distillation method, called TinyCLIP, for large-scale language-image pre-trained models.
1 code implementation • 19 Sep 2023 • Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, Fan Yang, Fei Deng, Feng Wang, Feng Liu, Guangwei Ai, Guosheng Dong, Haizhou Zhao, Hang Xu, Haoze Sun, Hongda Zhang, Hui Liu, Jiaming Ji, Jian Xie, Juntao Dai, Kun Fang, Lei Su, Liang Song, Lifeng Liu, Liyun Ru, Luyao Ma, Mang Wang, Mickel Liu, MingAn Lin, Nuolan Nie, Peidong Guo, Ruiyang Sun, Tao Zhang, Tianpeng Li, Tianyu Li, Wei Cheng, WeiPeng Chen, Xiangrong Zeng, Xiaochuan Wang, Xiaoxi Chen, Xin Men, Xin Yu, Xuehai Pan, Yanjun Shen, Yiding Wang, Yiyu Li, Youxin Jiang, Yuchen Gao, Yupeng Zhang, Zenan Zhou, Zhiying Wu
Large language models (LLMs) have demonstrated remarkable performance on a variety of natural language tasks based on just a few examples of natural language instructions, reducing the need for extensive feature engineering.
no code implementations • 10 Aug 2023 • YanJie Li, Mingxing Duan, Xuelong Dai, Bin Xiao
In the first stage, we extract multi-scale style embeddings by a pyramid-like network and identity embeddings by a pretrained FR model and propose a novel Attention-guided Adaptive Instance Normalization layer (AAIN) to merge them via background-patch cross-attention maps.
1 code implementation • 24 Jul 2023 • Xuelong Dai, Kaisheng Liang, Bin Xiao
Unrestricted adversarial attacks present a serious threat to deep learning models and adversarial defense techniques.
no code implementations • 4 Jun 2023 • Xinhang Wan, Bin Xiao, Xinwang Liu, Jiyuan Liu, Weixuan Liang, En Zhu
Such an incomplete continual data problem (ICDP) in MVC is tough to solve since incomplete information with continual data increases the difficulty of extracting consistent and complementary knowledge among views.
no code implementations • 30 May 2023 • Bin Xiao, Murat Simsek, Burak Kantarci, Ala Abu Alkheir
Table Detection (TD) is a fundamental task to enable visually rich document understanding, which requires the model to extract information without information loss.
no code implementations • 21 May 2023 • ZiYi Yang, Mahmoud Khademi, Yichong Xu, Reid Pryzant, Yuwei Fang, Chenguang Zhu, Dongdong Chen, Yao Qian, Mei Gao, Yi-Ling Chen, Robert Gmyr, Naoyuki Kanda, Noel Codella, Bin Xiao, Yu Shi, Lu Yuan, Takuya Yoshioka, Michael Zeng, Xuedong Huang
The convergence of text, visual, and audio data is a key step towards human-like artificial intelligence, however the current Vision-Language-Speech landscape is dominated by encoder-only models which lack generative abilities.
no code implementations • 4 May 2023 • Bin Xiao, Murat Simsek, Burak Kantarci, Ala Abu Alkheir
Moreover, to enrich the data sources, we propose a new ICT-TD dataset using the PDF files of Information and Communication Technologies (ICT) commodities, a different domain containing unique samples that hardly appear in open datasets.
1 code implementation • CVPR 2023 • Kaisheng Liang, Bin Xiao
Our method can prevent adversarial examples from using non-robust style features and help generate transferable perturbations.
no code implementations • 20 Apr 2023 • Lingyuan Meng, Ke Liang, Bin Xiao, Sihang Zhou, Yue Liu, Meng Liu, Xihong Yang, Xinwang Liu
Moreover, most of the existing methods ignore leveraging the beneficial information from aliasing relations (AR), i. e., data-rich relations with similar contextual semantics to the target data-poor relation.
no code implementations • CVPR 2023 • Ping Chen, Xingpeng Zhang, Ye Li, Ju Tao, Bin Xiao, Bing Wang, Zongjie Jiang
Inspired by the transfer learning, we designed the Delta Age AdaIN (DAA) operation to obtain the feature difference with each age, which obtains the style map of each age through the learned values representing the mean and standard deviation.
no code implementations • 15 Feb 2023 • Wenxuan Tu, Bin Xiao, Xinwang Liu, Sihang Zhou, Zhiping Cai, Jieren Cheng
With the development of various applications, such as social networks and knowledge graphs, graph data has been ubiquitous in the real world.
1 code implementation • CVPR 2023 • Bin Xiao, Yang Hu, Bo Liu, Xiuli Bi, Weisheng Li, Xinbo Gao
Since their binarization processes are not a component of the network, the learning-based binary descriptor cannot fully utilize the advances of deep learning.
1 code implementation • CVPR 2023 • Yongchao Wang, Bin Xiao, Xiuli Bi, Weisheng Li, Xinbo Gao
Inspired by the plain contrast idea, MCF introduces two different subnets to explore and utilize the discrepancies between subnets to correct cognitive bias of the model.
no code implementations • 28 Nov 2022 • Jingcan Duan, Bin Xiao, Siwei Wang, Haifang Zhou, Xinwang Liu
The average node-pair similarity can be regarded as the topology anomaly degree of nodes within substructures.
no code implementations • 3 Nov 2022 • Bin Xiao, Yakup Akkaya, Murat Simsek, Burak Kantarci, Ala Abu Alkheir
Table Structure Recognition (TSR) aims to represent tables with complex structures in a machine-interpretable format so that the tabular data can be processed automatically.
1 code implementation • 12 Oct 2022 • Bin Xiao, Chien-Liang Liu, Wen-Hoar Hsaio
Our proposed model uses word-embedding representations as semantic features to help train the embedding network and a semantic cross-attention module to bridge the semantic features into the typical visual modal.
no code implementations • 11 Aug 2022 • Bin Xiao, Murat Simsek, Burak Kantarci, Ala Abu Alkheir
To transform the tabular data in electronic documents into a machine-interpretable format and provide layout and semantic information for information extraction and interpretation, we define a Table Structure Recognition (TSR) task and a Table Cell Type Classification (CTC) task.
1 code implementation • 26 Jul 2022 • Haoxuan You, Luowei Zhou, Bin Xiao, Noel Codella, Yu Cheng, Ruochen Xu, Shih-Fu Chang, Lu Yuan
Large-scale multi-modal contrastive pre-training has demonstrated great utility to learn transferable features for a range of downstream tasks by mapping multiple modalities into a shared embedding space.
2 code implementations • 21 Jul 2022 • Kan Wu, Jinnian Zhang, Houwen Peng, Mengchen Liu, Bin Xiao, Jianlong Fu, Lu Yuan
It achieves a top-1 accuracy of 84. 8% on ImageNet-1k with only 21M parameters, being comparable to Swin-B pretrained on ImageNet-21k while using 4. 2 times fewer parameters.
Ranked #131 on
Image Classification
on ImageNet
no code implementations • 19 Jul 2022 • Zhenrong Shen, Xi Ouyang, Bin Xiao, Jie-Zhi Cheng, Qian Wang, Dinggang Shen
Moreover, we propose to synthesize nodule CXR images by controlling the disentangled nodule attributes for data augmentation, in order to better compensate for the nodules that are easily missed in the detection task.
no code implementations • CVPR 2023 • YanJie Li, Yiquan Li, Xuelong Dai, Songtao Guo, Bin Xiao
2D face recognition has been proven insecure for physical adversarial attacks.
no code implementations • 3 May 2022 • ZiYi Yang, Yuwei Fang, Chenguang Zhu, Reid Pryzant, Dongdong Chen, Yu Shi, Yichong Xu, Yao Qian, Mei Gao, Yi-Ling Chen, Liyang Lu, Yujia Xie, Robert Gmyr, Noel Codella, Naoyuki Kanda, Bin Xiao, Lu Yuan, Takuya Yoshioka, Michael Zeng, Xuedong Huang
Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview.
no code implementations • 22 Apr 2022 • Zhecan Wang, Noel Codella, Yen-Chun Chen, Luowei Zhou, Xiyang Dai, Bin Xiao, Jianwei Yang, Haoxuan You, Kai-Wei Chang, Shih-Fu Chang, Lu Yuan
Experiments demonstrate that MAD leads to consistent gains in the low-shot, domain-shifted, and fully-supervised conditions on VCR, SNLI-VE, and VQA, achieving SOTA performance on VCR compared to other single models pretrained with image-text data.
Ranked #4 on
Visual Question Answering (VQA)
on VCR (Q-A) test
2 code implementations • CVPR 2022 • Jinnian Zhang, Houwen Peng, Kan Wu, Mengchen Liu, Bin Xiao, Jianlong Fu, Lu Yuan
The central idea of MiniViT is to multiplex the weights of consecutive transformer blocks.
Ranked #203 on
Image Classification
on ImageNet
(using extra training data)
1 code implementation • CVPR 2022 • Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Ce Liu, Lu Yuan, Jianfeng Gao
Particularly, it attains gains up to 9. 2% and 14. 5% in average on zero-shot recognition benchmarks over the language-image contrastive learning and supervised learning methods, respectively.
3 code implementations • 7 Apr 2022 • Mingyu Ding, Bin Xiao, Noel Codella, Ping Luo, Jingdong Wang, Lu Yuan
We show that these two self-attentions complement each other: (i) since each channel token contains an abstract representation of the entire image, the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels; (ii) the spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention.
Ranked #12 on
Image Classification
on ImageNet
no code implementations • 8 Mar 2022 • Bin Xiao, Murat Simsek, Burak Kantarci, Ala Abu Alkheir
Table Structure Recognition (TSR) problem aims to recognize the structure of a table and transform the unstructured tables into a structured and machine-readable format so that the tabular data can be further analysed by the down-stream tasks, such as semantic modeling and information retrieval.
no code implementations • 15 Jan 2022 • Zhecan Wang, Noel Codella, Yen-Chun Chen, Luowei Zhou, Jianwei Yang, Xiyang Dai, Bin Xiao, Haoxuan You, Shih-Fu Chang, Lu Yuan
Experiments demonstrate that our proposed CLIP-TD leads to exceptional gains in the low-shot (up to 51. 9%) and domain-shifted (up to 71. 3%) conditions of VCR, while simultaneously improving performance under standard fully-supervised conditions (up to 2%), achieving state-of-art performance on VCR compared to other single models that are pretrained with image-text data only.
1 code implementation • NeurIPS 2021 • Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, Jianfeng Gao
With focal attention, we propose a new variant of Vision Transformer models, called Focal Transformers, which achieve superior performance over the state-of-the-art (SoTA) Vision Transformers on a range of public image classification and object detection benchmarks.
1 code implementation • 22 Nov 2021 • Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, JianFeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, Pengchuan Zhang
Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications.
Ranked #1 on
Action Recognition In Videos
on Kinetics-600
1 code implementation • 17 Nov 2021 • Xuelong Dai, YanJie Li, Hua Dai, Bin Xiao
The unrestricted adversarial attack loss is incorporated in the special adversarial training of GAN, which enables the generator to generate the adversarial examples to spoof the target network.
no code implementations • 29 Sep 2021 • Haoxuan You, Luowei Zhou, Bin Xiao, Noel C Codella, Yu Cheng, Ruochen Xu, Shih-Fu Chang, Lu Yuan
Large-scale multimodal contrastive pretraining has demonstrated great utility to support high performance in a range of downstream tasks by mapping multiple modalities into a shared embedding space.
3 code implementations • 1 Jul 2021 • Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, Jianfeng Gao
With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers on a range of public image classification and object detection benchmarks.
Ranked #15 on
Instance Segmentation
on COCO test-dev
1 code implementation • ICML Workshop AML 2021 • Fan Liu, Shuyu Zhao, Xuelong Dai, Bin Xiao
Although adversarial training (AT) methods such as Adversarial Query (AQ) can improve the adversarially robust performance of meta-learning models, AT is still computationally expensive training.
1 code implementation • ICLR 2022 • Chunyuan Li, Jianwei Yang, Pengchuan Zhang, Mei Gao, Bin Xiao, Xiyang Dai, Lu Yuan, Jianfeng Gao
This paper investigates two techniques for developing efficient self-supervised vision transformers (EsViT) for visual representation learning.
Ranked #11 on
Self-Supervised Image Classification
on ImageNet
Representation Learning
Self-Supervised Image Classification
3 code implementations • CVPR 2021 • Xiyang Dai, Yinpeng Chen, Bin Xiao, Dongdong Chen, Mengchen Liu, Lu Yuan, Lei Zhang
In this paper, we present a novel dynamic head framework to unify object detection heads with attentions.
Ranked #1 on
Object Detection
on COCO 2017 val
(AP75 metric)
12 code implementations • CVPR 2021 • Changqian Yu, Bin Xiao, Changxin Gao, Lu Yuan, Lei Zhang, Nong Sang, Jingdong Wang
We introduce a lightweight unit, conditional channel weighting, to replace costly pointwise (1x1) convolutions in shuffle blocks.
Ranked #37 on
Pose Estimation
on COCO test-dev
2 code implementations • CVPR 2021 • Zigang Geng, Ke Sun, Bin Xiao, Zhaoxiang Zhang, Jingdong Wang
Our motivation is that regressing keypoint positions accurately needs to learn representations that focus on the keypoint regions.
14 code implementations • ICCV 2021 • Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang
We present in this paper a new architecture, named Convolutional vision Transformer (CvT), that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs.
Ranked #2 on
Image Classification
on Flowers-102
(using extra training data)
3 code implementations • ICCV 2021 • Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, Jianfeng Gao
This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer, which significantly enhances the ViT of \cite{dosovitskiy2020image} for encoding high-resolution images using two techniques.
Ranked #42 on
Instance Segmentation
on COCO minival
no code implementations • ICCV 2021 • Bin Xiao, Haifeng Wu, Xiuli Bi
The proposed DTMNet is an end-to-end deep neural network with only one convolutional layer and three fully connected layers.
no code implementations • ICCV 2021 • Xiuli Bi, Zhipeng Zhang, Bin Xiao
For detecting the tampered regions, a forgery localization generator GM is proposed based on a multi-decoder-single-task strategy.
no code implementations • 1 Jan 2021 • Depu Meng, Zigang Geng, Zhirong Wu, Bin Xiao, Houqiang Li, Jingdong Wang
The proposed consistent instance classification (ConIC) approach simultaneously optimizes the classification loss and an additional consistency loss explicitly penalizing the feature dissimilarity between the augmented views from the same instance.
no code implementations • 11 Dec 2020 • Bin Xiao, Tao Geng, Xiuli Bi, Weisheng Li
In this paper, a color-related local binary pattern (cLBP) which learns the dominant patterns from the decoded LBP is proposed for color images recognition.
no code implementations • 3 Dec 2020 • Bo Liu, Ranglei Wu, Xiuli Bi, Bin Xiao, Weisheng Li, Guoyin Wang, Xinbo Gao
The unfixed encoder autonomously learns the image fingerprints that differentiate between the tampered and non-tampered regions, whereas the fixed encoder intentionally provides the direction information that assists the learning and detection of the network.
1 code implementation • 9 Sep 2020 • Bin Xiao, Chien-Liang Liu, Wen-Hoar Hsaio
We conclude that the success of metric-learning based approaches lies in the data embedding, the representative of each class, and the distance metric.
1 code implementation • 28 Jun 2020 • Ke Sun, Zigang Geng, Depu Meng, Bin Xiao, Dong Liu, Zhao-Xiang Zhang, Jingdong Wang
The typical bottom-up human pose estimation framework includes two stages, keypoint detection and grouping.
no code implementations • AAAI 2020 • Haiping Wu, Bin Xiao
n this work, we tackle the problem of estimating 3D human pose in camera space from a monocular image.
Ranked #20 on
3D Human Pose Estimation
on MPI-INF-3DHP
(PCK metric)
1 code implementation • cvpr 2019 workshop 2019 • Xiuli Bi, Yang Wei, Bin Xiao, Weisheng Li
The core idea of the RRU-Net is to strengthen the learning way of CNN, which is inspired by the recall and the consolidation mechanism of the human brain and implemented by the propagation and the feedback process of the residual in CNN.
19 code implementations • CVPR 2020 • Bowen Cheng, Bin Xiao, Jingdong Wang, Honghui Shi, Thomas S. Huang, Lei Zhang
HigherHRNet even surpasses all top-down methods on CrowdPose test (67. 6% AP), suggesting its robustness in crowded scene.
Ranked #2 on
Pose Estimation
on UAV-Human
42 code implementations • 20 Aug 2019 • Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, Bin Xiao
High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection.
Ranked #1 on
Object Detection
on COCO test-dev
(Hardware Burden metric)
39 code implementations • 9 Apr 2019 • Ke Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, Jingdong Wang
The proposed approach achieves superior results to existing single-model networks on COCO object detection.
Ranked #5 on
Semantic Segmentation
on LIP val
39 code implementations • CVPR 2019 • Ke Sun, Bin Xiao, Dong Liu, Jingdong Wang
We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the mutli-resolution subnetworks in parallel.
Ranked #1 on
Pose Estimation
on BRACE
25 code implementations • ECCV 2018 • Bin Xiao, Haiping Wu, Yichen Wei
There has been significant progress on pose estimation and increasing interests on pose tracking in recent years.
Ranked #2 on
2D Human Pose Estimation
on JHMDB (2D poses only)
2 code implementations • ECCV 2018 • Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, Yichen Wei
State-of-the-art human pose estimation methods are based on heat map representation.
Ranked #23 on
Pose Estimation
on MPII Human Pose
no code implementations • ICCV 2017 • Ting Zhang, Guo-Jun Qi, Bin Xiao, Jingdong Wang
The main point lies in a novel building block, a pair of two successive interleaved group convolutions: primary group convolution and secondary group convolution.
2 code implementations • 10 Jul 2017 • Ting Zhang, Guo-Jun Qi, Bin Xiao, Jingdong Wang
The main point lies in a novel building block, a pair of two successive interleaved group convolutions: primary group convolution and secondary group convolution.