no code implementations • 21 Aug 2024 • Haochen Wang, Kai Hu, Haoyu Dong, Liangcai Gao
To the best of our knowledge, this problem has not been previously explored.
no code implementations • 21 Aug 2024 • Yue Hu, Kai Hu, Patrick X. Zhao, Javed Khan, Chengming Xu
Large Language Models (LLMs) have significantly advanced artificial intelligence, excelling in numerous tasks.
1 code implementation • 13 Jul 2024 • Tianjun Yao, Yongqiang Chen, Zhenhao Chen, Kai Hu, Zhiqiang Shen, Kun Zhang
To bridge this gap, we introduce a novel graph invariance learning paradigm, which induces a robust and general inductive bias.
no code implementations • 7 Jul 2024 • Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, Zhijie Yan
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
2 code implementations • 4 Jul 2024 • Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, Shengpeng Ji, Yabin Li, Zerui Li, Heng Lu, Haoneng Luo, Xiang Lv, Bin Ma, Ziyang Ma, Chongjia Ni, Changhe Song, Jiaqi Shi, Xian Shi, Hao Wang, Wen Wang, Yuxuan Wang, Zhangyu Xiao, Zhijie Yan, Yexin Yang, Bin Zhang, Qinglin Zhang, Shiliang Zhang, Nan Zhao, Siqi Zheng
This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs).
no code implementations • 30 May 2024 • Hao Chen, Yujin Han, Diganta Misra, Xiang Li, Kai Hu, Difan Zou, Masashi Sugiyama, Jindong Wang, Bhiksha Raj
They benefit significantly from extensive pre-training on large-scale datasets, including web-crawled data with paired data and conditions, such as image-text and image-class pairs.
1 code implementation • 30 May 2024 • Fangyi Chen, Han Zhang, Zhantao Yang, Hao Chen, Kai Hu, Marios Savvides
Open-vocabulary object detection (OVD) requires solid modeling of the region-semantic relationship, which could be learned from massive region-text pairs.
Ranked #11 on Open Vocabulary Object Detection on LVIS v1.0 (using extra training data)
no code implementations • 24 May 2024 • Chengming Xu, Kai Hu, Qilin Wang, Donghao Luo, Jiangning Zhang, Xiaobin Hu, Yanwei Fu, Chengjie Wang
Stylized Text-to-Image Generation (STIG) aims to generate images from text prompts and style reference images.
no code implementations • 20 May 2024 • Jiawei Wang, Kai Hu, Qiang Huo
Document layout analysis (DLA) is crucial for understanding the physical layout and logical structure of documents, serving information retrieval, document summarization, knowledge extraction, etc.
no code implementations • 15 May 2024 • Kai Hu, Weichen Yu, Tianjun Yao, Xiang Li, Wenhe Liu, Lijun Yu, Yining Li, Kai Chen, Zhiqiang Shen, Matt Fredrikson
Our approach relaxes the discrete jailbreak optimization into a continuous optimization and progressively increases the sparsity of the optimizing vectors.
1 code implementation • 22 Jan 2024 • Jiawei Wang, Kai Hu, Zhuoyao Zhong, Lei Sun, Qiang Huo
Our end-to-end system achieves state-of-the-art performance on two large-scale document layout analysis datasets (PubLayNet and DocLayNet), a high-quality hierarchical document structure reconstruction dataset (HRDoc), and our Comp-HRDoc benchmark.
no code implementations • 17 Jan 2024 • Kai Hu, Jiawei Wang, WeiHong Lin, Zhuoyao Zhong, Lei Sun, Qiang Huo
This unified approach allows for the definition of various relation types and effectively tackles hierarchical relationships in form-like documents.
no code implementations • 17 Jan 2024 • Jiawei Wang, Shunchi Zhang, Kai Hu, Chixiang Ma, Zhuoyao Zhong, Lei Sun, Qiang Huo
Contextual Text Block Detection (CTBD) is the task of identifying coherent text blocks within the complexity of natural scenes.
no code implementations • 13 Oct 2023 • Ravi Mangal, Klas Leino, Zifan Wang, Kai Hu, Weicheng Yu, Corina Pasareanu, Anupam Datta, Matt Fredrikson
There are three layers to this inquiry, which we address in this paper: (1) why do we care about robustness research?
2 code implementations • 7 Oct 2023 • Zhihao Du, JiaMing Wang, Qian Chen, Yunfei Chu, Zhifu Gao, Zerui Li, Kai Hu, Xiaohuan Zhou, Jin Xu, Ziyang Ma, Wen Wang, Siqi Zheng, Chang Zhou, Zhijie Yan, Shiliang Zhang
Previous mainstream audio-and-text LLMs use discrete audio tokens to represent both input and output audio; however, they suffer from performance degradation on tasks such as automatic speech recognition, speech-to-text translation, and speech enhancement over models using continuous speech features.
1 code implementation • 4 Oct 2023 • Kai Hu, Klas Leino, Zifan Wang, Matt Fredrikson
A key challenge, supported both theoretically and empirically, is that robustness demands greater network capacity and more data than standard training.
no code implementations • 1 Oct 2023 • Xiang Li, Yinpeng Chen, Chung-Ching Lin, Hao Chen, Kai Hu, Rita Singh, Bhiksha Raj, Lijuan Wang, Zicheng Liu
This paper presents a novel approach to object completion, with the primary goal of reconstructing a complete object from its partially visible components.
1 code implementation • 14 Sep 2023 • Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng
We also demonstrate that the pre-trained models are suitable for downstream tasks, including automatic speech recognition and personalized text-to-speech synthesis.
no code implementations • 17 Apr 2023 • Kai Hu, Zhuoyuan Wu, Zhuoyao Zhong, WeiHong Lin, Lei Sun, Qiang Huo
In this paper, we present a new question-answering (QA) based key-value pair extraction approach, called KVPFormer, to robustly extracting key-value relationships between entities from form-like document images.
2 code implementations • NeurIPS 2023 • Kai Hu, Andy Zou, Zifan Wang, Klas Leino, Matt Fredrikson
We show that fast ways of bounding the Lipschitz constant for conventional ResNets are loose, and show how to address this by designing a new residual block, leading to the \emph{Linear ResNet} (LiResNet) architecture.
2 code implementations • CVPR 2023 • Fangyi Chen, Han Zhang, Kai Hu, Yu-Kai Huang, Chenchen Zhu, Marios Savvides
This paper investigates a phenomenon where query-based object detectors mispredict at the last decoding stage while predicting correctly at an intermediate stage.
Ranked #14 on Object Detection on COCO 2017 val
no code implementations • 26 Nov 2022 • Jianhong Tu, Zeyu Cui, Xiaohuan Zhou, Siqi Zheng, Kai Hu, Ju Fan, Chang Zhou
To achieve this task, we construct a synthetic dataset and develop an effective framework.
no code implementations • 20 Oct 2022 • Xian Qian, Kai Hu, Jiaqiang Wang, Yifeng Liu, Xingyuan Pan, Jun Cao, Mingxuan Wang
This report describes our VolcTrans system for the WMT22 shared task on large-scale multilingual machine translation.
no code implementations • 6 Jul 2022 • Yansong Li, Kai Hu, Kohei Nakajima, Yongping Pan
Echo state network (ESN), a kind of recurrent neural networks, consists of a fixed reservoir in which neurons are connected randomly and recursively and obtains the desired output only by training output connection weights.
no code implementations • 28 May 2022 • Kai Hu, Yu Liu, Renhe Liu, Wei Lu, Gang Yu, Bin Fu
In the asymmetric codec, we adopt a mixed multi-path residual block (MMRB) to gradually extract weak texture features of input images, which can better preserve the original facial features and avoid excessive fantasy.
no code implementations • 25 May 2021 • WeiHong Lin, Qifang Gao, Lei Sun, Zhuoyao Zhong, Kai Hu, Qin Ren, Qiang Huo
In this paper, we propose a new multi-modal backbone network by concatenating a BERTgrid to an intermediate layer of a CNN model, where the input of CNN is a document image and the BERTgrid is a grid of word embeddings, to generate a more powerful grid-based document representation, named ViBERTgrid.
1 code implementation • CVPR 2022 • Kai Hu, Wentong Liao, Michael Ying Yang, Bodo Rosenhahn
Text-to-image synthesis (T2I) aims to generate photo-realistic images which are semantically consistent with the text descriptions.
no code implementations • ICCV 2021 • Kai Hu, Jie Shao, YuAn Liu, Bhiksha Raj, Marios Savvides, Zhiqiang Shen
To address this, we present a contrast-and-order representation (CORP) framework for learning self-supervised video representations that can automatically capture both the appearance information within each frame and temporal information across different frames.
Ranked #3 on Self-Supervised Action Recognition Linear on UCF101
Action Recognition Self-Supervised Action Recognition Linear +1
1 code implementation • NeurIPS 2020 • Jie Shao, Kai Hu, Changhu Wang, xiangyang xue, Bhiksha Raj
In this paper, we study what would happen when normalization layers are removed from the network, and show how to train deep neural networks without normalization layers and without performance degradation.
no code implementations • 7 Apr 2020 • Shuo Tian, Lianhua Qu, Kai Hu, Nan Li, Lei Wang, Weixia Xu
By exploring the design space in network architectures and parameters, recent works have demonstrated great potential for improving the accuracy of LSM model with low complexity.
no code implementations • 18 Nov 2019 • Kai Hu, Barnabas Poczos
We further use a noise analysis method to interpret the difference between RotationOut and Dropout in co-adaptation reduction.
no code implementations • 19 Nov 2018 • Kai Hu, Bhiksha Raj
Capturing spatiotemporal dynamics is an essential topic in video recognition.
no code implementations • 4 Nov 2018 • Kai Hu, Zhijian Ou, Min Hu, Junlan Feng
Conditional random fields (CRFs) have been shown to be one of the most successful approaches to sequence labeling.
no code implementations • 12 Jun 2018 • Xiaoteng Zhang, Yixin Bao, Feiyun Zhang, Kai Hu, Yicheng Wang, Liang Zhu, Qinzhu He, Yining Lin, Jie Shao, Yao Peng
We also propose new non-local-based models for further improvement on the recognition accuracy.
no code implementations • 12 Jan 2018 • Fen Xiao, Wenzheng Deng, Liangchan Peng, Chunhong Cao, Kai Hu, Xieping Gao
Salient object detection is a fundamental problem and has been received a great deal of attentions in computer vision.