Search Results for author: Jing Shi

Found 37 papers, 11 papers with code

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

2 code implementations • 7 May 2023 • Feilong Chen, Minglun Han, Haozhi Zhao, Qingyang Zhang, Jing Shi, Shuang Xu, Bo Xu

(3) Integrating multiple modalities: all single-modal encoders are aligned with the LLM through X2L interfaces to integrate multimodal capabilities into the LLM.

Attribute Instruction Following +4

891

Paper
Code

Neural Speaker Diarization with Speaker-Wise Chain Rule

1 code implementation • 2 Jun 2020 • Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Yawen Xue, Jing Shi, Kenji Nagamatsu

Speaker diarization is an essential step for processing multi-speaker audio.

speaker-diarization Speaker Diarization

347

Paper
Code

VLP: A Survey on Vision-Language Pre-training

1 code implementation • 18 Feb 2022 • Feilong Chen, Duzhen Zhang, Minglun Han, Xiuyi Chen, Jing Shi, Shuang Xu, Bo Xu

Finally, we discuss the new frontiers in VLP.

277

Paper
Code

Audio-Visual Event Localization in Unconstrained Videos

2 code implementations • ECCV 2018 • Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, Chenliang Xu

In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos.

audio-visual event localization Temporal Localization

158

Paper
Code

Learning to Generate Scene Graph from Natural Language Supervision

1 code implementation • ICCV 2021 • Yiwu Zhong, Jing Shi, Jianwei Yang, Chenliang Xu, Yin Li

To bridge the gap between images and texts, we leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph.

Graph Generation Scene Graph Generation +1

Paper
Code

Knowledge Transfer from Pre-trained Language Models to Cif-based Speech Recognizers via Hierarchical Distillation

2 code implementations • 30 Jan 2023 • Minglun Han, Feilong Chen, Jing Shi, Shuang Xu, Bo Xu

Large-scale pre-trained language models (PLMs) have shown great potential in natural language processing tasks.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Code

Hierarchical Memory Networks for Answer Selection on Unknown Words

1 code implementation • COLING 2016 • Jiaming Xu, Jing Shi, Yiqun Yao, Suncong Zheng, Bo Xu

Recently, end-to-end memory networks have shown promising results on Question Answering task, which encode the past facts into an explicit memory and perform reasoning ability by making multiple computational steps on the memory.

Answer Selection Sentence

Paper
Code

Learning by Planning: Language-Guided Global Image Editing

1 code implementation • CVPR 2021 • Jing Shi, Ning Xu, Yihang Xu, Trung Bui, Franck Dernoncourt, Chenliang Xu

Recently, language-guided global image editing draws increasing attention with growing application potentials.

Paper
Code

Matching-based Term Semantics Pre-training for Spoken Patient Query Understanding

1 code implementation • 2 Mar 2023 • Zefa Hu, Xiuyi Chen, Haoran Wu, Minglun Han, Ziyi Ni, Jing Shi, Shuang Xu, Bo Xu

Medical Slot Filling (MSF) task aims to convert medical queries into structured information, playing an essential role in diagnosis dialogue systems.

slot-filling Slot Filling

Paper
Code

Actor-Action Video Classification CSC 249/449 Spring 2020 Challenge Report

1 code implementation • 1 Aug 2020 • Jing Shi, Zhiheng Li, Haitian Zheng, Yihang Xu, Tianyou Xiao, Weitao Tan, Xiaoning Guo, Sizhe Li, Bin Yang, Zhexin Xu, Ruitao Lin, Zhongkai Shangguan, Yue Zhao, Jingwen Wang, Rohan Sharma, Surya Iyer, Ajinkya Deshmukh, Raunak Mahalik, Srishti Singh, Jayant G Rohra, Yi-Peng Zhang, Tongyu Yang, Xuan Wen, Ethan Fahnestock, Bryce Ikeda, Ian Lawson, Alan Finkelstein, Kehao Guo, Richard Magnotti, Andrew Sexton, Jeet Ketan Thaker, Yiyang Su, Chenliang Xu

This technical report summarizes submissions and compiles from Actor-Action video classification challenge held as a final project in CSC 249/449 Machine Vision course (Spring 2020) at University of Rochester

General Classification Video Classification

Paper
Code

A Knowledge-enhanced Two-stage Generative Framework for Medical Dialogue Information Extraction

1 code implementation • 30 Jul 2023 • Zefa Hu, Ziyi Ni, Jing Shi, Shuang Xu, Bo Xu

However, these generative methods output a whole sequence consisting of term-status pairs in one stage and ignore integrating prior knowledge, which demands a deeper understanding to model the relationship between terms and infer the status of each term.

Paper
Code

Concept Learning through Deep Reinforcement Learning with Memory-Augmented Neural Networks

no code implementations • 15 Nov 2018 • Jing Shi, Jiaming Xu, Yiqun Yao, Bo Xu

In this paper, we present a memory-augmented neural network which is motivated by the process of human concept learning.

One-Shot Learning Outlier Detection +2

Paper
Add Code

GAN-EM: GAN based EM learning framework

no code implementations • 2 Dec 2018 • Wentian Zhao, Shaojie Wang, Zhihuai Xie, Jing Shi, Chenliang Xu

To overcome such limitation, we propose a GAN based EM learning framework that can maximize the likelihood of images and estimate the latent variables with only the constraint of L-Lipschitz continuity.

Clustering Dimensionality Reduction +2

Paper
Add Code

Combining Lexical and Semantic-based Features for Answer Sentence Selection

no code implementations • WS 2016 • Jing Shi, Jiaming Xu, Yiqun Yao, Suncong Zheng, Bo Xu

As the result of the evaluation shows, our solution provides a valuable and brief model which could be used in modelling question answering or sentence semantic relevance.

Feature Engineering Open-Domain Question Answering +1

Paper
Add Code

Not All Frames Are Equal: Weakly-Supervised Video Grounding With Contextual Similarity and Visual Clustering Losses

no code implementations • CVPR 2019 • Jing Shi, Jia Xu, Boqing Gong, Chenliang Xu

We invest the problem of weakly-supervised video grounding, where only video-level sentences are provided.

Clustering Sentence +1

Paper
Add Code

Learning Continuous-Time Dynamics by Stochastic Differential Networks

no code implementations • 11 Jun 2020 • Yingru Liu, Yucheng Xing, Xuewen Yang, Xin Wang, Jing Shi, Di Jin, Zhaoyue Chen

Learning continuous-time stochastic dynamics is a fundamental and essential problem in modeling sporadic time series, whose observations are irregular and sparse in both time and dimension.

Time Series Time Series Analysis

Paper
Add Code

Sequence to Multi-Sequence Learning via Conditional Chain Mapping for Mixture Signals

no code implementations • NeurIPS 2020 • Jing Shi, Xuankai Chang, Pengcheng Guo, Shinji Watanabe, Yusuke Fujita, Jiaming Xu, Bo Xu, Lei Xie

This model additionally has a simple and efficient stop criterion for the end of the transduction, making it able to infer the variable number of output sequences.

Ranked #3 on Speech Separation on WSJ0-4mix

speech-recognition Speech Recognition +1

Paper
Add Code

Speaker-Conditional Chain Model for Speech Separation and Extraction

no code implementations • 25 Jun 2020 • Jing Shi, Jiaming Xu, Yusuke Fujita, Shinji Watanabe, Bo Xu

With the predicted speaker information from whole observation, our model is helpful to solve the problem of conventional speech separation and speaker extraction for multi-round long recordings.

Audio and Speech Processing Sound

Paper
Add Code

Cubic Spline Smoothing Compensation for Irregularly Sampled Sequences

no code implementations • 3 Oct 2020 • Jing Shi, Jing Bi, Yingru Liu, Chenliang Xu

The marriage of recurrent neural networks and neural ordinary differential networks (ODE-RNN) is effective in modeling irregularly-observed sequences.

Paper
Add Code

A Benchmark and Baseline for Language-Driven Image Editing

no code implementations • 5 Oct 2020 • Jing Shi, Ning Xu, Trung Bui, Franck Dernoncourt, Zheng Wen, Chenliang Xu

To solve this new task, we first present a new language-driven image editing dataset that supports both local and global editing with editing operation and mask annotations.

Paper
Add Code

Audio-visual Speech Separation with Adversarially Disentangled Visual Representation

no code implementations • 29 Nov 2020 • Peng Zhang, Jiaming Xu, Jing Shi, Yunzhe Hao, Bo Xu

In our model, we use the face detector to detect the number of speakers in the scene and use visual information to avoid the permutation problem.

Speech Separation

Paper
Add Code

A Simple Baseline for Weakly-Supervised Scene Graph Generation

no code implementations • ICCV 2021 • Jing Shi, Yiwu Zhong, Ning Xu, Yin Li, Chenliang Xu

We investigate the weakly-supervised scene graph generation, which is a challenging task since no correspondence of label and object is provided.

Contrastive Learning Graph Generation +2

Paper
Add Code

Language-Guided Global Image Editing via Cross-Modal Cyclic Mechanism

no code implementations • ICCV 2021 • Wentao Jiang, Ning Xu, Jiayun Wang, Chen Gao, Jing Shi, Zhe Lin, Si Liu

Given the cycle, we propose several free augmentation strategies to help our model understand various editing requests given the imbalanced dataset.

Paper
Add Code

An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition

no code implementations • 9 Oct 2021 • Xuankai Chang, Takashi Maekaku, Pengcheng Guo, Jing Shi, Yen-Ju Lu, Aswin Shanmugam Subramanian, Tianzi Wang, Shu-wen Yang, Yu Tsao, Hung-Yi Lee, Shinji Watanabe

We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Closing the Gap Between Time-Domain Multi-Channel Speech Enhancement on Real and Simulation Conditions

no code implementations • 27 Oct 2021 • Wangyou Zhang, Jing Shi, Chenda Li, Shinji Watanabe, Yanmin Qian

The deep learning based time-domain models, e. g. Conv-TasNet, have shown great potential in both single-channel and multi-channel speech enhancement.

Speech Enhancement speech-recognition +1

Paper
Add Code

SpaceEdit: Learning a Unified Editing Space for Open-Domain Image Editing

no code implementations • 30 Nov 2021 • Jing Shi, Ning Xu, Haitian Zheng, Alex Smith, Jiebo Luo, Chenliang Xu

Recently, large pretrained models (e. g., BERT, StyleGAN, CLIP) have shown great knowledge transfer and generalization capability on various downstream tasks within their domains.

Image-to-Image Translation Retrieval +1

Paper
Add Code

Anomaly Crossing: New Horizons for Video Anomaly Detection as Cross-domain Few-shot Learning

no code implementations • 12 Dec 2021 • Guangyu Sun, Zhang Liu, Lianggong Wen, Jing Shi, Chenliang Xu

Video anomaly detection aims to identify abnormal events that occurred in videos.

Anomaly Detection cross-domain few-shot learning +1

Paper
Add Code

Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem

no code implementations • 17 Dec 2021 • Jing Shi, Xuankai Chang, Tomoki Hayashi, Yen-Ju Lu, Shinji Watanabe, Bo Xu

Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols, and convert the paradigm of the speech separation/enhancement related tasks from regression to classification.

regression Speech Separation

Paper
Add Code

SpaceEdit: Learning a Unified Editing Space for Open-Domain Image Color Editing

no code implementations • CVPR 2022 • Jing Shi, Ning Xu, Haitian Zheng, Alex Smith, Jiebo Luo, Chenliang Xu

Recently, large pretrained models (e. g., BERT, StyleGAN, CLIP) show great knowledge transfer and generalization capability on various downstream tasks within their domains.

Image-to-Image Translation Retrieval +1

Paper
Add Code

Dynamic Event-Triggered Discrete-Time Linear Time-Varying System with Privacy-Preservation

no code implementations • 28 Oct 2022 • Xuefeng Yang, Li Liu, Wenju Zhou, Jing Shi, Yinggang Zhang, Xin Hu, Huiyu Zhou

Moreover, the privacy of the system is analyzed to ensure the security of the real data.

Privacy Preserving

Paper
Add Code

InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning

no code implementations • 6 Apr 2023 • Jing Shi, Wei Xiong, Zhe Lin, Hyun Joon Jung

First, we learn the general concept of the input images by converting them to a textual token with a learnable image encoder.

Diffusion Personalization Tuning Free Text-to-Image Generation

Paper
Add Code

Mixture of personality improved Spiking actor network for efficient multi-agent cooperation

no code implementations • 10 May 2023 • Xiyun Li, Ziyi Ni, Jingqing Ruan, Linghui Meng, Jing Shi, Tielin Zhang, Bo Xu

Inspired by this two-step psychology theory, we propose a biologically plausible mixture of personality (MoP) improved spiking actor network (SAN), whereby a determinantal point process is used to simulate the complex formation and integration of different types of personality in MoP, and dynamic and spiking neurons are incorporated into the SAN for the efficient reinforcement learning.

Multi-agent Reinforcement Learning reinforcement-learning

Paper
Add Code

DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment

no code implementations • 22 May 2023 • Shentong Mo, Jing Shi, Yapeng Tian

In this work, we propose a novel and personalized text-to-sound generation approach with visual alignment based on latent diffusion models, namely DiffAVA, that can simply fine-tune lightweight visual-text alignment modules with frozen modality-specific encoders to update visual-aligned text embeddings as the condition.

AudioCaps Audio Generation +1

Paper
Add Code

VILAS: Exploring the Effects of Vision and Language Context in Automatic Speech Recognition

no code implementations • 31 May 2023 • Ziyi Ni, Minglun Han, Feilong Chen, Linghui Meng, Jing Shi, Pin Lv, Bo Xu

In this paper, we first propose ViLaS (Vision and Language into Automatic Speech Recognition), a novel multimodal ASR model based on the continuous integrate-and-fire (CIF) mechanism, which can integrate visual and textual context simultaneously or separately, to facilitate speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models

no code implementations • 22 Feb 2024 • Yixuan Ren, Yang Zhou, Jimei Yang, Jing Shi, Difan Liu, Feng Liu, Mingi Kwon, Abhinav Shrivastava

With the emergence of text-to-video (T2V) diffusion models, its temporal counterpart, motion customization, has not yet been well investigated.

Video Generation

Paper
Add Code

VIXEN: Visual Text Comparison Network for Image Difference Captioning

no code implementations • 29 Feb 2024 • Alexander Black, Jing Shi, Yifei Fan, Tu Bui, John Collomosse

We present VIXEN - a technique that succinctly summarizes in text the visual differences between a pair of images in order to highlight any content manipulation present.

Language Modelling Large Language Model +1

Paper
Add Code

Text-to-Audio Generation Synchronized with Videos

no code implementations • 8 Mar 2024 • Shentong Mo, Jing Shi, Yapeng Tian

Extensive evaluations on the AudioCaps and T2AV-Bench demonstrate that our T2AV sets a new standard for video-aligned TTA generation in ensuring visual alignment and temporal consistency.

AudioCaps Audio Generation +1

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.