Search Results for author: Jing Shi

Found 39 papers, 11 papers with code

FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction

no code implementations23 Apr 2024 Hang Hua, Jing Shi, Kushal Kafle, Simon Jenni, Daoan Zhang, John Collomosse, Scott Cohen, Jiebo Luo

To address this, we propose FineMatch, a new aspect-based fine-grained text and image matching benchmark, focusing on text and image mismatch detection and correction.

Hallucination In-Context Learning +2

Text-to-Audio Generation Synchronized with Videos

no code implementations8 Mar 2024 Shentong Mo, Jing Shi, Yapeng Tian

Extensive evaluations on the AudioCaps and T2AV-Bench demonstrate that our T2AV sets a new standard for video-aligned TTA generation in ensuring visual alignment and temporal consistency.

AudioCaps Audio Generation +1

VIXEN: Visual Text Comparison Network for Image Difference Captioning

no code implementations29 Feb 2024 Alexander Black, Jing Shi, Yifei Fan, Tu Bui, John Collomosse

We present VIXEN - a technique that succinctly summarizes in text the visual differences between a pair of images in order to highlight any content manipulation present.

Language Modelling Large Language Model +1

Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models

no code implementations22 Feb 2024 Yixuan Ren, Yang Zhou, Jimei Yang, Jing Shi, Difan Liu, Feng Liu, Mingi Kwon, Abhinav Shrivastava

With the emergence of text-to-video (T2V) diffusion models, its temporal counterpart, motion customization, has not yet been well investigated.

Video Generation

A Knowledge-enhanced Two-stage Generative Framework for Medical Dialogue Information Extraction

1 code implementation30 Jul 2023 Zefa Hu, Ziyi Ni, Jing Shi, Shuang Xu, Bo Xu

However, these generative methods output a whole sequence consisting of term-status pairs in one stage and ignore integrating prior knowledge, which demands a deeper understanding to model the relationship between terms and infer the status of each term.

VILAS: Exploring the Effects of Vision and Language Context in Automatic Speech Recognition

no code implementations31 May 2023 Ziyi Ni, Minglun Han, Feilong Chen, Linghui Meng, Jing Shi, Pin Lv, Bo Xu

In this paper, we first propose ViLaS (Vision and Language into Automatic Speech Recognition), a novel multimodal ASR model based on the continuous integrate-and-fire (CIF) mechanism, which can integrate visual and textual context simultaneously or separately, to facilitate speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment

no code implementations22 May 2023 Shentong Mo, Jing Shi, Yapeng Tian

In this work, we propose a novel and personalized text-to-sound generation approach with visual alignment based on latent diffusion models, namely DiffAVA, that can simply fine-tune lightweight visual-text alignment modules with frozen modality-specific encoders to update visual-aligned text embeddings as the condition.

AudioCaps Audio Generation +1

Mixture of personality improved Spiking actor network for efficient multi-agent cooperation

no code implementations10 May 2023 Xiyun Li, Ziyi Ni, Jingqing Ruan, Linghui Meng, Jing Shi, Tielin Zhang, Bo Xu

Inspired by this two-step psychology theory, we propose a biologically plausible mixture of personality (MoP) improved spiking actor network (SAN), whereby a determinantal point process is used to simulate the complex formation and integration of different types of personality in MoP, and dynamic and spiking neurons are incorporated into the SAN for the efficient reinforcement learning.

Multi-agent Reinforcement Learning reinforcement-learning

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

2 code implementations7 May 2023 Feilong Chen, Minglun Han, Haozhi Zhao, Qingyang Zhang, Jing Shi, Shuang Xu, Bo Xu

(3) Integrating multiple modalities: all single-modal encoders are aligned with the LLM through X2L interfaces to integrate multimodal capabilities into the LLM.

Attribute Instruction Following +4

Matching-based Term Semantics Pre-training for Spoken Patient Query Understanding

1 code implementation2 Mar 2023 Zefa Hu, Xiuyi Chen, Haoran Wu, Minglun Han, Ziyi Ni, Jing Shi, Shuang Xu, Bo Xu

Medical Slot Filling (MSF) task aims to convert medical queries into structured information, playing an essential role in diagnosis dialogue systems.

slot-filling Slot Filling

SpaceEdit: Learning a Unified Editing Space for Open-Domain Image Color Editing

no code implementations CVPR 2022 Jing Shi, Ning Xu, Haitian Zheng, Alex Smith, Jiebo Luo, Chenliang Xu

Recently, large pretrained models (e. g., BERT, StyleGAN, CLIP) show great knowledge transfer and generalization capability on various downstream tasks within their domains.

Decoder Image-to-Image Translation +2

Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem

no code implementations17 Dec 2021 Jing Shi, Xuankai Chang, Tomoki Hayashi, Yen-Ju Lu, Shinji Watanabe, Bo Xu

Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols, and convert the paradigm of the speech separation/enhancement related tasks from regression to classification.

regression Speech Separation

SpaceEdit: Learning a Unified Editing Space for Open-Domain Image Editing

no code implementations30 Nov 2021 Jing Shi, Ning Xu, Haitian Zheng, Alex Smith, Jiebo Luo, Chenliang Xu

Recently, large pretrained models (e. g., BERT, StyleGAN, CLIP) have shown great knowledge transfer and generalization capability on various downstream tasks within their domains.

Decoder Image-to-Image Translation +2

Closing the Gap Between Time-Domain Multi-Channel Speech Enhancement on Real and Simulation Conditions

no code implementations27 Oct 2021 Wangyou Zhang, Jing Shi, Chenda Li, Shinji Watanabe, Yanmin Qian

The deep learning based time-domain models, e. g. Conv-TasNet, have shown great potential in both single-channel and multi-channel speech enhancement.

Speech Enhancement speech-recognition +1

Learning to Generate Scene Graph from Natural Language Supervision

1 code implementation ICCV 2021 Yiwu Zhong, Jing Shi, Jianwei Yang, Chenliang Xu, Yin Li

To bridge the gap between images and texts, we leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph.

Graph Generation Scene Graph Generation +1

Learning by Planning: Language-Guided Global Image Editing

1 code implementation CVPR 2021 Jing Shi, Ning Xu, Yihang Xu, Trung Bui, Franck Dernoncourt, Chenliang Xu

Recently, language-guided global image editing draws increasing attention with growing application potentials.

A Simple Baseline for Weakly-Supervised Scene Graph Generation

no code implementations ICCV 2021 Jing Shi, Yiwu Zhong, Ning Xu, Yin Li, Chenliang Xu

We investigate the weakly-supervised scene graph generation, which is a challenging task since no correspondence of label and object is provided.

Contrastive Learning Graph Generation +2

Language-Guided Global Image Editing via Cross-Modal Cyclic Mechanism

no code implementations ICCV 2021 Wentao Jiang, Ning Xu, Jiayun Wang, Chen Gao, Jing Shi, Zhe Lin, Si Liu

Given the cycle, we propose several free augmentation strategies to help our model understand various editing requests given the imbalanced dataset.

Audio-visual Speech Separation with Adversarially Disentangled Visual Representation

no code implementations29 Nov 2020 Peng Zhang, Jiaming Xu, Jing Shi, Yunzhe Hao, Bo Xu

In our model, we use the face detector to detect the number of speakers in the scene and use visual information to avoid the permutation problem.

Speech Separation

A Benchmark and Baseline for Language-Driven Image Editing

no code implementations5 Oct 2020 Jing Shi, Ning Xu, Trung Bui, Franck Dernoncourt, Zheng Wen, Chenliang Xu

To solve this new task, we first present a new language-driven image editing dataset that supports both local and global editing with editing operation and mask annotations.

Cubic Spline Smoothing Compensation for Irregularly Sampled Sequences

no code implementations3 Oct 2020 Jing Shi, Jing Bi, Yingru Liu, Chenliang Xu

The marriage of recurrent neural networks and neural ordinary differential networks (ODE-RNN) is effective in modeling irregularly-observed sequences.

Speaker-Conditional Chain Model for Speech Separation and Extraction

no code implementations25 Jun 2020 Jing Shi, Jiaming Xu, Yusuke Fujita, Shinji Watanabe, Bo Xu

With the predicted speaker information from whole observation, our model is helpful to solve the problem of conventional speech separation and speaker extraction for multi-round long recordings.

Audio and Speech Processing Sound

Learning Continuous-Time Dynamics by Stochastic Differential Networks

no code implementations11 Jun 2020 Yingru Liu, Yucheng Xing, Xuewen Yang, Xin Wang, Jing Shi, Di Jin, Zhaoyue Chen

Learning continuous-time stochastic dynamics is a fundamental and essential problem in modeling sporadic time series, whose observations are irregular and sparse in both time and dimension.

Time Series Time Series Analysis

GAN-EM: GAN based EM learning framework

no code implementations2 Dec 2018 Wentian Zhao, Shaojie Wang, Zhihuai Xie, Jing Shi, Chenliang Xu

To overcome such limitation, we propose a GAN based EM learning framework that can maximize the likelihood of images and estimate the latent variables with only the constraint of L-Lipschitz continuity.

Clustering Dimensionality Reduction +2

Concept Learning through Deep Reinforcement Learning with Memory-Augmented Neural Networks

no code implementations15 Nov 2018 Jing Shi, Jiaming Xu, Yiqun Yao, Bo Xu

In this paper, we present a memory-augmented neural network which is motivated by the process of human concept learning.

One-Shot Learning Outlier Detection +2

Combining Lexical and Semantic-based Features for Answer Sentence Selection

no code implementations WS 2016 Jing Shi, Jiaming Xu, Yiqun Yao, Suncong Zheng, Bo Xu

As the result of the evaluation shows, our solution provides a valuable and brief model which could be used in modelling question answering or sentence semantic relevance.

Feature Engineering Open-Domain Question Answering +1

Hierarchical Memory Networks for Answer Selection on Unknown Words

1 code implementation COLING 2016 Jiaming Xu, Jing Shi, Yiqun Yao, Suncong Zheng, Bo Xu

Recently, end-to-end memory networks have shown promising results on Question Answering task, which encode the past facts into an explicit memory and perform reasoning ability by making multiple computational steps on the memory.

Answer Selection Sentence

Cannot find the paper you are looking for? You can Submit a new open access paper.