Recent advancements in text-to-3D generation have significantly contributed to the automation and democratization of 3D content creation.
An exciting advancement in the field of multilingual models is the emergence of autoregressive models with zero- and few-shot capabilities, a phenomenon widely reported in large-scale language models.
We employ ScoPE to facilitate text generation in the target domain by integrating it with language models through a cascading approach.
P-Flow comprises a speech-prompted text encoder for speaker adaptation and a flow matching generative decoder for high-quality and fast speech synthesis.
The advent of artificial intelligence (AI) has enabled a comprehensive exploration of materials for various applications.
Mass spectra, which are agglomerations of ionized fragments from targeted molecules, play a crucial role across various fields for the identification of molecular structures.
Visual Prompt Tuning (VPT) is an effective tuning method for adapting pretrained Vision Transformers (ViTs) to downstream tasks.
Diffusion-Stego achieved a high capacity of messages (3. 0 bpp of binary messages with 98% accuracy, and 6. 0 bpp with 90% accuracy) as well as high quality (with a FID score of 2. 77 for 1. 0 bpp on the FFHQ 64$\times$64 dataset) that makes it challenging to distinguish from real images in the PNG format.
Text-to-image diffusion models can generate diverse, high-fidelity images based on user-provided text prompts.
Despite the fact that text-to-video (TTV) model has recently achieved remarkable success, there have been few approaches on TTV for its extension to video editing.
In particular, in comparison with existing self-supervised learning methods for tabular data, we propose a different corruption method for state and action representations that is robust to diverse distortions.
The aim of continual learning is to learn new tasks continuously (i. e., plasticity) without forgetting previously learned knowledge from old tasks (i. e., stability).
While the existing methods require the collection of auxiliary data or model weights to generate a counterpart, FedClassAvg only requires clients to communicate with a couple of fully connected layers, which is highly communication-efficient.
In terms of image quality, the LPIPS score improves by up to 12% and the reconstruction speed is 87% higher than that of ET-Net.
Unlike existing confidence scores that use only one of the source or target domain knowledge, the JMDS score uses both knowledge.
Finally, for each element in the feature set, the aggregation features are extracted by calculating the weighted means and variances, where the weights are derived from the similarity distributions.
Despite recent progress in generative adversarial network (GAN)-based vocoders, where the model generates raw waveform conditioned on acoustic features, it is challenging to synthesize high-fidelity audio for numerous speakers across various recording environments.
Ranked #4 on Speech Synthesis on LibriTTS
We train the speaker-conditional diffusion model on large-scale untranscribed datasets for a classifier-free guidance method and further fine-tune the diffusion model on the reference speech of the target speaker for adaptation, which only takes 40 seconds.
This manipulation is realized in an anti-adversarial manner, so that the original image is perturbed along pixel gradients in directions opposite to those used in an adversarial attack.
Weakly supervised object localization aims to find a target object region in a given image with only weak supervision, such as image-level labels.
Diffusion models learn to restore noisy data, which is corrupted with different levels of noise, by optimizing the weighted sum of the corresponding loss terms, i. e., denoising score matching loss.
To take such non-linear characteristics into account, we introduce Label-Gradient Alignment (LGA), a novel NTK-based metric whose inherent formulation allows it to capture the large amount of non-linear advantage present in modern neural architectures.
However, training on class labels only, classifiers suffer from the spurious correlation between foreground and background cues (e. g. train and rail), fundamentally bounding the performance of WSSS.
However, in this study, we prove that the existing DC methods can perform worse than the random selection method when task-irrelevant information forms a significant part of the training dataset.
We investigate the design choices used in the previous studies in terms of the accuracy and number of spikes and figure out that they are not best-suited for SNNs.
The computer-aided diagnosis of focal liver lesions (FLLs) can help improve workflow and enable correct diagnoses; FLL detection is the first step in such a computer-aided diagnosis.
Namely, we make the supervised pre-training of Neural DUDE compatible with the adaptive fine-tuning of the parameters based on the given noisy data subject to denoising.
For TTS synthesis, we guide the generative process of the diffusion model with a phoneme classifier trained on a large-scale speech recognition dataset.
We implemented large-batch synchronous training of DNNs based on Caffe, a deep learning library.
Weakly supervised semantic segmentation produces pixel-level localization from class labels; however, a classifier trained on such labels is likely to focus on a small discriminative region of the target object.
Ranked #13 on Weakly-Supervised Semantic Segmentation on COCO 2014 val
In this work, we present Facial Identity Controllable GAN (FICGAN) for not only generating high-quality de-identified face images with ensured privacy protection, but also detailed controllability on attribute preservation for enhanced data utility.
Several recent studies have shown that the use of extra in-distribution data can lead to a high level of adversarial robustness.
By modeling the unconditional distribution for speech, our model can utilize the untranscribed data for training.
Unsupervised domain adaptation (UDA) aims to achieve high performance within the unlabeled target domain by leveraging the labeled source domain.
Non-autoregressive neural machine translation (NART) models suffer from the multi-modality problem which causes translation inconsistency such as token repetition.
Furthermore, we question the potential of existing TAD methods by showing that an untrained model obtains comparable detection performance to the existing methods even when PA is forbidden.
AGG addresses the degeneration problem by gating the specific part of the gradient for rare token embeddings.
In this work, we propose Iterative Latent Variable Refinement (ILVR), a method to guide the generative process in DDPM to generate high-quality images based on a given reference image.
Current efforts to improve the robustness of neural networks against adversarial examples are focused on developing robust training methods, which update the weights of a neural network in a more robust direction.
From our observations, the generator's implicit positional encoding is translation-variant, making the generator spatially biased.
In recent years, molecular representation learning has emerged as a key area of focus in various chemical tasks.
Message passing neural network provides an effective framework for capturing molecular geometric features with the perspective of a molecule as a graph.
On MNIST dataset, our proposed student SNN achieves up to 0. 09% higher accuracy and produces 65% less spikes compared to the student SNN trained with conventional knowledge distillation method.
Denoising diffusion probabilistic models have been recently proposed to generate high-quality samples by estimating the gradient of the data density.
By analyzing proxy data constructed using various selection methods through data entropy, we propose a novel proxy data selection method tailored for NAS.
Generative adversarial networks (GANs) with clustered latent spaces can perform conditional generation in a completely unsupervised manner.
Weakly supervised segmentation methods using bounding box annotations focus on obtaining a pixel-level mask from each box containing an object.
Weakly supervised semantic segmentation produces a pixel-level localization from a classifier, but it is likely to restrict its focus to a small discriminative region of the target object.
Herein, we propose a data augmentation method to improve generalization in both adversarial and standard learning by using out-of-distribution (OOD) data that are devoid of the abovementioned issues.
To demystify the "black box" property of deep neural networks for natural language processing (NLP), several methods have been proposed to interpret their predictions by measuring the change in prediction probability after erasing each token of an input.
In this paper, we address the problem of image anomaly detection and segmentation.
Ranked #7 on Anomaly Detection on BTAD (using extra training data)
Enhancing feature transferability by matching marginal distributions has led to improvements in domain adaptation, although this is at the expense of feature discrimination.
Normalizing flows (NFs) have become a prominent method for deep generative models that allow for an analytic probability density estimation and efficient synthesis.
By leveraging the properties of flows, MAS searches for the most probable monotonic alignment between text and the latent representation of speech.
Ranked #4 on Text-To-Speech Synthesis on LJSpeech (using extra training data)
Spiking neural networks (SNNs) have gained considerable interest due to their energy-efficient characteristics, yet lack of a scalable training algorithm has restricted their applicability in practical machine learning problems.
In this paper, we identify Adversarial Feature Overfitting (AFO), which may cause poor adversarially robust generalization, and we show that adversarial training can overshoot the optimal point in terms of robust generalization, leading to AFO in our simple Gaussian model.
Bridging the exponentially growing gap between the numbers of unlabeled and labeled protein sequences, several studies adopted semi-supervised learning for protein sequence modeling.
We propose a method of using videos automatically harvested from the web to identify a larger region of the target object by using temporal information, which is not present in the static image.
We show that our model outperforms state-of-the-art approaches for various text-to-SQL datasets in two aspects: 1) the SQL generation accuracy for the trained templates, and 2) the adaptability to the unseen SQL templates based on a single example without any additional training.
Over the past decade, deep neural networks (DNNs) have demonstrated remarkable performance in a variety of applications.
We propose a novel domain adaptation method based on label propagation and cycle consistency to let the clusters of the features from the two domains overlap exactly and become clear for high accuracy.
We propose a DL based steganalysis technique that effectively removes secret images by restoring the distribution of the original images.
The main obstacle to weakly supervised semantic image segmentation is the difficulty of obtaining pixel-level information from coarse image-level annotations.
We evaluate the classification performance (F1-score) of the proposed method with 20% missingness and confirm up to a 5% improvement in comparison with the performance of combinations of state-of-the-art methods.
We compared our method to state-of-the-art techniques and observed that our method preserves the same level of privacy as differential privacy (DP), but had better prediction results.
Most of modern text-to-speech architectures use a WaveNet vocoder for synthesizing a high-fidelity waveform audio, but there has been a limitation for practical applications due to its slow autoregressive sampling scheme.
Sound Audio and Speech Processing
The spiking neural networks (SNNs) are considered as one of the most promising artificial neural networks due to their energy efficient computing capability.
Furthermore, the privacy of the data involved in model training is also threatened by attacks such as the model-inversion attack, or by dishonest service providers of AI applications.
We present a focal liver lesion detection model leveraged by custom-designed multi-phase computed tomography (CT) volumes, which reflects real-world clinical lesion detection practice using a Single Shot MultiBox Detector (SSD).
Knowledge tracing (KT), a key component of an intelligent tutoring system, is a machine learning technique that estimates the mastery level of a student based on his/her past performance.
The experimental results with real-world data confirm the effectiveness of the system and models.
Video prediction can be performed by finding features in recent frames, and using them to generate approximations to upcoming frames.
Ranked #1 on Video Prediction on KTH (Cond metric)
A recommender system aims to recommend items that a user is interested in among many items.
The objective of this study is to train an autonomous navigation model that uses a simulator (instead of real labeled data) and an inexpensive monocular camera.
Distributed training of deep neural networks has received significant research interest, and its major approaches include implementations on multiple GPUs and clusters.
Generative Adversarial Networks (GAN) have received wide attention in the machine learning field for their potential to learn high-dimensional, complex real data distribution.
Electronic health records (EHRs) have contributed to the computerization of patient records and can thus be used not only for efficient and systematic medical services, but also for research on biomedical data science.
In this paper, we identify memory addressing (specifically, content-based addressing) as the main reason for the performance degradation and propose a robust quantization method for MANNs to address the challenge.
We propose an application of sequence generative adversarial networks (SeqGAN), which are generative adversarial networks for discrete sequence generation, for creating polyphonic musical sequences.
Sound Audio and Speech Processing
Recommender systems aim to find an accurate and efficient mapping from historic data of user-preferred items to a new item that is to be liked by a user.
To guarantee the commutative property for homogeneous interaction, we apply model sharing and hidden representation merging techniques.
We compare our proposed method to various existing methods and biological sequence analysis methods implemented on top of our framework.
Experiments on Czech-German and French-German translations demonstrate the efficacy of the proposed pseudo parallel corpus, which shows not only enhanced results for bidirectional translation tasks but also substantial improvement with the aid of a ground truth real parallel corpus.
Graphs provide a powerful means for representing complex interactions between entities.
Under the assumption that using such an automatically generated dataset could relieve the burden of manual question-answer generation, we tried to use this dataset to train an instance of Watson and checked the training efficiency and accuracy.
In order to eliminate this workaround, recently proposed is a new class of SNN named deep spiking networks (DSNs), which can be trained directly (without a mapping from conventional deep networks) by error backpropagation with stochastic gradient descent.
In computer security, designing a robust intrusion detection system is one of the most fundamental and important problems.
The second is the popularity of NAND flash-based solid-state drives (SSDs) containing multicore processors that can accommodate extra computation for data processing.
Since microRNAs (miRNAs) play a crucial role in post-transcriptional gene regulation, miRNA identification is one of the most essential problems in computational biology.
MicroRNAs (miRNAs) are short sequences of ribonucleic acids that control the expression of target messenger RNAs (mRNAs) by binding them.
The increasing complexity of deep neural networks (DNNs) has made it challenging to exploit existing large-scale data processing pipelines for handling massive data and parameters involved in DNN training.
A eukaryotic gene consists of multiple exons (protein coding regions) and introns (non-coding regions), and a splice junction refers to the boundary between a pair of exon and intron.
Learning meaningful representations using deep neural networks involves designing efficient training schemes and well-structured networks.
Motivated by the need for fast and accurate classification of unlabeled nucleotide sequences on a large scale, we developed NASCUP, a new classification method that captures statistical structures of nucleotide sequences by compact context-tree models and universal probability from information theory.
Genomics Information Theory Information Theory
Robust classification becomes challenging when each class consists of multiple subclasses.