LST-Net: Learning a Convolutional Neural Network with a Learnable Sparse Transform

The 2D convolutional (Conv2d) layer is the fundamental element to a deep convolutional neural network (CNN).

Momentum Batch Normalization for Deep Learning with Small Batch Size

To make a deeper understanding of BN, in this work we prove that BN actually introduces a certain level of noise into the sample mean and variance during the training process, while the noise level depends only on the batch size.

CPPF: A contextual and post-processing-free model for automatic speech recognition

To address this issue, we draw inspiration from the multifaceted capabilities of LLMs and Whisper, and focus on integrating multiple ASR text processing tasks related to speech recognition into the ASR model.

Delta-LoRA: Fine-Tuning High-Rank Parameters with the Delta of Low-Rank Matrices

In this paper, we present Delta-LoRA, which is a novel parameter-efficient approach to fine-tune large language models (LLMs).

Pixel-Aware Stable Diffusion for Realistic Image Super-resolution and Personalized Stylization

However, the existing methods along this line either fail to keep faithful pixel-wise image structures or resort to extra skipped connections to reproduce details, which requires additional training in image space and limits their extension to other related tasks in latent space such as image stylization.

Neural Interactive Keypoint Detection

Click-Pose explores how user feedback can cooperate with a neural keypoint detector to correct the predicted keypoints in an interactive way for a faster and more effective annotation process.

Uncovering User Interest from Biased and Noised Watch Time in Video Recommendation

In the video recommendation, watch time is commonly adopted as an indicator of user interest.

Exploring Winograd Convolution for Cost-effective Neural Network Fault Tolerance

When it is applied on fault-tolerant neural networks enhanced with fault-aware retraining and constrained activation functions, the resulting model accuracy generally shows significant improvement in presence of various faults.

Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation

Recent leading zero-shot video object segmentation (ZVOS) works devote to integrating appearance and motion information by elaborately designing feature fusion modules and identically applying them in multiple feature stages.

A Benchmark for Chinese-English Scene Text Image Super-resolution

Scene Text Image Super-resolution (STISR) aims to recover high-resolution (HR) scene text images with visually pleasant and readable text content from the given low-resolution (LR) input.

Point2Mask: Point-supervised Panoptic Segmentation via Optimal Transport

Weakly-supervised image segmentation has recently attracted increasing research attentions, aiming to avoid the expensive pixel-wise labeling.

DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting

Existing feature lifting approaches, such as Lift-Splat-based and 2D attention-based, either use estimated depth to get pseudo LiDAR features and then splat them to a 3D space, which is a one-pass operation without feature refinement, or ignore depth and lift features by 2D attention mechanisms, which achieve finer semantics while suffering from a depth ambiguity problem.

CORE: Cooperative Reconstruction for Multi-Agent Perception

This paper presents CORE, a conceptually simple, effective and communication-efficient model for multi-agent cooperative perception.

Neural Quantile Optimization for Edge-Cloud Computing

The network structure reflects the edge-cloud computing topology and is trained to minimize the expectation of the cost function for unconstrained continuous optimization problems.

Semantic-SAM: Segment and Recognize Anything at Any Granularity

In this paper, we introduce Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity.

MomentDiff: Generative Video Moment Retrieval from Random to Real

Video moment retrieval pursues an efficient and generalized solution to identify the specific temporal segments within an untrimmed video that correspond to a given language description.

Impact of UAVs Equipped with ADS-B on the Civil Aviation Monitoring System

However, due to the limited frequency of ADS-B technique, UAVs equipped with ADS-B devices result in the loss of packets to both UAVs and civil aviation.


Steganographic Capacity of Deep Learning Models

We find that the steganographic capacity of the learning models tested is surprisingly high, and that in each case, there is a clear threshold after which model performance rapidly degrades.

DreamTime: An Improved Optimization Strategy for Text-to-3D Content Creation

Text-to-image diffusion models pre-trained on billions of image-text pairs have recently enabled text-to-3D content creation by optimizing a randomly initialized Neural Radiance Fields (NeRF) with score distillation.

Understanding Optimization of Deep Learning

This article provides a comprehensive understanding of optimization in deep learning, with a primary focus on the challenges of gradient vanishing and gradient exploding, which normally lead to diminished model representational ability and training instability, respectively.

detrex: Benchmarking Detection Transformers

To address this issue, we develop a unified, highly modular, and lightweight codebase called detrex, which supports a majority of the mainstream DETR-based instance recognition algorithms, covering various fundamental tasks, including object detection, segmentation, and pose estimation.

Recognize Anything: A Strong Image Tagging Model

We are releasing the RAM at \url{https://recognize-anything. github. io/} to foster the advancements of large models in computer vision.

Efficient and Interpretable Compressive Text Summarisation with Unsupervised Dual-Agent Reinforcement Learning

Recently, compressive text summarisation offers a balance between the conciseness issue of extractive summarisation and the factual hallucination issue of abstractive summarisation.


Inferring and Leveraging Parts from Object Shape for Improving Semantic Image Synthesis

Despite the progress in semantic image synthesis, it remains a challenging problem to generate photo-realistic parts from input semantic map.

DreamWaltz: Make a Scene with Complex 3D Animatable Avatars

We present DreamWaltz, a novel framework for generating and animating complex 3D avatars given text guidance and parametric human body prior.

Boosting Human-Object Interaction Detection with Text-to-Image Diffusion Model

Extensive experiments demonstrate that DiffHOI significantly outperforms the state-of-the-art in regular detection (i. e., 41. 50 mAP) and zero-shot detection.

Cognition Guided Human-Object Relationship Detection

Human-object relationship detection reveals the fine-grained relationship between humans and objects, helping the comprehensive understanding of videos.

MLCopilot: Unleashing the Power of Large Language Models in Solving Machine Learning Tasks

In contrast, though human engineers have the incredible ability to understand tasks and reason about solutions, their experience and knowledge are often sparse and difficult to utilize by quantitative approaches.


A marker-less human motion analysis system for motion-based biomarker discovery in knee disorders

In recent years the NHS has been having increased difficulty seeing all low-risk patients, this includes but not limited to suspected osteoarthritis (OA) patients.

A Strong and Reproducible Object Detector with Only Public Datasets

This work presents Focal-Stable-DINO, a strong and reproducible object detection model which achieves 64. 6 AP on COCO val2017 and 64. 8 AP on COCO test-dev using only 700M parameters without any test time augmentation.

LipsFormer: Introducing Lipschitz Continuity to Vision Transformers

In contrast to previous practical tricks that address training instability by learning rate warmup, layer normalization, attention formulation, and weight initialization, we show that Lipschitz continuity is a more essential property to ensure training stability.

DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training

In this way, we can reduce the GPU memory consumption of contrastive loss computation from $\bigO(B^2)$ to $\bigO(\frac{B^2}{N})$, where $B$ and $N$ are the batch size and the number of GPUs used for training.

Language Guided Local Infiltration for Interactive Image Retrieval

Interactive Image Retrieval (IIR) aims to retrieve images that are generally similar to the reference image but under the requested text modification.

Detection Transformer with Stable Matching

We point out that the unstable matching in DETR is caused by a multi-optimization path problem, which is highlighted by the one-to-one matching design in DETR.

HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation

While such a plug-and-play approach is appealing, the inevitable and uncertain conflicts between the original images produced from the frozen SD branch and the given condition incur significant challenges for the learnable branch, which essentially conducts image feature editing for condition enforcement.

Multi-view Adversarial Discriminator: Mine the Non-causal Factors for Object Detection in Unseen Domains

In this work, we present an idea to remove non-causal factors from common features by multi-view adversarial training on source domains, because we observe that such insignificant non-causal factors may still be significant in other latent spaces (views) due to the multi-mode structure of data.

Uncertainty-Aware Source-Free Adaptive Image Super-Resolution with Wavelet Augmentation Transformer

Unsupervised Domain Adaptation (UDA) can effectively address domain gap issues in real-world image Super-Resolution (SR) by accessing both the source and target data.

One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer

It is challenging to perform this task with a single network due to resolution issues, i. e., the face and hands are usually located in extremely small regions.

BoxVIS: Video Instance Segmentation with Box Annotations

As a result, the amount of pixel-wise annotations in existing video instance segmentation (VIS) datasets is small, limiting the generalization capability of trained VIS models.

Exploring the Impact of Instruction Data Scaling on Large Language Models: An Empirical Study on Real-World Use Cases

However current research rarely studies the impact of different amounts of instruction data on model performance, especially in the real-world use cases.

OTAvatar: One-shot Talking Face Avatar with Controllable Tri-plane Rendering

In this paper, we propose One-shot Talking face Avatar (OTAvatar), which constructs face avatars by a generalized controllable tri-plane rendering solution so that each personalized avatar can be constructed from only one portrait as the reference.

MDQE: Mining Discriminative Query Embeddings to Segment Occluded Instances on Challenging Videos

The proposed MDQE is the first VIS method with per-clip input that achieves state-of-the-art results on challenging videos and competitive performance on simple videos.

Human Guided Ground-truth Generation for Realistic Image Super-resolution

A human guided GT image dataset with both positive and negative samples is then constructed, and a loss function is proposed to train the Real-ISR models.

One-to-Few Label Assignment for End-to-End Dense Detection

The positive and negative weights of these soft anchors are dynamically adjusted during training so that they can contribute more to ``representation learning'' in the early training stage, and contribute more to ``duplicated prediction removal'' in the later stage.

Sharpness-Aware Gradient Matching for Domain Generalization

In this paper, we present two conditions to ensure that the model could converge to a flat minimum with a small loss, and present an algorithm, named Sharpness-Aware Gradient Matching (SAGM), to meet the two conditions for improving model generalization capability.

Towards Diverse Binary Segmentation via A Simple yet General Gated Network

They ignore two key problems when the encoder exchanges information with the decoder: one is the lack of interference control mechanism between them, the other is without considering the disparity of the contributions from different encoder levels.

MSF: Motion-guided Sequential Fusion for Efficient 3D Object Detection from Point Cloud Sequences

Current top-performing multi-frame detectors mostly follow a Detect-and-Fuse framework, which extracts features from each frame of the sequence and fuses them to detect the objects in the current frame.

DynaMask: Dynamic Mask Selection for Instance Segmentation

The representative instance segmentation methods mostly segment different object instances with a mask of the fixed resolution, e. g., 28*28 grid.

A Simple Framework for Open-Vocabulary Segmentation and Detection

We present OpenSeeD, a simple Open-vocabulary Segmentation and Detection framework that jointly learns from different segmentation and detection datasets.

Synthesizing Realistic Image Restoration Training Pairs: A Diffusion Approach

In supervised image restoration tasks, one key issue is how to obtain the aligned high-quality (HQ) and low-quality (LQ) training image pairs.

Tag2Text: Guiding Vision-Language Model via Image Tagging

This paper presents Tag2Text, a vision language pre-training (VLP) framework, which introduces image tagging into vision-language models to guide the learning of visual-linguistic features.

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion.

ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation

In addition to the unprecedented ability in imaginary creation, large text-to-image models are expected to take customized concepts in image generation.

Spatial-Frequency Attention for Image Denoising

In this paper, we propose the spatial-frequency attention network (SFANet) to enhance the network's ability in exploiting long-range dependency.

Introducing Depth into Transformer-based 3D Object Detection

To address the second issue, we introduce an auxiliary learning task called Depth-aware Negative Suppression loss.

Towards a Sustainable Internet-of-Underwater-Things based on AUVs, SWIPT, and Reinforcement Learning

In this paper, we propose a sustainable scheme to improve the throughput and lifetime of underwater networks, enabling them to potentially operate indefinitely.

Variation Enhanced Attacks Against RRAM-based Neuromorphic Computing System

The RRAM-based neuromorphic computing system has amassed explosive interests for its superior data processing capability and energy efficiency than traditional architectures, and thus being widely used in many data-centric applications.

Dual Graph Multitask Framework for Imbalanced Delivery Time Estimation

To address the issue, we propose a novel Dual Graph Multitask framework for imbalanced Delivery Time Estimation (DGM-DTE).

DRGCN: Dynamic Evolving Initial Residual for Deep Graph Convolutional Networks

Our experimental results show that our model effectively relieves the problem of over-smoothing in deep GCNs and outperforms the state-of-the-art (SOTA) methods on various benchmark datasets.

Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation

This paper presents a novel end-to-end framework with Explicit box Detection for multi-person Pose estimation, called ED-Pose, where it unifies the contextual learning between human-level (global) and keypoint-level (local) information.

Adversarial Style Augmentation for Domain Generalization

By updating the model against the adversarial statistics perturbation during training, we allow the model to explore the worst-case domain and hence improve its generalization performance.

Towards Precise Model-free Robotic Grasping with Sim-to-Real Transfer Learning

In physical robotic experiments, our grasping framework grasped single known objects and novel complex-shaped household objects with a success rate of 90. 91%.

Towards Accurate Acne Detection via Decoupled Sequential Detection Head

In addition, we build a high-quality acne detection dataset named ACNE-DET to verify the effectiveness of DSDH.

Revisiting Prototypical Network for Cross Domain Few-Shot Learning

Prototypical Network is a popular few-shot solver that aims at establishing a feature metric generalizable to novel few-shot classification (FSC) tasks using deep neural networks.

Automatic Network Pruning via Hilbert-Schmidt Independence Criterion Lasso under Information Bottleneck Principle

In this paper, we try to solve this problem by introducing a principled and unified framework based on Information Bottleneck (IB) theory, which further guides us to an automatic pruning approach.

RCA-NOC: Relative Contrastive Alignment for Novel Object Captioning

We evaluate our approach on two datasets and show that our proposed RCA-NOC approach outperforms state-of-the-art methods by a large margin, demonstrating its effectiveness in improving vision-language representation for novel object captioning.

Joint HDR Denoising and Fusion: A Real-World Mobile HDR Image Dataset

In this work, we develop, for the first time to our best knowledge, an HDR image dataset by using mobile phone cameras, namely Mobile-HDR dataset.


A General Regret Bound of Preconditioned Gradient Method for DNN Training

Though the full-matrix preconditioned gradient methods theoretically have a lower regret bound, they are impractical for use to train DNNs because of the high complexity.

Exploring Vision Transformers as Diffusion Learners

We further provide a hypothesis on the implication of disentangling the generative backbone as an encoder-decoder structure and show proof-of-concept experiments verifying the effectiveness of a stronger encoder for generative tasks with ASymmetriC ENcoder Decoder (ASCEND).

Accelerating Dataset Distillation via Model Augmentation

Dataset Distillation (DD), a newly emerging field, aims at generating much smaller but efficient synthetic training datasets from large ones.

Multi-adversarial Faster-RCNN with Paradigm Teacher for Unrestricted Object Detection

Our proposed MAF has two distinct contributions: (1) The Hierarchical Domain Feature Alignment (HDFA) module is introduced to minimize the image-level domain disparity, where Scale Reduction Module (SRM) reduces the feature map size without information loss and increases the training efficiency.

Benchmark Dataset and Effective Inter-Frame Alignment for Real-World Video Super-Resolution

On the other hand, alignment algorithms in existing VSR methods perform poorly for real-world videos, leading to unsatisfactory results.

Box2Mask: Box-supervised Instance Segmentation via Level-set Evolution

In contrast to fully supervised methods using pixel-wise mask labels, box-supervised instance segmentation takes advantage of simple box annotations, which has recently attracted increasing research attention.

DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding

As phrase extraction can be regarded as a $1$D text segmentation problem, we formulate PEG as a dual detection problem and propose a novel DQ-DETR model, which introduces dual queries to probe different features from image and text for object prediction and phrase mask prediction.

Parameter-Efficient Transformer with Hybrid Axial-Attention for Medical Image Segmentation

To this end, we propose a parameter-efficient transformer to explore intrinsic inductive bias via position information for medical image segmentation.

Point-MA2E: Masked and Affine Transformed AutoEncoder for Self-supervised Point Cloud Learning

Generally, we corrupt the point cloud with affine transformation and masking as input and learn an encoder-decoder model to reconstruct the original point cloud from its corrupted version.

Rethinking the transfer learning for FCN based polyp segmentation in colonoscopy

Besides the complex nature of colonoscopy frames with intrinsic frame formation artefacts such as light reflections and the diversity of polyp types/shapes, the publicly available polyp segmentation training datasets are limited, small and imbalanced.

Mining Word Boundaries in Speech as Naturally Annotated Word Segmentation Data

Chinese word segmentation (CWS) models have achieved very high performance when the training data is sufficient and in-domain.

Mitigating spectral bias for the multiscale operator learning with hierarchical attention

Neural operators have emerged as a powerful tool for learning the mapping between infinite-dimensional parameter and solution spaces of partial differential equations (PDEs).

Motion correction in MRI using deep learning and a novel hybrid loss function

Evaluation used simulated T1 and T2-weighted axial, coronal, and sagittal images unseen during training, as well as T1-weighted images with motion artifacts from real scans.


TLDW: Extreme Multimodal Summarisation of News Videos

Multimodal summarisation with multimodal output is drawing increasing attention due to the rapid growth of multimedia data.

Learning Dual Memory Dictionaries for Blind Face Restoration

Generally, it is a challenging and intractable task to improve the photo-realistic performance of blind restoration and adaptively handle the generic and specific restoration scenarios with a single unified model.

Attention Diversification for Domain Generalization

Under this guidance, a novel Attention Diversification framework is proposed, in which Intra-Model and Inter-Model Attention Diversification Regularization are collaborated to reassign appropriate attention to diverse task-related features.

From Face to Natural Image: Learning Real Degradation for Blind Image Super-Resolution

Notably, LQ face images, which may have the same degradation process as natural images, can be robustly restored with photo-realistic textures by exploiting their strong structural priors.

Skin Lesion Recognition with Class-Hierarchy Regularized Hyperbolic Embeddings

Accordingly, the learned prototypes preserve the semantic class relations in the embedding space and we can predict the label of an image by assigning its feature to the nearest hyperbolic class prototype.

Statistical Foundation Behind Machine Learning and Its Impact on Computer Vision

This paper revisits the principle of uniform convergence in statistical learning, discusses how it acts as the foundation behind machine learning, and attempts to gain a better understanding of the essential problem that current deep learning algorithms are solving.

Recurrent LSTM-based UAV Trajectory Prediction with ADS-B Information

It is noted that the recurrent neural network (RNN) is available for the UAV trajectory prediction, in which the long short-term memory (LSTM) is specialized in dealing with the time-series data.

Generative Action Description Prompts for Skeleton-based Action Recognition

More specifically, we employ a pre-trained large-scale language model as the knowledge engine to automatically generate text descriptions for body parts movements of actions, and propose a multi-modal training scheme by utilizing the text encoder to generate feature vectors for different body parts and supervise the skeleton encoder for action representation learning.

Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition

For 3D video-based tasks such as action recognition, however, directly applying spatiotemporal transformers on video data will bring heavy computation and memory burdens due to the largely increased number of patches and the quadratic complexity of self-attention computation.

Spatial-Temporal Federated Learning for Lifelong Person Re-identification on Distributed Edges

Then, the learnt knowledge from edge clients will be aggregated by centralized parameter server, where the knowledge will be selectively and attentively distilled from spatial- and temporal-dimension with carefully designed mechanisms.

Auto Machine Learning for Medical Image Analysis by Unifying the Search on Data Augmentation and Neural Architecture

To address the problem, an improved augmentation search strategy, named Augmented Density Matching, was proposed by randomly sampling policies from a prior distribution for training.

A Survey on Leveraging Pre-trained Generative Adversarial Networks for Image Editing and Restoration

Generative adversarial networks (GANs) have drawn enormous attention due to the simple yet effective training mechanism and superior image generation quality.

Box-supervised Instance Segmentation with Level Set Evolution

A simple mask supervised SOLOv2 model is adapted to predict the instance-aware mask map as the level set for each instance.

Mind the Gap: Polishing Pseudo labels for Accurate Semi-supervised Object Detection

Instead of directly exploiting the pseudo labels produced by the teacher detector, we take the first attempt at reducing their deviation from ground truth using dual polishing learning, where two differently structured polishing networks are elaborately developed and trained using synthesized paired pseudo labels and the corresponding ground truth for categories and bounding boxes on the given annotated objects, respectively.

MIMO-DoAnet: Multi-channel Input and Multiple Outputs DoA Network with Unknown Number of Sound Sources

These algorithms are usually achieved by mapping the multi-channel audio input to the single output (i. e. overall spatial pseudo-spectrum (SPS) of all sources), that is called MISO.

E2FIF: Push the limit of Binarized Deep Imagery Super-resolution using End-to-end Full-precision Information Flow

Binary neural network (BNN) provides a promising solution to deploy parameter-intensive deep single image super-resolution (SISR) models onto real devices with limited storage and computational resources.

Domain Gap Estimation for Source Free Unsupervised Domain Adaptation with Many Classifiers

However, for source free UDA, the source domain data can not be accessed during adaptation, which poses great challenge of measuring the domain gap.

Learning High-quality Proposals for Acne Detection

Acne detection is crucial for interpretative diagnosis and precise treatment of skin disease.

Masked Surfel Prediction for Self-Supervised Point Cloud Learning

In this work, we make the first attempt, to the best of our knowledge, to consider the local geometry information explicitly into the masked auto-encoding, and propose a novel Masked Surfel Prediction (MaskSurf) method.

Universal Domain Adaptive Object Detector

Universal domain adaptive object detection (UniDAOD)is more challenging than domain adaptive object detection (DAOD) since the label space of the source domain may not be the same as that of the target and the scale of objects in the universal scenarios can vary dramatically (i. e, category shift and scale shift).

Approximating Discontinuous Nash Equilibrial Values of Two-Player General-Sum Differential Games

This paper investigates two potential solutions to this problem: a hybrid method that leverages both supervised Nash equilibria and the HJI PDE, and a value-hardening method where a sequence of HJIs are solved with a gradually hardening reward.

Improving Nighttime Driving-Scene Segmentation via Dual Image-adaptive Learnable Filters

With DIAL-Filters, we design both unsupervised and supervised frameworks for nighttime driving-scene segmentation, which can be trained in an end-to-end manner.

Saliency Guided Inter- and Intra-Class Relation Constraints for Weakly Supervised Semantic Segmentation

Specifically, we propose a saliency guided class-agnostic distance module to pull the intra-category features closer by aligning features to their class prototypes.

SP-ViT: Learning 2D Spatial Priors for Vision Transformers

Unlike convolutional inductive biases, which are forced to focus exclusively on hard-coded local regions, our proposed SPs are learned by the model itself and take a variety of spatial relations into account.

Learning Domain Adaptive Object Detection with Probabilistic Teacher

In addition, we conduct anchor adaptation in parallel with localization adaptation, since anchor can be regarded as a learnable parameter.

Are Transformers Effective for Time Series Forecasting?

Recently, there has been a surge of Transformer-based solutions for the long-term time series forecasting (LTSF) task.

Multiple Domain Cyberspace Attack and Defense Game Based on Reward Randomization Reinforcement Learning

In order to improve the defense ability of defender, a game model based on reward randomization reinforcement learning is proposed.

SwinFuse: A Residual Swin Transformer Fusion Network for Infrared and Visible Images

The existing deep learning fusion methods mainly concentrate on the convolutional neural networks, and few attempts are made with transformer.

OTExtSum: Extractive Text Summarisation with Optimal Transport

Optimal sentence extraction is conceptualised as obtaining an optimal summary that minimises the transportation cost to a given document regarding their semantic distributions.

Dense Learning based Semi-Supervised Object Detection

Semi-supervised object detection (SSOD) aims to facilitate the training and deployment of object detectors with the help of a large amount of unlabeled data.

Rapid model transfer for medical image segmentation via iterative human-in-the-loop update: from labelled public to unlabelled clinical datasets for multi-organ segmentation in CT

Despite the remarkable success on medical image analysis with deep learning, it is still under exploration regarding how to rapidly transfer AI models from one dataset to another for clinical applications.

Image Segmentation Medical Image Segmentation +2

Towards Reliable Image Outpainting: Learning Structure-Aware Multimodal Fusion with Depth Guidance

Concretely, we propose a Depth-Guided Outpainting Network to model different feature representations of two modalities and learn the structure-aware cross-modal fusion.

Large-Scale Pre-training for Person Re-identification with Noisy Labels

Since theses ID labels automatically derived from tracklets inevitably contain noises, we develop a large-scale Pre-training framework utilizing Noisy Labels (PNL), which consists of three learning modules: supervised Re-ID learning, prototype-based contrastive learning, and label-guided contrastive learning.

Efficient and Degradation-Adaptive Network for Real-World Image Super-Resolution

Specifically, a tiny regression network is employed to predict the degradation parameters of the input image, while several convolutional experts with the same topology are jointly optimized to specify the network parameters via a non-linear mixture of experts.

Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds

VoxSeT is built upon a voxel-based set attention (VSA) module, which reduces the self-attention in each voxel by two cross-attentions and models features in a hidden space induced by a group of latent codes.

Towards Robust 2D Convolution for Reliable Visual Recognition

2D convolution (Conv2d), which is responsible for extracting features from the input image, is one of the key modules of a convolutional neural network (CNN).

A Dual Weighting Label Assignment Scheme for Object Detection

Existing LA methods mostly focus on the design of pos weighting function, while the neg weight is directly derived from the pos weight.

Beyond a Video Frame Interpolator: A Space Decoupled Learning Approach to Continuous Image Transition

Most of the existing deep learning based VFI methods adopt off-the-shelf optical flow algorithms to estimate the bidirectional flows and interpolate the missing frames accordingly.

Details or Artifacts: A Locally Discriminative Learning Approach to Realistic Image Super-Resolution

In this paper, we demonstrate that it is possible to train a GAN-based SISR model which can stably generate perceptually realistic details while inhibiting visual artifacts.

Exact Feature Distribution Matching for Arbitrary Style Transfer and Domain Generalization

In this work, we, for the first time to our best knowledge, propose to perform Exact Feature Distribution Matching (EFDM) by exactly matching the empirical Cumulative Distribution Functions (eCDFs) of image features, which could be implemented by applying the Exact Histogram Matching (EHM) in the image feature space.

Efficient Long-Range Attention Network for Image Super-resolution

A highly efficient long-range attention block (ELAB) is then built by simply cascading two shift-conv with a GMSA module, which is further accelerated by using a shared attention mechanism.

One-stage Video Instance Segmentation: From Frame-in Frame-out to Clip-in Clip-out

Based on the fact that adjacent frames in a short clip are highly coherent in content, we propose to extend the one-stage FiFo framework to a clip-in clip-out (CiCo) one, which performs VIS clip by clip.

Unfolded Deep Kernel Estimation for Blind Image Super-resolution

Nonetheless, the existing deep unfolding methods cannot explicitly solve the data term of the unfolding objective function, limiting their capability in blur kernel estimation.

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Compared to other models on the leaderboard, DINO significantly reduces its model size and pre-training data size while achieving better results.

Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

This survey is inspired by the remarkable progress in both computer vision and natural language processing, and recent trends shifting from single modality processing to multiple modality comprehension.

DN-DETR: Accelerate DETR Training by Introducing Query DeNoising

Our method is universal and can be easily plugged into any DETR-like methods by adding dozens of lines of code to achieve a remarkable improvement.

BP-Triplet Net for Unsupervised Domain Adaptation: A Bayesian Perspective

In our work, considering the different importance of pair-wise samples for both feature learning and domain alignment, we deduce our BP-Triplet loss for effective UDA from the perspective of Bayesian learning.

Winograd Convolution: A Perspective from Fault Tolerance

Winograd convolution is originally proposed to reduce the computing overhead by converting multiplication in neural network (NN) with addition via linear transformation.

DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR

We present in this paper a novel query formulation using dynamic anchor boxes for DETR (DEtection TRansformer) and offer a deeper understanding of the role of queries in DETR.

Adversarial Examples for Good: Adversarial Examples Guided Imbalanced Learning

Adversarial examples are inputs for machine learning models that have been designed by attackers to cause the model to make mistakes.

Online Multi-Object Tracking with Unsupervised Re-Identification Learning and Occlusion Estimation

In addition, such practice of re-identification still can not track those highly occluded objects when they are missed by the detector.

Towards Efficient Data Free Black-Box Adversarial Attack

The proposed method can efficiently imitate the target model through a small number of queries and achieve high attack success rate.

Neural Architecture Search With Representation Mutual Information

Building upon RMI, we further propose a new search algorithm termed RMI-NAS, facilitating with a theorem to guarantee the global optimal of the searched architecture.

DeePN$^2$: A deep learning-based non-Newtonian hydrodynamic model

A long standing problem in the modeling of non-Newtonian hydrodynamics of polymeric flows is the availability of reliable and interpretable hydrodynamic models that faithfully encode the underlying micro-scale polymer dynamics.

CausalMTA: Eliminating the User Confounding Bias for Causal Multi-touch Attribution

Existing methods first train a model to predict the conversion probability of the advertisement journeys with historical data and calculate the attribution of each touchpoint using counterfactual predictions.

Image-Adaptive YOLO for Object Detection in Adverse Weather Conditions

Though deep learning-based object detection methods have achieved promising results on the conventional datasets, it is still challenging to locate objects from the low-quality images captured in adverse weather conditions.

