PVSG relates to the existing video scene graph generation (VidSGG) problem, which focuses on temporal interactions between humans and objects grounded with bounding boxes in videos.
Full-waveform inversion (FWI) is a powerful geophysical imaging technique that infers high-resolution subsurface physical parameters by solving a non-convex optimization problem.
Previous studies have revealed that ICL is sensitive to the selection and the ordering of demonstrations.
Experiments on our proposed datasets demonstrate that popular VLMs underperform in the food domain compared with their performance in the general domain.
Considering that Vision-Language Pre-Training (VLP) models master massive such knowledge from large-scale web-harvested data, it is promising to utilize the generalizability of VLP models to incorporate knowledge into image descriptions.
Generating visually grounded image captions with specific linguistic styles using unpaired stylistic corpora is a challenging task, especially since we expect stylized captions with a wide variety of stylistic patterns.
In this paper, we introduce two types of novel Asymptotic-Preserving Convolutional Deep Operator Networks (APCONs) designed to address the multiscale time-dependent linear transport problem.
To address this issue, this paper proposes an extension to PINNs called Laplace-based fractional physics-informed neural networks (Laplace-fPINNs), which can effectively solve the forward and inverse problems of fractional diffusion equations.
To alleviate the problem, we propose a global distribution fitting (GDF) method by a penalty term to constrain generated data distribution.
A large number of numerical experiments demonstrate that the operator learning method proposed in this work can efficiently solve the forward problems and Bayesian inverse problems of the subdiffusion equation.
In recent years, vision and language pre-training (VLP) models have advanced the state-of-the-art results in a variety of cross-modal downstream tasks.
Data imputation is an effective way to handle missing data, which is common in practical applications.
To this end, we present OpenCalib, a calibration toolbox that contains a rich set of various sensor calibration methods.
With CRPD, a unified detection and recognition network with high efficiency is presented as the baseline.
In this paper, we propose a a machine learning approach via model-operator-data network (MOD-Net) for solving PDEs.
It draws class-wise features closer than coarse feature alignment or class-wise feature alignment only, therefore improves the model's performance to a great extent.
Channel pruning is broadly recognized as an effective approach to obtain a small compact model through eliminating unimportant channels from a large cumbersome network.
Why heavily parameterized neural networks (NNs) do not overfit the data is an important long standing open question.
A supervised learning problem is to find a function in a hypothesis function space given values on isolated data points.
Recent works show an intriguing phenomenon of Frequency Principle (F-Principle) that deep neural networks (DNNs) fit the target function from low to high frequency during the training, which provides insight into the training and generalization behavior of DNNs in complex tasks.
To handle the data explosion in the era of internet of things (IoT), it is of interest to investigate the decentralized network, with the aim at relaxing the burden to central server along with keeping data privacy.
In this work, inspired by the phase diagram in statistical mechanics, we draw the phase diagram for the two-layer ReLU neural network at the infinite-width limit for a complete characterization of its dynamical regimes and their dependence on hyperparameters related to initialization.
Borrowing ideas from physics, we propose a path integral based graph neural networks (PAN) for classification and regression tasks on graphs.
We study the problem of distilling knowledge from a large deep teacher network to a much smaller student network for the task of road marking segmentation.
Ranked #1 on Semantic Segmentation on ApolloScape
We demonstrate that our two-stream architecture is robust to adversarial examples built by currently known attacking algorithms.
The input of each pooling layer is transformed by the compressive Haar basis of the corresponding clustering.
To achieve high coverage of target boxes, a normal strategy of conventional one-stage anchor-based detectors is to utilize multiple priors at each spatial position, especially in scene text detection tasks.
Recently, scene text recognition methods based on deep learning have sprung up in computer vision area.
Training deep models for lane detection is challenging due to the very subtle and sparse supervisory signals inherent in lane annotations.
Ranked #6 on Lane Detection on BDD100K val
Graph Neural Networks (GNNs) have become a topic of intense research recently due to their powerful capability in high-dimensional classification and regression tasks for graph-structured data.
Along with fruitful applications of Deep Neural Networks (DNNs) to realistic problems, recently, some empirical studies of DNNs reported a universal phenomenon of Frequency Principle (F-Principle): a DNN tends to learn a target function from low to high frequencies during the training.
It remains a puzzle that why deep neural networks (DNNs), with more parameters than samples, often generalize well.
Overall, our work serves as a baseline for the further investigation of the impact of initialization and loss function on the generalization of DNNs, which can potentially guide and improve the training of DNNs in practice.
In this paper, we propose PAN, a new graph convolution framework that involves every path linking the message sender and receiver with learnable weights depending on the path length, which corresponds to the maximal entropy random walk.
3D face reconstruction from a single 2D image is a challenging problem with broad applications.
Ranked #7 on Face Alignment on AFLW2000-3D
Since sparse unmixing has emerged as a promising approach to hyperspectral unmixing, some spatial-contextual information in the hyperspectral images has been exploited to improve the performance of the unmixing recently.
We propose a CNN framework using sparsely labeled data from the target domain to learn features that are invariant across domains for face anti-spoofing.
Spatio-temporal information is very important to capture the discriminative cues between genuine and fake faces from video sequences.
Face anti-spoofing (a. k. a presentation attack detection) has drawn growing attention due to the high-security demand in face authentication systems.
Ranked #2 on Face Anti-Spoofing on MSU-MFSD
Reinforcement learning agents need exploratory behaviors to escape from local optima.
In this paper, we considerably improve the accuracy and robustness of predictions through heterogeneous auxiliary networks feature mimicking, a new and effective training method that provides us with much richer contextual signals apart from steering direction.
Ranked #1 on Steering Control on BDD100K val
Previous approaches for scene text detection usually rely on manually defined sliding windows.
Ranked #1 on Scene Text Detection on COCO-Text
The goal of this paper is to evaluate density maps generated by density estimation methods on a variety of crowd analysis tasks, including counting, detection, and tracking.
For each region, a sliding window (ROI) is passed over the density map to calculate the instance count within each ROI.
Next, the number of people is estimated in a set of overlapping sliding windows on the temporal slice image, using a regression function that maps from local features to a count.