Information extraction from documents such as receipts or invoices is a fundamental and crucial step for office automation.
We propose a deep convolutional neural network (CNN) to estimate surface normal from a single color image accompanied with a low-quality depth channel.
Mis- and disinformation online have become a major societal problem as major sources of online harms of different kinds.
We verify that importance sampling the seed chain in the continuous space reaches the goal of importance sampling the discrete admissible specular chain.
Single hyperspectral image super-resolution (single-HSI-SR) aims to restore a high-resolution hyperspectral image from a low-resolution observation.
The main idea of multimodal recommendation is the rational utilization of the item's multimodal information to improve the recommendation performance.
CLIP (Contrastive Language-Image Pretraining) is well-developed for open-vocabulary zero-shot image-level recognition, while its applications in pixel-level tasks are less investigated, where most efforts directly adopt CLIP features without deliberative adaptations.
Recent methods mostly rely on convolutional neural networks (CNNs) to fill the missing contents in the warped panorama.
Recently, Neural Radiance Fields (NeRF) have emerged as a potent method for synthesizing novel views from a dense set of images.
Extensive experiments and comparisons demonstrate our superiority and generalization and show that our method achieves state-of-the-art performance on unsupervised completion of real scene objects.
Then, a multi-granularity shared space is established with a designed Multi-granularity Feature Aggregation and Rearrangement (MFAR) module, which enhances the semantic corresponding relations between the local and global information, and obtains more accurate feature representations for the image and text modalities.
Automatic keyword extraction (AKE) has gained more importance with the increasing amount of digital textual data that modern computing systems process.
no code implementations • 30 Jun 2022 • Yuting Wang, Hangning Zhou, Zhigang Zhang, Chen Feng, Huadong Lin, Chaofei Gao, Yizhi Tang, Zhenting Zhao, Shiyu Zhang, Jie Guo, Xuefeng Wang, Ziyao Xu, Chi Zhang
This technical report presents an effective method for motion prediction in autonomous driving.
Ranked #12 on Motion Forecasting on Argoverse CVPR 2020
We propose in this paper an end-to-end network, named CS-Net, to complete the point clouds contaminated by noises or containing outliers.
In this paper, we propose a novel point cloud simplification network (PCS-Net) dedicated to high-quality surface mesh reconstruction while maintaining geometric fidelity.
At its core is a new lighting model (dubbed DSGLight) based on depth-augmented Spherical Gaussians (SG) and a Graph Convolutional Network (GCN) that infers the new lighting representation from a single LDR image of limited field-of-view.
In this paper, we propose a learning-based method for predicting dense depth values of a scene from a monocular omnidirectional image.
Ranked #7 on Depth Estimation on Stanford2D3D Panoramic
TOAA block calculates the low-level information with attention mechanism in both row and column directions and fuses it with the high-level information to capture the shape characteristic of targets and suppress noises.
The vigorous developments of Internet of Things make it possible to extend its computing and storage capabilities to computing tasks in the aerial system with collaboration of cloud and edge, especially for artificial intelligence (AI) tasks based on deep learning (DL).
For efficiency, we train the network in two stages: reusing a trained model to initialize the SVBRDFs and fine-tune it based on the input image.
However, most of them ignored the domain generalization scenario and scale variances, with an inferior performance on domain shift situations, and normally were exacerbated by intra-domain and inter-domain scale variances.
In order to obtain the similarity of a pair of videos, we predict the alignment scores between all pairs of temporal positions in the two videos with the temporal alignment prediction function.
We demonstrate that local geometry has a greater impact on the sound than the global geometry and offers more cues in material recognition.
This leads to a new problem of confidence discrepancy for the detector ensembles.
Natural language video localization (NLVL), which aims to locate a target moment from a video that semantically corresponds to a text query, is a novel and challenging task.
We consider the scattering of light in participating media composed of sparsely and randomly distributed discrete particles.
However, naively compressing an outdoor panorama into a low-dimensional latent vector, as existing models have done, causes two major problems.
We consider online change detection of high dimensional data streams with sparse changes, where only a subset of data streams can be observed at each sensing time point due to limited sensing capacities.
To the best of our knowledge, we are the first to enhance the facial attractiveness with GANs in both geometry and appearance aspects.
We propose the OpenHybrid framework, which is composed of an encoder to encode the input data into a joint embedding space, a classifier to classify samples to inlier classes, and a flow-based density estimator to detect whether a sample belongs to the unknown category.
To this end, we propose to build a flexible and efficient Actor Relation Graph (ARG) to simultaneously capture the appearance and position relation between actors.
Ranked #3 on Group Activity Recognition on Collective Activity
In recent years, with the development of the marine industry, navigation environment becomes more complicated.
Specifically, the Multimodal IRIS model consists of three modules, i. e., multimodal feature learning module, the Interest-Related Network (IRN) module and item similarity recommendation module.
The sparsity and self-similarity of the image blocks are taken as the constraints.
We propose a sparse and low-rank reflection model for specular highlight detection and removal using a single input image.
Deep convolutional neural networks (CNNs) have dominated many computer vision domains because of their great power to extract good features automatically.
The Localization of the target object for data retrieval is a key issue in the Intelligent and Connected Transportation Systems (ICTS).
In this paper, we introduce a novel end-end framework for multi-oriented scene text detection from an instance-aware semantic segmentation perspective.
Ranked #12 on Scene Text Detection on MSRA-TD500