In this work, we focus on imitator learning based on only one expert demonstration.
The capacity of a modern deep learning system to determine if a sample falls within its realm of knowledge is fundamental and important.
Addressing this, we study video captioning from a different perspective in compressed domain, which brings multi-fold advantages over the existing pipeline: 1) Compared to raw images from the decoded video, the compressed video, consisting of I-frames, motion vectors and residuals, is highly distinguishable, which allows us to leverage the entire video for learning without manual sampling through a specialized model design; 2) The captioning model is more efficient in inference as smaller and less redundant information is processed.
Ranked #5 on Video Captioning on VATEX
In the domain-aware stage, we apply a low-cost prompt tuning paradigm to learn soft visual prompts from an in-domain dataset for equipping the pretrained models with object-level and scene-level cross-modal alignment in VLN tasks.
Diffusion models have emerged as potential tools to tackle the challenge of sparse-view CT reconstruction, displaying superior performance compared to conventional methods.
The key insight is that if the UDF is estimated correctly, the 3D points should be locally projected onto the underlying surface following the gradient of the UDF.
For the first time, randomized optimization is made possible in neural tracking with several key designs to the learning process, enabling efficient and robust tracking even under fast camera motions.
The commonly adopted detect-then-match approach to registration finds difficulties in the cross-modality cases due to the incompatible keypoint detection and inconsistent feature description.
They seek correspondences over downsampled superpoints, which are then propagated to dense points.
3D plane recovery from a single image can usually be divided into several subtasks of plane detection, segmentation, parameter estimation and possibly depth estimation.
In particular, we propose RayMVSNet which learns sequential prediction of a 1D implicit field along each camera ray with the zero-crossing point indicating scene depth.
Surface reconstruction from raw point clouds has been studied for decades in the computer graphics community, which is highly demanded by modeling and rendering applications nowadays.
We find that decoupling the diffusion process reduces the learning difficulty and the explicit transition probability improves the generative speed significantly.
Generative models trained with Differential Privacy (DP) are increasingly used to produce synthetic data while reducing privacy risks.
We show that if these auxiliary densities are constructed such that they overlap with $p$ and $q$, then a multi-class logistic regression allows for estimating $\log p/q$ on the domain of any of the $K+2$ distributions and resolves the distribution shift problems of the current state-of-the-art methods.
Video super-resolution commonly uses a frame-wise alignment to support the propagation of information over time.
Ranked #1 on Video Super-Resolution on REDS4- 4x upscaling
For re-rendering, we propose a differentiable specular rendering layer to render low-frequency non-Lambertian materials under various illuminations of spherical harmonics.
We propose Semantically-aware Object Coordinate Space (SOCS) built by warping-and-aligning the objects guided by a sparse set of keypoints with semantically meaningful correspondence.
We first design a local spatial consistency measure over the deformation graph of the point cloud, which evaluates the spatial compatibility only between the correspondences in the vicinity of a graph node.
To tackle the challenges, we propose an accurate template matching method based on differentiable coarse-to-fine correspondence refinement.
We study the problem of reconstructing 3D feature curves of an object from a set of calibrated multi-view images.
In this work, we present Multi-Symmetry Ensembles (MSE), a framework for constructing diverse ensembles by capturing the multiplicity of hypotheses along symmetry axes, which explore the hypothesis space beyond stochastic perturbations of model weights and hyperparameters.
In this work, we explore a central 3D scene modeling task, namely, semantic scene reconstruction without using any 3D annotations.
Light goods vehicles (LGV) used extensively in the last mile of delivery are one of the leading polluters in cities.
State-of-the-art video-text retrieval (VTR) methods usually fully fine-tune the pre-trained model (e. g.
Among them, the differential Laplican regularizer can effectively alleviate the implicit surface unsmoothness caused by the point cloud quality deteriorates; Meanwhile, in order to reduce the excessive smoothing at the edge regions of implicit suface, we proposed a dynamic edge extract strategy for sampling near the sharp edge of point cloud, which can effectively avoid the Laplacian regularizer from smoothing all regions.
Extensive experiments on real-world scenarios demonstrate that our method achieves the best of both worlds in accuracy, efficiency, and generalization.
We study the problem of learning online packing skills for irregular 3D shapes, which is arguably the most challenging setting of bin packing problems.
Therefore, we propose a novel depth map fusion module to combine the advantages of estimations with multi-resolution inputs.
However, leveraging 3D scene representation can be prohibitively unpractical for policy learning in this floor-level task, due to low sample efficiency and expensive computational cost.
Given a few object manipulation demos, NIFT guides the generation of the interaction imitation for a new object instance by matching the Neural Interaction Template (NIT) extracted from the demos in the target Neural Interaction Field (NIF) defined for the new object.
We introduce a well-targeted down-sampling strategy that focuses more on edge area for efficient feature extraction of complex geometry.
One of the most important tasks in recommender systems is to predict the potential connection between two nodes under a specific edge type (i. e., relationship).
Seeing as a systematic outlier is a combination of patterns of a clean instance and systematic error patterns, our main insight is that inliers can be modelled by a smaller representation (subspace) in a model than outliers.
We propose a general, flexible, and scalable framework dpart, an open source Python library for differentially private synthetic data generation.
In particular, we propose RayMVSNet which learns sequential prediction of a 1D implicit field along each camera ray with the zero-crossing point indicating scene depth.
To resolve the sample efficiency issue in learning the high-dimensional and complex control of dexterous grasping, we propose an effective representation of grasping state characterizing the spatial interaction between the gripper and the target object.
However, adopting relations between all the object or patch proposals for detection is inefficient, and an imbalanced combination of local and global relations brings extra noise that could mislead the training.
A standard hardware bottleneck when training deep neural networks is GPU memory.
Context has proven to be one of the most important factors in object layout reasoning for 3D scene understanding.
In the field of 3D object detection, previous methods have been taking the advantage of context encoding, graph embedding, or explicit relation reasoning to extract relation context.
Such sparse and loose matching requires contextual features capturing the geometric structure of the point clouds.
Each level of the tree corresponds to an assembly of shape parts, represented as implicit functions, to reconstruct the input shape.
In particular, we impose a consistency regularization which enforces the outputs from each of the multiple layers to be consistent for the input image and its perturbed counterpart.
In this paper, we introduce a neural architecture, termed Box2Seg, to learn point-level semantics of 3D point clouds with bounding box-level supervision.
Weakly supervised learning can help local feature methods to overcome the obstacle of acquiring a large-scale dataset with densely labeled correspondences.
Ranked #1 on Camera Localization on Aachen Day-Night benchmark
In this paper, we propose a Bayesian-symbolic framework (BSP) for physical reasoning and learning that is close to human-level sample-efficiency and accuracy.
These approaches, however, are limited in their ability to capture the underlying neural dynamics (e. g. linear) and in their ability to relate the learned dynamics back to the observed behaviour (e. g. no time lag).
PCT is a full-fledged description of the state and action space of bin packing which can support packing policy learning based on deep reinforcement learning (DRL).
As such, estimating density ratios accurately using only samples from $p$ and $q$ is of high significance and has led to a flurry of recent work in this direction.
In this problem, the items are delivered to the agent without informing the full sequence information.
We propose an efficient plug-and-play acceleration framework for semi-supervised video object segmentation by exploiting the temporal redundancies in videos presented by the compressed bitstream.
Traffic simulators act as an essential component in the operating and planning of transportation systems.
We propose to tackle the difficulties of fast-motion camera tracking in the absence of inertial measurements using random optimization, in particular, the Particle Filter Optimization (PFO).
In this work, rather than defining a continuous or discrete kernel, we directly embed convolutional kernels into the learnable potential fields, giving rise to potential convolution.
Learning-based 3D shape segmentation is usually formulated as a semantic labeling problem, assuming that all parts of training shapes are annotated with a given set of tags.
According to the theory of geometric stability analysis, a minimal set of three planar/cylindrical patches are geometrically stable and determine the full 6DoFs of the object pose.
In this paper, we propose a selective sensing framework that adopts the novel concept of data-driven nonuniform subsampling to reduce the dimensionality of acquired signals while retaining the information of interest in a computation-free fashion.
As such, learning the laws is then reduced to symbolic regression and Bayesian inference methods are used to obtain the distribution of unobserved properties.
The masses of tetraquark states of all $qc\bar q \bar c$ and $cc\bar c \bar c$ quark configurations are evaluated in a constituent quark model, where the Cornell-like potential and one-gluon exchange spin-spin coupling are employed.
High Energy Physics - Phenomenology
We show that HPC constitutes a powerful point feature learning with a rather compact set of only four types of geometric priors as kernels.
We present a novel attention-based mechanism to learn enhanced point features for point cloud processing tasks, e. g., classification and segmentation.
The succeed of simulation strongly supports the ellipse packing hypothesis that was proposed to explain the dynamic behaviors of a trivalent 2D structure.
Biological Physics Adaptation and Self-Organizing Systems Cell Behavior
Current autoencoder-based disentangled representation learning methods achieve disentanglement by penalizing the (aggregate) posterior to encourage statistical independence of the latent factors.
The rapidly growing amount of data emerging in many fields motivated the need for accelerated hash tables designed for modern parallel architectures.
Distributed, Parallel, and Cluster Computing
We propose an end-to-end deep neural network which is able to predict both reflectional and rotational symmetries of 3D objects present in the input RGB-D image.
We introduce an end-to-end learnable technique to robustly identify feature edges in 3D point cloud data.
We solve a challenging yet practically useful variant of 3D Bin Packing Problem (3D-BPP).
Adding the attention module with a rectified linear unit (ReLU) results in an amplification of positive elements and a suppression of negative ones, both with learned, data-adaptive parameters.
We demonstrate these by capturing contextual information at patch, object and scene levels.
Online semantic 3D segmentation in company with real-time RGB-D reconstruction poses special challenges such as how to perform 3D convolution directly over the progressively fused 3D geometric data, and how to smartly fuse information from frame to frame.
Experiment results show that learning in the frequency domain with static channel selection can achieve higher accuracy than the conventional spatial downsampling approach and meanwhile further reduce the input data size.
Since DynamicPPL is a modular, stand-alone library, any probabilistic programming system written in Julia, such as Turing. jl, can use DynamicPPL to specify models and trace their model parameters.
To tackle intra-class shape variations, we learn canonical shape space (CASS), a unified representation for a large variety of instances of a certain object category.
Based on further studying the low-rank subspace clustering (LRSC) and L2-graph subspace clustering algorithms, we propose a F-graph subspace clustering algorithm with a symmetric constraint (FSSC), which constructs a new objective function with a symmetric constraint basing on F-norm, whose the most significant advantage is to obtain a closed-form solution of the coefficient matrix.
The regularization maximizes the mutual information between navigation actions and visual observation transforms of an agent, thus promoting more informed navigation decisions.
To address this issue, we approach camera relocalization with a decoupled solution where feature extraction, coordinate regression, and pose estimation are performed separately.
We introduce PQ-NET, a deep neural network which represents and generates 3D shapes via sequential part assembly.
Transforming one probability distribution to another is a powerful tool in Bayesian inference and machine learning.
Stan's Hamilton Monte Carlo (HMC) has demonstrated remarkable sampling robustness and efficiency in a wide range of Bayesian inference problems through carefully crafted adaption schemes to the celebrated No-U-Turn sampler (NUTS) algorithm.
In depth-sensing applications ranging from home robotics to AR/VR, it will be common to acquire 3D scans of interior spaces repeatedly at sparse time intervals (e. g., as part of regular daily use).
No-cloning theorem forbids perfect cloning of an unknown quantum state.
While the current convolution neural network tends to extract global features and global semantic information in a scene, the geo-spatial objects can be located at anywhere in an aerial image scene and their spatial arrangement tends to be more complicated.
Exhaustive experiments indicate that the proposed method can detect building change types directly and outperform the current multi-index learning method.
In our method, the exploratory robot scanning is both driven by and targeting at the recognition and segmentation of semantic objects from the scene.
First, the latent distribution is conditioned on current observations and the target view, leading to a model-based, target-driven navigation.
Enlightened by the fact that 3D shape structure is characterized as part composition and placement, we propose to model 3D shape variations with a part-aware deep generative network, coined as PAGENet.
Determining the positions of neurons in an extracellular recording is useful for investigating functional properties of the underlying neural circuitry.
In this paper, we propose to re-examine the RL approaches through the lens of classic transportation theory.
Increasingly available city data and advanced learning techniques have empowered people to improve the efficiency of our city functions.
To enable cooperation of traffic signals, in this paper, we propose a model, CoLight, which uses graph attentional networks to facilitate communication.
Specifically, the temporal coherence branch pretrained in an adversarial fashion from unlabeled video data, is designed to capture the dynamic appearance and motion cues of video sequences to guide object segmentation.
Ranked #2 on Semi-Supervised Video Object Segmentation on YouTube
While the part prior network can be trained with noisy and inconsistently segmented shapes, the final output of AdaCoSeg is a consistent part labeling for the input set, with each shape segmented into up to (a user-specified) K parts.
For the task of mobility analysis of 3D shapes, we propose joint analysis for simultaneous motion part segmentation and motion attribute estimation, taking a single 3D model as input.
Meanwhile, to increase the segmentation accuracy at each node, we enhance the recursive contextual feature with the shape feature extracted for the corresponding part.
Ranked #14 on 3D Part Segmentation on ShapeNet-Part (Class Average IoU metric)
The network may significantly alter the geometry and structure of the input parts and synthesize a novel shape structure based on the inputs, while adding or removing parts to minimize a structure plausibility loss.
We propose to generate part hypotheses from the components based on a hierarchical grouping strategy, and perform labeling on those part groups instead of directly on the components.
Multi-view deep neural network is perhaps the most successful approach in 3D shape classification.
The reason is that it finds the similar instances according to their features directly, which is usually impacted by the imperfect data, and thus returns sub-optimal results.
In this network, a Score Generation Unit is devised to evaluate the quality of each projected image with score vectors.
We propose a scalable Laplacian pyramid reconstructive adversarial network (LAPRAN) that enables high-fidelity, flexible and fast CS images reconstruction.
We present a generative neural network which enables us to generate plausible 3D indoor scenes in large quantities and varieties, easily and highly efficiently.
In this work, we take their insight of using kernels as fixed adversaries further and present a novel method for training deep generative models that does not involve saddlepoint optimization.
We present a semi-supervised co-analysis method for learning 3D shape styles from projected feature lines, achieving style patch localization with only weak supervision.
We propose to recover 3D shape structures from single RGB images, where structure refers to shape parts represented by cuboids and part relations encompassing connectivity and symmetry.
We introduce a novel RGB-D patch descriptor designed for detecting coplanar surfaces in SLAM reconstruction.
In this paper, we propose LCANet, an end-to-end deep neural network based lipreading system.
Ranked #2 on Lipreading on GRID corpus (mixed-speech)
Interpreting black box classifiers, such as deep networks, allows an analyst to validate a classifier before it is deployed in a high-stakes setting.
We introduce a novel neural network architecture for encoding and synthesis of 3D shapes, particularly their structures.
For processing static data in large batch sizes, the proposed solution is on a par with a Titan X GPU in terms of throughput while delivering 9. 5x higher energy efficiency.
Compressive sensing (CS) is a promising technology for realizing energy-efficient wireless sensors for long-term health monitoring.
This paper addresses the real-time encoding-decoding problem for high-frame-rate video compressive sensing (CS).
Active vision is inherently attention-driven: The agent actively selects views to attend in order to fast achieve the vision task while improving its internal representation of the scene being observed.