This paper describes AISP-SJTU’s submissions for the IWSLT 2022 Simultaneous Translation task.
Most existing methods focus on learning noise-robust models from web images while neglecting the performance drop caused by the differences between web domain and real-world domain.
We present HandAvatar, a novel representation for hand animation and rendering, which can generate smoothly compositional geometry and self-occlusion-aware texture.
To ensure optimal consistency, the optimal node is required to be the unique STN.
Place recognition is a critical and challenging task for mobile robots, aiming to retrieve an image captured at the same place as a query image from a database.
Neural Radiance Fields (NeRF) have achieved photorealistic novel views synthesis; however, the requirement of accurate camera poses limits its application.
We present two versatile methods to generally enhance self-supervised monocular depth estimation (MDE) models.
In this paper, we propose a lightweight system, RDS-SLAM, based on ORB-SLAM2, which can accurately estimate poses and build semantic maps at object level for dynamic scenarios in real time using only one commonly used Intel Core i7 CPU.
The proposed sparse semantic map-based localization approach is robust against occlusion and long-term appearance changes in the environments.
In this paper, we redesign the patch-based triplet loss in MDE to alleviate the ubiquitous edge-fattening issue.
Ranked #1 on Unsupervised Monocular Depth Estimation on Kitti Raw
In this paper, we propose an efficient structure named Efficient Correspondence Transformer (ECO-TR) by finding correspondences in a coarse-to-fine manner, which significantly improves the efficiency of functional correspondence model.
In this work, we propose a novel OWOD problem called Unknown-Classified Open World Object Detection (UC-OWOD).
However, this API-based architecture greatly limits the information-searching capability of intelligent assistants and may even lead to task failure if TOD-specific APIs are not available or the task is too complicated to be executed by the provided APIs.
Recently, the structural reading comprehension (SRC) task on web pages has attracted increasing research interests.
Neural volume rendering enables photo-realistic renderings of a human performer in free-view, a critical task in immersive VR/AR applications.
We allow the effective combination of design experience from different sources, so as to create an effective search space containing a variety of TSF models to support different TSF tasks.
In this work, we propose a framework for single-view hand mesh reconstruction, which can simultaneously achieve high reconstruction accuracy, fast inference speed, and temporal coherence.
This paper studies the problem of hallucinated NeRF: i. e., recovering a realistic NeRF at a different time of day from a group of tourism images.
Due to the representation limitation of the joint Q value function, multi-agent reinforcement learning (MARL) methods with linear or monotonic value decomposition can not ensure the optimal consistency (i. e. the correspondence between the individual greedy actions and the maximal true Q value), leading to instability and poor coordination.
The experimental results show that our approach makes the interaction more efficient and safer.
Our experimental results show that the proposed method is able to perform high-quality restoration for unconstrained underwater images without any supervision.
Spoken dialogue systems such as Siri and Alexa provide great convenience to people's everyday life.
Based on this observation, we propose the adaptive feature alignment (AFA) to generate features of arbitrary attacking strengths.
Thus, we propose a novel unsupervised FIQA method that incorporates Similarity Distribution Distance for Face Image Quality Assessment (SDD-FIQA).
In the root-relative mesh recovery task, we exploit semantic relations among joints to generate a 3D mesh from the extracted 2D cues.
In this paper, we introduce the task of structural reading comprehension (SRC) on web.
Tests on AFLW2000-3D and BIWI show that our method runs at real-time and outperforms state of the art (SotA) face pose estimators.
Ranked #4 on Head Pose Estimation on AFLW2000
We present S3ML, a secure serving system for machine learning inference in this paper.
Using a gating mechanism that discriminates the unseen samples from the seen samples can decompose the GZSL problem to a conventional Zero-Shot Learning (ZSL) problem and a supervised classification problem.
According to our analysis, five key discoveries are reported: 1) Domain quality has an ignorable effect on within-domain convolutional representation and detection accuracy; 2) low-quality domain leads to higher generalization ability in cross-domain detection; 3) low-quality domain can hardly be well learned in a domain-mixed learning process; 4) degrading recall efficiency, restoration cannot improve within-domain detection accuracy; 5) visual restoration is beneficial to detection in the wild by reducing the domain shift between training data and real-world scenes.
From a robotic perspective, the importance of recall continuity and localization stability is equal to that of accuracy, but the AP is insufficient to reflect detectors' performance across time.
As for temporal detection in videos, temporal refinement networks (TRNet) and temporal dual refinement networks (TDRNet) are developed by propagating the refinement information across time.
Moreover, we develop a creative temporal analysis unit, namely, attentional ConvLSTM (AC-LSTM), in which a temporal attention mechanism is specially tailored for background suppression and scale suppression while a ConvLSTM integrates attention-aware features across time.
More specifically, an underwater index is investigated to describe underwater properties, and a loss function based on the underwater index is designed to train the critic branch for underwater noise suppression.