Simulation results demonstrate that our method is capable of adequately addressing the uncertainties resulting from RES and loads, mitigating the impact of cyber-attacks on the scheduling strategy, and ensuring a stable demand supply for various energy sources.
Our approach is different because it used adaptive average ensemble after training which has increased the performance of evaluation metrics.
It has been over six years since the Transformer architecture was put forward.
To this end, we propose FedSN as a general FL framework to tackle the above challenges, and fully explore data diversity on LEO satellites.
Our simulation results confirm that HoloFed achieves a 57% lower positioning error variance compared to a beam-scanning baseline and can effectively adapt to diverse environments.
In both aspects, considering the inherent resource limitations at the edge, we discuss various cutting-edge techniques, including split learning/inference, parameter-efficient fine-tuning, quantization, and parameter-sharing inference, to facilitate the efficient deployment of LLMs.
Generating dialogue grounded in videos requires a high level of understanding and reasoning about the visual scenes in the videos.
To tackle this issue we propose a new approach for MAPF where agents are guided to their destination by following congestion-avoiding paths.
To overcome the challenge in labeling RF imaging given its human incomprehensible nature, OCHID-Fi employs a cross-modality and cross-domain training process.
Experimental results show that replacing the self-attention mechanism with the SHE evidently improves the performance of the Transformer, whereas the simplified versions of the SHE, i. e., the HE, the WE, and the ME, perform close to or better than the self-attention mechanism with less computational and memory complexity.
We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world.
In this paper, we propose AVSegFormer, a novel framework for AVS tasks that leverages the transformer architecture.
However, the usage of diffusion models to generate the high-quality object detection data remains an underexplored area, where not only image-level perceptual quality but also geometric conditions such as bounding boxes and camera views are essential.
The core insight of our method is to fully consider the information propagation among nodes and edges in a graph when building the attention module in the transformer blocks.
Ranked #2 on Graph Regression on PCQM4M-LSC (Validation MAE metric)
We hope this model can set a new baseline for generalist vision and language models.
Multi-Agent Path Finding (MAPF) is an important core problem for many new and emerging industrial applications.
2 code implementations • 9 May 2023 • Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, Kunchang Li, Zhe Chen, Xue Yang, Xizhou Zhu, Yali Wang, LiMin Wang, Ping Luo, Jifeng Dai, Yu Qiao
Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, the proposed iGPT significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios where the number of objects is greater than 2.
To this end, we present a novel paradigm that attempts to extract noise-resistant features in its pipeline and introduces a noise-aware learning scheme to effectively improve the robustness of multimodal emotion understanding.
In our framework, by making the best use of the hardware parameters of the sensor that captures real-world space images, we first develop a high-fidelity RSO simulator that can generate various realistic space images.
We propose a simple, efficient, yet powerful framework for dense visual predictions based on the conditional diffusion pipeline.
Ranked #2 on Monocular Depth Estimation on SUN-RGBD
Metamaterial-based reconfigurable holographic surfaces (RHSs) have been proposed as novel cost-efficient antenna arrays, which are promising for improving the positioning and communication performance of integrated sensing and communications (ISAC) systems.
Object detection with on-board sensors (e. g., lidar, radar, and camera) play a crucial role in autonomous driving (AD), and these sensors complement each other in modalities.
In this report, we present our champion solution to the WSDM2023 Toloka Visual Question Answering (VQA) Challenge.
We show that our model improves over general-domain and single-domain medical and legal language models when processing mixed-domain (personal injury) text.
Our FPGA implementation enables the real-time calcium image decoding with sub-ms processing latency for closed-loop feedback applications.
Self-supervised facial representation has recently attracted increasing attention due to its ability to perform face understanding without relying on large-scale annotated datasets heavily.
2 code implementations • 17 Nov 2022 • Guo Chen, Sen Xing, Zhe Chen, Yi Wang, Kunchang Li, Yizhuo Li, Yi Liu, Jiahao Wang, Yin-Dong Zheng, Bingkun Huang, Zhiyu Zhao, Junting Pan, Yifei HUANG, Zun Wang, Jiashuo Yu, Yinan He, Hongjie Zhang, Tong Lu, Yali Wang, LiMin Wang, Yu Qiao
In this report, we present our champion solutions to five tracks at Ego4D challenge.
Ranked #1 on State Change Object Detection on Ego4D
Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state.
Ranked #1 on Instance Segmentation on COCO test-dev (APS metric, using extra training data)
Current object detectors typically have a feature pyramid (FP) module for multi-level feature fusion (MFF) which aims to mitigate the gap between features from different levels and form a comprehensive object representation to achieve better detection performance.
Motivated by the progress of visual-language research, we propose that pre-trained language models (e. g., CLIP) can facilitate animal pose estimation by providing rich prior knowledge for describing animal keypoints in text.
To track the target in a video, current visual trackers usually adopt greedy search for target object localization in each frame, that is, the candidate region with the maximum response score will be selected as the tracking result of each frame.
This work investigates a simple yet powerful dense prediction task adapter for Vision Transformer (ViT).
Ranked #4 on Semantic Segmentation on PASCAL Context
Point cloud segmentation is fundamental in understanding 3D environments.
Ranked #15 on Semantic Segmentation on S3DIS Area5
However, methods based on this technique ignore the pressure on a single transformation matrix due to the complex information contained in the data.
We observe that the prevailing set abstraction design for down-sampling points may maintain too much unimportant background information that can affect feature learning for detecting objects.
Then, a glimpse-based decoder is introduced to provide refined detection results based on both the glimpse features and the attention modeling outputs of the previous stage.
Ranked #1 on Object Detection on COCO (GFlops metric)
Whereas adversarial training can be useful against specific adversarial perturbations, they have also proven ineffective in generalizing towards attacks deviating from those used for training.
Crucial for healthcare and biomedical applications, respiration monitoring often employs wearable sensors in practice, causing inconvenience due to their direct contact with human bodies.
We propose an accurate and efficient scene text detection framework, termed FAST (i. e., faster arbitrarily-shaped text detector).
Ranked #2 on Scene Text Detection on MSRA-TD500
Radio-Frequency (RF) based device-free Human Activity Recognition (HAR) rises as a promising solution for many applications.
Given the significant amount of time people spend in vehicles, health issues under driving condition have become a major concern.
Dropout has been commonly used to quantify prediction uncertainty, i. e, the variations of model predictions on a given input example.
In many practical scenarios of signal extraction from a nonlinear mixture, only one (signal) source is intended to be extracted.
To this end, we propose to decompose each video into a series of expression snippets, each of which contains a small number of facial movements, and attempt to augment the Transformer's ability for modeling intra-snippet and inter-snippet visual relations, respectively, obtaining the Expression snippet Transformer (EST).
Ranked #7 on Dynamic Facial Expression Recognition on DFEW
Different from visible cameras which record intensity images frame by frame, the biologically inspired event camera produces a stream of asynchronous and sparse events with much lower latency.
Ranked #1 on Object Tracking on VisEvent
no code implementations • 30 Mar 2021 • Florian Laurent, Manuel Schneider, Christian Scheller, Jeremy Watson, Jiaoyang Li, Zhe Chen, Yi Zheng, Shao-Hung Chan, Konstantin Makhnev, Oleg Svidchenko, Vladimir Egorov, Dmitry Ivanov, Aleksei Shpilman, Evgenija Spirovska, Oliver Tanevski, Aleksandar Nikov, Ramon Grunder, David Galevski, Jakov Mitrovski, Guillaume Sartoretti, Zhiyao Luo, Mehul Damani, Nilabha Bhattacharya, Shivam Agarwal, Adrian Egli, Erik Nygren, Sharada Mohanty
However, the coordination of hundreds of agents in a real-life setting like a railway network remains challenging and the Flatland environment used for the competition models these real-world properties in a simplified manner.
In this paper, we propose to introduce more dynamics by devising a dynamic attention-guided multi-trajectory tracking strategy.
(1) We divide input image into small patches and adopt TIN, successfully transferring image style with arbitrary high-resolution.
Action recognition, which is formulated as a task to identify various human actions in a video, has attracted increasing interest from computer vision researchers due to its importance in various applications.
In this paper, we give a mathematical formalization of Multi-Agent Path Finding for Car-Like robots (CL-MAPF) problem.
Robotics Multiagent Systems
Despite deep neural network (DNN)'s impressive prediction performance in various domains, it is well known now that a set of DNN models trained with the same model specification and the same data can produce very different prediction results.
We introduce a novel neural network-based BRDF model and a Bayesian framework for object inverse rendering, i. e., joint estimation of reflectance and natural illumination from a single image of an object of known geometry.
This method can act as a plug-in for Fast Style Transfer without any modification to the network architecture.
Accurate knowledge of the distribution system topology and parameters is required to achieve good voltage controls, but this is difficult to obtain in practice.
Modern two-stage object detectors generally require excessively large models for their detection heads to achieve high accuracy.
We extend the classical result asserting that the twisting operator preserves certain Deligne--Lusztig character values for truncated formal power series; along the way we discuss some properties of centralisers.
This paper proposes a data-driven distributed voltage control approach based on the spectrum clustering and the enhanced multi-agent deep reinforcement learning (MADRL) algorithm.
More specifically, we propose to perceive texts from three levels of feature representations, i. e., character-, word- and global-level, and then introduce a novel text representation fusion technique to help achieve robust arbitrary text detection.
Ranked #1 on Scene Text Detection on ICDAR 2015
Human keypoint detection from a single image is very challenging due to occlusion, blur, illumination and scale variance.
Ranked #5 on Pose Estimation on COCO test-dev
Alternatively, to access much more natural-looking pedestrians, we propose to augment pedestrian detection datasets by transforming real pedestrians from the same dataset into different shapes.
Human keypoint detection from a single image is very challenging due to occlusion, blur, illumination and scale variance of person instances.
Only learning one projection matrix from original samples to the corresponding binary labels is too strict and will consequentlly lose some intrinsic geometric structures of data.
In this paper, we propose a non-negative representation based discriminative dictionary learning algorithm (NRDL) for multicategory face classification.
On one hand, the Fisher criterion improves the intra-class compactness of the relaxed labels during relaxation learning.
To solve above problems, we propose a low-rank discriminative least squares regression model (LRDLSR) for multi-class image classification.
We also review some popular network architectures which have been widely applied in the deep learning community.
Current two-stage object detectors, which consists of a region proposal stage and a refinement stage, may produce unreliable results due to ill-localized proposed regions.
We find that further improvements for correlation filter-based tracking can be made on estimating scales, applying part-based tracking strategy and cooperating with long-term tracking methods.
Variations in the appearance of a tracked object, such as changes in geometry/photometry, camera viewpoint, illumination, or partial occlusion, pose a major challenge to object tracking.
In this paper, a nonparametric maximum likelihood (ML) estimator for band-limited (BL) probability density functions (pdfs) is proposed.
Rodent hippocampal population codes represent important spatial information about the environment during navigation.