We propose a novel Latent Diffusion Transformer, namely Latte, for video generation.
Leveraging object information within scenes to enhance the distinguishability of feature representations has emerged as a key approach in this domain.
The goal is to generate high-quality long videos with smooth and creative transitions between scenes and varying lengths of shot-level videos.
Federated learning (FL) is a widely employed distributed paradigm for collaboratively training machine learning models from multiple clients without sharing local data.
2 code implementations • 26 Sep 2023 • Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing Wu, Chenyang Si, Yuming Jiang, Cunjian Chen, Chen Change Loy, Bo Dai, Dahua Lin, Yu Qiao, Ziwei Liu
To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model.
Ranked #9 on Text-to-Video Generation on UCF-101
Due to the high inter-class similarity caused by the complex composition and the co-existing objects across scenes, numerous studies have explored object semantic knowledge within scenes to improve scene recognition.
1 code implementation • 13 Jul 2023 • Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Conghui He, Ping Luo, Ziwei Liu, Yali Wang, LiMin Wang, Yu Qiao
Specifically, we utilize a multi-scale approach to generate video-related descriptions.
Exploring the semantic context in scene images is essential for indoor scene recognition.
Despite the remarkable success of convolutional neural networks in various computer vision tasks, recognizing indoor scenes still presents a significant challenge due to their complex composition.
Ranked #1 on Scene Recognition on MIT Indoor Scenes (10-stage average accuracy metric)
Our key idea is to represent motion as a sequence of flow maps in the generation process, which inherently isolate motion from appearance.
Masked image modeling (MIM) has shown great promise for self-supervised learning (SSL) yet been criticized for learning inefficiency.
We compare the prediction performance for different intelligence measures based on static FC, dynamic FC, and region level time series acquired from the Adolescent Brain Cognitive Development (ABCD) study involving close to 7000 individuals.
Our estimator is designed to minimize the $L_1$ norm among all estimators belonging to suitable feasible sets, without requiring any knowledge of the noise distribution.
In this work, we propose FedSSO, a server-side second-order optimization method for federated learning (FL).
Traffic light recognition, as a critical component of the perception module of self-driving vehicles, plays a vital role in the intelligent transportation systems.
Hence, previous methods optimize the compressed model layer-by-layer and try to make every layer have the same outputs as the corresponding layer in the teacher model, which is cumbersome.
This multi-scale architecture is beneficial for the decoder to utilize discriminative representations learned from encoders into images.
VCC takes advantage of distributions of local relationships of samples near the boundary of clusters, so that they can be properly separated and pulled to cluster centers to form compact clusters.
In deep learning-based local stereo matching methods, larger image patches usually bring better stereo matching accuracy.
no code implementations • 8 Dec 2020 • Angelo Ziletti, Christoph Berns, Oliver Treichel, Thomas Weber, Jennifer Liang, Stephanie Kammerath, Marion Schwaerzler, Jagatheswari Virayah, David Ruau, Xin Ma, Andreas Mattern
Millions of unsolicited medical inquiries are received by pharmaceutical companies every year.
Considering the intuitive artifacts in the existing methods, we propose a contrastive style loss for style rendering to enforce the similarity between the style of rendered photo and the caricature, and simultaneously enhance its discrepancy to the photos.
In this paper, we introduce two new approximation properties for \'etale groupoids, almost elementariness and (ubiquitous) fiberwise amenability, inspired by Matui's and Kerr's notions of almost finiteness.
Operator Algebras Dynamical Systems 22A22 (primary) 46L35, 51F30, 37A55, 37B05 (Secondary)
It is difficult for encoders to capture such powerful representations under this complex situation.
As there is significant interest in understanding the altered interactions between different brain regions that lead to neuro-disorders, it is important to develop data-driven methods that work with a population of graph data for traditional prediction tasks.
Visual loop closure detection, which can be considered as an image retrieval task, is an important problem in SLAM (Simultaneous Localization and Mapping) systems.
In this paper, we propose alleviating this problem through sampling only a small fraction of data for normalization at each iteration.
Crossbar architecture based devices have been widely adopted in neural network accelerators by taking advantage of the high efficiency on vector-matrix multiplication (VMM) operations.