Especially in spatial transformer, we design a dual perspective spatial MHSA, which integrates the global tokens to the window-based attention.
Current Visual Document Understanding (VDU) methods outsource the task of reading text to off-the-shelf Optical Character Recognition (OCR) engines and focus on the understanding task with the OCR outputs.
Drawing images of characters with desired poses is an essential but laborious task in anime production.
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance.
Ranked #2 on
Language Modelling
on C4
In this work, we investigate common issues with existing spatial encodings and propose a simple yet highly effective approach to modeling high-fidelity volumetric humans from sparse views.
Ranked #1 on
Generalizable Novel View Synthesis
on ZJU-MoCap
Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations.
In RDMs, a set of nearest neighbors is retrieved from an external database during training for each training instance, and the diffusion model is conditioned on these informative samples.
YOLOv7 surpasses all known object detectors in both speed and accuracy in the range from 5 FPS to 160 FPS and has the highest accuracy 56. 8% AP among all known real-time object detectors with 30 FPS or higher on GPU V100.
Ranked #1 on
Real-Time Object Detection
on COCO
We propose Deep Patch Visual Odometry (DPVO), a new deep learning system for monocular Visual Odometry (VO).
Notably, StyleFaceV is capable of generating realistic $1024\times1024$ face videos even without high-resolution training videos.