Through extensive experiments on two mathematical reasoning benchmarks, namely GSM8k and MATH, we reveal the extraordinary capabilities of our model.
Ranked #49 on Arithmetic Reasoning on GSM8K (using extra training data)
The key idea is to eliminate unsafe visual representations from the model regardless of the text input.
Generative models, e. g., Stable Diffusion, have enabled the creation of photorealistic images from text prompts.
Usually, correspondences are 2D-to-2D and the pose we estimate is defined only up to scale.
Furthermore, we find spatial variance exists in LoFTR's fine correlation module, which is adverse to matching accuracy.
Here we show that smaller LMs trained utilizing some of the layers of GPT2-medium (355M) and GPT-2-large (770M) can effectively match the val loss of their bigger counterparts when trained from scratch for the same number of training steps on OpenWebText dataset with 9B tokens.
We present TinyLlama, a compact 1. 1B language model pretrained on around 1 trillion tokens for approximately 3 epochs.
In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation.
Ranked #2 on Text-to-Video Generation on EvalCrafter Text-to-Video (ECTV) Dataset (using extra training data)
In this study, we propose AniPortrait, a novel framework for generating high-quality animation driven by audio and a reference portrait image.
We study the use of large language model-based agents for interacting with software via web browsers.