Instead of using readout tokens, radar representations contribute additional depth information to a monocular depth estimation model and improve performance.
Since the release of radar data in large scale autonomous driving dataset, many works have been proposed fusing radar data as an additional guidance signal into monocular depth estimation models.
We integrate sparse radar data into a monocular depth estimation model and introduce a novel preprocessing method for reducing the sparseness and limited field of view provided by radar.
In this paper, we extend the CDVAE-VC framework by incorporating the concept of adversarial learning, in order to further increase the degree of disentanglement, thereby improving the quality and similarity of converted speech.
Such hypothesis implies that during the conversion phase, the latent codes and the converted features in VAE based VC are in fact source F0 dependent.
In this paper, we propose deep learning-based assessment models to predict human ratings of converted speech.