TASK |
DATASET |
MODEL |
METRIC NAME |
METRIC VALUE |
GLOBAL RANK |
REMOVE |
Text to Audio/Video Retrieval
|
AudioCaps
|
VGGish
|
R@1
|
18.0±0.2
|
# 6
|
|
Text to Audio/Video Retrieval
|
AudioCaps
|
VGGish
|
R@10
|
62.0±0.5
|
# 6
|
|
Text to Audio/Video Retrieval
|
AudioCaps
|
CE-Visual + VGGish
|
R@1
|
23.9±0.7
|
# 3
|
|
Text to Audio/Video Retrieval
|
AudioCaps
|
CE-Visual + VGGish
|
R@10
|
74.4±0.2
|
# 3
|
|
Audio/Video to Text Retrieval
|
AudioCaps
|
VGGish + VGGSound (CE-Audio)
|
R@1
|
25.1±0.9
|
# 4
|
|
Audio/Video to Text Retrieval
|
AudioCaps
|
VGGish + VGGSound (CE-Audio)
|
R@10
|
73.2±1.6
|
# 4
|
|
Text to Audio/Video Retrieval
|
AudioCaps
|
VGGish + VGGSound (CE-Audio)
|
R@1
|
23.1±0.8
|
# 4
|
|
Text to Audio/Video Retrieval
|
AudioCaps
|
VGGish + VGGSound (CE-Audio)
|
R@10
|
70.7±0.7
|
# 4
|
|
Audio/Video to Text Retrieval
|
AudioCaps
|
VGGSound
|
R@1
|
24.6±0.9
|
# 5
|
|
Audio/Video to Text Retrieval
|
AudioCaps
|
VGGSound
|
R@10
|
70.4±0.4
|
# 5
|
|
Text to Audio/Video Retrieval
|
AudioCaps
|
VGGSound
|
R@1
|
20.5±0.6
|
# 5
|
|
Text to Audio/Video Retrieval
|
AudioCaps
|
VGGSound
|
R@10
|
67.0±1.0
|
# 5
|
|
Audio/Video to Text Retrieval
|
AudioCaps
|
R2P1D + Inst (CE-Visual)
|
R@1
|
12.1±0.4
|
# 7
|
|
Audio/Video to Text Retrieval
|
AudioCaps
|
R2P1D + Inst (CE-Visual)
|
R@10
|
46.1±1.3
|
# 7
|
|
Text to Audio/Video Retrieval
|
AudioCaps
|
R2P1D + Inst (CE-Visual)
|
R@1
|
10.1±0.2
|
# 7
|
|
Text to Audio/Video Retrieval
|
AudioCaps
|
R2P1D + Inst (CE-Visual)
|
R@10
|
49.6±1.1
|
# 7
|
|
Audio/Video to Text Retrieval
|
AudioCaps
|
VGGish
|
R@1
|
21.0±0.8
|
# 6
|
|
Audio/Video to Text Retrieval
|
AudioCaps
|
VGGish
|
R@10
|
62.7±1.6
|
# 6
|
|
Audio/Video to Text Retrieval
|
AudioCaps
|
CE-Visual + VGGSound
|
R@1
|
34.0±1.5
|
# 1
|
|
Audio/Video to Text Retrieval
|
AudioCaps
|
CE-Visual + VGGSound
|
R@10
|
82.5±1.2
|
# 2
|
|
Text to Audio Retrieval
|
AudioCaps
|
CE
|
R@1
|
23.1±0.8
|
# 9
|
|
Text to Audio Retrieval
|
AudioCaps
|
CE
|
R@10
|
70.7±0.7
|
# 9
|
|
Text to Audio/Video Retrieval
|
AudioCaps
|
CE-Visual + CE-Audio
|
R@1
|
28.1±0.6
|
# 1
|
|
Text to Audio/Video Retrieval
|
AudioCaps
|
CE-Visual + CE-Audio
|
R@10
|
79.0±0.5
|
# 1
|
|
Audio to Text Retrieval
|
AudioCaps
|
CE
|
R@1
|
25.1±0.9
|
# 6
|
|
Audio to Text Retrieval
|
AudioCaps
|
CE
|
R@10
|
73.2±1.6
|
# 6
|
|
Audio to Text Retrieval
|
AudioCaps
|
MoEE
|
R@1
|
25.1±0.8
|
# 6
|
|
Audio to Text Retrieval
|
AudioCaps
|
MoEE
|
R@10
|
72.9±1.2
|
# 7
|
|
Text to Audio Retrieval
|
AudioCaps
|
MoEE
|
R@1
|
22.5±0.3
|
# 11
|
|
Text to Audio Retrieval
|
AudioCaps
|
MoEE
|
R@10
|
69.5±0.9
|
# 10
|
|
Audio/Video to Text Retrieval
|
AudioCaps
|
Scene + R2P1D
|
R@1
|
11.0±0.6
|
# 8
|
|
Audio/Video to Text Retrieval
|
AudioCaps
|
Scene + R2P1D
|
R@10
|
45.1±1.7
|
# 8
|
|
Text to Audio/Video Retrieval
|
AudioCaps
|
Scene + R2P1D
|
R@1
|
8.8±0.1
|
# 8
|
|
Text to Audio/Video Retrieval
|
AudioCaps
|
Scene + R2P1D
|
R@10
|
46.8±0.1
|
# 9
|
|
Audio/Video to Text Retrieval
|
AudioCaps
|
Scene + Inst
|
R@1
|
10.6±0.6
|
# 9
|
|
Audio/Video to Text Retrieval
|
AudioCaps
|
Scene + Inst
|
R@10
|
41.4±1.5
|
# 10
|
|
Text to Audio/Video Retrieval
|
AudioCaps
|
Scene + Inst
|
R@1
|
8.7±0.5
|
# 9
|
|
Text to Audio/Video Retrieval
|
AudioCaps
|
Scene + Inst
|
R@10
|
47.4±0.5
|
# 8
|
|
Audio/Video to Text Retrieval
|
AudioCaps
|
R2P1D
|
R@1
|
10.3±0.4
|
# 10
|
|
Audio/Video to Text Retrieval
|
AudioCaps
|
R2P1D
|
R@10
|
41.8±3.1
|
# 9
|
|
Text to Audio/Video Retrieval
|
AudioCaps
|
R2P1D
|
R@1
|
8.2±0.5
|
# 10
|
|
Text to Audio/Video Retrieval
|
AudioCaps
|
R2P1D
|
R@10
|
44.7±0.9
|
# 11
|
|
Audio/Video to Text Retrieval
|
AudioCaps
|
Inst
|
R@1
|
9.8±0.9
|
# 11
|
|
Audio/Video to Text Retrieval
|
AudioCaps
|
Inst
|
R@10
|
40.6±0.7
|
# 11
|
|
Text to Audio/Video Retrieval
|
AudioCaps
|
Inst
|
R@1
|
7.7±0.2
|
# 11
|
|
Text to Audio/Video Retrieval
|
AudioCaps
|
Inst
|
R@10
|
46.7±1.3
|
# 10
|
|
Audio/Video to Text Retrieval
|
AudioCaps
|
Scene
|
R@1
|
6.5±0.8
|
# 12
|
|
Audio/Video to Text Retrieval
|
AudioCaps
|
Scene
|
R@10
|
31.3±1.6
|
# 12
|
|
Text to Audio/Video Retrieval
|
AudioCaps
|
Scene
|
R@1
|
6.1±0.4
|
# 12
|
|
Text to Audio/Video Retrieval
|
AudioCaps
|
Scene
|
R@10
|
35.8±0.6
|
# 12
|
|
Audio/Video to Text Retrieval
|
AudioCaps
|
CE-Visual + CE-Audio
|
R@1
|
33.7±1.6
|
# 2
|
|
Audio/Video to Text Retrieval
|
AudioCaps
|
CE-Visual + CE-Audio
|
R@10
|
83.7±0.4
|
# 1
|
|
Text to Audio/Video Retrieval
|
AudioCaps
|
CE-Visual + VGGSound
|
R@1
|
27.4±0.7
|
# 2
|
|
Text to Audio/Video Retrieval
|
AudioCaps
|
CE-Visual + VGGSound
|
R@10
|
78.2±0.3
|
# 2
|
|
Audio/Video to Text Retrieval
|
AudioCaps
|
CE-Visual + VGGish
|
R@1
|
29.0±2.0
|
# 3
|
|
Audio/Video to Text Retrieval
|
AudioCaps
|
CE-Visual + VGGish
|
R@10
|
77.2±1.9
|
# 3
|
|
Text to Audio Retrieval
|
Clotho
|
CE (pretraining:AudioCaps)
|
R@1
|
9.6±0.3
|
# 5
|
|
Text to Audio Retrieval
|
Clotho
|
CE (pretraining:AudioCaps)
|
R@10
|
40.1±0.7
|
# 4
|
|
Text to Audio Retrieval
|
Clotho
|
MoEE (pretraining:AudioCaps)
|
R@1
|
8.6±0.4
|
# 6
|
|
Text to Audio Retrieval
|
Clotho
|
MoEE (pretraining:AudioCaps)
|
R@10
|
39.3±0.7
|
# 5
|
|
Audio to Text Retrieval
|
Clotho
|
CE
|
R@1
|
7.1±0.3
|
# 5
|
|
Audio to Text Retrieval
|
Clotho
|
CE
|
R@10
|
34.6±0.5
|
# 4
|
|
Text to Audio Retrieval
|
Clotho
|
CE
|
R@1
|
6.7±0.4
|
# 7
|
|
Text to Audio Retrieval
|
Clotho
|
CE
|
R@10
|
33.2±0.3
|
# 6
|
|
Audio to Text Retrieval
|
Clotho
|
MoEE
|
R@1
|
7.2±0.5
|
# 4
|
|
Audio to Text Retrieval
|
Clotho
|
MoEE
|
R@10
|
33.2±1.1
|
# 6
|
|
Text to Audio Retrieval
|
Clotho
|
MoEE
|
R@1
|
6.0±0.1
|
# 10
|
|
Text to Audio Retrieval
|
Clotho
|
MoEE
|
R@10
|
32.3±0.3
|
# 9
|
|
Audio to Text Retrieval
|
Clotho
|
CE (pretraining:AudioCaps)
|
R@1
|
10.7±0.6
|
# 2
|
|
Audio to Text Retrieval
|
Clotho
|
CE (pretraining:AudioCaps)
|
R@10
|
40.8±1.4
|
# 2
|
|
Audio to Text Retrieval
|
Clotho
|
MoEE (pretraining:AudioCaps)
|
R@1
|
10.0±0.3
|
# 3
|
|
Audio to Text Retrieval
|
Clotho
|
MoEE (pretraining:AudioCaps)
|
R@10
|
40.1±1.3
|
# 3
|
|