Learning to Reason: End-to-End Module Networks for Visual Question Answering

Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems. For example, to answer "is there an equal number of balls and boxes?".. (read more)

PDF Abstract ICCV 2017 PDF ICCV 2017 Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Visual Question Answering VQA v2 test-dev N2NMN (ResNet-152, policy search) Accuracy 64.9 # 19

Results from Other Papers


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK SOURCE PAPER COMPARE
Visual Dialog Visual Dialog v1.0 test-std NMN NDCG (x 100) 58.1 # 38
MRR (x 100) 58.8 # 32
R@1 44.15 # 39
R@5 76.88 # 31
R@10 86.88 # 30
Mean 4.4 # 27

Methods used in the Paper


METHOD TYPE
Average Pooling
Pooling Operations
Residual Connection
Skip Connections
ReLU
Activation Functions
1x1 Convolution
Convolutions
Batch Normalization
Normalization
Bottleneck Residual Block
Skip Connection Blocks
Global Average Pooling
Pooling Operations
Residual Block
Skip Connection Blocks
Kaiming Initialization
Initialization
Max Pooling
Pooling Operations
Convolution
Convolutions
ResNet
Convolutional Neural Networks