In this paper, a novel BERT based SLU model (WCN-BERT SLU) is proposed to encode WCNs and the dialogue context jointly.
In the paper, we focus on spoken language understanding from unaligned data whose annotation is a set of act-slot-value triples.
There are a total of $470K$ human instances from the train and validation subsets, and $~22. 6$ persons per image, with various kinds of occlusions in the dataset.
Ranked #7 on Pedestrian Detection on Caltech (using extra training data)