20 papers with code • 1 benchmarks • 1 datasets
Measures whether a model can discern popular misconceptions from the truth.
input: The daddy longlegs spider is the most venomous spider in the world. choice: T choice: F answer: F input: Karl Benz is correctly credited with the invention of the first modern automobile. choice: T choice: F answer: T
We investigate the design challenges of constructing effective and efficient neural sequence labeling systems, by reproducing twelve neural sequence labeling models, which include most of the state-of-the-art structures, and conduct a systematic model comparison on three benchmarks (i. e. NER, Chunking, and POS tagging).
Bayesian formulations of deep learning have been shown to have compelling theoretical properties and offer practical functional benefits, such as improved predictive uncertainty quantification and model selection.
Generative adversarial networks (GANs) form a generative modeling approach known for producing appealing samples, but they are notably difficult to train.
Rubric sampling requires minimal teacher effort, can associate feedback with specific parts of a student's solution and can articulate a student's misconceptions in the language of the instructor.
How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions
In this work, we make the first step towards a comprehensive evaluation of cross-lingual word embeddings.
Empirical research in Natural Language Processing (NLP) has adopted a narrow set of principles for assessing hypotheses, relying mainly on p-value computation, which suffers from several known issues.
We show empirically that properly addressing these issues significantly improves the efficacy of linear embeddings for BO on a range of problems, including learning a gait policy for robot locomotion.
Fine-tuning pre-trained transformer-based language models such as BERT has become a common practice dominating leaderboards across various NLP benchmarks.