Additionally, our analysis uncovers the semantic predispositions in LLMs and reveals the impact of recency bias for information presented in long contexts.
In this work, we take a step towards answering these questions by demonstrating the following: (a) On a test-bed with a variety of Boolean function classes, we find that Transformers can nearly match the optimal learning algorithm for 'simpler' tasks, while their performance deteriorates on more 'complex' tasks.
Large-scale pre-training has made progress in many fields of natural language processing, though little is understood about the design of pre-training datasets.
With the increase in the scale of Deep Learning (DL) training workloads in terms of compute resources and time consumption, the likelihood of encountering in-training failures rises substantially, leading to lost work and resource wastage.
(ii) When trained on Boolean functions, both Transformers and LSTMs prioritize learning functions of low sensitivity, with Transformers ultimately converging to functions of lower sensitivity.
Compositional generalization is a fundamental trait in humans, allowing us to effortlessly combine known phrases to form novel sentences.
Since existing solvers achieve high performance on the benchmark datasets for elementary level MWPs containing one-unknown arithmetic word problems, such problems are often considered "solved" with the bulk of research attention moving to more complex MWPs.
Ranked #1 on Math Word Problem SolvingΩ on MAWPS
We find that while recurrent models generalize nearly perfectly if the lengths of the training and test strings are from the same range, they perform poorly if the test strings are longer.
Our analysis also provides insights on the role of self-attention mechanism in modeling certain behaviors and the influence of positional encoding schemes on the learning and generalization abilities of the model.
In this paper, we examine and analyze the challenges associated with developing and introducing language technologies to low-resource language communities.
Inducing diversity in the task of paraphrasing is an important problem in NLP with applications in data augmentation and conversational agents.