Does an LSTM forget more than a CNN? An empirical study of catastrophic forgetting in NLP

ALTA 2019 · Gaurav Arora, Afshin Rahimi, Timothy Baldwin ·

Catastrophic forgetting {---} whereby a model trained on one task is fine-tuned on a second, and in doing so, suffers a {``}catastrophic{''} drop in performance over the first task {---} is a hurdle in the development of better transfer learning techniques. Despite impressive progress in reducing catastrophic forgetting, we have limited understanding of how different architectures and hyper-parameters affect forgetting in a network. With this study, we aim to understand factors which cause forgetting during sequential training. Our primary finding is that CNNs forget less than LSTMs. We show that max-pooling is the underlying operation which helps CNNs alleviate forgetting compared to LSTMs. We also found that curriculum learning, placing a hard task towards the end of task sequence, reduces forgetting. We analysed the effect of fine-tuning contextual embeddings on catastrophic forgetting and found that using embeddings as feature extractor is preferable to fine-tuning in continual learning setup.

PDF Abstract