# Does an LSTM forget more than a CNN? An empirical study of catastrophic forgetting in NLP

Catastrophic forgetting {---} whereby a model trained on one task is fine-tuned on a second, and in doing so, suffers a {}catastrophic{''} drop in performance over the first task {---} is a hurdle in the development of better transfer learning techniques. Despite impressive progress in reducing catastrophic forgetting, we have limited understanding of how different architectures and hyper-parameters affect forgetting in a network. With this study, we aim to understand factors which cause forgetting during sequential training. Our primary finding is that CNNs forget less than LSTMs. We show that max-pooling is the underlying operation which helps CNNs alleviate forgetting compared to LSTMs. We also found that curriculum learning, placing a hard task towards the end of task sequence, reduces forgetting. We analysed the effect of fine-tuning contextual embeddings on catastrophic forgetting and found that using embeddings as feature extractor is preferable to fine-tuning in continual learning setup.

PDF Abstract

## Code Add Remove Mark official

No code implementations yet. Submit your code now

## Datasets

Add Datasets introduced or used in this paper

## Results from the Paper Add Remove

Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.