Big Bird : Transformers for Longer Sequences

In this posting, we will deal with the NLP model named Big Bird. As you can see in the title, it is the model for longer sequences. Let’s take a brief look at the concept of Big Bird.

1. Motivation

Pre-training a model like BERT requires a huge amount of data. No matter how well known the performance is, the model itself is so large that it is hard for ordinary researchers to even get a GPU to raise it.

The reason why so many datasets and large models are needed is because the main structure of the Transformer family is multi-head self-attention, which is a memory-intensive task because it computes all combinations of data arrays. The longer the length, the more memory the longer the model needs, so the performance of the model itself is a problem for very long sentences, but also there is a limit to the increasing memory that should be uploaded.

This is one of the problems of the models like BERT. Do you remember that we cannot use sentence with more than 512 tokens when using BERT package ? We definitely need a better model for analyzing longer sequences in more computationally efficient way.

The answer is Big Bird.

2. Model Architecture

The big bird’s attention consists of 3 elements - random, local, global - , which can implement quadratic operations linearly. And the paper have proved 2 things for theoretical reasons that this method maintains the characteristics of the existing full attention. Big bird method shows outstanding performance, especially in long sequences, which is demonstrated through existing NLP tasks and genetic information tasks. To learn more about theoretical proofs, check here : https://arxiv.org/pdf/2007.14062v1.pdf

Do you remember Sparse Transformer that I introduced in earlier posting? Doesn’t the above figure look similar to that? That’s right, Big Bird doesn’t use Full attention, so it’s a model that takes care of the computational efficiency but performs similarly to Full attention.

Big bird’s attention method is a combination of 3 structures as shown above. In the paper, for theoretical explanation and proof, the attention structure is described from a graph perspective. However, I will explain it in simpler way for easier understanding.

1) Random Attention : Each query is focused on random r keys, and when you draw an attention map, it is drawn as figure (a).

2) Window Attention : Focuses only on neighboring tokens. It reflects the points that are most affected by adjacent tokens/words based on their linguistic structure. This attention focuses on tokens that are a certain distance away from each side and emphasizes locality.

3) Global Attention : Global token is a token that focuses on all tokens but is also focused on all tokens. A bit tricky? It is not your fault that this is confusing. The concept of Global Attention itself is actually not easy to understand. The authors’ of the paper said that we can classify global tokens into 2 concepts - Big Bird-itc & Big Bird-etc.

In the case of itc, some of the existing tokens are made into global tokens, and in the case of etc, tokens such as CLS are introduced and those are made into global tokens. So in short, global attention is the one that gives attention to certain global tokens.

** NOTE that Global Attention is the core of this model since Random + Window alone makes it difficult to reach BERT’s performance.

The Big Bird is a combination of all the three attention approaches described above, which can be expressed as (d). For this attention, as you can see in the picture, it is a sparse form where empty parts exist. The fact that there is an empty part implies it does not use this information, and compared to the full attention that uses all parts, many people may think it will be degraded in terms of performance due to the information which is not used. However, the authors proved that this attention mechanism satisfies almost all properties of Full attention.

3. Evaluation results

Experiment with multiple NLP tasks showed that Big Bird performed better than other SOTA models. What was surprising was that most of the models were ensemble, while Big Bird was performed as a single model. In particular, the model showed a significant difference in terms of performance between smaller and longer inputs in the task.

Reference :

  • https://arxiv.org/pdf/2007.14062v1.pdf
  • https://github.com/kakaobrain/nlp-paper-reading/blob/master/notes/Big_Bird.md
  • https://wain-together.tistory.com/7