用Butterfly Factorization来加速Transformer运算

主要参考论文：

Pixelated Butterfly: Simple and Efficient Sparse Training for Neural Network Models

Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations

算法原理以及具体的数据流是怎样的?

算法原理在代码上是如何实现的？是否可以将该算法用于所有的矩阵乘法运算？

如果所有的weight都可以用以下两种公式表示，那么activation将怎样和它们进行相乘？

Flat block butterfly与Low-rank两个矩阵各自占多少的计算量？

从这句话可以看到，计算量在二者之间的分配似乎是主观选取的？

该算法的局限性在哪里？

该算法使用一个cost model来评测运算的开销。由于memory coalescing（访问一个单独的memory cell在开销上相当于访问了一整块的memory）的问题，所以一个sparse矩阵中non-zero element的分布也决定了计算这个sparse matrix所需要花费的memory access数目。所以文中提出”exploiting hardware locality is crutial to obtain speed up”。

环境配置

conda create -n fly python=3.8
ssh-keygen -t rsa -b 4096 -C "yc17483@umac.mo"
ssh-agent bash
ssh-add ./.ssh/id_rsa
ssh -T git@github.com
git clone git@github.com:JohnsonZ-microe/fly.git

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

pip install torchtext
pip install munch
pip install einops
pip install timm
pip install hydra-core
pip install hydra-colorlog
pip install python-dotenv
pip install rich
pip install pytorch-lightning
pip install lightning-bolts
pip install scipy
pip install datasets
pip install wandb

## run imagenet_preprocess

mkdir -p checkpoints/t2tvit
cd checkpoints/t2tvit
wget https://github.com/yitu-opensource/T2T-ViT/releases/download/main/81.7_T2T_ViTt_14.pth.tar