Model Name | Training Time | 時間 | Hardware | Data size | Paper Link |
Transformer | 12h | 2017.06 | 8 P100 GPU | 37000 token | http://arxiv.org/abs/1706.03762 |
BERT | 81.4h | 2018.10 | 16 TPU | 3.3B word corpus | https://arxiv.org/abs/1810.04805 |
BERT | 76 min | 2019.4 | 1024 TPU | 3.3B word corpus | https://arxiv.org/abs/1904.00962 |
XLNet | 2.5days | 2019.6 | 512 TPU v3 chips | 32.89B | https://arxiv.org/abs/1906.08237 |
Resnet50 | 2.2 min | 2018.11 | TPU v3 Pod | ImageNet | https://arxiv.org/abs/1811.06992 |
Resnet50 | 75s | 2019.3 | 2048 GPU v100 | ImageNet | https://arxiv.org/abs/1903.12650 |
GPT | month | 2018.6 | 8 GPU | BooksCorpus 800M words | https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf |
GPT-2 | week | 2019.2 | 256 of Google's Cloud TPU v3 | 23 million URLsover 10 million HTML pages | https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf |
BigGAN | 24~48h | 2018.9 | TPU v3 Pod | ImageNet | https://arxiv.org/abs/1809.11096 |
ResNet 10132*16D | 22days | 2018.5 | 336GPU | 3.5 billion images | https://research.fb.com/wp-content/uploads/2018/05/exploring_the_limits_of_weakly_supervised_pretraining.pdf |
RoBERTa | 1day | 2019.7 | 1024 v100 | 4倍 XLNet, 40倍 BERT | https://arxiv.org/pdf/1907.11692.pdf |
ELMo | 2weeks | 2018.2 | 3 GTX 1080 | 5.5B tokens | https://arxiv.org/abs/1802.05365 |