aditya-malte/smallberta_pretraining.ipynb

Created February 22, 2020 13:41

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/aditya-malte/2d4f896f471be9c38eb4d723a710768b.js"></script>
Save aditya-malte/2d4f896f471be9c38eb4d723a710768b to your computer and use it in GitHub Desktop.

Download ZIP

smallBERTa_Pretraining.ipynb

Raw

smallberta_pretraining.ipynb

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

carlstrath commented Jan 18, 2021

I get this error in line 20

TypeError Traceback (most recent call last)
in ()
----> 1 tokenizer.save("/content/models/smallBERTa", "smallBERTa")

/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py in save(self, path, pretty)
329 A path to the destination Tokenizer file
330 """
--> 331 return self._tokenizer.save(path, pretty)
332
333 def to_str(self, pretty: bool = False):

TypeError: Can't convert 'smallBERTa' to PyBool

julien-c commented Jan 18, 2021

@carlstrath In recent versions of tokenizers I think you can just call .save(path) (cc @n1t0)

carlstrath commented Jan 29, 2021 •

edited

Sorry to bother everyone again. I am now getting this error in ln27

python3: can't open file '/content/transformers/examples/run_language_modeling.py': [Errno 2] No such file or directory

Author

aditya-malte commented Jan 29, 2021

Hi @carlstrath,
(Sorry I’ve been a bit busy lately so wasn’t active).

This gist was made for a specific version of the transformer and tokenizer library. Can you try using it with the versions mentioned at the start.

Meanwhile, I guess it’s about time now that I update this gist to reflect changes in the dependencies.
Thanks
Aditya

Author

aditya-malte commented Jan 29, 2021 •

edited

Also, while cloning from git. Please ensure you use this (https://github.com/huggingface/transformers/tree/v2.5.0) github repo instead. (As the gist is compatible with that version of huggingface, the newer one probably doesn’t contain the required run_language_modeling file)

sv-v5 commented Sep 11, 2021

I ran into issues while following the directions from the 2020 blog post https://huggingface.co/blog/how-to-train. This gist was more helpful. Thank you 👍

For anyone interested in running through training with an updated transformers: I have a write-up here of a complete example on training from scratch using transformers 4.10 and the updated run_language_modeling.py script (https://github.com/huggingface/transformers/blob/4a872caef4e70595202c64687a074f99772d8e92/examples/legacy/run_language_modeling.py) committed on Jun 25, 2021.

https://github.com/sv-v5/train-roberta-ua

Python package verisions are locked with pipenv so the example remains reproducible. Tested on Linux and Windows on GPU and CPU.

Happy training

Author

aditya-malte commented Sep 11, 2021

Hi,
That’s great to hear! Also, thanks a lot for making an updated training script. I’ve been busy lately (earlier with work and now my Master’s), so your updated script is much appreciated.

mgiardinelli commented Oct 10, 2021

I had to update step #26 from tokenizer.save to tokenizer.save_model. FYI

tokenizer.save_model("/content/models/smallBERTa", "smallBERTa")

aditya-malte/smallberta_pretraining.ipynb

carlstrath commented Jan 18, 2021

julien-c commented Jan 18, 2021

carlstrath commented Jan 29, 2021 • edited

aditya-malte commented Jan 29, 2021

aditya-malte commented Jan 29, 2021 • edited

sv-v5 commented Sep 11, 2021

aditya-malte commented Sep 11, 2021

mgiardinelli commented Oct 10, 2021

carlstrath commented Jan 29, 2021 •

edited

aditya-malte commented Jan 29, 2021 •

edited