Skip to content

Instantly share code, notes, and snippets.

@aditya-malte
Created February 22, 2020 13:41
Show Gist options
  • Save aditya-malte/2d4f896f471be9c38eb4d723a710768b to your computer and use it in GitHub Desktop.
Save aditya-malte/2d4f896f471be9c38eb4d723a710768b to your computer and use it in GitHub Desktop.
smallBERTa_Pretraining.ipynb
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@carlstrath
Copy link

I get this error in line 20

TypeError Traceback (most recent call last)
in ()
----> 1 tokenizer.save("/content/models/smallBERTa", "smallBERTa")

/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py in save(self, path, pretty)
329 A path to the destination Tokenizer file
330 """
--> 331 return self._tokenizer.save(path, pretty)
332
333 def to_str(self, pretty: bool = False):

TypeError: Can't convert 'smallBERTa' to PyBool

@julien-c
Copy link

@carlstrath In recent versions of tokenizers I think you can just call .save(path) (cc @n1t0)

@carlstrath
Copy link

carlstrath commented Jan 29, 2021

Sorry to bother everyone again. I am now getting this error in ln27

python3: can't open file '/content/transformers/examples/run_language_modeling.py': [Errno 2] No such file or directory

@aditya-malte
Copy link
Author

Hi @carlstrath,
(Sorry I’ve been a bit busy lately so wasn’t active).

This gist was made for a specific version of the transformer and tokenizer library. Can you try using it with the versions mentioned at the start.

Meanwhile, I guess it’s about time now that I update this gist to reflect changes in the dependencies.
Thanks
Aditya

@aditya-malte
Copy link
Author

aditya-malte commented Jan 29, 2021

Also, while cloning from git. Please ensure you use this (https://github.com/huggingface/transformers/tree/v2.5.0) github repo instead. (As the gist is compatible with that version of huggingface, the newer one probably doesn’t contain the required run_language_modeling file)

@sv-v5
Copy link

sv-v5 commented Sep 11, 2021

I ran into issues while following the directions from the 2020 blog post https://huggingface.co/blog/how-to-train. This gist was more helpful. Thank you 👍

For anyone interested in running through training with an updated transformers: I have a write-up here of a complete example on training from scratch using transformers 4.10 and the updated run_language_modeling.py script (https://github.com/huggingface/transformers/blob/4a872caef4e70595202c64687a074f99772d8e92/examples/legacy/run_language_modeling.py) committed on Jun 25, 2021.

https://github.com/sv-v5/train-roberta-ua

Python package verisions are locked with pipenv so the example remains reproducible. Tested on Linux and Windows on GPU and CPU.

Happy training

@aditya-malte
Copy link
Author

Hi,
That’s great to hear! Also, thanks a lot for making an updated training script. I’ve been busy lately (earlier with work and now my Master’s), so your updated script is much appreciated.

@mgiardinelli
Copy link

I had to update step #26 from tokenizer.save to tokenizer.save_model. FYI

tokenizer.save_model("/content/models/smallBERTa", "smallBERTa")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment