aditya-malte/smallberta_pretraining.ipynb

Created February 22, 2020 13:41

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/aditya-malte/2d4f896f471be9c38eb4d723a710768b.js"></script>
Save aditya-malte/2d4f896f471be9c38eb4d723a710768b to your computer and use it in GitHub Desktop.

Download ZIP

smallBERTa_Pretraining.ipynb

Raw

smallberta_pretraining.ipynb

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

jbmaxwell commented Feb 24, 2020

First of, thanks so much for sharing this—it definitely helped me get a lot further along!
I was hoping to use my own tokenizer though, so I'm guessing the only way would be write the tokenizer, then just replace the LineByTextDataset() call in load_and_cache_examples() with my custom dataset, yes?

Author

aditya-malte commented Feb 24, 2020

I think you mean custom “Dataset Loader”, because the code above already uses a custom tokenizer

jbmaxwell commented Feb 24, 2020 •

edited

I see what you mean—a custom parametrization of the BPE tokenizer. But my use case is very specialized (music), so I actually want a very specific tokenization. But yes, I may be able to specify it the way you've done here. I'll think more about that. Thanks!
ps - I did get it running with the given tokenizer, so that's a huge step forward!

Author

aditya-malte commented Feb 24, 2020

You’re welcome :)
Also, do share this gist on your network

jbmaxwell commented Feb 24, 2020 •

edited

I'm struggling with trying to use a fixed vocabulary. My vocab.txt (for music) is small, and I want to avoid wordpieces, so that I don't have to predict multiple, adjacent pieces/tokens to get a "complete" prediction/"word". So all I want to do is load a vocab.txt and tokenize. Super simple, but I can't find a way to do that.
(If I can't find a way to do this, I'll just settle with the BPE tokenizer and figure out a way around the problems when I deploy it.)

Curious; is there a simple way to load weights and continue training?

mrm8488 commented Feb 28, 2020

Great work! I have executed the Colab you provided https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb
and I got this error:

02/27/2020 17:23:08 - INFO - __main__ -   Training new model from scratch
02/27/2020 17:23:16 - INFO - __main__ -   Training/evaluation parameters Namespace(adam_epsilon=1e-08, block_size=1000000000000, cache_dir=None, config_name='./EsperBERTo', device=device(type='cuda'), do_eval=False, do_train=True, eval_all_checkpoints=False, eval_data_file=None, evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, learning_rate=0.0001, line_by_line=False, local_rank=-1, logging_steps=500, max_grad_norm=1.0, max_steps=-1, mlm=True, mlm_probability=0.15, model_name_or_path=None, model_type='roberta', n_gpu=1, no_cuda=False, num_train_epochs=1.0, output_dir='./EsperBERTo-small-v1', overwrite_cache=False, overwrite_output_dir=False, per_gpu_eval_batch_size=4, per_gpu_train_batch_size=16, save_steps=2000, save_total_limit=2, seed=42, server_ip='', server_port='', should_continue=False, tokenizer_name='./EsperBERTo', train_data_file='./oscar.eo.txt', warmup_steps=0, weight_decay=0.0)
02/27/2020 17:23:16 - INFO - __main__ -   Loading features from cached file ./roberta_cached_lm_999999999998_oscar.eo.txt
Traceback (most recent call last):
  File "transformers/examples/run_language_modeling.py", line 799, in <module>
    main()
  File "transformers/examples/run_language_modeling.py", line 749, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
  File "transformers/examples/run_language_modeling.py", line 245, in train
    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/sampler.py", line 94, in __init__
    "value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0

julien-c commented Feb 28, 2020

@mrm8488 should be fixed now thanks to huggingface/blog#8

008karan commented Mar 6, 2020 •

edited

I want to train Albert than what changes I need to do in run_pretraining.py. What changes would require that?

@julien-c as training for Albert like model requires the generation of pre-training data so is pre-training data generated while training itself?

Nix07 commented Mar 17, 2020 •

edited

@jbmaxwell You can try other tokenizers like CharBPETokenizer, SentencePieceBPETokenizer, etc to check if that works for you.

To load weights and continue training, you can use the model_name_or_path parameter and point it to the latest checkpoint.

PhilipMay commented Jun 25, 2020

How do I have to preprocess the corpus when I want to train my own LM for roBERTa? I think it must be one sentence per row. But does it need empty lines between documents? Is it ok to shuffle the text line by line?

Shafi2016 commented Jul 12, 2020

I get the following error

File "/content/transformers/examples/language-modeling/run_language_modeling.py", line 296, in
main()
File "/content/transformers/examples/language-modeling/run_language_modeling.py", line 188, in main
config = AutoConfig.from_pretrained(model_args.config_name, cache_dir=model_args.cache_dir)
File "/usr/local/lib/python3.6/dist-packages/transformers/configuration_auto.py", line 217, in from_pretrained
"in its name: {}".format(pretrained_model_name_or_path, ", ".join(CONFIG_MAPPING.keys()))
ValueError: Unrecognized model in /content/models/smallBERTa. Should have a model_type key in its config.json, or contain one of the following strings in its name: retribert, t5, mobilebert, distilbert, albert, camembert, xlm-roberta, marian, mbart, bart, reformer, longformer, roberta, flaubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm, ctrl, electra, encoder-decoder

After running these codes

cmd = '''python /content/transformers/examples/language-modeling/run_language_modeling.py --output_dir {0}
--model_type roberta
--mlm
--train_data_file {1}
--eval_data_file {2}
--config_name /content/models/smallBERTa
--tokenizer_name /content/models/smallBERTa
--do_train
--line_by_line
--overwrite_output_dir
--do_eval
--block_size 256
--learning_rate 1e-4
--num_train_epochs 5
--save_total_limit 2
--save_steps 2000
--logging_steps 500
--per_gpu_eval_batch_size 32
--per_gpu_train_batch_size 32
--evaluate_during_training
--seed 42
'''.format(weights_dir, train_path, eval_path)

Please let me know how to fix this error

RobertHua96 commented Jul 23, 2020

I hope this isn't a silly question because I'm very new to NLP and AI in general. I find the advantages of a bytepiece encoder very enticing - and am hoping to continue pretraining Distilbert on a custom corpus.

Is it possible to:

Train that bytepiece encoder on the dataset
Load it in with Distilbert (From HF's checkpoint)
Continue pretraining Distilbert with the bytepiece tokenizer on custom corpus?

NianzuMa commented Aug 2, 2020 •

edited

Hi, I have a question regarding the training file for the tokenizer.
At the beginning of the tutorial, it says:

To the Tokenizer:
LM data in a directory containing all samples in separate *.txt files.

Also there is code snippet:

for row in tqdm(data.to_list()):
  file_name = os.path.join(txt_files_dir, str(i)+'.txt')
  try:
    f = open(file_name, 'w')
    f.write(row)
    f.close()
  except Exception as e:  #catch exceptions(for eg. empty rows)
    print(row, e) 
  i+=1

What this does is to separate each sentence into a single file, rather than put 200_000 sentences line by line in a single file.

On contrast, in this tutorial: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb#scrollTo=HOk4iZ9YZvec

the file oscar.eo.txt contains all sentences line by line in a single file.

I tried to search for the documentation but have no clue which way to do is correct.

Is it necessary to split each sentence into one file, which results in 200_000 files?

Thank you for your answer.

amazingsmash commented Sep 22, 2020

I'm kinda new to this, but playing a bit around with the code I noticed that the function call "tokenizer.save()" should be changed to "tokenizer.save_model()".

You let me know wether my hunch is correct. :)

carlstrath commented Jan 18, 2021

I get this error in line 20

TypeError Traceback (most recent call last)
in ()
----> 1 tokenizer.save("/content/models/smallBERTa", "smallBERTa")

/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py in save(self, path, pretty)
329 A path to the destination Tokenizer file
330 """
--> 331 return self._tokenizer.save(path, pretty)
332
333 def to_str(self, pretty: bool = False):

TypeError: Can't convert 'smallBERTa' to PyBool

julien-c commented Jan 18, 2021

@carlstrath In recent versions of tokenizers I think you can just call .save(path) (cc @n1t0)

carlstrath commented Jan 29, 2021 •

edited

Sorry to bother everyone again. I am now getting this error in ln27

python3: can't open file '/content/transformers/examples/run_language_modeling.py': [Errno 2] No such file or directory

Author

aditya-malte commented Jan 29, 2021

Hi @carlstrath,
(Sorry I’ve been a bit busy lately so wasn’t active).

This gist was made for a specific version of the transformer and tokenizer library. Can you try using it with the versions mentioned at the start.

Meanwhile, I guess it’s about time now that I update this gist to reflect changes in the dependencies.
Thanks
Aditya

Author

aditya-malte commented Jan 29, 2021 •

edited

Also, while cloning from git. Please ensure you use this (https://github.com/huggingface/transformers/tree/v2.5.0) github repo instead. (As the gist is compatible with that version of huggingface, the newer one probably doesn’t contain the required run_language_modeling file)

sv-v5 commented Sep 11, 2021

I ran into issues while following the directions from the 2020 blog post https://huggingface.co/blog/how-to-train. This gist was more helpful. Thank you 👍

For anyone interested in running through training with an updated transformers: I have a write-up here of a complete example on training from scratch using transformers 4.10 and the updated run_language_modeling.py script (https://github.com/huggingface/transformers/blob/4a872caef4e70595202c64687a074f99772d8e92/examples/legacy/run_language_modeling.py) committed on Jun 25, 2021.

https://github.com/sv-v5/train-roberta-ua

Python package verisions are locked with pipenv so the example remains reproducible. Tested on Linux and Windows on GPU and CPU.

Happy training

Author

aditya-malte commented Sep 11, 2021

Hi,
That’s great to hear! Also, thanks a lot for making an updated training script. I’ve been busy lately (earlier with work and now my Master’s), so your updated script is much appreciated.

mgiardinelli commented Oct 10, 2021

I had to update step #26 from tokenizer.save to tokenizer.save_model. FYI

tokenizer.save_model("/content/models/smallBERTa", "smallBERTa")

aditya-malte/smallberta_pretraining.ipynb

jbmaxwell commented Feb 24, 2020

aditya-malte commented Feb 24, 2020

jbmaxwell commented Feb 24, 2020 • edited

aditya-malte commented Feb 24, 2020

jbmaxwell commented Feb 24, 2020 • edited

mrm8488 commented Feb 28, 2020

julien-c commented Feb 28, 2020

008karan commented Mar 6, 2020 • edited

Nix07 commented Mar 17, 2020 • edited

PhilipMay commented Jun 25, 2020

Shafi2016 commented Jul 12, 2020

RobertHua96 commented Jul 23, 2020

NianzuMa commented Aug 2, 2020 • edited

amazingsmash commented Sep 22, 2020

carlstrath commented Jan 18, 2021

julien-c commented Jan 18, 2021

carlstrath commented Jan 29, 2021 • edited

aditya-malte commented Jan 29, 2021

aditya-malte commented Jan 29, 2021 • edited

sv-v5 commented Sep 11, 2021

aditya-malte commented Sep 11, 2021

mgiardinelli commented Oct 10, 2021

jbmaxwell commented Feb 24, 2020 •

edited

jbmaxwell commented Feb 24, 2020 •

edited

008karan commented Mar 6, 2020 •

edited

Nix07 commented Mar 17, 2020 •

edited

NianzuMa commented Aug 2, 2020 •

edited

carlstrath commented Jan 29, 2021 •

edited

aditya-malte commented Jan 29, 2021 •

edited