Skip to content

Instantly share code, notes, and snippets.

@aditya-malte
Created February 22, 2020 13:41
Show Gist options
  • Save aditya-malte/2d4f896f471be9c38eb4d723a710768b to your computer and use it in GitHub Desktop.
Save aditya-malte/2d4f896f471be9c38eb4d723a710768b to your computer and use it in GitHub Desktop.
smallBERTa_Pretraining.ipynb
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@jbmaxwell
Copy link

First of, thanks so much for sharing this—it definitely helped me get a lot further along!
I was hoping to use my own tokenizer though, so I'm guessing the only way would be write the tokenizer, then just replace the LineByTextDataset() call in load_and_cache_examples() with my custom dataset, yes?

@aditya-malte
Copy link
Author

I think you mean custom “Dataset Loader”, because the code above already uses a custom tokenizer

@jbmaxwell
Copy link

jbmaxwell commented Feb 24, 2020

I see what you mean—a custom parametrization of the BPE tokenizer. But my use case is very specialized (music), so I actually want a very specific tokenization. But yes, I may be able to specify it the way you've done here. I'll think more about that. Thanks!
ps - I did get it running with the given tokenizer, so that's a huge step forward!

@aditya-malte
Copy link
Author

You’re welcome :)
Also, do share this gist on your network

@jbmaxwell
Copy link

jbmaxwell commented Feb 24, 2020

I'm struggling with trying to use a fixed vocabulary. My vocab.txt (for music) is small, and I want to avoid wordpieces, so that I don't have to predict multiple, adjacent pieces/tokens to get a "complete" prediction/"word". So all I want to do is load a vocab.txt and tokenize. Super simple, but I can't find a way to do that.
(If I can't find a way to do this, I'll just settle with the BPE tokenizer and figure out a way around the problems when I deploy it.)

Curious; is there a simple way to load weights and continue training?

@mrm8488
Copy link

mrm8488 commented Feb 28, 2020

Great work! I have executed the Colab you provided https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb
and I got this error:

02/27/2020 17:23:08 - INFO - __main__ -   Training new model from scratch
02/27/2020 17:23:16 - INFO - __main__ -   Training/evaluation parameters Namespace(adam_epsilon=1e-08, block_size=1000000000000, cache_dir=None, config_name='./EsperBERTo', device=device(type='cuda'), do_eval=False, do_train=True, eval_all_checkpoints=False, eval_data_file=None, evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, learning_rate=0.0001, line_by_line=False, local_rank=-1, logging_steps=500, max_grad_norm=1.0, max_steps=-1, mlm=True, mlm_probability=0.15, model_name_or_path=None, model_type='roberta', n_gpu=1, no_cuda=False, num_train_epochs=1.0, output_dir='./EsperBERTo-small-v1', overwrite_cache=False, overwrite_output_dir=False, per_gpu_eval_batch_size=4, per_gpu_train_batch_size=16, save_steps=2000, save_total_limit=2, seed=42, server_ip='', server_port='', should_continue=False, tokenizer_name='./EsperBERTo', train_data_file='./oscar.eo.txt', warmup_steps=0, weight_decay=0.0)
02/27/2020 17:23:16 - INFO - __main__ -   Loading features from cached file ./roberta_cached_lm_999999999998_oscar.eo.txt
Traceback (most recent call last):
  File "transformers/examples/run_language_modeling.py", line 799, in <module>
    main()
  File "transformers/examples/run_language_modeling.py", line 749, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
  File "transformers/examples/run_language_modeling.py", line 245, in train
    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/sampler.py", line 94, in __init__
    "value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0

@julien-c
Copy link

@mrm8488 should be fixed now thanks to huggingface/blog#8

@008karan
Copy link

008karan commented Mar 6, 2020

I want to train Albert than what changes I need to do in run_pretraining.py. What changes would require that?

@julien-c as training for Albert like model requires the generation of pre-training data so is pre-training data generated while training itself?

@Nix07
Copy link

Nix07 commented Mar 17, 2020

@jbmaxwell You can try other tokenizers like CharBPETokenizer, SentencePieceBPETokenizer, etc to check if that works for you.

To load weights and continue training, you can use the model_name_or_path parameter and point it to the latest checkpoint.

@PhilipMay
Copy link

How do I have to preprocess the corpus when I want to train my own LM for roBERTa? I think it must be one sentence per row. But does it need empty lines between documents? Is it ok to shuffle the text line by line?

@Shafi2016
Copy link

I get the following error

File "/content/transformers/examples/language-modeling/run_language_modeling.py", line 296, in
main()
File "/content/transformers/examples/language-modeling/run_language_modeling.py", line 188, in main
config = AutoConfig.from_pretrained(model_args.config_name, cache_dir=model_args.cache_dir)
File "/usr/local/lib/python3.6/dist-packages/transformers/configuration_auto.py", line 217, in from_pretrained
"in its name: {}".format(pretrained_model_name_or_path, ", ".join(CONFIG_MAPPING.keys()))
ValueError: Unrecognized model in /content/models/smallBERTa. Should have a model_type key in its config.json, or contain one of the following strings in its name: retribert, t5, mobilebert, distilbert, albert, camembert, xlm-roberta, marian, mbart, bart, reformer, longformer, roberta, flaubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm, ctrl, electra, encoder-decoder

After running these codes

cmd = '''python /content/transformers/examples/language-modeling/run_language_modeling.py --output_dir {0}
--model_type roberta
--mlm
--train_data_file {1}
--eval_data_file {2}
--config_name /content/models/smallBERTa
--tokenizer_name /content/models/smallBERTa
--do_train
--line_by_line
--overwrite_output_dir
--do_eval
--block_size 256
--learning_rate 1e-4
--num_train_epochs 5
--save_total_limit 2
--save_steps 2000
--logging_steps 500
--per_gpu_eval_batch_size 32
--per_gpu_train_batch_size 32
--evaluate_during_training
--seed 42
'''.format(weights_dir, train_path, eval_path)

Please let me know how to fix this error

@RobertHua96
Copy link

I hope this isn't a silly question because I'm very new to NLP and AI in general. I find the advantages of a bytepiece encoder very enticing - and am hoping to continue pretraining Distilbert on a custom corpus.

Is it possible to:

  1. Train that bytepiece encoder on the dataset
  2. Load it in with Distilbert (From HF's checkpoint)
  3. Continue pretraining Distilbert with the bytepiece tokenizer on custom corpus?

@NianzuMa
Copy link

NianzuMa commented Aug 2, 2020

Hi, I have a question regarding the training file for the tokenizer.
At the beginning of the tutorial, it says:

To the Tokenizer:
LM data in a directory containing all samples in separate *.txt files.

Also there is code snippet:

for row in tqdm(data.to_list()):
  file_name = os.path.join(txt_files_dir, str(i)+'.txt')
  try:
    f = open(file_name, 'w')
    f.write(row)
    f.close()
  except Exception as e:  #catch exceptions(for eg. empty rows)
    print(row, e) 
  i+=1

What this does is to separate each sentence into a single file, rather than put 200_000 sentences line by line in a single file.

On contrast, in this tutorial: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb#scrollTo=HOk4iZ9YZvec

the file oscar.eo.txt contains all sentences line by line in a single file.

I tried to search for the documentation but have no clue which way to do is correct.

Is it necessary to split each sentence into one file, which results in 200_000 files?

Thank you for your answer.

@amazingsmash
Copy link

I'm kinda new to this, but playing a bit around with the code I noticed that the function call "tokenizer.save()" should be changed to "tokenizer.save_model()".

You let me know wether my hunch is correct. :)

@carlstrath
Copy link

I get this error in line 20

TypeError Traceback (most recent call last)
in ()
----> 1 tokenizer.save("/content/models/smallBERTa", "smallBERTa")

/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py in save(self, path, pretty)
329 A path to the destination Tokenizer file
330 """
--> 331 return self._tokenizer.save(path, pretty)
332
333 def to_str(self, pretty: bool = False):

TypeError: Can't convert 'smallBERTa' to PyBool

@julien-c
Copy link

@carlstrath In recent versions of tokenizers I think you can just call .save(path) (cc @n1t0)

@carlstrath
Copy link

carlstrath commented Jan 29, 2021

Sorry to bother everyone again. I am now getting this error in ln27

python3: can't open file '/content/transformers/examples/run_language_modeling.py': [Errno 2] No such file or directory

@aditya-malte
Copy link
Author

Hi @carlstrath,
(Sorry I’ve been a bit busy lately so wasn’t active).

This gist was made for a specific version of the transformer and tokenizer library. Can you try using it with the versions mentioned at the start.

Meanwhile, I guess it’s about time now that I update this gist to reflect changes in the dependencies.
Thanks
Aditya

@aditya-malte
Copy link
Author

aditya-malte commented Jan 29, 2021

Also, while cloning from git. Please ensure you use this (https://github.com/huggingface/transformers/tree/v2.5.0) github repo instead. (As the gist is compatible with that version of huggingface, the newer one probably doesn’t contain the required run_language_modeling file)

@sv-v5
Copy link

sv-v5 commented Sep 11, 2021

I ran into issues while following the directions from the 2020 blog post https://huggingface.co/blog/how-to-train. This gist was more helpful. Thank you 👍

For anyone interested in running through training with an updated transformers: I have a write-up here of a complete example on training from scratch using transformers 4.10 and the updated run_language_modeling.py script (https://github.com/huggingface/transformers/blob/4a872caef4e70595202c64687a074f99772d8e92/examples/legacy/run_language_modeling.py) committed on Jun 25, 2021.

https://github.com/sv-v5/train-roberta-ua

Python package verisions are locked with pipenv so the example remains reproducible. Tested on Linux and Windows on GPU and CPU.

Happy training

@aditya-malte
Copy link
Author

Hi,
That’s great to hear! Also, thanks a lot for making an updated training script. I’ve been busy lately (earlier with work and now my Master’s), so your updated script is much appreciated.

@mgiardinelli
Copy link

I had to update step #26 from tokenizer.save to tokenizer.save_model. FYI

tokenizer.save_model("/content/models/smallBERTa", "smallBERTa")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment