Comprehensive Language Model Fine Tuning, Part 1: π€ Datasets library [Updated]
Get your data ready to train with the π€ Datasets library, plus Datasets implementation tips and tricks
This post has been updated to show how to use HuggingFace's normalizer
s functions for your text pre-processing
In the following post, I'll cover the following using the HuggingFace Datasets libray:
- Loading data, single or multiple files, csv, txt or dataframes, train/test splits
- Processing data with 11 text processing functions
- Tokenizing data for use with MobileBERT
- Saving processed data to disk
- Datasets tips and tricks along the way
Note: Click the colab button to open this notebook in Google Colab and run it end to end. This script was written with Transformers 3.3.1, Datasets 1.1 and Pytorch 1.6
I would love to hear your feedback, what could have been written better or clearer, let me know what you think on twitter: @mcgenergy
!transformers-cli env
HuggingFace Datasets Library
Why Should I use this "Datasets" library?
Lets see what the docs have to say:
Built-in interoperability with Numpy, Pandas, PyTorch and Tensorflow 2 Lightweight and fast with a transparent and pythonic API Strive on large datasets
π€Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped on drive by default.
Smart caching:never wait for your data to process several times > - π€Datasets currently provides access to ~100 NLP datasets and ~10 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics.
You can browse the full set of datasets with the live π€Datasets viewer
My fav
For me personally, I am irrationally fond of this library. It just has so many useful features for handling your text data! I have really enjoyed the speed of data processing and the fact that caching means that running your processing a second time is lightening fast! I've spent about 6 weeks working with it and I feel I've only scratched the surface of what it can do in some areas.
So, huge kudos to the team working on Datasets, the library and docs are now really great! But enough of what I think, lets get stuck in some data processing woop woop!
Lets Go 🚦
Lets start our guide to using the Datasets library to get your data ready to train. Note that a couple of the examples in this post are taken from the π€ Datasets docs, becasue "why fix it if it ain't broken!".
To start, lets install the library with a handy to remember pip install:
!pip install datasets --upgrade
# Single file
dataset = load_dataset('text', data_files='my_file.txt')
# Multiple files
dataset = load_dataset('csv', data_files=['my_file_1.csv', 'my_file_2.csv', 'my_file_3.csv'])
dataset = load_dataset('csv', data_files={'train': ['my_train_file_1.csv', 'my_train_file_2.csv'], 'test': 'my_test_file.csv'})
Alternatively we can split a single file ourselves. Lets grab some Shakespeare text from Andrej Karpathy. Because this is a sinlge file, lets do a 80/20 train/test split
# collapse-hide
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
We can see that after loading, this dataset contains a DatasetDict
with a sinlge key called train
, which in turn has Dataset
object with a sinlge column called text
, with 32,777 rows of text:
#collapse-hide
full_ds = datasets.load_dataset('text', data_files='input.txt')
full_ds
We'll have to index into the dictionary with the train
key and the name of the column(s) we'd like to inspect the text
full_ds['train'][:10]['text']
cache_dir
when loading a dataset if the default cache in your root directory has limited disk space, for example when procesing large files on Kaggle your working
directory has a 5GB limit, however ../../tmp
has a much higher limit which you can use for your active session
Loading only a small section of our data file
If we only want to take a small part of the dataset to enable us to develop rapidly we can specify the number of rows we would like to load, lets take 400 rows for example. Here we use the ReadInstruction
method, have a look through the docs for even more interesting ways to use this.
mini_ds = load_dataset('text', data_files='input.txt', split=ReadInstruction('train', from_=0, to=400, unit='abs'))
mini_ds
Since this is a single block of text lets create an 80/20 train/test split for ourselves by specifying a split
when loading the data, like so: split=['train[:80%]']
. There are additional useful examples of splits such as, K-fold cross validation
, in the docs here
train_ds = datasets.load_dataset('text', data_files='input.txt', split=['train[:80%]'])[0]
val_ds = datasets.load_dataset('text', data_files='input.txt', split=['train[80%:]'])[0]
train_ds, val_ds
r = np.random.rand(50).tolist()
rand_dataset = full_ds['train'].select(r)
rand_dataset
This covers some typical ways one might want to load data, however there are many more options to explore, including loading from pandas dataframes and creating your own loading script, see the docs for more
Processing our Data
[UPDATE] See the next section below for how to use the normalizers
from the HuggingFace tokenizers
library to do some of this pre-processing event faster!
Now we have data loaded lets take a look at some processing options. .map()
will be the main tool we'll use to apply processing functions our text. Note here are additional modifications you can make including shuffling and sorting with .shuffle()
and .sort()
respectively, but I'll leave those to you to explore in the docs π
The map
Function
map
applies a function to our dataset. Below you can see how to lowercase our data by passing the lower_case
function to map
. When applying map
you can choose to feed your function a batch of items (with batched=True
or a single item. You can also adjust this batch size, the default is 1000
. Feeding batches can be handy when using functions like tokenizers that can efficiently processes batches. Note that the structure of the processing
def lower_case(example):
tmp_ls=[]
example['text'] = _listify(example['text'])
for e in example['text']:
tmp_ls.append(e.lower())
if len(tmp_ls) == 1: return {'text': tmp_ls[0]}
else: return {'text': tmp_ls}
train_ds = train_ds.map(lower_case, batched=True)
print(' '.join(train_ds['text'][200:203]))
#collapse-hide
'''
Below are a selection of often useful processing functions to apply to your text.
As currently written, these functions require that your text column in your dataset
is called "text"
The functions are written to be able to deal with either a batch of samples
being passed or a single sample being passed.
Most pre-processing functions are taken from the covid-twitter-bert processing file, here:
https://github.com/digitalepidemiologylab/covid-twitter-bert/blob/d5a87550bb9d2424672d1ea56c84786f462321a3/utils/preprocess.py
or else from fastai's processing rules here:
https://docs.fast.ai/text.core#Preprocessing-rules
'''
# compile regexes
username_regex = re.compile(r'(^|[^@\w])@(\w{1,15})\b')
url_regex = re.compile(r'((www\.[^\s]+)|(https?://[^\s]+)|(http?://[^\s]+))')
control_char_regex = re.compile(r'[\r\n\t]+')
# Get unk character from your tokenizer of choice
tokenizer = AutoTokenizer.from_pretrained('google/mobilebert-uncased')
unk = tokenizer.special_tokens_map['unk_token']
# processing functions
def standardise_punc(example):
transl_table = dict([(ord(x), ord(y)) for x, y in zip(u"ββΒ΄βββ-", u"'''\"\"--")])
tmp_ls=[]
example['text'] = _listify(example['text'])
for e in example['text']:
tmp_ls.append(e.translate(transl_table))
if len(tmp_ls) == 1: return {'text': tmp_ls[0]}
else: return {'text': tmp_ls}
def remove_control_char(example):
tmp_ls=[]
example['text'] = _listify(example['text'])
for e in example['text']:
tmp_ls.append(re.sub(control_char_regex, ' ', e))
if len(tmp_ls) == 1: return {'text': tmp_ls[0]}
else: return {'text': tmp_ls}
def remove_remaining_control_chars(example):
tmp_ls=[]
example['text'] = _listify(example['text'])
for e in example['text']:
tmp_ls.append(''.join(ch for ch in e if unicodedata.category(ch)[0] != 'C'))
if len(tmp_ls) == 1: return {'text': tmp_ls[0]}
else: return {'text': tmp_ls}
def remove_multi_space(example):
tmp_ls=[]
example['text'] = _listify(example['text'])
for e in example['text']:
tmp_ls.append(' '.join(e.split()))
if len(tmp_ls) == 1: return {'text': tmp_ls[0]}
else: return {'text': tmp_ls}
def remove_accented_characters(example):
tmp_ls=[]
example['text'] = _listify(example['text'])
for e in example['text']:
tmp_ls.append(unidecode.unidecode(e))
if len(tmp_ls) == 1: return {'text': tmp_ls[0]}
else: return {'text': tmp_ls}
def remove_unicode_symbols(example):
tmp_ls=[]
example['text'] = _listify(example['text'])
for e in example['text']:
tmp_ls.append(''.join(ch for ch in e if unicodedata.category(ch)[0] != 'So'))
if len(tmp_ls) == 1: return {'text': tmp_ls[0]}
else: return {'text': tmp_ls}
def lower_case(example):
tmp_ls=[]
example['text'] = _listify(example['text'])
for e in example['text']:
tmp_ls.append(e.lower())
if len(tmp_ls) == 1: return {'text': tmp_ls[0]}
else: return {'text': tmp_ls}
def replace_usernames(example):
filler,tmp_ls = '<user>',[]
example['text'] = _listify(example['text'])
for e in example['text']:
occ = e.count('@')
for _ in range(occ):
e = e.replace('@<user>', f'{filler}')
e = re.sub(username_regex, filler, e) # replace other user handles by filler
e = e.replace(filler, f' {filler} ') # add spaces between, and remove double spaces again
e = ' '.join(e.split())
tmp_ls.append(e)
if len(tmp_ls) == 1: return {'text': tmp_ls[0]}
else: return {'text': tmp_ls}
def replace_urls(example):
filler,tmp_ls = '<url>',[]
example['text'] = _listify(example['text'])
for e in example['text']:
occ = e.count('www.') + e.count('http:') + e.count('https:')
for _ in range(occ):
e = re.sub(url_regex, filler, e) # replace other urls by filler
e = e.replace(filler, f' {filler} ') # add spaces between, and remove double spaces again
e = ' '.join(e.split())
tmp_ls.append(e)
if len(tmp_ls) == 1: return {'text': tmp_ls[0]}
else: return {'text': tmp_ls}
def asciify_emojis(example):
"""
Converts emojis into text aliases. E.g. π becomes :thumbs_up:
For a full list of text aliases see: https://www.webfx.com/tools/emoji-cheat-sheet/
"""
tmp_ls = []
example['text'] = _listify(example['text'])
for e in example['text']:
tmp_ls.append(emoji.demojize(e))
if len(tmp_ls) == 1: return {'text': tmp_ls[0]}
else: return {'text': tmp_ls}
def fix_html(example):
"From fastai: 'Fix messy things we've seen in documents'"
tmp_ls = []
example['text'] = _listify(example['text'])
for e in example['text']:
e = e.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace('nbsp;', ' ').replace(
'#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace('<br />', "\n").replace(
'\\"', '"').replace('<unk>',unk).replace(' @.@ ','.').replace(' @-@ ','-').replace('...',' β¦')
tmp_ls.append(html.unescape(e))
if len(tmp_ls) == 1: return {'text': tmp_ls[0]}
else: return {'text': tmp_ls}
Tip :To keep your code a little cleaner you could compose your processing functions together into a single list, so that you would then only have to apply
map
once, instead of calling it multiple times. In the example below I use thecompose
function from thefastcore
library.
# Lets add "yo!" to the beginning of each of our items
def add_yo(example):
'''Add "yo! " to each example'''
tmp_ls=[]
example['text'] = _listify(example['text'])
for e in example['text']:
tmp_ls.append('yo! ' + e)
if len(tmp_ls) == 1: return {'text': tmp_ls}
else: return {'text': tmp_ls}
# Compose our lower_case and add_yo functions
my_processing_funcs = compose(*[lower_case, add_yo])
# Apply both functions with map
train_ds = train_ds.map(my_processing_funcs, batched=True)
# We have lowercased and added "yo!" to to each item in a single call to map!
train_ds['text'][200:203]
#collapse-hide
'''
Do processing of the train and validation set
'''
do_batched = True
train_ds = train_ds.map(fix_html, batched=do_batched)
train_ds = train_ds.map(lower_case, batched=do_batched)
train_ds = train_ds.map(standardise_punc, batched=do_batched)
train_ds = train_ds.map(remove_control_char, batched=do_batched)
train_ds = train_ds.map(remove_remaining_control_chars, batched=do_batched)
train_ds = train_ds.map(remove_multi_space, batched=do_batched)
train_ds = train_ds.map(remove_accented_characters, batched=do_batched)
train_ds = train_ds.map(remove_unicode_symbols, batched=do_batched)
train_ds = train_ds.map(replace_usernames, batched=do_batched)
train_ds = train_ds.map(replace_urls, batched=do_batched)
train_ds = train_ds.map(asciify_emojis, batched=do_batched) # 3-4x slower than the others
val_ds = val_ds.map(fix_html, batched=do_batched)
val_ds = val_ds.map(lower_case, batched=do_batched)
val_ds = val_ds.map(standardise_punc, batched=do_batched)
val_ds = val_ds.map(remove_control_char, batched=do_batched)
val_ds = val_ds.map(remove_remaining_control_chars, batched=do_batched)
val_ds = val_ds.map(remove_multi_space, batched=do_batched)
val_ds = val_ds.map(remove_accented_characters, batched=do_batched)
val_ds = val_ds.map(remove_unicode_symbols, batched=do_batched)
val_ds = val_ds.map(replace_usernames, batched=do_batched)
val_ds = val_ds.map(replace_urls, batched=do_batched)
val_ds = val_ds.map(asciify_emojis, batched=do_batched)
[UPDATE] Using normalizers
from the tokenizers
library for your preprocessing
The day I originally published this article, Sylvain Gugger at HuggingFace also tweeted that the tokenizers
library had been updated, including updated docs
Well there go half of the processing functions I mentioned π
— Morgan McGuire (@mcgenergy) October 9, 2020
Stoked about the Normalizers and Pre-Tokenizers though, was one of the things I thought was missing (or maybe it was there and I missed it)https://t.co/gSheva9I2p
Available Normalizers
As of writing the normalizers available, according to the docs, are:
NFD
,NFKD
,NFC
: NFD, NFKD and NFC unicode normalization algorithms **Lowercase
: Replaces all uppercase to lowercaseStrip
: Removes all whitespace characters on the specified sides (left, right or both) of the inputStripAccents
: Removes all accent symbols in unicode (to be used with NFD for consistency)Replace
: Replaces a custom string or regexp and changes it with given content
** I'm not familiar with these normalizers, but if it is any help, the documentation uses NFD
in their BERT tokenizer example
Applying a Normalizer to a string
We can apply a normalizer to a string by instantiating it and then calling .normalize_str
, like so:
from tokenizers.normalizers import Lowercase
lc = Lowercase()
lc.normalize_str('ho wy iii KKKK')
tmp = train_ds.map(lambda e: {'text' : lc.normalize_str(e['text'])}, batched=False)
import tokenizers
from tokenizers.normalizers import Lowercase, NFD, StripAccents
# Compose our normalizers
normalizer = tokenizers.normalizers.Sequence([Lowercase(), NFD(), StripAccents()])
# Apply to string (example shamelessly copied from the tokenizers docs)
print(normalizer.normalize_str("HΓ©llΓ² hΓ΄w are ΓΌ?"))
# Apply to Dataset
tmp = train_ds.map(lambda e: {'text' : normalizer.normalize_str(e['text'])}, batched=False)
We can even then append this normalizer to our tokenzier!
tokenizer = AutoTokenizer.from_pretrained('google/mobilebert-uncased')
tokenizer.normalizer = normalizer
tokenizer.normalizer.normalize_str("HΓ©llΓ² hΓ΄w are ΓΌ?")
After processing our data with all the pre-processing/normalizer function above (click the button to show all funcs used) we're now ready for tokenization!
Tokenization
Combining HuggingFaces "Fast" tokenizers with the Datasets library is a real dream, the speed is something else! Here we'll instantiate a tokenizer compatible with the MobileBERT model.
AutoTokenizer
class makes loading tokenizers super simple, removing the need to import the specific tokenizer class for each different model you use. AutoModel
is the equivalent for model loading and we’ll use that in the next part of this series
tokenizer = AutoTokenizer.from_pretrained('google/mobilebert-uncased', return_dict=True)
'lambda' and 'map'
Here we use a lambda
function with map
to apply the tokenizer to the train and validation sets. With HuggingFace tokenizers we have map options such as adding padding, truncating the text and setting a max_length and more. We use batched=True
to take full advantages of our tokenizers ability to handle batches
-large
transformer models I found that truncating the training text and setting a max length to be really useful. It worth experimenting with, if your text has very long sequences then truncation might degrade performance to an unacceptable level. In my case I was dealing with tweet data so I knew I wasn’t chopping too much from my texts. I didn’t truncate the validation text as the evaluation phases is generally less memory intensive than the training phase, so the model could handle the full text. You’ll want to consider when pursuing this strategy if you want to validate against the full text or truncated text.
Given the above, lets do our tokenization like so:
train_ds = train_ds.map(lambda e: tokenizer(e['text'], padding=False, truncation=True, max_length=200), batched=True)
val_ds = val_ds.map(lambda e: tokenizer(e['text'], padding=True, truncation=False), batched=True)
After tokenization, our tokenized data are all in lists. To be able to use them in our model we need to encode the data as either Pytorch or Tensorflow tensors. Here we convert the relevant columns to pytorch tensors, we can set type = "tensorflow"
(or "tf"
) if we are using Tensorflow here. You can see here we also specify only a subset of our columns as that is all that is needed for training our model.
train_ds.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask'])
val_ds.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask'])
Our data has been loaded, processed, tokenized and formatted, you are now go for training right? Well, one more thing you might want to think about before jumping into your modelling is if you need to use your data on different machines...
Saving and Loading Data
If you typically only use one machine consistently there is probably no need to save your data as Datasets keeps a cache of everything you have done to it.
However if your processing takes a significant amount of time and you need to move your data between machines, if you are using Kaggle notebooks, then I recommend saving your data for easy loading like so:
train_ds.save_to_disk('20M_processed_tokenized_pt_train_dataset')
You can then easily load your data again like so:
train_ds = load_from_disk('20M_processed_tokenized_pt_train_dataset')
Ready to Train 🎉
Now that our data is loaded, processed, tokenized and formatted we are ready to train! Check out the next part in this series too see how how we fine-tune our Transformer Language Model!
Coming Up in Post 2: Training your Language Model Transformer with 🤗 Trainer
Coming up in Post 2:
- Getting your data collator
- Setting up all Training Arguments
- Make sure Weights and Biases is tracking what you need
- Training a MobileBERT model
- Training on TPUs
- Saving your model model
Thanks for Reading This Far 🙏
As always, I would love to hear your feedback, what could have been written better or clearer, you can find me on twitter: @mcgenergy