Text Data Cleanup - Dynamic Embedding Visualisation
Identify noisy text in a Machine Translation dataset through dynamic text embedding visualisation
In order for Machine Translation to be useful in the real world, we should should strive to train it on high quality translation data. This is doubly true for lower-resource languages such as Irish, where clean training data is relatively limited. In this article we will try and identify and remove clusters of dirty/noisey samples in our parallalel dataset. The stages we will go through are:
- Generate embeddings from a pre-trained multi-lingual model, XLM-RoBERTa
- Visualise these embeddings using a dimensionality technique via UMAP
- Identify clusters in a sample of the data that seem to be of low translation quality via Bokeh
- Remove similar samples from the main dataset via nmslib
Arabic ⁉️
Lighting 💡
Website Footers 📇
Clothing and Jewelery 👑
Text Embeddings for Fun and Profit
Text embeddings can be incredibly useful for a variety of tasks, a couple of interesting examples include @er214's post and notebook demonstrating how they used GloVe word embeddings to create a spell-checker for their dataset. They also created a "pretentiousness" embedding to score news outlets 🤣. While these examples used word embeddings, we will generate embeddings for chunks of text, but the concept remains the same.
Pre-Trained XLM-RoBERTa Model
We derive our text embedding by passing a text sample though XLM-RoBERTa Large (XLM-R), a multilingual model which should work reasonably well for both Irish and English. For our embeddings we will extract the values in final hidden layer. Note that some research done on BERT shows that by concatenating the final 4 layers of the model one gets even richer contextual embeddings. In the interest of simplicity we will stick to the final layer only for the moment, although using additional layers would be an interesting area of exploration!
Dimension Reduction with UMAP
Dimensionality reduction techniques attempt to find the latent features in your data. Uniform Manifold Approximation and Projection or "UMAP" is a dimension reduction technique published in 2018 by McInnes and Healy that can be used for visualisation of high dimensional data, in a similar manner to t-SNE. Its main advantage is that it is fast (faster than t-SNE when dealing with large datasets) and also better maintains the global structure of the data.
UMAP also has beautifully well-written documentation. To learn more, McInnes gives a helpful overview of UMAP at the SciPy 2018 conference and Pair Code has an excellent explanation complete with many different visualisations that aid in understanding.
Similarity Search
In order to calculate the similarity between our embeddings there are a multitude of tools and techniques we can try. In this article we will demonstrate both Sci-kit Learn's cosine_similarity
as well as nmslib
's functionality. See the end of this article for the similarity search resources I found useful.
ParaCrawl Data
First lets load our raw ParaCrawl data been crawled from the internet and contains a few different types of artifacts including:
- Non-latin scripts
- Non alphanumeric characters (e.g. '©','³','º')
- Text samples that are unlikely to have been translated to Irish by a human, including text from
- Porn sites
- Lighting equipment sites (who knows why?)
Lets see how much we can find by turning our samples into embeddings and using dimenstionality reduction to visualise them. For fun, we'll label texts that contain "sex", "lighting" and cyrillic characters like "и", "з" or "л" see if they get clustered together when we visualise our word embeddings later. We will also lowercae our entire dataset; many of the legal texts here are fully written in uppercase, however right now we care more about the content of the text as opposed to the style
After creating our dataloaders you can see the English sample below, including the special tokens < s >
and < /s >
used by XLM-R to denote the start and end of the sequence. You can also see all the padding needed for this sample.
samp_dls = get_dls(samp_texts, bs, sl, show=True)
Now we're ready to retrieve our embeddings from the pretrained multi-lingual model. We do this by simply passing our samples to the model and saving the output activations from the last layer of the model. From this we can generate an embedding of size (40000, 1024) which we can then pass to our dimensionality reduction algorithm.
Processing 40k samples with UMAP takes about 5 minutes. After a little testing with the n_neighbors
and min_dist
parameters I actually found the defaults work quite well, although its always worth playing around with them and distance metric used for your particular dataset.
Scaling your Embeddings
One thing to consider might be whether you should scale your embeddings. The embeddings used here had mean 0 and standard deviation of 0.5, normalising them to (0,1) didn't seem to have much of an impact on the UMAP visualisation so it is not done here. Scikit-Learn has a wide variety of scaling functions if you do need to scale your data.
Remove similar samples from the entire dataset
We will remove items by:
Visually identifying similar noisey embeddings
Taking the average of these embeddings
Calculating a distance between this average embedding and all of the embeddings in our entire dataset
Removing embeddings that are within a certain distance to the average embeddings
First we reduce the embedding space down from 1024 to 2 dimensions using UMAP, then we plot the embedding using the Bokeh visualisation library, which enables us to dynamically select different regions of interest, enabling us to interactively explore our dataset.
samp_mapper = umap.UMAP(random_state=42).fit(samp_embs.cpu().numpy())
nb_url="http://0.0.0.0:8888" # optional, depending on where you are running the notebook
show(plot_emb_app, notebook_url=nb_url)
Below we can see how we can explore how our data has been clustered together and more interestingly select and extract datapoints of interest directly from the plot into a list for further processing with pandas, numpy etc
Can we identify suspect looking clusters of data? Yes!
We can see the orange "islands" that we have labelled all contain text related to "lighting", "LED", "Lamps" etc, some of which are not even in English.
The sparse cluster of green blue points and their neighbours are also full of language related to pornography
We don't see so many Cryllic points, however this was also the least common of our labels, comprising only (0.07% of our labels)
Other "islands" that can identified by hovering over them include:
- Arabic texts
- Website footers
- Text from jewellery sites (e.g. Pandora) and also clothing sites
It Worked!
Below you can see additional Arabic texts in our main dataset we have discovered:
I am fairly confident that none of the text in some these islands contain valuable translations and are likely the result of automated translations of suspect quality that we would like to remove.
Remove similar samples from the full dataset
To remove similar samples we will calculate a distance metric between each of our selected embeddings and each of the embeddings in the full dataset. Alternativly we could also calculate the "average embedding" of all of our selected datapoints, and then calculate the distance between this average and each of the embeddings in the main dataset.
From experimentation I found that the first option is more effective at identifying more of the noisy data we are looking for. In addtion, because our nms algorithm is super fast at retrieval there isn't any significant overhead to this approach over using the average embedding.
First of all, we'll need to generate embeddings for all of our 780k text samples. This might take a little while depending on the size of your dataset so kick off the extraction and grab a coffee.
We could stop at this point and continue to loop through selecting new datapoints of interest and removing them from our full dataset, but for fun lets look at another way to do similarity search, using nmslib
Once we have all of our embeddings we can create our index with nmslib. This is the most time-consuming part of this process, it took 47minutes for the index to be created in this example, although there are other nmslib settings thatn can reduce this to ~15minutes.
We create an index for our entire dataset only once. We create an index for the entire dataset because we will likely have multiple queries we do not want to re-create a new index each time as it will really slow down how fast we can identify and remove low quality data.
Arabic
Lighting
Website Footer
Clothing and Jewelery
Summary
Hopefully this gives you a sense of how you can explore and clean up large, noisy text datasets. You can open this notebook on github and test it for your own text dataset.
As always, I would love to hear your feedback, what could have been written better or clearer, you can find me on twitter: @mcgenergy
Appendix
Similarity Search Resources
Below are the resources I found that helped give me an overview of the similarity search tools and methods that are out there
- nmslib is an "efficient similarity search library and a toolkit for evaluation of k-NN methods for generic non-metric spaces". In testing Ben Frederickson found that between nmslib, FAISS (Facebook) and annoy (Spotify) nmslib was the fastest nearest neighbours library for CPU. Note FAISS on GPU is blazes past nmslib and annoy, but can also difficult to set up.
- Milvus is an interesting library that was open sourced in late 2019 which offers similarity search using FAISS, nmslib or annoy as well as GPU capability. If I had had more time I would have explored this further
- **ElasticSearch and article is yet another tool we could use to carry out the similarity search between our text embeddings.
- DAIR.ai recently posted a video called "101 Ways to Solve Search" by Pratik Bhavsar which is a great introduction to how search (including semantic search) works in general.
- models.pratik.ai is a really great flow chart visualisation which he tries to keep up to date of the models and techniques available for semantic search and NLP in general
- Spotify's annoy: @hmendonca's EDA notebook on kaggle is an excellent example showing how to use annoy to group and display similar images (faces in this case).