tl;dr

If there is one thing I would like you to take away from this article it is the ability to use the Bokeh library to dynamically visualise, select and extract text embeddings of interest directly from a plot into a list for further processing with pandas, numpy etc:

In order for Machine Translation to be useful in the real world, we should should strive to train it on high quality translation data. This is doubly true for lower-resource languages such as Irish, where clean training data is relatively limited. In this article we will try and identify and remove clusters of dirty/noisey samples in our parallalel dataset. The stages we will go through are:

Generate embeddings from a pre-trained multi-lingual model, XLM-RoBERTa
Visualise these embeddings using a dimensionality technique via UMAP
Identify clusters in a sample of the data that seem to be of low translation quality via Bokeh
Remove similar samples from the main dataset via nmslib

Warning: Spoiler alert, check out the weird text I found in my Irish-English dataset below 😅

Arabic ⁉️

1691 ids returned from query

Lighting 💡

1699 ids returned from query

Website Footers 📇

1825 ids returned from query

Clothing and Jewelery 👑

1938 ids returned from query

Text Embeddings for Fun and Profit

Text embeddings can be incredibly useful for a variety of tasks, a couple of interesting examples include @er214's post and notebook demonstrating how they used GloVe word embeddings to create a spell-checker for their dataset. They also created a "pretentiousness" embedding to score news outlets 🤣. While these examples used word embeddings, we will generate embeddings for chunks of text, but the concept remains the same.

Pre-Trained XLM-RoBERTa Model

We derive our text embedding by passing a text sample though XLM-RoBERTa Large (XLM-R), a multilingual model which should work reasonably well for both Irish and English. For our embeddings we will extract the values in final hidden layer. Note that some research done on BERT shows that by concatenating the final 4 layers of the model one gets even richer contextual embeddings. In the interest of simplicity we will stick to the final layer only for the moment, although using additional layers would be an interesting area of exploration!

Dimension Reduction with UMAP

Dimensionality reduction techniques attempt to find the latent features in your data. Uniform Manifold Approximation and Projection or "UMAP" is a dimension reduction technique published in 2018 by McInnes and Healy that can be used for visualisation of high dimensional data, in a similar manner to t-SNE. Its main advantage is that it is fast (faster than t-SNE when dealing with large datasets) and also better maintains the global structure of the data.

UMAP also has beautifully well-written documentation. To learn more, McInnes gives a helpful overview of UMAP at the SciPy 2018 conference and Pair Code has an excellent explanation complete with many different visualisations that aid in understanding.

Similarity Search

In order to calculate the similarity between our embeddings there are a multitude of tools and techniques we can try. In this article we will demonstrate both Sci-kit Learn's cosine_similarity as well as nmslib's functionality. See the end of this article for the similarity search resources I found useful.

ParaCrawl Data

First lets load our raw ParaCrawl data been crawled from the internet and contains a few different types of artifacts including:

Non-latin scripts
Non alphanumeric characters (e.g. '©','³','º')
Text samples that are unlikely to have been translated to Irish by a human, including text from
- Porn sites
- Lighting equipment sites (who knows why?)

Lets see how much we can find by turning our samples into embeddings and using dimenstionality reduction to visualise them. For fun, we'll label texts that contain "sex", "lighting" and cyrillic characters like "и", "з" or "л" see if they get clustered together when we visualise our word embeddings later. We will also lowercae our entire dataset; many of the legal texts here are fully written in uppercase, however right now we care more about the content of the text as opposed to the style

Dataset has : 784606 rows

[en] : green dog walkers is a regional programme involving most of the councils' in the leinster area.

[ga] : is clár réigiúnach é siúlóirí glasa madraí a bhaineann le formhór na gcomhairlí i gcúige laighin.

Grab the Embeddings

After creating our dataloaders you can see the English sample below, including the special tokens < s > and < /s > used by XLM-R to denote the start and end of the sequence. You can also see all the padding needed for this sample.

samp_dls = get_dls(samp_texts, bs, sl, show=True)

<s> (i) in subsection (1) by deleting the definition of “dependant”, and</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>

Now we're ready to retrieve our embeddings from the pretrained multi-lingual model. We do this by simply passing our samples to the model and saving the output activations from the last layer of the model. From this we can generate an embedding of size (40000, 1024) which we can then pass to our dimensionality reduction algorithm.

Processing 40k samples with UMAP takes about 5 minutes. After a little testing with the n_neighbors and min_dist parameters I actually found the defaults work quite well, although its always worth playing around with them and distance metric used for your particular dataset.

Scaling your Embeddings

One thing to consider might be whether you should scale your embeddings. The embeddings used here had mean 0 and standard deviation of 0.5, normalising them to (0,1) didn't seem to have much of an impact on the UMAP visualisation so it is not done here. Scikit-Learn has a wide variety of scaling functions if you do need to scale your data.

Remove similar samples from the entire dataset

We will remove items by:

Visually identifying similar noisey embeddings
Taking the average of these embeddings
Calculating a distance between this average embedding and all of the embeddings in our entire dataset
Removing embeddings that are within a certain distance to the average embeddings

Visualise: Dimensionality Reduction with UMAP

First we reduce the embedding space down from 1024 to 2 dimensions using UMAP, then we plot the embedding using the Bokeh visualisation library, which enables us to dynamically select different regions of interest, enabling us to interactively explore our dataset.

samp_mapper = umap.UMAP(random_state=42).fit(samp_embs.cpu().numpy())

nb_url="http://0.0.0.0:8888"    # optional, depending on where you are running the notebook 
show(plot_emb_app, notebook_url=nb_url)

Below we can see how we can explore how our data has been clustered together and more interestingly select and extract datapoints of interest directly from the plot into a list for further processing with pandas, numpy etc

Identify clusters in a sample of the data that seem to be of low translation quality

Can we identify suspect looking clusters of data? Yes!

We can see the orange "islands" that we have labelled all contain text related to "lighting", "LED", "Lamps" etc, some of which are not even in English.
The sparse cluster of green blue points and their neighbours are also full of language related to pornography
We don't see so many Cryllic points, however this was also the least common of our labels, comprising only (0.07% of our labels)

Other "islands" that can identified by hovering over them include:

Arabic texts
Website footers
Text from jewellery sites (e.g. Pandora) and also clothing sites

It Worked!

Below you can see additional Arabic texts in our main dataset we have discovered:

I am fairly confident that none of the text in some these islands contain valuable translations and are likely the result of automated translations of suspect quality that we would like to remove.

Remove similar samples from the full dataset

To remove similar samples we will calculate a distance metric between each of our selected embeddings and each of the embeddings in the full dataset. Alternativly we could also calculate the "average embedding" of all of our selected datapoints, and then calculate the distance between this average and each of the embeddings in the main dataset.

From experimentation I found that the first option is more effective at identifying more of the noisy data we are looking for. In addtion, because our nms algorithm is super fast at retrieval there isn't any significant overhead to this approach over using the average embedding.

First of all, we'll need to generate embeddings for all of our 780k text samples. This might take a little while depending on the size of your dataset so kick off the extraction and grab a coffee.

Sci-kit Learn's cosine_similarity

By calculating the average embedding of our selected datapoints we can calculate the cosine similarity between it and all of the other embeddings in our full dataset, providing decent results!

2724 rows with a similarity score greater than : 0.9995

We could stop at this point and continue to loop through selecting new datapoints of interest and removing them from our full dataset, but for fun lets look at another way to do similarity search, using nmslib

Creating a nmslib Index

Note: As an alternative to Sci-kit Learn’s cosine similarity function we can also use nmslib to calculate our similarity. Note this method is slower given our needs in this example, but its always fun to work with a new technology 😀

Once we have all of our embeddings we can create our index with nmslib. This is the most time-consuming part of this process, it took 47minutes for the index to be created in this example, although there are other nmslib settings thatn can reduce this to ~15minutes.

We create an index for our entire dataset only once. We create an index for the entire dataset because we will likely have multiple queries we do not want to re-create a new index each time as it will really slow down how fast we can identify and remove low quality data.

Note: (This index will now include our sample datapoints, which are the same datapoints that we use in our query to the index. Therefore we will have to exclude these sample datapoints from our query result)

Querying the Index

Below are the sampled results from querying the full dataset, found through similarity search with an average embedding of the selection from the bokeh plot:

Arabic

1691 ids returned from query

Lighting

1699 ids returned from query

Website Footer

1825 ids returned from query

Clothing and Jewelery

1938 ids returned from query

Removal

If we like we can limit our removal to only very similar embeddings in the full dataset by only selecting the datapoints in the full dataset that are sufficiently close (lower score) to the average embedding, plotting the score distribution can help us decide on a suitable threshold

Summary

Hopefully this gives you a sense of how you can explore and clean up large, noisy text datasets. You can open this notebook on github and test it for your own text dataset.

As always, I would love to hear your feedback, what could have been written better or clearer, you can find me on twitter: @mcgenergy

Appendix

Similarity Search Resources

Below are the resources I found that helped give me an overview of the similarity search tools and methods that are out there

nmslib is an "efficient similarity search library and a toolkit for evaluation of k-NN methods for generic non-metric spaces". In testing Ben Frederickson found that between nmslib, FAISS (Facebook) and annoy (Spotify) nmslib was the fastest nearest neighbours library for CPU. Note FAISS on GPU is blazes past nmslib and annoy, but can also difficult to set up.

Milvus is an interesting library that was open sourced in late 2019 which offers similarity search using FAISS, nmslib or annoy as well as GPU capability. If I had had more time I would have explored this further

**ElasticSearch and article is yet another tool we could use to carry out the similarity search between our text embeddings.

DAIR.ai recently posted a video called "101 Ways to Solve Search" by Pratik Bhavsar which is a great introduction to how search (including semantic search) works in general.

models.pratik.ai is a really great flow chart visualisation which he tries to keep up to date of the models and techniques available for semantic search and NLP in general

Spotify's annoy: @hmendonca's EDA notebook on kaggle is an excellent example showing how to use annoy to group and display similar images (faces in this case).

	ga	en	noise_type
566807	قلعة فن البكسل ( pixels art )	طريقه عمل اطار منقط	na
571586	مواقع الارسال المجانى free sms ( 1 2)	لا تستطيع الرد على المواضيع	na
571682	قلعة دروس الفيكتور illustrator و coreldraw	اللهم صلي وسلم على نبينا محمد	na
570557	مــحل الإقـامة: jordan - palestine	قلعة الصور الخاصة بالتصميم	na
566810	مــحل الإقـامة: jordan - palestine	مــحل الإقـامة: في الفوتوشوب	na

	ga	en	noise_type
690718	an tsín bhí guangdong soilse linn snámha faoi stiúir monaróirí agus atá liostaithe anseo a fháil ag an soilsiú karnar.	the featured china guangdong 3x5 watts led-e27 manufacturers and listed here are sourced by the karnar lighting.	lighting
477330	foinse do gairdín lawn monaróirí ag soilsiú karnar zhongshan & leictreon mhonarcha.	source for high power led wall washer 144w led wall washer manufacturers at zhongshan karnar lighting & electron factory.	lighting
479274	an tsín bhí guangdong cumhacht ard-éadrom lawn faoi stiúir monaróirí agus atá liostaithe anseo a fháil ag an soilsiú karnar.	the featured china guangdong 3 watts led under ground lights round type manufacturers and listed here are sourced by the karnar lighting.	lighting
636797	an tsín bhí guangdong faoi stiúir tube monaróirí agus atá liostaithe anseo a fháil ag an soilsiú karnar.	the featured china guangdong led spot light flash lamp and fancy ball manufacturers and listed here are sourced by the karnar lighting.	lighting
692603	an tsín bhí guangdong faoi stiúir connectable soilse na nollag monaróirí agus atá liostaithe anseo a fháil ag an soilsiú karnar.	the featured china guangdong high-power led colorful 500w led wall washer manufacturers and listed here are sourced by the karnar lighting.	lighting

	ga	en	noise_type
696159	©2005-2010 karnardéan teagmháil linnnasc linnléarscáil an láithreáinlast modified: august 08 2016 02:43:19.	©2005-2010 karnarsambandlink okkurveftrélast modified: august 08 2016 04:06:45.	na
478396	©2005-2010 karnardéan teagmháil linnnasc linnléarscáil an láithreáinlast modified: july 31 2016 00:39:06.	©2005-2010 karnarcontact uslink ussite maplast modified: july 30 2016 22:04:01.	na
538977	©2005-2010 karnardéan teagmháil linnnasc linnléarscáil an láithreáinlast modified: july 14 2016 20:20:33.	©2005-2010 karnarcontact uslink ussite maplast modified: july 15 2016 01:50:20.	na
684675	©2005-2010 karnardéan teagmháil linnnasc linnléarscáil an láithreáinlast modified: august 08 2016 01:19:15.	©2005-2010 karnarcontactez-nousnous lierplan du sitelast modified: august 08 2016 03:54:30.	na
534904	©2005-2010 karnardéan teagmháil linnnasc linnléarscáil an láithreáinlast modified: july 31 2016 11:57:09.	©2005-2010 karnarcontact uslink ussite maplast modified: july 31 2016 09:46:30.	na

	ga	en	noise_type
641483	home:: uaireadóirí hublot:: sraith fusion classic:: sraith 45mm fusion classic:: macasamhail hublot classic comhleá sraith faire 45mm 511.zx.1170	home:: hublot watches:: classic fusion series:: classic fusion 45mm series:: replica hublot classic fusion 45mm watch series 401.mx.0123.gr [da02]	na
79453	pandora bead óir ivy óir - €10.23 : jewelry pandora saor, pandoraaustraliabracelets.com	pandora gold bead ivy gold - $11.00 : cheap pandora jewelry, pandoraaustraliabracelets.com	na
615744	clrip036a 925 sterling silver bán crystal ring - €28.83 : jewelry pandora saor , pandoraforyou.com	clrip036a 925 sterling silver white crystal ring - $31.00 : cheap pandora jewelry, pandoraforyou.com	na
67287	ard-chumhacht táirgí faoi stiúir > cumhacht ard-threoraithe colorful > product-list	led lighting > high-power led colorful > product-list lww-10 lww-10-108p lww-10-206p lww-8c-108p	lighting
635957	prada málaí láimhe p - br4692 caife leathar : seaicéad spyder, pradahandbags.top	prada borse p - br4692 coffee leather : spyder giacca , pradahandbags.top	na

	ga	en	noise_type
572564	مــحل الإقـامة: الـ riyadh	مــحل الإقـامة: ..ايـ جنوب ــران..	na
570189	رد: درس..استخدآم content-aware scale..~	مــحل الإقـامة: طيبهـ الطيبهـ	na
567067	ملعقة طعام خل ابيض	قلعة فن البكسل ( pixels art )	na
573439	هـــذاك لـــو انـــي مـــن الـوقــت مـهــزوم	قلعة دروس الايميج ريدي	na
571181	إرسال رسالة خاصة إلى adigatalostan	إرسال رسالة خاصة إلى nimrow	na

	ga	en	noise_type
570624	البحث عن المشاركات التي كتبها ɢнσғяαи	قلعة برامج الفيكتور illustrator و coreldraw	na
571940	safari مشاهدة ملفه الشخصي	مشكور على الاسرار ياعم	na
567896	رسالة إدارية رسالة إدارية	دورة الفوتوشوب cs5 للمبتدئين	na
573774	الله يعافيك أخي الفاضل	ممكن تعلميني ctrl+f اختصار لـ إيش بالظبط	na
568595	إرسال رسالة خاصة إلى fahooody	إرسال رسالة خاصة إلى رجل أعمال	na