Introducing nlp.irish!

A collaborative source of Irish language dataset descriptions (with code) for the Irish NLP community

Jun 12, 2020 • Morgan McGuire • 1 min read

irish translation nmt mt nlp

tl;dr

Looking through papers to track down the Irish-English parallel corpora they used was a real pain, so I built nlp.irish to document where to find them and how to process them easily

What?

The intention behind nlp.irish is to make NLP for folks new to working with Irish a little easier by documenting the datasets that are available out there, where to find them and how to load them to a pandas dataframe.

The site is hosted on github here with the intention that it will grow via a collaborative effort of those working in Irish NLP.

Where?

🇮🇪 nlp.irish 🇮🇪

Why?

Irish is a low-resource language and every piece of data out there is valuable.

Current Data

As of writing, 5 commonly used Irish-English parallel corpora have been documented, with instructions on where to find them and code on how to process them:
- ParaCrawl, v6
- DGT-TM, DGT-Translation Memory
- DCEP, Digital Corpus of the European Parliament
- ELRC, European Language Resource Coordination
- Tatoeba

Contributing

Contributing is as easy as submitting a pull request on Github. Alternatively you can find me on twitter at @mcgenergy and I can help update the site with your contibution.