Introducing nlp.irish!
A collaborative source of Irish language dataset descriptions (with code) for the Irish NLP community
tl;dr
- Looking through papers to track down the Irish-English parallel corpora they used was a real pain, so I built nlp.irish to document where to find them and how to process them easily
What?
- The intention behind nlp.irish is to make NLP for folks new to working with Irish a little easier by documenting the datasets that are available out there, where to find them and how to load them to a pandas dataframe.
- The site is hosted on github here with the intention that it will grow via a collaborative effort of those working in Irish NLP.
Where?
- 🇮🇪 nlp.irish 🇮🇪
Why?
- Irish is a low-resource language and every piece of data out there is valuable.
Current Data
As of writing, 5 commonly used Irish-English parallel corpora have been documented, with instructions on where to find them and code on how to process them:
- ParaCrawl, v6
- DGT-TM, DGT-Translation Memory
- DCEP, Digital Corpus of the European Parliament
- ELRC, European Language Resource Coordination
- Tatoeba
Contributing
- Contributing is as easy as submitting a pull request on Github. Alternatively you can find me on twitter at @mcgenergy and I can help update the site with your contibution.