🇭🇺Awesome NLP Resources for Hungarian

Awesomearrow-up-right Check Linksarrow-up-right starsarrow-up-right

A curated list of free resources dedicated to Hungarian Natural Language Processing

Maintainers - György Oroszarrow-up-right

Table of contents

Tools

Notations:

  • 👌 Easy to install and use

  • 🚀 Commercial-friendly license

  • 💯 Pretrained models are available or not needed

Word tokenization, sentence splitting

Morphology

PoS / Morphological taggers

Taggers / Chunkers

Pipelines with Hungarian NLP components

Syntactic parsers

Semantic analysis

Other

Language models

Word embeddings

Transformer models

Large Language models

General Multilingual Large Language models

Large Language models specifically developed for Hungarian language use-cases

LLM Benchmarks

Datasets

Corpora

  • The Hungarian National Corpus Portalarrow-up-right lists 15 corpora with a links to their main page (főoldal), concordance (kereső), registration page (if needed, regisztráció), and contact e-mail (kapcsolat).

Raw corpora

  • Hungarian Webcorpusarrow-up-right With over 1.48 billion words unfiltered (589 million words fully filtered), this is by far the largest Hungarian language corpus, and unlike the Hungarian National Corpus (125 million words), it is available in its entirety under a permissive Open Content license.

  • Hungarian Webcorpus 2.0arrow-up-right The new version of the Hungarian Webcorpus was built from Common Crawl and includes a little over 9 billion words.

  • OSCARarrow-up-right is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. (2339 million unique words)

  • emLamarrow-up-right A Language Modeling Benchmark Corpus for Hungarian, similar to the One Billion Word corpus (Chelba, 2014) for English.

  • Leipzig corporaarrow-up-right contains randomly selected sentences in the language of the corpus and are available in sizes from 10,000 sentences up to 1 million sentences. The sources are either newspaper texts or texts randomly collected from the web.

  • web2corpusarrow-up-right Automatically created multilingual web corpus

  • CC-100arrow-up-right Monolingual Datasets from Web Crawl Data

Annotated corpora

Parallel corpora

  • Hunglish Corpusarrow-up-right The Hunglish Corpus is a free sentence-aligned Hungarian-English parallel corpus of about 120 million words in 4 million sentence pairs.

  • SzegedParallelarrow-up-right The English-Hungarian parallel corpus contains texts selected on the basis of grammatical and translational criteria.

  • HunOrarrow-up-right A Hungarian-Russian Parallel corpus comprises approximately 800 thousand words.

  • CoNLL 2017 Shared Task Hungarian dataarrow-up-right Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts from the Common Crawl

  • CSS10arrow-up-right A Collection of Single Speaker Speech Datasets for 10 Languages including Hungarian

  • TED talks transcripts parallel corpusarrow-up-right sentence aligned TED talks including Hungarian.

  • TaPaCo Corpusarrow-up-right is a paraphrase corpus for 73 languages, including Hungarian, extracted from the Tatoeba database

  • Duolingo STAPLEarrow-up-right is a dataset of comprehensive accepted translations from English to 5 different languages, including Hungarian

  • PPDBarrow-up-right is an automatically extracted database containing millions of paraphrases in 16 different languages, including Hungarian

  • OpenSubtitles Corpusarrow-up-right contains movie subtitles and alignments for 62 languages, including Hungarian

  • [OPUS Corpus][https://opus.nlpl.eu] is a growing collection of translated texts from the web

  • MASSIVE datasetarrow-up-right is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation.

  • PWSarrow-up-right is a parallel collection of the Winograd schemas in seven languages (including Hungarian)

  • [HunSimpleNews](https://huggingface.co/datasets/ELTE-DH/HunSimpleNews is the first Hungarian text simplification corpus that includes the standard and simplified versions of whole documents.

  • HunSum-1arrow-up-right is a Hungarian-language dataset containing over 1.1M unique news articles with lead and other metadata. The dataset contains articles from 9 major Hungarian news websites.

  • HunSum-2-abstractivearrow-up-right and HunSum-2-extractivearrow-up-right are Hungarian-language datasets containing over 1.8M unique news articles with lead and other metadata. The dataset contains articles from 27 major Hungarian news websites.

  • parallelbiblearrow-up-right The Parallel Bible Corpus is based on the historical text material of the Old Hungarian Corpus, as its database contains all of the Old and Middle Hungarian Bible translations which are available in this corpus. The King James Bible and three Finnish translations are included in the database as well.

Linguistic resources

Linked Open Data

Geo data

Other

Academy

Journals

Conferences

Institutes

Learning resources

Books

Courses

Tutorials

Communities

Last updated