Awesome Hungarian NLP Resources
  • 🇭🇺Awesome NLP Resources for Hungarian
Powered by GitBook
On this page
  • Table of contents
  • Tools
  • Word tokenization, sentence splitting
  • Morphology
  • PoS / Morphological taggers
  • Taggers / Chunkers
  • Pipelines with Hungarian NLP components
  • Syntactic parsers
  • Semantic analysis
  • Other
  • Language models
  • Word embeddings
  • Transformer models
  • Large Language models
  • LLM Benchmarks
  • Datasets
  • Corpora
  • Linguistic resources
  • Linked Open Data
  • Geo data
  • Speech related data
  • Other
  • Academy
  • Journals
  • Conferences
  • Institutes
  • Learning resources
  • Books
  • Courses
  • Tutorials
  • Communities
  • Other Hungarian related resource collections

Awesome NLP Resources for Hungarian

Last updated 1 month ago

A curated list of free resources dedicated to Hungarian Natural Language Processing

Maintainers -

Table of contents

Tools

Notations:

  • 👌 Easy to install and use

  • 🚀 Commercial-friendly license

  • đź’Ż Pretrained models are available or not needed

Word tokenization, sentence splitting

Morphology

PoS / Morphological taggers

Taggers / Chunkers

Pipelines with Hungarian NLP components

Syntactic parsers

Semantic analysis

Other

Language models

Word embeddings

Transformer models

Large Language models

LLM Benchmarks

Datasets

Corpora

Raw corpora

Annotated corpora

Parallel corpora

  • [OPUS Corpus][https://opus.nlpl.eu] is a growing collection of translated texts from the web

  • [HunSimpleNews](https://huggingface.co/datasets/ELTE-DH/HunSimpleNews is the first Hungarian text simplification corpus that includes the standard and simplified versions of whole documents.

Linguistic resources

Linked Open Data

Geo data

Speech related data

Other

Academy

Journals

Conferences

Institutes

Learning resources

Books

Courses

Tutorials

Communities

Other Hungarian related resource collections

👌🚀💯 Hungarian word and sentence splitter

👌🚀💯 New Hungarian tokenizer based on quex, huntoken

đź’Ż Hungarian morphological analyzer based on Humor

👌💯A wrapper, a lemmatizer and REST API implemented in Python for emMorph (Humor) Hungarian morphological analyzer

🚀💯 is an open source tool and programming library for spell-checking, stemming and morphological analysing of agglutinative, german and other languages.

🚀💯 Hungarian morpholical analyzer and generator based on hunmorph.

👌🚀💯 is an open-source spell-checker, stemmer and morphological analyzer

👌🚀💯 LARA is a lightweight Python NLP library for ChatBots in Hungarian.

👌🚀💯 project aims at providing standardized open source multilingual platform for lemmatisation. ( | )

👌🚀💯 is a simple multilingual lemmatizer for Python

👌🚀💯 Hunpos is an open source reimplementation of TnT, the well known part-of-speech tagger by Thorsten Brants.

👌🚀 Open source morphological tagger based on HunPos

👌🚀 Python wrapper for PurePos

👌🚀 A sequential tagger for NLP using Maximum Entropy Learning and Hidden Markov Models

👌🚀 Improved version of the original HunTag

👌🚀💯 Named Entity Recognition tool for Hungarian and English

👌🚀💯 DBpedia Spotlight is a tool for automatically annotating mentions of DBpedia resources in text.

👌🚀💯 is an emtsv module for pre-trained Transfomer-based models. It provides tagging models based on Huggingface's transformers package.

👌💯 A toolkit for the basic linguistic processing of Hungarian

👌💯 Spark wrapper for magyarlanc

👌💯 Clojurized access to magyarlanc

👌🚀💯 Industrial-strength Hungarian Natural Language Processing

👌💯 An experimental unified Java and REST API for magyarlanc and szegedNER

đź’Ż GATE plugin containing Hungarian NLP tools as GATE processing resources

🚀 Hungarian NLP pipeline for social media text analysis (TrendMiner project)

🚀💯 Neural Models of Syntax

👌🚀💯 is a trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files

👌🚀💯 is a natural language pipeline that supports massive multilingual applications.

👌💯 is a text processing system with inter-module communication via tsv + REST API

👌🚀💯 is a Python NLP Library for Many Human Languages

👌🚀💯 wraps the StanfordNLP library, so you can use Stanford's models as a spaCy pipeline

👌🚀💯 A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

🚀💯 A rule based Hungarian syntactical analyzer

🚀💯 An NLTK-based parser using KR-style morphological annotation

A parser based on psycholinguistics principles

👌🚀💯 A high-accuracy parser with models for 11 languages, implemented in Python. Based on Constituency Parsing with a Self-Attentive Encoder from ACL 2018.

👌🚀💯 is an open-source sentiment analysis tool for Hungarian language, written in Python.

👌🚀💯 A tool for extracting datetime intervals from Hungarian sentences and turning datetime objects into Hungarian text.

SZTAKI HunSum-1 models 👌🚀💯 , , ,

emotion classification models using 6-label and 9-label codebooks.

👌🚀💯 Preprocessing scripts for Hungarian Language Modeling

👌🚀💯 Python3 API for WordNet XML (Hungarian WordNet / BalkaNet / VisDic format)

👌🚀💯 A simple Hungarian chatbot for booking an appointment using the Rasa framework.

👌🚀💯 Automatic punctuation restoration with BERT models for English and Hungarian

👌🚀💯 Small Footprint Diacritic Restoration for Hungarian

🚀💯 Lightweight Diacritics Restoration with Dilated Convolutional Neural Networks

👌🚀💯 NYTK Machine translation models

🚀💯 Syntax-based data augmentation for Hungarian-English machine translation

🚀💯 The Hungarian anonymization tool for CURLICAT

pre-trained word vectors for 90 languages, trained on Wikipedia using fastText.

pre-trained word vectors for 157 languages, trained on Wikipedia and the Common Crawl using fastText's CBOW model.

Multilingual word vectors in 78 languages

polgyglot embeddings on Wikipedia

Pre-trained word2vec and fasttext word vectors on wikipedia of 30+ languages

A word2vec word embedding trained on the concatenation of the Hungarian Webcorpus and the Hungarian National Corpus in 600 dimensions with a cut-off of 10 words.

Word embeddings (word2vec & fasttext) for Hungarian trained on 4.3 billion tokens

Hungarian analogical questions following Mikolov et al.

Conceptnet numbermatch multi- and cross-lingual semantic word embeddings

pretrained Subword Embeddings, downloadable in many formats

300d Floret embeddings trained on the Hungarian Webcorpus 2.0

100d Floret embeddings trained on the Hungarian Webcorpus 2.0

Deep contextualized word representation trained for many languages

Hungarian BERT base models trained on Webcorpus 2.0 and the Hungarian Wikipedia

Pretrained transformer models provided by HILANCO

is a Hungarian BERT large model based on MegatronBERT

is a Hungarian-English-Chinese trilingual GPT-NeoX model

is a Hungarian GPT-NeoX model

is a pretrained Bi-lingual Hungarian and English model that adapts Llama-2-7b to Hungarian by training on 59 billion tokens from the Hungarian split of the Cultura-X dataset

is a human aligned chat model trained in Hungarian and English

is a Hungarian GPT-2 model

is a Hungarian GPT-NeoX model (6.7 billion parameter)

is a library for evaluating and training language models on Hungarian tasks within the HuLU benchmark.

is a dataset and leaderboard comparing 100+ text embedding models across 1000+ languages including Hungarian.

The Hungarian lists 15 corpora with a links to their main page (főoldal), concordance (kereső), registration page (if needed, regisztráció), and contact e-mail (kapcsolat).

With over 1.48 billion words unfiltered (589 million words fully filtered), this is by far the largest Hungarian language corpus, and unlike the Hungarian National Corpus (125 million words), it is available in its entirety under a permissive Open Content license.

The new version of the Hungarian Webcorpus was built from Common Crawl and includes a little over 9 billion words.

is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. (2339 million unique words)

A Language Modeling Benchmark Corpus for Hungarian, similar to the One Billion Word corpus (Chelba, 2014) for English.

contains randomly selected sentences in the language of the corpus and are available in sizes from 10,000 sentences up to 1 million sentences. The sources are either newspaper texts or texts randomly collected from the web.

Automatically created multilingual web corpus

Monolingual Datasets from Web Crawl Data

Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts in 45 languages, generated by , together with word embeddings of dimension 100 computed from lowercased texts by

OpinHuBank is a human-annotated corpus to aid the research of opinion mining and sentiment analysis in Hungarian

corpus was built using pre-agenda speeches of the Hungarian National Assembly (2014-2018) and consists 764008 tokens/36475 sentences. Aspect level emotion annotation, with 39840 identified emotions, in addition, marked the keywords that evoked the emotion.

This database is the first one dedicated to Opinion Mining in Hungarian. The data for further processing were gathered from the posts of the forum topic of the Hungarian government portal dealing with the referendum about dual citizenship.

is a deeply annotated Hungarian sentiment corpus. It is composed of Hungarian opinion texts written about different types of products, published on the homepage [http://divany.hu/]

The Szeged Treebank is the largest fully manually annotated treebank of the Hungarian language

The Szeged Dependency Treebank is a dependency-tree format version of the Szeged Treebank.

The Named Entity Corpus for Hungarian is a subcorpus of the Szeged Treebank, which contains full syntactic annotations done manually by linguist experts.

is a gold standard corpus consisting of multiple layers such as dependency parse and coreference annotations

is a gold standard named entity annotated corpus containing 1 million tokens.

A 1M+-token Hungarian named entity dataset with ~30 entity types derived from NYTK-NerKor

a silver standard corpus for Hungarian Named Entity Recognition

contains 28M sentences from the MNSZ1 corpus annotated with shallow syntactic analysis

is a database of 21K hapaxes of verbs with verbal prefixes

containing 39 suitable word form samples for the purpose of word sense disambiguation

is a learners' corpus of Hungarian containing written data from 35 students majoring in Hungarian studies at the University of Zagreb, Croatia. Texts were morphologically and syntactically analyzed by the magyarlanc tool.

Hungarian Language Understanding Benchmark Kit

Hungarian Corpus of Linguistic Acceptability

Hungarian Choice of Plausible Alternatives Corpus

is a corpus of naturally occurring discourses whose final sentence contains a clause-embedding predicate under an entailment canceling operator.

Hungarian version of the Sentiment Treebank

Anaphora resolution datasets for Hungarian as an inference task

is the Hungarian set of the Winograd schemas

is the Hungarian version of the Recognizing Textual Entailment datasets

Hungarian Corpus for Reading Comprehension with Commonsense Reasoning

is a database of complete poems of 50 Hungarian canonical poets together with the sound devices of the poems and the grammatical features of words in XML format

is a database of 400 Hungarian novels (with the annotation of structural units and the grammatical features of words in TEI XML format)

is a database of 58 dramas (with the annotation of structural units and the grammatical features of words in TEI XML format)

is a dataset containing over 1.1M unique news articles with lead and other metadata

is the Hungarian translation of the Definite Pronoun Resolution Dataset

The Hunglish Corpus is a free sentence-aligned Hungarian-English parallel corpus of about 120 million words in 4 million sentence pairs.

The English-Hungarian parallel corpus contains texts selected on the basis of grammatical and translational criteria.

A Hungarian-Russian Parallel corpus comprises approximately 800 thousand words.

Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts from the Common Crawl

A Collection of Single Speaker Speech Datasets for 10 Languages including Hungarian

sentence aligned TED talks including Hungarian.

is a paraphrase corpus for 73 languages, including Hungarian, extracted from the Tatoeba database

is a dataset of comprehensive accepted translations from English to 5 different languages, including Hungarian

is an automatically extracted database containing millions of paraphrases in 16 different languages, including Hungarian

contains movie subtitles and alignments for 62 languages, including Hungarian

is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation.

is a parallel collection of the Winograd schemas in seven languages (including Hungarian)

is a Hungarian-language dataset containing over 1.1M unique news articles with lead and other metadata. The dataset contains articles from 9 major Hungarian news websites.

and are Hungarian-language datasets containing over 1.8M unique news articles with lead and other metadata. The dataset contains articles from 27 major Hungarian news websites.

The Parallel Bible Corpus is based on the historical text material of the Old Hungarian Corpus, as its database contains all of the Old and Middle Hungarian Bible translations which are available in this corpus. The King James Bible and three Finnish translations are included in the database as well.

is an open source morphological database of Hungarian, consisting of a lexicon and morphological grammar that are based on well-founded theoretical decisions.

Hungarian Wordnet

The dictionaries were manually created on the basis of Wordnet-Affect lexicons.

Highly accurate sentiment lexicons for analysing news data

Concept dictionary using Eilenberg machines

lists 500K verb frames extracted from the Mazsola database

merges verb frames existing databases

List of phrasel verbs

Tagsets and description of Hungarian morphological analysers.

CHECKLIST diagnostic test cases for Hungarian Named Entity Recognition

Hungarian WordNet in RDF format for the Linked Open Data cloud

An open, multilingual knowledge graph (with partial Hungarian support)

In the keys, the

( imported from wikidata labels)

is a gazetteer of places (with )

is the Hungarian translation of a subset of the Stanford Alpaca prompts.

Conference on Hungarian Computational Linguistics (since 2003)

Official blog of Precognox Inc.

The broad index of NLP resources for Eastern European languages.

🇭🇺
huntoken
quntoken
emMorph (Humor)
emMorphPy
hunmorph
hunmorph-foma
hunspell
lara-hungarian-nlp
Lemmagen
Python package for v3
C# project for v3
Simplemma
hunpos
PurePos
purepos.py
HunTag
HunTag3
SzegedNER
DBpedia Spotlight
Docker image
emBERT
magyarlanc
magyarlanc_spark
eszterland
HuSpaCy
huNLP
hunlp-GATE
Trendminer Hungarian Processing Pipeline
Google Syntaxnet
UDPipe
polyglot
emtsv
Stanza
spaCy StanfordNLP
trankit
hunpars
HunParse
Anagramma Parser
benepar
SentimentAnalysisHUN
hun-date-parser
mT5-small-HunSum-1
mT5-base-HunSum-1
Bert2Bert-HunSum-1
poltextLAB's models
emLam
pywnxml
Hun-appointment-chatbot
neural-punctuator
hunaccent
Diacritics_restoration
NYTK MT
syntax-augmentation-nmt
anonymizer_hu
FasText Wikipedia
FasText Common Crawl & Wikipedia
FastText_multilingual
polyglot vectors
wordvectors
hunembed0.0
Szeged word vectors
questions-words-hu
Conceptnet Numberbatch
Multi-sense word embeddings
BytePair Embeddings
HuSpaCy 300d
HuSpaCy 100d
ELMo Representations
huBERT
HIL* Transformer models
PULI-BERT-Large
PULI-GPTrio
PULI-GPT-3SX
SambaLingo-Hungarian-Base
SambaLingo-Hungarian-Chat
PULI-GPT-2
PULI-GPT-3SX
HuLU evaluate
(M)MTEB
National Corpus Portal
Hungarian Webcorpus
Hungarian Webcorpus 2.0
OSCAR
emLam
Leipzig corpora
web2corpus
CC-100
CoNLL 2017: Automatically Annotated Raw Texts and Word Embeddings
UDPipe
word2vec
OpinHuBank
HunEmPoli
The Hungarian forum corpus for Opinion Mining
Hungarian sentiment corpus (HuSent)
Szeged Treebank
Szeged Dependency Treebank
Universal Dependencies
Hungarian Named Entity Corpora
KorKor Pilotcorpus
NerKor
NerKor 1.41e
hunNERwiki
Mazsola database
PrevCons
Hungarian word sense disambiguated corpus
HunLearner
HuLU
HuCOLA
HuCoPA
HuCommitmentBank
HuSST
HuWNLI
HuWS
HuRTE
HuRC
ELTE Poetry Corpus
ELTE Novel Corpus
ELTE Drama Corpus
HumSum-1
HAPP
Hunglish Corpus
SzegedParallel
HunOr
CoNLL 2017 Shared Task Hungarian data
CSS10
Hungarian-Russian Prisoner of War Database
TED talks transcripts parallel corpus
TaPaCo Corpus
Duolingo STAPLE
PPDB
OpenSubtitles Corpus
MASSIVE dataset
PWS
HunSum-1
HunSum-2-abstractive
HunSum-2-extractive
parallelbible
morphdb.hu
huwn
Hungarian Sentiment Lexicon
poltextLAB's sentiment lexicons
4lang
Named Entity lists for Hungarian
Mazsola ISZ
Manocska
PrevLex
panmorph
hun_ner_checklist
Wikipedia dumps
Wikidata dumps
DBPedia dumps
huwn.rdf
Conceptnet
OpenStreetMap(OSM)
Hungary
name
otherwise
*name:hu
Natural-earth-vector
name_hu
Who's On First
Hungarian administrative places
Hungarian Single Speaker Speech Dataset
Mozilla Common Voice
alpaca_hu_2k
Acta Cybernetica
MSZNY
Natural Language Processing Group of the Pázmány Péter Catholic University Faculty of Information Tehnology and Bionics
Department of Language Technology and Applied Linguistics, RIL-MTA
Human Language Technology Research Group of the Budapest University of Technology and Economics
Natural Language Processing Group of the SzegedUniversity
BME - Laboratory of Speech Acoustics
Szövegbányászat
Szövegbányászat és mesterséges intelligencia R-ben
Kvantitatív szövegelemzés és szövegbányászat a politikatudományban
NLP Courses by the University Of Szeged
NLP Courses by the HLT Group of the Budapest University of Technology
Mini NLP Course by the Center Of Digital Humanities
Tutorial on Text Mining for Hungarian
Kereső világ
Hungarian NLP Meetup
Deep Learning Reading Seminar Meetup
HuNLP Slack
EENLP
European Language Grid
Hugging Face Datasets (filtered for Hungarian)
György Orosz
Tools
Word tokenization, sentence splitting
Morphology
PoS / Morphological taggers
Taggers / Chunkers
Pipelines with Hungarian NLP components
Syntactic parsers
Semantic analysis
Other
Language models
Word embeddings
Transformer models
Large Language models
LLM Benchmarks
Datasets
Corpora
Raw corpora
Annotated corpora
Parallel corpora
Linguistic resources
Linked Open Data
Geo data
Speech related data
Other
Academy
Journals
Conferences
Institutes
Learning resources
Books
Courses
Tutorials
Communities
Other Hungarian related resource collections