Hungary_s_AI_Quest__Safeguarding_a_Linguistic_Island

Hungary’s AI Quest: Safeguarding a Linguistic Island

At the World Artificial Intelligence Conference in Shanghai, Hungarian researcher Tamás Váradi, PhD, shared insights into Hungary's quest to preserve its unique linguistic heritage in an era dominated by AI. According to Váradi, Hungary—with its 10 million speakers and a language that doesn’t belong to the Indo-European family—is a true linguistic island in a sea of global languages. 🌍

Shifting its focus to neural deep learning methods in the 2020s, the Hungarian Research Center for Linguistics has built the largest curated, cleaned, and deduplicated training corpus for Hungarian. In fact, the center rolled out its first native Hungarian models just two weeks before ChatGPT blew up, while GPT-3 had incorporated only 128 million Hungarian words compared to the 32 billion words in their tailored corpus.

Multilingual models soon reshaped the landscape. As Váradi explained, "When multilingual models came out in succession, particularly Meta's, we found that even though Hungarian data comprised only 0.006 percent, it scaled up to 40 billion words." The rapid advances by global leaders, including companies from the Chinese mainland and Meta, present significant challenges for a team working on a limited basis.

Despite these hurdles, Váradi remains confident. He stated, "Global models often lack the expertise and attention to individual language components," emphasizing that a carefully curated language database—sourced not only from the internet but also from libraries and repositories—gives them full control over representing Hungarian culture.

This story is a vibrant reminder that in our fast-evolving digital era, technology can bridge the gap and help preserve cultural heritage. Local expertise is key in safeguarding linguistic diversity. Stay curious and keep celebrating your roots! 🤖

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top