Natural Language Processing at Arcanum

In the past decade, we have seen the emerging demand to communicate with machines and systems surrounding us using natural language. Think of the virtual assistant of your smartphone or an advanced translation application. We want the software to understand us by brief sentences. Such systems are based on artificial intelligence with which machines can comprehend commands by humans on a higher level, thus providing us with more accurate and relevant answers.

BERT model – which is a natural language processing method – helps Arcanum users. In this article, we’re going to present our services that utilize the BERT model and the technology behind them.

BERT in a nutshell

This machine learning model was introduced by Google in 2018 with the aim of improving the way how a computer can transform natural human languages into mathematical expressions. By this, Google could better understand what people type in their search box. The significance of this model is that BERT can interpret words in their context.

As the name suggests, any machine learning model needs training before fulfilling a particular function. In our case, BERT had to be introduced to the unique Hungarian language. To achieve this, we trained the model on Arcanum’s proprietary corpus containing 10 billion Hungarian words . What the model does is it practically reads through the entire text of this massive corpus while finding mathematical connections and patterns between words in the sentences. This results in – after some fine-tuning – a fully trained language model that is capable of handling a wide range of tasks upon which we have built some of the most exciting features of the Arcanum Newspapers.

Named entity recognition

After analyzing what our users try to find in Arcanum Newspapers, we see that names related to persons, cities, and institutions are overrepresented. However, the spelling of proper names may change over time due to political or grammatical changes. Besides, a proper name can often refer to different entities, among which differentiating the search query is challenging. For example, typing in the name ‘Kossuth’ (famous politician in the 19 th century Hungary) brings results to both the personality and the many institutes, schools, and street names bearing his name. Or the Hungarian word ‘lenti’ can refer to a Hungarian town in Zala county, or it can be an adjective that refers to an object below something.

To develop a feature that recognizes proper names, we used the already introduced BERT model. We annotated a text corpus consisting of 8,000 paragraphs (c. 450 thousand words) by marking names in ten categories, such as proper names, institution names, location, title, event, etc. This hand-compiled database served as a training dataset for the BERT model to recognize names in texts.

1803. október 17-én Söjtörön született Deák Ferenc , és itt töltötte gyerekkorát. Pályája a reformkori ellenzéktől az 1867-es , a forradalmat és szabadságharcot lezáró kiegyezés létrehozásáig ívelt. Az 1832-36-os országgyűlésen Zala vármegye követe, 1848-ban az első felelős magyar kormány igazságügy-minisztere volt. A forradalom után visszavonult Kehidára , majd 1854-ben Pestre költözött, hogy létrehozza nagy művét - a kiegyezést. Szülőhelyén a kiállítás ezt az életutat tekinti át.

Figure 1. Sample from the name recognition training dataset

Our users can observe these names as suggestions when using the full-text search box. As soon as the user starts typing, the AI finds the matching names in the database and offers them as search suggestions based on relevancy. The system can handle typos and alternative spellings.

Figure 2. Variations to the former Vietnamese prime minister

Question answering

Programming computers to comprehend and answer questions that are formed using natural language has been a long-standing challenge for scientists and linguists. In 2018, Google launched Talk to Books , a service that can understand questions using AI and finds relevant answers in more than a hundred thousand books. Since this service is also based on the BERT model, we have been experimenting with a similar service for Arcanum Newspapers. For this, we have built our training dataset: we picked 100 well-written Hungarian Wikipedia pages, then we wrote 10,000 related questions using natural language. Finally, we marked the answers to those questions in the text. These questions and answers combined are the basis for the algorithm. As a result, the algorithm learned to understand questions in Hungarian and finds relevant answers in a database of any size.

Bronx

Bronx New York városának legészakibb kerülete, amely egybeesik Bronx megyével. A város öt kerülete közül ez az egyetlen, amelynek nagyobb része van szárazföldön, mint szigeten.

A 2010-es népszámlálás adatai szerint 1 385 108 lakosa volt. Ha minden kerület önálló városnak számítana, akkor Bronx volna a kilencedik legnépesebb amerikai város. A népességben az 1960-as években csökkenés mutatkozott, majd ez újra növekedésnek indult. A legmagasabb népességet 1950-ben számlálták.Bronx a negyedik legnépesebb New York öt kerülete közül, és az ötödik legnépesebb járás a New York-i agglomerációban. Bár a köznyelvben egyszerűen „The Bronx” a neve, a járás hivatalos nevében nincs névelő („The”).

Nevét a Bronx folyóról kapta, és mivel a folyókat az angol nyelvben általában névelővel használják, (pl. „the Hudson”) ez a járás nevében is benne maradt. A folyót egy svédről , Jonas Bronckról nevezték el, aki tengerészkapitány volt és 1641-ben egy 2 km² méretű birtoka volt a Harlem folyó és a Bronx (vagy akkori indián nevén Aquahung) folyó között.

  1. Mennyi a lakossága Bronx-nak?
  2. Melyik New York legészakibb kerülete?
  3. Hány kerülete van New Yorknak?
  4. Miről kapta nevét Bronx?
  5. Milyen nemzetiségű emberről kapta a nevét a Bronx folyó?
  6. Mi volt Jonas Bronck foglalkozása?
  7. Melyik Bronx legmagasabb épülete?

Figure 1. Sample from our question-answering training dataset

We aim to help users find relevant answers to their questions as simple as possible.

Try out our question answering service that has been built on the most important Hungarian
encyclopedias.


* Corpus is a linguistical term, it refers to the collection of actually occured written, or noted spoken lingual data.

Try it here

Millions of pages of domestic scientific and specialized journals, encyclopaedias, weekly and daily newspapers.

Let's see
Try it here

Millions of pages of domestic scientific and specialized journals, encyclopaedias, weekly and daily newspapers.

Let's see

Arcanum logo

Arcanum is an online publisher that creates massive structured databases of digitized cultural contents.

The Company Contact Press room

Languages