Smooth inverse frequency based text data selection for medical dictation

Bálint Domonkos and Mihajlik Péter: Smooth inverse frequency based text data selection for medical dictation. In: Magyar Számítógépes Nyelvészeti Konferencia, (17). pp. 233-242. (2021)

[thumbnail of msznykonf_017_233-242.pdf]
Cikk, tanulmány, mű

Download (407kB) | Preview


Under-resourced domain problem is significant in automatic speech recognition, especially in small languages such as Hungarian or in fields where data is often confidential such as finance and medicine. We introduce a method using word embedding and smooth inverse frequency (SIF) based distance measurement to filter public domain web corpora. The selection for (medical) domain matching documents can be scaled. The resulted text is used to train an augmented language model for a medical dictation system. We show that using the appropriately scaled selection leads to optimal performance of the ASR system over the baselines where no data augmentation was applied or all the augmentation data was added.

Item Type: Article
Heading title: Poszter, laptopos bemutató
Journal or Publication Title: Magyar Számítógépes Nyelvészeti Konferencia
Date: 2021
Volume: 17
ISBN: 978-963-306-781-9
Page Range: pp. 233-242
Language: English
Event Title: Magyar számítógépes nyelvészeti konferencia (17.) (2021) (Szeged)
Related URLs:
Uncontrolled Keywords: Nyelvészet - számítógép alkalmazása
Additional Information: Bibliogr.: p. 240-242. és a lábjegyzetekben ; összefoglalás angol nyelven
Subjects: 01. Natural sciences
01. Natural sciences > 01.02. Computer and information sciences
06. Humanities
06. Humanities > 06.02. Languages and Literature
Date Deposited: 2021. Sep. 28. 13:00
Last Modified: 2021. Sep. 28. 13:00

Actions (login required)

View Item View Item