A corpus-based word frequency list of Turkish : Evidence from the subcorpora of Turkish National Corpus project

Aksan, Yeşim and Yaldır, Yilmaz: A corpus-based word frequency list of Turkish : Evidence from the subcorpora of Turkish National Corpus project. Studia uralo-altaica, (49). pp. 47-57. (2012)

[img] Cikk, tanulmány, mű
altaica_049_047-057.pdf

Download (5MB)

Abstract

Word frequency studies have a central role in various disciplines, such as linguistics, cognitive psychology, natural language processing, computational linguistics. Developments in the computer technologies and information processing help researchers make comprehensive word lists on the basis of digitally constructed language corpora. Since Kucera and Francis's first corpus-based word frequency lists derived from the Brown Corpus (1967), a variety of research have been conducted on general or specialized corpora to obtain rank frequency order and distribution of words for different Indo-European languages (Johansson & Hofland 1989; Leech et al. 2001; Baroni et al. 2004; Ha et al. 2006; Davies & Gardner 2010). In Turkish, Goz's dictionary (2003), which is based on a 1 million-word general corpus, is the only work on word frequency. In general, lexical properties of Turkish and, in particular, word frequency lists of text collections representing different registers of Turkish need to be described via corpus-based word frequency lists. Keeping this necessity in mind, this study has two aims: (1) to produce word frequency lists of Turkish on the basis of two subcorpora, namely the Corpus of Contemporary Turkish Fiction and the Corpus of Contemporary Turkish News Texts. In this respect, frequency lists of both root types and word classes in Turkish are prepared; (2) to compare these two corpora by using frequency profiling information. This paper is organized as follows. First we explain basic concepts and review literature of word frequency studies. Then, we describe the construction of two subcorpora used to derive wordlists and explain the steps followed in tokenization and root type mapping scheme on which the token and root counts are based. Finally, we compare rank frequency and word class lists of Turkish Fiction and Turkish News Texts Corpora.

Item Type: Article
Event Title: International Conference on Turkish Linguistics, 15., 2010, Szeged
Journal or Publication Title: Studia uralo-altaica
Date: 2012
Volume: 49
Page Range: pp. 47-57
ISSN: 0133-4239
Language: angol
Heading title: The Szeged Conference
Uncontrolled Keywords: Nyelvtudomány
Additional Information: Bibliogr.: p. 56-57.
Date Deposited: 2016. Oct. 15. 14:09
Last Modified: 2018. May. 24. 09:43
URI: http://acta.bibl.u-szeged.hu/id/eprint/16689

Actions (login required)

View Item View Item