Graph-Based Turkish Text Normalization and Its Impact on Noisy Text Processing

dc.contributor.author Topçu, Berkay
dc.contributor.author Demir, Şeniz
dc.date.accessioned 2022-07-01T06:31:34Z
dc.date.available 2022-07-01T06:31:34Z
dc.date.issued 2022
dc.description.abstract User generated texts on the web are freely-available and lucrative sources of data for language technology researchers. Unfortunately, these texts are often dominated by informal writing styles and the language used in user generated content poses processing difficulties for natural language tools. Experienced performance drops and processing issues can be addressed either by adapting language tools to user generated content or by normalizing noisy texts before being processed. In this article, we propose a Turkish text normalizer that maps non-standard words to their appropriate standard forms using a graph-based methodology and a context-tailoring approach. Our normalizer benefits from both contextual and lexical similarities between normalization pairs as identified by a graph-based subnormalizer and a transformation-based subnormalizer. The performance of our normalizer is demonstrated on a tweet dataset in the most comprehensive intrinsic and extrinsic evaluations reported so far for Turkish. In this article, we present the first graph-based solution to Turkish text normalization with a novel context-tailoring approach, which advances the state-of-the-art results by outperforming other publicly available normalizers. For the first time in the literature, we measure the extent to which the accuracy of a Turkish language processing tool is affected by normalizing noisy texts before being processed. An analysis of these extrinsic evaluations that focus on more than one Turkish NLP task (i.e., part-of-speech tagger and dependency parser) reveals that Turkish language tools are not robust to noisy texts and a normalizer leads to remarkable performance improvements once used as a preprocessing tool in this morphologically-rich language.
dc.identifier.citation Demir, S., & Topcu, B. (June 2022). Graph-based Turkish text normalization and its impact on noisy text processing. Engineering Science and Technology, an International Journal. pp.1-13. https://doi.org/10.1016/j.jestch.2022.101192
dc.identifier.doi 10.1016/j.jestch.2022.101192
dc.identifier.issn 2215-0986
dc.identifier.scopus 2-s2.0-85135925837
dc.identifier.uri https://doi.org/10.1016/j.jestch.2022.101192
dc.identifier.uri https://hdl.handle.net/20.500.11779/1794
dc.language.iso en
dc.publisher Elsevier
dc.relation.ispartof Engineering Science and Technology, an International Journal
dc.rights info:eu-repo/semantics/openAccess
dc.subject Noisy text
dc.subject Graph-based representation
dc.subject Turkish
dc.subject Text normalization
dc.title Graph-Based Turkish Text Normalization and Its Impact on Noisy Text Processing
dc.type Article
dspace.entity.type Publication
gdc.author.id Şeniz Demir / 0000-0003-4897-4616
gdc.author.institutional Demir, Şeniz
gdc.bip.impulseclass C4
gdc.bip.influenceclass C5
gdc.bip.popularityclass C4
gdc.coar.access open access
gdc.coar.type text::journal::journal article
gdc.collaboration.industrial true
gdc.description.department Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü
gdc.description.endpage 13
gdc.description.publicationcategory Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
gdc.description.scopusquality Q1
gdc.description.startpage 1
gdc.description.volume 35
gdc.description.woscitationindex Science Citation Index Expanded
gdc.description.wosquality Q1
gdc.identifier.openalex W4283331272
gdc.identifier.wos WOS:000892526300014
gdc.index.type WoS
gdc.index.type Scopus
gdc.oaire.accesstype GOLD
gdc.oaire.diamondjournal false
gdc.oaire.impulse 5.0
gdc.oaire.influence 3.0557517E-9
gdc.oaire.isgreen true
gdc.oaire.keywords Turkish
gdc.oaire.keywords Graph-based representation
gdc.oaire.keywords Text normalization
gdc.oaire.keywords Noisy text
gdc.oaire.keywords TA1-2040
gdc.oaire.keywords Engineering (General). Civil engineering (General)
gdc.oaire.popularity 6.0498984E-9
gdc.oaire.publicfunded false
gdc.oaire.sciencefields 0211 other engineering and technologies
gdc.oaire.sciencefields 0202 electrical engineering, electronic engineering, information engineering
gdc.oaire.sciencefields 02 engineering and technology
gdc.openalex.collaboration National
gdc.openalex.fwci 1.7241
gdc.openalex.normalizedpercentile 0.87
gdc.openalex.toppercent TOP 10%
gdc.opencitations.count 5
gdc.plumx.crossrefcites 5
gdc.plumx.mendeley 27
gdc.plumx.newscount 1
gdc.plumx.scopuscites 9
gdc.publishedmonth Haziran
gdc.relation.journal Engineering Science and Technology, an International Journal
gdc.scopus.citedcount 12
gdc.virtual.author Demir, Şeniz
gdc.wos.citedcount 6
gdc.wos.publishedmonth Haziran
gdc.yokperiod YÖK - 2021-22
relation.isAuthorOfPublication 93fa0200-13f7-446a-bdc2-118401cab062
relation.isAuthorOfPublication.latestForDiscovery 93fa0200-13f7-446a-bdc2-118401cab062
relation.isOrgUnitOfPublication 05ffa8cd-2a88-4676-8d3b-fc30eba0b7f3
relation.isOrgUnitOfPublication 0d54cd31-4133-46d5-b5cc-280b2c077ac3
relation.isOrgUnitOfPublication a6e60d5c-b0c7-474a-b49b-284dc710c078
relation.isOrgUnitOfPublication.latestForDiscovery 05ffa8cd-2a88-4676-8d3b-fc30eba0b7f3

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
1-s2.0-S221509862200101X-main.pdf
Size:
1.03 MB
Format:
Adobe Portable Document Format
Description:
Full Text - Article

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.44 KB
Format:
Item-specific license agreed upon to submission
Description: