Graph-Based Turkish Text Normalization and Its Impact on Noisy Text Processing
| dc.contributor.author | Topçu, Berkay | |
| dc.contributor.author | Demir, Şeniz | |
| dc.date.accessioned | 2022-07-01T06:31:34Z | |
| dc.date.available | 2022-07-01T06:31:34Z | |
| dc.date.issued | 2022 | |
| dc.description.abstract | User generated texts on the web are freely-available and lucrative sources of data for language technology researchers. Unfortunately, these texts are often dominated by informal writing styles and the language used in user generated content poses processing difficulties for natural language tools. Experienced performance drops and processing issues can be addressed either by adapting language tools to user generated content or by normalizing noisy texts before being processed. In this article, we propose a Turkish text normalizer that maps non-standard words to their appropriate standard forms using a graph-based methodology and a context-tailoring approach. Our normalizer benefits from both contextual and lexical similarities between normalization pairs as identified by a graph-based subnormalizer and a transformation-based subnormalizer. The performance of our normalizer is demonstrated on a tweet dataset in the most comprehensive intrinsic and extrinsic evaluations reported so far for Turkish. In this article, we present the first graph-based solution to Turkish text normalization with a novel context-tailoring approach, which advances the state-of-the-art results by outperforming other publicly available normalizers. For the first time in the literature, we measure the extent to which the accuracy of a Turkish language processing tool is affected by normalizing noisy texts before being processed. An analysis of these extrinsic evaluations that focus on more than one Turkish NLP task (i.e., part-of-speech tagger and dependency parser) reveals that Turkish language tools are not robust to noisy texts and a normalizer leads to remarkable performance improvements once used as a preprocessing tool in this morphologically-rich language. | |
| dc.identifier.citation | Demir, S., & Topcu, B. (June 2022). Graph-based Turkish text normalization and its impact on noisy text processing. Engineering Science and Technology, an International Journal. pp.1-13. https://doi.org/10.1016/j.jestch.2022.101192 | |
| dc.identifier.doi | 10.1016/j.jestch.2022.101192 | |
| dc.identifier.issn | 2215-0986 | |
| dc.identifier.scopus | 2-s2.0-85135925837 | |
| dc.identifier.uri | https://doi.org/10.1016/j.jestch.2022.101192 | |
| dc.identifier.uri | https://hdl.handle.net/20.500.11779/1794 | |
| dc.language.iso | en | |
| dc.publisher | Elsevier | |
| dc.relation.ispartof | Engineering Science and Technology, an International Journal | |
| dc.rights | info:eu-repo/semantics/openAccess | |
| dc.subject | Noisy text | |
| dc.subject | Graph-based representation | |
| dc.subject | Turkish | |
| dc.subject | Text normalization | |
| dc.title | Graph-Based Turkish Text Normalization and Its Impact on Noisy Text Processing | |
| dc.type | Article | |
| dspace.entity.type | Publication | |
| gdc.author.id | Şeniz Demir / 0000-0003-4897-4616 | |
| gdc.author.institutional | Demir, Şeniz | |
| gdc.bip.impulseclass | C4 | |
| gdc.bip.influenceclass | C5 | |
| gdc.bip.popularityclass | C4 | |
| gdc.coar.access | open access | |
| gdc.coar.type | text::journal::journal article | |
| gdc.collaboration.industrial | true | |
| gdc.description.department | Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü | |
| gdc.description.endpage | 13 | |
| gdc.description.publicationcategory | Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı | |
| gdc.description.scopusquality | Q1 | |
| gdc.description.startpage | 1 | |
| gdc.description.volume | 35 | |
| gdc.description.woscitationindex | Science Citation Index Expanded | |
| gdc.description.wosquality | Q1 | |
| gdc.identifier.openalex | W4283331272 | |
| gdc.identifier.wos | WOS:000892526300014 | |
| gdc.index.type | WoS | |
| gdc.index.type | Scopus | |
| gdc.oaire.accesstype | GOLD | |
| gdc.oaire.diamondjournal | false | |
| gdc.oaire.impulse | 5.0 | |
| gdc.oaire.influence | 3.0557517E-9 | |
| gdc.oaire.isgreen | true | |
| gdc.oaire.keywords | Turkish | |
| gdc.oaire.keywords | Graph-based representation | |
| gdc.oaire.keywords | Text normalization | |
| gdc.oaire.keywords | Noisy text | |
| gdc.oaire.keywords | TA1-2040 | |
| gdc.oaire.keywords | Engineering (General). Civil engineering (General) | |
| gdc.oaire.popularity | 6.0498984E-9 | |
| gdc.oaire.publicfunded | false | |
| gdc.oaire.sciencefields | 0211 other engineering and technologies | |
| gdc.oaire.sciencefields | 0202 electrical engineering, electronic engineering, information engineering | |
| gdc.oaire.sciencefields | 02 engineering and technology | |
| gdc.openalex.collaboration | National | |
| gdc.openalex.fwci | 1.7241 | |
| gdc.openalex.normalizedpercentile | 0.87 | |
| gdc.openalex.toppercent | TOP 10% | |
| gdc.opencitations.count | 5 | |
| gdc.plumx.crossrefcites | 5 | |
| gdc.plumx.mendeley | 27 | |
| gdc.plumx.newscount | 1 | |
| gdc.plumx.scopuscites | 9 | |
| gdc.publishedmonth | Haziran | |
| gdc.relation.journal | Engineering Science and Technology, an International Journal | |
| gdc.scopus.citedcount | 12 | |
| gdc.virtual.author | Demir, Şeniz | |
| gdc.wos.citedcount | 6 | |
| gdc.wos.publishedmonth | Haziran | |
| gdc.yokperiod | YÖK - 2021-22 | |
| relation.isAuthorOfPublication | 93fa0200-13f7-446a-bdc2-118401cab062 | |
| relation.isAuthorOfPublication.latestForDiscovery | 93fa0200-13f7-446a-bdc2-118401cab062 | |
| relation.isOrgUnitOfPublication | 05ffa8cd-2a88-4676-8d3b-fc30eba0b7f3 | |
| relation.isOrgUnitOfPublication | 0d54cd31-4133-46d5-b5cc-280b2c077ac3 | |
| relation.isOrgUnitOfPublication | a6e60d5c-b0c7-474a-b49b-284dc710c078 | |
| relation.isOrgUnitOfPublication.latestForDiscovery | 05ffa8cd-2a88-4676-8d3b-fc30eba0b7f3 |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- 1-s2.0-S221509862200101X-main.pdf
- Size:
- 1.03 MB
- Format:
- Adobe Portable Document Format
- Description:
- Full Text - Article
License bundle
1 - 1 of 1
No Thumbnail Available
- Name:
- license.txt
- Size:
- 1.44 KB
- Format:
- Item-specific license agreed upon to submission
- Description:
