A Benchmark Dataset for Turkish Data-To Generation

dc.contributor.author Demir, Şeniz
dc.contributor.author Öktem, Seza
dc.date.accessioned 2022-07-22T11:32:03Z
dc.date.available 2022-07-22T11:32:03Z
dc.date.issued 2022
dc.description.abstract In the last decades, data-to-text (D2T) systems that directly learn from data have gained a lot of attention in natural language generation. These systems need data with high quality and large volume, but unfortunately some natural languages suffer from the lack of readily available generation datasets. This article describes our efforts to create a new Turkish dataset (Tr-D2T) that consists of meaning representation and reference sentence pairs without fine-grained word alignments. We utilize Turkish web resources and existing datasets in other languages for producing meaning representations and collect reference sentences by crowdsourcing native speakers. We particularly focus on the generation of single-sentence biographies and dining venue descriptions. In order to motivate future Turkish D2T studies, we present detailed benchmarking results of different sequence-to-sequence neural models trained on this dataset. To the best of our knowledge, this work is the first of its kind that provides preliminary findings and lessons learned from the creation of a new Turkish D2T dataset. Moreover, our work is the first extensive study that presents generation performances of transformer and recurrent neural network models from meaning representations in this morphologically-rich language.
dc.description.sponsorship TUBITAK-ARDEB, Turkey [117E977]
dc.description.sponsorship Artun Burak Mecik; Batuhan Bilgin; TUBITAK-ARDEB, (117E977)
dc.description.sponsorship This work is supported by TUBITAK-ARDEB, Turkey under the grant number 117E977 . The dataset is available for research purposes and non-commercial use. To obtain the dataset, you are required to send an email to the corresponding author, and agree to general terms and conditions for data usage according to TUBITAK Open Science Policy. The authors want to thank Uluc Furkan Vardar and Ilkay Tevfik Devran for implementing the XML parser and building input meaning representations, and Artun Burak Mecik, Batuhan Bilgin, and Volkan Ozer for delexicalizing the collected dataset.
dc.description.sponsorship Acknowledgments This work is supported by TUBITAK-ARDEB, Turkey under the grant number 117E977. The dataset is available for research purposes and non-commercial use. To obtain the dataset, you are required to send an email to the corresponding author, and agree to general terms and conditions for data usage according to TUBITAK Open Science Policy. The authors want to thank Uluc Furkan Vardar and Ilkay Tevfik Devran for implementing the XML parser and building input meaning representations, and Artun Burak Mecik, Batuhan Bilgin, and Volkan Ozer for delexicalizing the collected dataset.
dc.identifier.citation Demir, S., & Oktem, S. (16 July 2022). A benchmark dataset for Turkish data-to-text generation. Computer Speech & Language. pp.1-45. https://doi.org/10.1016/j.csl.2022.101433
dc.identifier.doi 10.1016/j.csl.2022.101433
dc.identifier.issn 0885-2308
dc.identifier.issn 1095-8363
dc.identifier.scopus 2-s2.0-85134849907
dc.identifier.uri https://doi.org/10.1016/j.csl.2022.101433
dc.identifier.uri https://hdl.handle.net/20.500.11779/1807
dc.language.iso en
dc.publisher Elsevier
dc.relation.ispartof Computer Speech & Language
dc.rights info:eu-repo/semantics/closedAccess
dc.subject Turkish
dc.subject Neural models
dc.subject Dining venue domain
dc.subject Biography domain
dc.subject Data-to-text generation
dc.subject Crowdsourcing
dc.title A Benchmark Dataset for Turkish Data-To Generation
dc.type Article
dspace.entity.type Publication
gdc.author.id Şeniz Demir / 0000-0003-4897-4616
gdc.author.id Seza Öktem / 0000-0003-2885-7359
gdc.author.id Demir, Şeniz/0000-0003-4897-4616
gdc.author.institutional Demir, Şeniz
gdc.author.institutional Öktem, Seza
gdc.author.scopusid 57818047100
gdc.author.scopusid 14044928200
gdc.author.wosid Demir, Şeniz/AAB-5451-2021
gdc.bip.impulseclass C5
gdc.bip.influenceclass C5
gdc.bip.popularityclass C5
gdc.coar.access metadata only access
gdc.coar.type text::journal::journal article
gdc.collaboration.industrial false
gdc.description.department Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü
gdc.description.departmenttemp [Demir, Seniz] MEF Univ, Dept Comp Engn, Istanbul, Turkiye; [Oktem, Seza] MEF Univ, Dept English Language Teaching, Istanbul, Turkiye
gdc.description.endpage 45
gdc.description.publicationcategory Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
gdc.description.scopusquality Q1
gdc.description.startpage 1
gdc.description.volume 77
gdc.description.woscitationindex Science Citation Index Expanded
gdc.description.wosquality Q2
gdc.identifier.openalex W4285606469
gdc.identifier.wos WOS:000834597200001
gdc.index.type WoS
gdc.index.type Scopus
gdc.oaire.diamondjournal false
gdc.oaire.impulse 2.0
gdc.oaire.influence 2.6821256E-9
gdc.oaire.isgreen true
gdc.oaire.popularity 3.469167E-9
gdc.oaire.publicfunded false
gdc.oaire.sciencefields 0202 electrical engineering, electronic engineering, information engineering
gdc.oaire.sciencefields 02 engineering and technology
gdc.openalex.collaboration National
gdc.openalex.fwci 0.4138
gdc.openalex.normalizedpercentile 0.68
gdc.opencitations.count 2
gdc.plumx.mendeley 12
gdc.plumx.newscount 1
gdc.plumx.scopuscites 2
gdc.publishedmonth Temmuz
gdc.relation.journal Computer Speech & Language
gdc.scopus.citedcount 3
gdc.virtual.author Demir, Şeniz
gdc.wos.citedcount 3
gdc.wos.collaboration Uluslararası işbirliği ile yapılmayan - HAYIR
gdc.wos.documenttype Article
gdc.wos.indexdate 2022
gdc.wos.publishedmonth Temmuz
gdc.yokperiod YÖK - 2021-22
relation.isAuthorOfPublication 93fa0200-13f7-446a-bdc2-118401cab062
relation.isAuthorOfPublication.latestForDiscovery 93fa0200-13f7-446a-bdc2-118401cab062
relation.isOrgUnitOfPublication 05ffa8cd-2a88-4676-8d3b-fc30eba0b7f3
relation.isOrgUnitOfPublication 0d54cd31-4133-46d5-b5cc-280b2c077ac3
relation.isOrgUnitOfPublication a6e60d5c-b0c7-474a-b49b-284dc710c078
relation.isOrgUnitOfPublication.latestForDiscovery 05ffa8cd-2a88-4676-8d3b-fc30eba0b7f3

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
1-s2.0-S0885230822000614-main.pdf
Size:
1.46 MB
Format:
Adobe Portable Document Format
Description:
Full Text - Article

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.44 KB
Format:
Item-specific license agreed upon to submission
Description: