A Framework for Automatic Generation of Spoken Question-Answering Data

dc.contributor.author Manav, Y.
dc.contributor.author Menevşe, M.Ü.
dc.contributor.author Özgür, A.
dc.contributor.author Arısoy, Ebru
dc.date.accessioned 2023-10-18T12:13:23Z
dc.date.available 2023-10-18T12:13:23Z
dc.date.issued 2022
dc.description The authors would like to thank Şeniz Demir for providing the Turkish Wikipedia dataset, Emrah Budur for providing the English to Turkish machine translated SQuAD dataset and the anonymous reviewers for their valuable feedback.
dc.description.abstract This paper describes a framework to automatically generate a spoken question answering (QA) dataset. The framework consists of a question generation (QG) module to generate questions automatically from given text documents, a text-to-speech (TTS) module to convert the text documents into spoken form and an automatic speech recognition (ASR) module to transcribe the spoken content. The final dataset contains question-answer pairs for both the reference text and ASR transcriptions as well as the audio files corresponding to each reference text. For QG and ASR systems we used pre-trained multilingual encoder-decoder transformer models and fine-tuned these models using a limited amount of manually generated QA data and TTS-based speech data, respectively. As a proof of concept, we investigated the proposed framework for Turkish and generated the Turkish Question Answering (TurQuAse) dataset using Wikipedia articles. Manual evaluation of the automatically generated question-answer pairs and QA performance evaluation with state-of-the-art models on TurQuAse show that the proposed framework is efficient for automatically generating spoken QA datasets. To the best of our knowledge, TurQuAse is the first publicly available spoken question answering dataset for Turkish. The proposed framework can be easily extended to other languages where a limited amount of QA data is available. © 2022 Association for Computational Linguistics.
dc.identifier.citation Menevşe, M. Ü., Manav, Y., Arisoy, E., & Özgür, A. (2022, December). A Framework for Automatic Generation of Spoken Question-Answering Data. In Findings of the Association for Computational Linguistics: EMNLP 2022 (pp. 4659-4666).
dc.identifier.scopus 2-s2.0-85149897199
dc.identifier.uri https://hdl.handle.net/20.500.11779/1998
dc.language.iso en
dc.publisher Association for Computational Linguistics (ACL)
dc.rights info:eu-repo/semantics/closedAccess
dc.subject Speech module
dc.subject Turkishs
dc.subject Speech-recognition modules
dc.subject Question-answer pairs
dc.subject Question answering
dc.subject Speech recognition
dc.subject Computational linguistics
dc.subject Audio files
dc.subject Text to speech
dc.subject Automatic speech recognition
dc.subject Text document
dc.subject Character recognition
dc.subject Automatic generation
dc.title A Framework for Automatic Generation of Spoken Question-Answering Data
dc.type Conference Object
dspace.entity.type Publication
gdc.author.institutional Arısoy, Ebru
gdc.author.institutional Arısoy Saraçlar, Ebru
gdc.coar.access metadata only access
gdc.coar.type text::conference output
gdc.description.department Mühendislik Fakültesi, Elektrik Elektronik Mühendisligi Bölümü
gdc.description.endpage 4695
gdc.description.publicationcategory Konferans Öğesi - Uluslararası - Kurum Öğretim Elemanı
gdc.description.scopusquality N/A
gdc.description.startpage 4688
gdc.description.wosquality N/A
gdc.publishedmonth Aralık
gdc.relation.journal 2022 Findings of the Association for Computational Linguistics: EMNLP 2022 -- 7 December 2022 through 11 December 2022 -- 186900
gdc.relation.journal Findings of the Association for Computational Linguistics: EMNLP 2022
gdc.scopus.citedcount 4
gdc.wos.publishedmonth Aralık
gdc.wos.yokperiod YÖK - 2022-23
relation.isAuthorOfPublication 0b895153-5793-4e46-bc2f-06a28b30f531
relation.isAuthorOfPublication.latestForDiscovery 0b895153-5793-4e46-bc2f-06a28b30f531
relation.isOrgUnitOfPublication de19334f-6a5b-4f7b-9410-9433c48d1e5a
relation.isOrgUnitOfPublication 0d54cd31-4133-46d5-b5cc-280b2c077ac3
relation.isOrgUnitOfPublication a6e60d5c-b0c7-474a-b49b-284dc710c078
relation.isOrgUnitOfPublication.latestForDiscovery de19334f-6a5b-4f7b-9410-9433c48d1e5a

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
2022.findings-emnlp.342.pdf
Size:
178.7 KB
Format:
Adobe Portable Document Format
Description:
Full Text- Article

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
0 B
Format:
Item-specific license agreed upon to submission
Description: