Evaluating Large Language Models in Data Generation for Low-Resource Scenarios: A Case Study on Question Answering

dc.contributor.author Arisoy, E.
dc.contributor.author Menevşe, M.U.
dc.contributor.author Manav, Y.
dc.contributor.author Özgür, A.
dc.date.accessioned 2025-12-05T17:08:11Z
dc.date.available 2025-12-05T17:08:11Z
dc.date.issued 2025
dc.description Meta en_US
dc.description.abstract Large Language Models (LLMs) are powerful tools for generating synthetic data, offering a promising solution to data scarcity in low-resource scenarios. This study evaluates the effectiveness of LLMs in generating question-answer pairs to enhance the performance of question answering (QA) models trained with limited annotated data. While synthetic data generation has been widely explored for text-based QA, its impact on spoken QA remains underexplored. We specifically investigate the role of LLM-generated data in improving spoken QA models, showing performance gains across both text-based and spoken QA tasks. Experimental results on subsets of the SQuAD, Spoken SQuAD, and a Turkish spoken QA dataset demonstrate significant relative F1 score improvements of 7.8%, 7.0%, and 2.7%, respectively, over models trained solely on restricted human-annotated data. Furthermore, our findings highlight the robustness of LLM-generated data in spoken QA settings, even in the presence of noise. © 2025 International Speech Communication Association. All rights reserved. en_US
dc.identifier.doi 10.21437/Interspeech.2025-1965
dc.identifier.isbn 9781713836902
dc.identifier.isbn 9781713820697
dc.identifier.isbn 9781605603162
dc.identifier.isbn 9781617821233
dc.identifier.isbn 9781604234497
dc.identifier.issn 1990-9772
dc.identifier.issn 2958-1796
dc.identifier.scopus 2-s2.0-105020060826
dc.identifier.uri https://doi.org/10.21437/Interspeech.2025-1965
dc.identifier.uri https://hdl.handle.net/20.500.11779/3141
dc.language.iso en
dc.publisher International Speech Communication Association
dc.relation.ispartof Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH -- 26th Interspeech Conference 2025 -- 2025-08-17 through 2025-08-21 -- Rotterdam -- 213554
dc.rights info:eu-repo/semantics/closedAccess
dc.subject Data Generation en_US
dc.subject Large Language Models en_US
dc.subject Spoken Question Answering en_US
dc.title Evaluating Large Language Models in Data Generation for Low-Resource Scenarios: A Case Study on Question Answering
dc.type Conference Object
dspace.entity.type Publication
gdc.author.institutional Arısoy, Ebru
gdc.author.scopusid 14030977200
gdc.author.scopusid 58137783500
gdc.author.scopusid 57219551922
gdc.author.scopusid 56230487200
gdc.coar.access metadata only access
gdc.coar.type text::conference output
gdc.description.department Mühendislik Fakültesi, Elektrik Elektronik Mühendisliği Bölümü
gdc.description.endpage 1777 en_US
gdc.description.publicationcategory Konferans Öğesi - Uluslararası - Kurum Öğretim Elemanı
gdc.description.scopusquality N/A
gdc.description.startpage 1773 en_US
gdc.description.wosquality N/A
gdc.identifier.openalex W4415432975
gdc.index.type Scopus
gdc.openalex.fwci 0.0
gdc.openalex.normalizedpercentile 0.19
gdc.openalex.toppercent TOP 10%
gdc.opencitations.count 0
gdc.plumx.mendeley 1
gdc.plumx.scopuscites 0
gdc.publishedmonth Ağustos
gdc.scopus.citedcount 0
gdc.virtual.author Arısoy Saraçlar, Ebru
gdc.yokperiod YÖK - 2024-25
relation.isAuthorOfPublication 0b895153-5793-4e46-bc2f-06a28b30f531
relation.isAuthorOfPublication.latestForDiscovery 0b895153-5793-4e46-bc2f-06a28b30f531
relation.isOrgUnitOfPublication a6e60d5c-b0c7-474a-b49b-284dc710c078
relation.isOrgUnitOfPublication 0d54cd31-4133-46d5-b5cc-280b2c077ac3
relation.isOrgUnitOfPublication de19334f-6a5b-4f7b-9410-9433c48d1e5a
relation.isOrgUnitOfPublication.latestForDiscovery a6e60d5c-b0c7-474a-b49b-284dc710c078

Files