Evaluating Large Language Models in Data Generation for Low-Resource Scenarios: A Case Study on Question Answering

dc.contributor.author Arisoy, E.
dc.contributor.author Menevşe, M.U.
dc.contributor.author Manav, Y.
dc.contributor.author Özgür, A.
dc.date.accessioned 2025-12-05T17:08:11Z
dc.date.available 2025-12-05T17:08:11Z
dc.date.issued 2025
dc.description Meta en_US
dc.description.abstract Large Language Models (LLMs) are powerful tools for generating synthetic data, offering a promising solution to data scarcity in low-resource scenarios. This study evaluates the effectiveness of LLMs in generating question-answer pairs to enhance the performance of question answering (QA) models trained with limited annotated data. While synthetic data generation has been widely explored for text-based QA, its impact on spoken QA remains underexplored. We specifically investigate the role of LLM-generated data in improving spoken QA models, showing performance gains across both text-based and spoken QA tasks. Experimental results on subsets of the SQuAD, Spoken SQuAD, and a Turkish spoken QA dataset demonstrate significant relative F1 score improvements of 7.8%, 7.0%, and 2.7%, respectively, over models trained solely on restricted human-annotated data. Furthermore, our findings highlight the robustness of LLM-generated data in spoken QA settings, even in the presence of noise. © 2025 International Speech Communication Association. All rights reserved. en_US
dc.identifier.doi 10.21437/Interspeech.2025-1965
dc.identifier.isbn 9781713836902
dc.identifier.isbn 9781713820697
dc.identifier.isbn 9781605603162
dc.identifier.isbn 9781617821233
dc.identifier.isbn 9781604234497
dc.identifier.issn 1990-9772
dc.identifier.issn 2958-1796
dc.identifier.scopus 2-s2.0-105020060826
dc.identifier.uri https://doi.org/10.21437/Interspeech.2025-1965
dc.identifier.uri https://hdl.handle.net/20.500.11779/3141
dc.language.iso en en_US
dc.publisher International Speech Communication Association en_US
dc.relation.ispartof Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH -- 26th Interspeech Conference 2025 -- 2025-08-17 through 2025-08-21 -- Rotterdam -- 213554 en_US
dc.rights info:eu-repo/semantics/closedAccess en_US
dc.subject Data Generation en_US
dc.subject Large Language Models en_US
dc.subject Spoken Question Answering en_US
dc.title Evaluating Large Language Models in Data Generation for Low-Resource Scenarios: A Case Study on Question Answering
dc.type Conference Object en_US
dspace.entity.type Publication
gdc.author.scopusid 14030977200
gdc.author.scopusid 58137783500
gdc.author.scopusid 57219551922
gdc.author.scopusid 56230487200
gdc.description.department Mef University en_US
gdc.description.departmenttemp [Arisoy] Ebru, Department of Electrical and Electronic Engineering, MEF University, Istanbul, Turkey; [Menevşe] Merve Ünlü, Computer Engineering, Boğaziçi Üniversitesi, Bebek, Istanbul, Turkey; [Manav] Yusufcan, Allianz, Munich, Bayern, Germany; [Özgür] Arzucan, Computer Engineering, Boğaziçi Üniversitesi, Bebek, Istanbul, Turkey en_US
gdc.description.endpage 1777 en_US
gdc.description.publicationcategory Konferans Öğesi - Uluslararası - Kurum Öğretim Elemanı en_US
gdc.description.scopusquality N/A
gdc.description.startpage 1773 en_US
gdc.description.wosquality N/A
relation.isOrgUnitOfPublication a6e60d5c-b0c7-474a-b49b-284dc710c078
relation.isOrgUnitOfPublication.latestForDiscovery a6e60d5c-b0c7-474a-b49b-284dc710c078

Files