Evaluating Large Language Models in Data Generation for Low-Resource Scenarios: A Case Study on Question Answering
| dc.contributor.author | Arisoy, E. | |
| dc.contributor.author | Menevşe, M.U. | |
| dc.contributor.author | Manav, Y. | |
| dc.contributor.author | Özgür, A. | |
| dc.date.accessioned | 2025-12-05T17:08:11Z | |
| dc.date.available | 2025-12-05T17:08:11Z | |
| dc.date.issued | 2025 | |
| dc.description | Meta | en_US |
| dc.description.abstract | Large Language Models (LLMs) are powerful tools for generating synthetic data, offering a promising solution to data scarcity in low-resource scenarios. This study evaluates the effectiveness of LLMs in generating question-answer pairs to enhance the performance of question answering (QA) models trained with limited annotated data. While synthetic data generation has been widely explored for text-based QA, its impact on spoken QA remains underexplored. We specifically investigate the role of LLM-generated data in improving spoken QA models, showing performance gains across both text-based and spoken QA tasks. Experimental results on subsets of the SQuAD, Spoken SQuAD, and a Turkish spoken QA dataset demonstrate significant relative F1 score improvements of 7.8%, 7.0%, and 2.7%, respectively, over models trained solely on restricted human-annotated data. Furthermore, our findings highlight the robustness of LLM-generated data in spoken QA settings, even in the presence of noise. © 2025 International Speech Communication Association. All rights reserved. | en_US |
| dc.identifier.doi | 10.21437/Interspeech.2025-1965 | |
| dc.identifier.isbn | 9781713836902 | |
| dc.identifier.isbn | 9781713820697 | |
| dc.identifier.isbn | 9781605603162 | |
| dc.identifier.isbn | 9781617821233 | |
| dc.identifier.isbn | 9781604234497 | |
| dc.identifier.issn | 1990-9772 | |
| dc.identifier.issn | 2958-1796 | |
| dc.identifier.scopus | 2-s2.0-105020060826 | |
| dc.identifier.uri | https://doi.org/10.21437/Interspeech.2025-1965 | |
| dc.identifier.uri | https://hdl.handle.net/20.500.11779/3141 | |
| dc.language.iso | en | en_US |
| dc.publisher | International Speech Communication Association | en_US |
| dc.relation.ispartof | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH -- 26th Interspeech Conference 2025 -- 2025-08-17 through 2025-08-21 -- Rotterdam -- 213554 | en_US |
| dc.rights | info:eu-repo/semantics/closedAccess | en_US |
| dc.subject | Data Generation | en_US |
| dc.subject | Large Language Models | en_US |
| dc.subject | Spoken Question Answering | en_US |
| dc.title | Evaluating Large Language Models in Data Generation for Low-Resource Scenarios: A Case Study on Question Answering | |
| dc.type | Conference Object | en_US |
| dspace.entity.type | Publication | |
| gdc.author.scopusid | 14030977200 | |
| gdc.author.scopusid | 58137783500 | |
| gdc.author.scopusid | 57219551922 | |
| gdc.author.scopusid | 56230487200 | |
| gdc.description.department | Mef University | en_US |
| gdc.description.departmenttemp | [Arisoy] Ebru, Department of Electrical and Electronic Engineering, MEF University, Istanbul, Turkey; [Menevşe] Merve Ünlü, Computer Engineering, Boğaziçi Üniversitesi, Bebek, Istanbul, Turkey; [Manav] Yusufcan, Allianz, Munich, Bayern, Germany; [Özgür] Arzucan, Computer Engineering, Boğaziçi Üniversitesi, Bebek, Istanbul, Turkey | en_US |
| gdc.description.endpage | 1777 | en_US |
| gdc.description.publicationcategory | Konferans Öğesi - Uluslararası - Kurum Öğretim Elemanı | en_US |
| gdc.description.scopusquality | N/A | |
| gdc.description.startpage | 1773 | en_US |
| gdc.description.wosquality | N/A | |
| relation.isOrgUnitOfPublication | a6e60d5c-b0c7-474a-b49b-284dc710c078 | |
| relation.isOrgUnitOfPublication.latestForDiscovery | a6e60d5c-b0c7-474a-b49b-284dc710c078 |