Evaluating Large Language Models in Data Generation for Low-Resource Scenarios: A Case Study on Question Answering
No Thumbnail Available
Date
2025
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
International Speech Communication Association
Open Access Color
OpenAIRE Downloads
OpenAIRE Views
Abstract
Large Language Models (LLMs) are powerful tools for generating synthetic data, offering a promising solution to data scarcity in low-resource scenarios. This study evaluates the effectiveness of LLMs in generating question-answer pairs to enhance the performance of question answering (QA) models trained with limited annotated data. While synthetic data generation has been widely explored for text-based QA, its impact on spoken QA remains underexplored. We specifically investigate the role of LLM-generated data in improving spoken QA models, showing performance gains across both text-based and spoken QA tasks. Experimental results on subsets of the SQuAD, Spoken SQuAD, and a Turkish spoken QA dataset demonstrate significant relative F1 score improvements of 7.8%, 7.0%, and 2.7%, respectively, over models trained solely on restricted human-annotated data. Furthermore, our findings highlight the robustness of LLM-generated data in spoken QA settings, even in the presence of noise. © 2025 International Speech Communication Association. All rights reserved.
Description
Meta
Keywords
Data Generation, Large Language Models, Spoken Question Answering
Turkish CoHE Thesis Center URL
Fields of Science
Citation
WoS Q
N/A
Scopus Q
N/A
Source
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH -- 26th Interspeech Conference 2025 -- 2025-08-17 through 2025-08-21 -- Rotterdam -- 213554
Volume
Issue
Start Page
1773
End Page
1777