Evaluating Large Language Models in Data Generation for Low-Resource Scenarios: A Case Study on Question Answering
Loading...
Date
Journal Title
Journal ISSN
Volume Title
Open Access Color
OpenAIRE Downloads
OpenAIRE Views
Abstract
Large Language Models (LLMs) are powerful tools for generating synthetic data, offering a promising solution to data scarcity in low-resource scenarios. This study evaluates the effectiveness of LLMs in generating question-answer pairs to enhance the performance of question answering (QA) models trained with limited annotated data. While synthetic data generation has been widely explored for text-based QA, its impact on spoken QA remains underexplored. We specifically investigate the role of LLM-generated data in improving spoken QA models, showing performance gains across both text-based and spoken QA tasks. Experimental results on subsets of the SQuAD, Spoken SQuAD, and a Turkish spoken QA dataset demonstrate significant relative F1 score improvements of 7.8%, 7.0%, and 2.7%, respectively, over models trained solely on restricted human-annotated data. Furthermore, our findings highlight the robustness of LLM-generated data in spoken QA settings, even in the presence of noise.
Description
Keywords
Spoken Question Answering, Large Language Models, Data Generation
Fields of Science
Citation
WoS Q
Scopus Q

OpenCitations Citation Count
N/A
Volume
Issue
Start Page
1773
End Page
1777
PlumX Metrics
Citations
Scopus : 0
Captures
Mendeley Readers : 2
Web of Science™ Citations
1
checked on May 21, 2026
Page Views
2
checked on May 21, 2026
Google Scholar™

