Evaluating Large Language Models in Data Generation for Low-Resource Scenarios: A Case Study on Question Answering

Loading...

Date

Journal Title

Journal ISSN

Volume Title

Open Access Color

OpenAIRE Downloads

OpenAIRE Views

relationships.isProjectOf

relationships.isJournalIssueOf

Abstract

Large Language Models (LLMs) are powerful tools for generating synthetic data, offering a promising solution to data scarcity in low-resource scenarios. This study evaluates the effectiveness of LLMs in generating question-answer pairs to enhance the performance of question answering (QA) models trained with limited annotated data. While synthetic data generation has been widely explored for text-based QA, its impact on spoken QA remains underexplored. We specifically investigate the role of LLM-generated data in improving spoken QA models, showing performance gains across both text-based and spoken QA tasks. Experimental results on subsets of the SQuAD, Spoken SQuAD, and a Turkish spoken QA dataset demonstrate significant relative F1 score improvements of 7.8%, 7.0%, and 2.7%, respectively, over models trained solely on restricted human-annotated data. Furthermore, our findings highlight the robustness of LLM-generated data in spoken QA settings, even in the presence of noise.

Description

Keywords

Spoken Question Answering, Large Language Models, Data Generation

Fields of Science

Citation

WoS Q

Scopus Q

OpenCitations Logo
OpenCitations Citation Count
N/A

Volume

Issue

Start Page

1773

End Page

1777
PlumX Metrics
Citations

Scopus : 0

Captures

Mendeley Readers : 2

Web of Science™ Citations

1

checked on May 21, 2026

Page Views

2

checked on May 21, 2026

Google Scholar Logo
Google Scholar™
OpenAlex Logo
OpenAlex FWCI
2.8331

Sustainable Development Goals

SDG data is not available