Evaluating Large Language Models in Data Generation for Low-Resource Scenarios: A Case Study on Question Answering

No Thumbnail Available

Date

2025

Journal Title

Journal ISSN

Volume Title

Publisher

International Speech Communication Association

Open Access Color

OpenAIRE Downloads

OpenAIRE Views

Research Projects

Journal Issue

Abstract

Large Language Models (LLMs) are powerful tools for generating synthetic data, offering a promising solution to data scarcity in low-resource scenarios. This study evaluates the effectiveness of LLMs in generating question-answer pairs to enhance the performance of question answering (QA) models trained with limited annotated data. While synthetic data generation has been widely explored for text-based QA, its impact on spoken QA remains underexplored. We specifically investigate the role of LLM-generated data in improving spoken QA models, showing performance gains across both text-based and spoken QA tasks. Experimental results on subsets of the SQuAD, Spoken SQuAD, and a Turkish spoken QA dataset demonstrate significant relative F1 score improvements of 7.8%, 7.0%, and 2.7%, respectively, over models trained solely on restricted human-annotated data. Furthermore, our findings highlight the robustness of LLM-generated data in spoken QA settings, even in the presence of noise. © 2025 International Speech Communication Association. All rights reserved.

Description

Meta

Keywords

Data Generation, Large Language Models, Spoken Question Answering

Turkish CoHE Thesis Center URL

Fields of Science

Citation

WoS Q

N/A

Scopus Q

N/A

Source

Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH -- 26th Interspeech Conference 2025 -- 2025-08-17 through 2025-08-21 -- Rotterdam -- 213554

Volume

Issue

Start Page

1773

End Page

1777
Google Scholar Logo
Google Scholar™

Sustainable Development Goals

SDG data is not available