Large language models-based metric for generative question answering systems

Hazem Abdel Azim, Mohamed Tharwat Waheed, Ammar Mohammed

Abstract


In the evolving landscape of text generation, which has advanced rapidly in recent years, techniques for evaluating the performance and quality of the generated text lag behind relatively. Traditionally, lexical-based metrics such as bilingual evaluation understudy (BLEU), recall-oriented understudy for gisting evaluation (ROUGE), metric for evaluation of translation with explicit ordering (METEOR), consensus-based image description evaluation (CIDER), and F1 have been utilized, primarily relying on n-gram similarity for evaluation. In recent years, neural and machine-learning-based metrics, like bidirectional encoder representations from transformers (BERT) score, key phrase question answering (KPQA), and BERT supervised training of learned evaluation metric for reading comprehension (LERC) have shown superior performance over traditional metrics but suffered from a lack of generalization towards different domains and requires massive human-labeled training data. The main contribution of the current research is to investigate the use of train-free large language models (LLMs) as scoring metrics, evaluators, and judges within a questionanswering context, encompassing both closed and open-QA scenarios. To validate this idea, we employ a simple zero-shot prompting of Mixtral 8x7 B, a popular and widely used open-source LLM, to score a variety of datasets and domains. The experimental results on ten different benchmark datasets are compared against human judgments, revealing that, on average, simple LLMbased metrics outperformed sophisticated state-of-the-art statistical and neural machine-learning-based metrics by 2-8 points on answer-pairs scoring tasks and up to 15 points on contrastive preferential tasks.


Keywords


Evaluation metrics; Generative question answering; Large language models; Likert-scale scoring; Zero-shot prompting;

Full Text:

PDF


DOI: http://doi.org/10.11591/ijai.v14.i1.pp151-158

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

IAES International Journal of Artificial Intelligence (IJ-AI)
ISSN/e-ISSN 2089-4872/2252-8938 
This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU).

View IJAI Stats