AI Tech

Microsoft Research: The talk focuses on accelerating multilingual retrieval, relevance, and generation evaluation systems, emphasizing the need for research beyond English to include more languages.

Microsoft Research• 9 episodes

Microsoft Research - Accelerating Multilingual RAG Systems

The speaker discusses the importance of developing multilingual retrieval systems to cater to the vast number of non-English speakers globally. They highlight the creation of the Miracle dataset, which covers 18 languages and involves extensive human annotation to evaluate multilingual retrieval systems. The dataset aims to improve retrieval accuracy by providing high-quality training data. The talk also covers relevance assessment, where the goal is to determine if retrieved documents can answer a given question. The speaker introduces a binary classification task to evaluate this, using metrics like hallucination rate to measure errors. Finally, the generation part focuses on evaluating the quality of generated answers using a combination of heuristic features and LLM as a judge, proposing a surrogate judge model to reduce costs and improve evaluation efficiency. The speaker emphasizes the need for multilingual research and the potential for cross-lingual applications.

Key Points:

Multilingual retrieval systems are crucial for non-English speakers, with only 4.6% of the world having English as their first language.
The Miracle dataset provides high-quality training data for multilingual retrieval, covering 18 languages and involving extensive human annotation.
Relevance assessment involves evaluating if retrieved documents can answer questions, using metrics like hallucination rate to measure errors.
Generation evaluation uses a surrogate judge model combining heuristic features and LLM as a judge to improve cost efficiency and evaluation accuracy.
Future work includes expanding to cross-lingual settings and exploring multilingual data uniformity across different domains.

Details:

1. 🎤 Introduction and Speaker Background

The speaker will discuss accelerating multilingual rank systems with a focus on retrieval, relevance, and generation evaluation, setting the stage for how their background supports this topic.
The speaker completed their undergraduate studies at BITS Pilani, Goa, laying a foundation in technical education.
After completing their degree, they gained practical experience as a software engineer at Notes Cape, which honed their technical skills.
In 2019, the speaker transitioned to an NLP research assistant role at UKP Lab, aligning their career with their current focus on multilingual systems.
Currently, the speaker is pursuing a PhD at the University of Oralu, and has reached the 4th year, indicating a deep dive into their research area.
The speaker's internships at Google Research and Databricks provide industry insights and experience that contribute to their expertise in multilingual rank systems.

2. 🌍 Global Language Demographics and Research Motivation

7.2 billion people on earth, with survey covering 6.3 billion.
4.1 billion people speak one of the 23 most spoken languages natively.
Indic languages like Hindi, Telugu, Tamil, Bengali, Marathi are significant.
Only 4.6% of global speakers have English as their first language.
Including L1 and L2 speakers, English totals 18.1% of global speakers.
A significant number of global speakers do not speak English.

3. 🔍 Multilingual Research Focus and Objectives

3.1. Focus on Multilingual Research

3.2. Objectives and Implementation

4. 📊 Multilingual Rag Pipeline Overview

A typical multilingual RAG (Retrieval-Augmented Generation) pipeline involves retrieving top K documents from a corpus, such as Wikipedia, using a retriever model.
The generator model uses these documents along with the user's question to generate an answer, forming the RAG answer generation stage.
The pipeline includes three key stages: multilingual retrieval, relevance assessment, and generation.
Multilingual retrieval involves evaluating different retrieval systems on how well they can retrieve relevant documents across languages, which is crucial for accurate and comprehensive responses.
Relevance assessment evaluates if the retrieved documents are capable of answering the user's question effectively, ensuring high-quality input for the generation stage.
The generation stage assesses the quality of the answers, ensuring they are appropriately cited and fluent, which is essential for user trust and satisfaction.

5. 📚 Multilingual Retrieval Evaluation with Miracle Dataset

5.1. Multilingual Retrieval Dataset Overview

5.2. Dataset Construction and Scale

6. 🌐 Miracle Dataset Construction, Phases, and Impact

6.1. Miracle Dataset Construction: Phase 1 - Query Generation

6.2. Miracle Dataset Construction: Phase 2 - Relevance Assessment

6.3. Impact of the Miracle Dataset

7. 🧐 Relevance Assessment with No-Miracle and Binary Classification

The research is centered on multilingual relevance assessment for the EMNLB 2024 findings track, aiming to determine if LLMs can accurately assess document relevance through a binary classification approach.
A critical part of the study involves evaluating the top 10 annotated passages, dividing them into relevant and non-relevant categories to assess LLM performance.
The motivation behind this study is to address retrieval errors by ensuring LLMs can differentiate between relevant and non-relevant information, thereby prompting alternative actions when necessary.
An example highlighted is a query regarding the 'AC button' on a calculator, which led to irrelevant documents being retrieved.
The research question seeks to establish whether LLMs can comprehend relevance across multiple languages and effectively evaluate the top K passages for providing accurate answers.
The experimental setup involves utilizing a binary decision tree model to differentiate relevant from non-relevant documents.
Two key metrics introduced in the study are hallucination rate and relevant subset error rate.
The hallucination rate measures the frequency of errors where LLMs incorrectly assess relevance in non-relevant subsets.
The relevant subset error rate tracks errors where LLMs fail to identify relevance within relevant subsets.

8. 🔬 Experimental Setup and Findings in No-Miracle

8.1. Non-Relevant Subset Utilization

8.2. Model Performance

8.3. Prompt Optimization

8.4. Context Length and Tokenization Challenges

9. 📝 Generation Evaluation with Mirage Bench and Evaluation Approaches

9.1. Introduction to Mirage Bench

9.2. Evaluation Approaches

9.3. Combining Evaluation Approaches

9.4. Inference and Practical Benefits

10. 🏅 Training Surrogate Judge Model for Multilingual RAG

A random forest model was selected for its simplicity and efficiency, allowing training to be completed in minutes using a CPU.
Due to initial data constraints, with only 50-100 query pairs available, bootstrapping was utilized to expand the dataset and improve variance estimation.
The Bradley Terry model was implemented to assign scores by calculating log odds from pairwise comparisons, providing a statistical foundation for model ranking.
To determine final model rankings, 200 tournaments were conducted using bootstrapped queries, with ranks averaged across all tournaments to ensure robust evaluation.

11. 📈 Mirage Bench Results and Model Rankings

11.1. Construction and Quality of Mirage Bench

11.2. Evaluation Focus

11.3. Models Evaluated

11.4. Use of Instruct Versions

11.5. Arena Based Results

12. 📊 Surrogate Judge, Feature Importance, and Fine-Tuning Experiments

12.1. Kendall Tau Correlation

12.2. Feature Importance in Surrogate Judge

12.3. Feature Combination Ablation

12.4. Exhaustive Comparison Experiment

13. 🔄 Summary, Key Takeaways, and Future Directions

The construction of a high-quality dataset in Miracle underscores the importance of data reusability, allowing for its application in both no Miracle and Mirage bench tests.
Hybrid search techniques are currently at the forefront of multilingual retrieval, with open-source methods matching the performance of commercial APIs.
Recent advancements in multilingual language models (LLMs) demonstrate improved reasoning capabilities, with newer models outperforming older versions in identifying non-relevant passages.
Closed and large open-source models lead in relevance assessment and generation tasks, showcasing superior performance.
Cost-effectiveness is achieved through using surrogate judges for LLM rankings.
Fine-tuning smaller open-source LLMs can significantly boost their performance, emphasizing the potential of customization.
Investing in long-term projects and reusing materials is crucial for sustainable development.
Future initiatives aim to broaden the scope beyond Wikipedia to more realistic domains, despite challenges in finding uniform multilingual data.

14. 🗣️ Q&A Session and Closing Remarks

14.1. Cross-Lingual Information Retrieval

14.2. Cross-Lingual RAG Setup

14.3. Improving Multilingual Language Models

14.4. Investment and Benchmarking

14.5. Performance Variability Across Languages

14.6. Resource Availability and Model Performance