arXiv RAG v2

GitHub

AI/ML Research Paper Search & Chat

Loading...

arXiv RAG v2: Hybrid Retrieval Benchmark

A comprehensive evaluation of 10 retrieval configurations across 9,544 synthetic benchmark queries. This study explores the performance trade-offs between dense embeddings (BGE-M3, OpenAI 3-large), sparse retrieval, hybrid fusion strategies, and cross-encoder reranking for AI/ML research paper search.

2,500
Research Papers
-
Benchmark Queries
-
Best NDCG@10
-
Avg Reranker Gain

πŸ“Š Data Collection & Benchmark Design

Data Collection Pipeline

1.
arXiv API Collection - 14 months (2025.01 ~ 2026.02), 7 categories (cs.CL, cs.AI, cs.LG, cs.CV, stat.ML, cs.IR, cs.NE)
2.
NG Keyword Filtering - 1,756 keywords across 9 categories (biomedical, chemistry, robotics, etc.)
3.
Gemini Classification - SUITABLE/NOT SUITABLE via gemini-3-flash-preview
4.
Semantic Filtering - Cosine similarity β‰₯ 0.55 with anchor queries
5.
Multi-Score Ranking - Citation + Recency + Semantic + Stratified sampling β†’ 2,500 papers

Benchmark Query Generation

4 Query Styles Γ— ~2,400 papers = ~9,600 queries
KEYWORD
4-7 technical terms
NATURAL_SHORT
6-12 word questions
NATURAL_LONG
15-25 word research Qs
CONCEPTUAL
Paraphrased, no acronyms
Hard Negative Mining: BGE-M3 embedding similarity search to identify "similar but different" papers for robust evaluation.

πŸ”§ Model Configurations

Dense (BGE-M3)
1024d vectors
Sparse (BGE-M3)
Lexical weights
Hybrid (BGE-M3)
RRF: dense + sparse
OpenAI 3-large
3072d vectors
Hybrid-3L
3-large + sparse
+Rerank Variants: Each base model tested with BGE-reranker-v2-m3 cross-encoder (Top-20 β†’ Rerank β†’ Top-10)

πŸ“ˆ Global Performance Overview

5 Metrics Γ— 5 Models (Base + Reranker Gain/Loss)
Dense Sparse Hybrid 3-large Hybrid-3L +Reranker Gain +Reranker Loss

πŸ”₯ QueryType Γ— Difficulty Heatmap

Loading heatmap data...

πŸ“Š Model Comparison Delta (MRR)

Delta values show MRR difference compared to baseline model (green = better, red = worse)
Loading delta data...

⚑ Reranker Impact Analysis

-
Avg NDCG@10 Gain
-
Max Gain (conceptual)
-
Avg Latency Overhead

βš–οΈ Model Comparison Tool

vs
Select models to compare

⏱️ Search Latency

Detailed Results

Model MRR NDCG@5 NDCG@10 P@5 P@10 Latency

πŸ’‘ Insights & Production Recommendations

Key Findings

Loading analysis...

Production Recommendations

πŸ† Best Quality
Loading...
⚑ Best Speed
Loading...
🎯 Best Balance
Loading...
πŸ’° Best ROI
Loading...

Chat with Research Papers

AI/ML 연ꡬ 논문에 λŒ€ν•΄ μ§ˆλ¬Έν•˜μ„Έμš”

μ•ˆλ…•ν•˜μ„Έμš”! AI/ML 연ꡬ 논문에 λŒ€ν•΄ μ§ˆλ¬Έν•΄ μ£Όμ„Έμš”. λ‹€μŒκ³Ό 같은 μ£Όμ œμ— λŒ€ν•΄ λ‹΅λ³€λ“œλ¦΄ 수 μžˆμŠ΅λ‹ˆλ‹€:

Research Papers Dataset

Loading...
# Title Date arXiv
Loading papers...