arXiv RAG v2: Hybrid Retrieval Benchmark
A comprehensive evaluation of 10 retrieval configurations across 9,544 synthetic benchmark queries. This study explores the performance trade-offs between dense embeddings (BGE-M3, OpenAI 3-large), sparse retrieval, hybrid fusion strategies, and cross-encoder reranking for AI/ML research paper search.
2,500
Research Papers
-
Benchmark Queries
-
Best NDCG@10
-
Avg Reranker Gain
π Data Collection & Benchmark Design
Data Collection Pipeline
1.
arXiv API Collection - 14 months (2025.01 ~ 2026.02), 7 categories (cs.CL, cs.AI, cs.LG, cs.CV, stat.ML, cs.IR, cs.NE)
2.
NG Keyword Filtering - 1,756 keywords across 9 categories (biomedical, chemistry, robotics, etc.)
3.
Gemini Classification - SUITABLE/NOT SUITABLE via gemini-3-flash-preview
4.
Semantic Filtering - Cosine similarity β₯ 0.55 with anchor queries
5.
Multi-Score Ranking - Citation + Recency + Semantic + Stratified sampling β 2,500 papers
Benchmark Query Generation
4 Query Styles Γ ~2,400 papers = ~9,600 queries
KEYWORD
4-7 technical terms
NATURAL_SHORT
6-12 word questions
NATURAL_LONG
15-25 word research Qs
CONCEPTUAL
Paraphrased, no acronyms
Hard Negative Mining: BGE-M3 embedding similarity search to identify "similar but different" papers for robust evaluation.
π§ Model Configurations
Dense (BGE-M3)
1024d vectors
Sparse (BGE-M3)
Lexical weights
Hybrid (BGE-M3)
RRF: dense + sparse
OpenAI 3-large
3072d vectors
Hybrid-3L
3-large + sparse
+Rerank Variants:
Each base model tested with BGE-reranker-v2-m3 cross-encoder (Top-20 β Rerank β Top-10)
π Global Performance Overview
5 Metrics Γ 5 Models (Base + Reranker Gain/Loss)
Dense
Sparse
Hybrid
3-large
Hybrid-3L
+Reranker Gain
+Reranker Loss
π₯ QueryType Γ Difficulty Heatmap
Loading heatmap data...
π Model Comparison Delta (MRR)
Delta values show MRR difference compared to baseline model (green = better, red = worse)
Loading delta data...
β‘ Reranker Impact Analysis
-
Avg NDCG@10 Gain
-
Max Gain (conceptual)
-
Avg Latency Overhead
βοΈ Model Comparison Tool
Select models to compare
β±οΈ Search Latency
Detailed Results
| Model | MRR | NDCG@5 | NDCG@10 | P@5 | P@10 | Latency |
|---|
π‘ Insights & Production Recommendations
Key Findings
Loading analysis...
Production Recommendations
π
Best Quality
Loading...
β‘
Best Speed
Loading...
π―
Best Balance
Loading...
π°
Best ROI
Loading...
Chat with Research Papers
AI/ML μ°κ΅¬ λ Όλ¬Έμ λν΄ μ§λ¬ΈνμΈμ
Research Papers Dataset
Loading...
| # | Title | Date | arXiv |
|---|---|---|---|
|
Loading papers...
|
|||