Skip to content
CompletedAugust 2025 - December 2025

SimRAG Reproduction Study

Reproduction study of SimRAG paper implementing similarity-based RAG with two-stage fine-tuning on consumer hardware, analyzing model capacity limitations and retriever-generator coupling.

PythonRAGQdrantSentence TransformersOllamaPurdue GenAI APIPyTorchDockerPoetry
View on GitHub

Research Approach

Reproduction study of the SimRAG paper exploring similarity-based Retrieval Augmented Generation techniques. Built a modular implementation to understand RAG fundamentals, fine-tuning concepts, and practical ML engineering workflows.

Documents → Embeddings → Vector DB → Stage 1 Fine-tuning → QA Generation → Stage 2 Fine-tuning → Evaluation

Focused on learning through implementation rather than just theoretical understanding.

Key Features

🔌

Provider-Agnostic Interface

Supports both local (Ollama) and cloud (Purdue GenAI) LLMs with automatic provider selection.

🔍

RAG Implementation

Sentence Transformers for embeddings, Qdrant vector storage, and context-aware question answering with source citations.

⚙️

Two-Stage Fine-Tuning

QLoRA fine-tuning: Stage 1 for instruction following, Stage 2 for domain adaptation with synthetic QA pairs.

🧪

Test Suite

Test suite with mocked external dependencies for reproducible testing and validation.

Technical Details

Workflow

Two-stage fine-tuning process: instruction following, then domain adaptation with synthetic QA pairs.

Setup & Document Ingestion

Load documents, chunk text, generate embeddings, store in Qdrant.

Stage 1: Instruction Following

QLoRA fine-tuning on general instructions (~4-6 hours).

Generate QA Pairs

Create domain-specific training data from documents.

Stage 2: Domain Adaptation

QLoRA fine-tuning on domain QA dataset (~30 minutes).

Testing & Comparison

Compare baseline RAG vs fine-tuned RAG performance.

Implementation

Successfully trained and tested model on personal hardware, demonstrating practical ML engineering skills.

Experiment framework for research reproducibility with logging, result tracking, and automated testing.

Impact & Results

Understanding of RAG fundamentals and fine-tuning concepts through hands-on implementation, demonstrating practical ML engineering skills

Key Achievements

Designed provider-agnostic interface supporting both local (Ollama) and cloud (Purdue GenAI) LLMs with automatic provider selection

Built RAG system with Sentence Transformers for embeddings, Qdrant vector storage, and context-aware question answering

Implemented both synchronous and asynchronous API calls for flexible integration patterns

Trained and tested model on personal hardware (RTX 3080, 10GB VRAM). Results: context relevance unchanged (0.316), answer quality decreased 0.1-1.9%, response time increased 52-53%. Findings attributed to model capacity limitations (1.5B vs. original 8B/27B) and lack of retriever fine-tuning

Created test suite with mocked external dependencies for reproducible testing

Technical Highlights

  • • Modular RAG implementation
  • • Provider-agnostic LLM interface
  • • QLoRA fine-tuning on consumer hardware
  • • Vector storage with Qdrant
  • • Experiment framework for reproducibility
  • • Test suite with mocked dependencies