Fine Tuning
1. Definition
Section titled “1. Definition”- Fine-tuning = adapting a pre-trained model to a specific task or dataset by continuing training.
- Leverages knowledge already learned during pre-training → faster, efficient, smaller datasets required.
2. Purpose
Section titled “2. Purpose”- Improve task-specific performance.
- Reduce training cost compared to training from scratch.
- Enable domain adaptation (e.g., medical, legal, finance).
3. Pre-requisites
Section titled “3. Pre-requisites”- Pre-trained model (BERT, GPT, T5, etc.)
- Task-specific labeled dataset
- Compatible loss function (e.g., cross-entropy for classification)
4. Types of Fine-Tuning
Section titled “4. Types of Fine-Tuning”a) Full Fine-Tuning
- All model parameters updated.
- Pros: maximum performance.
- Cons: computationally expensive, prone to overfitting, high storage.
b) Parameter-Efficient Fine-Tuning (PEFT)
- Only a small subset of parameters updated.
- Methods:
- LoRA (Low-Rank Adaptation) → insert trainable low-rank matrices into transformer layers.
- Adapters → small feed-forward layers added between transformer blocks.
- Prefix Tuning → learn task-specific prefix embeddings fed into attention layers.
- Pros: memory & compute efficient, prevents catastrophic forgetting.
c) Task-Specific Fine-Tuning
- Adapts model to one specific task (QA, summarization, classification).
- Can be full or PEFT.
5. Fine-Tuning Process
Section titled “5. Fine-Tuning Process”- Data Preparation
- Clean, labeled, balanced dataset.
- Split into training, validation, test sets.
- Model Selection
- Choose pre-trained model suitable for task.
- Adaptation
- Modify input/output layers if necessary (e.g., classifier head for classification).
- Training
- Use appropriate learning rate (often lower than pre-training).
- Optimizer: AdamW or variants.
- Loss: task-dependent (cross-entropy, MSE, etc.)
- Evaluation
- Metrics: task-dependent
- Classification → Accuracy, F1, Precision, Recall
- Generation → BLEU, ROUGE, METEOR
- Retrieval → Recall@k, MRR
- Metrics: task-dependent
6. Advantages
Section titled “6. Advantages”- Faster than training from scratch.
- Requires smaller datasets.
- Can adapt large general models to domain-specific tasks.
- Works with large LLMs using PEFT methods.
7. Challenges
Section titled “7. Challenges”- Overfitting with small datasets.
- Catastrophic forgetting of pre-trained knowledge.
- Requires careful learning rate & batch size tuning.
- May still need data augmentation for low-resource domains.
8. Best Practices
Section titled “8. Best Practices”- Use lower learning rates for pre-trained layers, higher for new layers.
- Freeze some layers if data is small.
- Prefer PEFT for large LLMs to save GPU/memory.
- Monitor validation loss to prevent overfitting.
- Combine with regularization (dropout, weight decay).
Fine-Tuning in RAG
Section titled “Fine-Tuning in RAG”1. What is Fine-Tuning
Section titled “1. What is Fine-Tuning”- Fine-tuning = adapting a pre-trained model to a specific domain/task using additional training data.
- In RAG, fine-tuning improves:
- Retriever (better document recall)
- Generator (better response style/accuracy)
- End-to-end RAG pipeline (retrieval + generation jointly).
2. Why Fine-Tuning in RAG
Section titled “2. Why Fine-Tuning in RAG”- Pre-trained retrievers may not work well on domain-specific corpus (e.g., legal, medical, finance).
- LLMs may generate hallucinations or irrelevant answers.
- Fine-tuning ensures:
- Domain adaptation
- Higher precision in retrieval
- More faithful generation
3. Levels of Fine-Tuning
Section titled “3. Levels of Fine-Tuning”a) Retriever Fine-Tuning
- Uses contrastive learning:
- Positive pairs → (query, relevant doc)
- Negative pairs → (query, irrelevant doc)
- Optimizes embedding space → similar items closer.
- Methods:
- DPR (Dense Passage Retrieval) fine-tuning
- ColBERT, Contriever
b) Generator Fine-Tuning
- Train LLM on (retrieved context + query → target answer).
- Types:
- Supervised Fine-Tuning (SFT) → on QA pairs.
- Instruction Fine-Tuning → align with prompt styles.
- Preference Optimization (RLHF, DPO) → better answer quality.
c) End-to-End RAG Fine-Tuning
- Joint optimization of retriever and generator.
- Feedback loop: retrieval improves generation, generation guides retriever.
- Requires large compute + carefully curated data.
4. Fine-Tuning Approaches
Section titled “4. Fine-Tuning Approaches”a) Full Fine-Tuning
- Update all model parameters.
- High accuracy but very costly.
b) Parameter-Efficient Fine-Tuning (PEFT)
- Freeze most weights, update small modules.
- Methods:
- LoRA (Low-Rank Adaptation)
- Prefix Tuning
- Adapters
- Efficient and scalable.
5. Data Preparation
Section titled “5. Data Preparation”- High-quality domain-specific QA pairs or retrieval relevance labels.
- Dataset types:
- Retriever training → triplets (query, positive doc, negative doc).
- Generator training → (query + retrieved docs → gold answer).
- Ensure cleaning, deduplication, balanced coverage.
6. Evaluation of Fine-Tuning
Section titled “6. Evaluation of Fine-Tuning”a) Retriever Metrics
- Recall@k, MRR (Mean Reciprocal Rank), nDCG.
b) Generator Metrics
- BLEU, ROUGE, METEOR, BERTScore.
- Faithfulness & factuality tests.
c) End-to-End RAG
- Human evaluation on relevance, accuracy, hallucination rate.
7. Challenges
Section titled “7. Challenges”- Data scarcity (domain QA pairs hard to obtain).
- Catastrophic forgetting (losing general knowledge).
- Costly computation (GPU/TPU resources).
- Alignment issues (retriever pulls irrelevant docs, generator hallucinates).
8. Best Practices
Section titled “8. Best Practices”- Start with PEFT methods (LoRA) before full fine-tuning.
- Use in-domain embeddings for retriever.
- Perform knowledge distillation (smaller model learns from larger).
- Evaluate iteratively (retriever and generator separately before joint tuning).
- Hybrid approach: prompt engineering + fine-tuning.
9. Alternatives to Fine-Tuning
Section titled “9. Alternatives to Fine-Tuning”- Prompt engineering (zero/few-shot).
- Retrieval customization (re-ranking, hybrid search).
- External memory augmentation.
Fine-Tuning in General vs Fine-Tuning in RAG
Section titled “Fine-Tuning in General vs Fine-Tuning in RAG”Fine-Tuning in General
Section titled “Fine-Tuning in General”- Adapting a pretrained model (like BERT, GPT, T5) to a specific dataset/task.
- Examples: text classification, summarization, translation.
- You fine-tune either full model or with PEFT (LoRA, Adapters, Prefix tuning).
- Output: model specialized for that task.
Fine-Tuning in RAG
Section titled “Fine-Tuning in RAG”- RAG has two components (retriever + generator). Fine-tuning can target one or both:
- Retriever Fine-Tuning
- Optimizes embeddings so relevant docs are closer to queries.
- Uses contrastive learning on (query, positive doc, negative doc).
- Generator Fine-Tuning
- Adapts LLM to generate answers using retrieved docs.
- Trained on (query + retrieved docs → gold answer).
- End-to-End Fine-Tuning
- Jointly optimizes retriever + generator in one loop.
- Retriever Fine-Tuning
Key Difference
Section titled “Key Difference”- Normal Fine-Tuning
- adapting one model (BERT, GPT) to a task.
- i.e. task adaptation of pretrained models.
- RAG Fine-Tuning
- adapting multiple modules (retriever + generator), sometimes jointly, to improve end-to-end retrieval + answer generation.
- i.e. specialized: retriever, generator, or both, to reduce hallucination, improve retrieval quality, and adapt to domain-specific QA.
Parameter-Efficient Fine-Tuning (PEFT)
Section titled “Parameter-Efficient Fine-Tuning (PEFT)”1. Definition
Section titled “1. Definition”- PEFT = fine-tuning only a small subset of parameters in a pre-trained model while keeping most weights frozen.
- Goal: adapt large models efficiently in terms of memory, compute, and data.
- Essential for LLMs like GPT, LLaMA, T5, BERT.
2. Motivation
Section titled “2. Motivation”- Large models (billions of parameters) → full fine-tuning is expensive.
- PEFT allows task adaptation without updating all weights.
- Preserves pre-trained knowledge, prevents catastrophic forgetting.
- Works well with small datasets.
3. Core PEFT Methods
Section titled “3. Core PEFT Methods”a) LoRA (Low-Rank Adaptation)
- Inserts low-rank trainable matrices into transformer layers.
- Original weights frozen; only low-rank matrices updated.
- Steps:
- Weight update ΔW ≈ A × B (low-rank)
- Forward pass: W + ΔW
- Backprop updates only A, B
- Pros: memory-efficient, good for domain/task-specific adaptation.
b) Prefix Tuning
- Prepends task-specific learnable embeddings to key/query in attention layers.
- Frozen original model.
- Pros: extremely efficient, works in few-shot scenarios.
c) Adapters
- Small feed-forward layers inserted between transformer blocks.
- Only adapter layers trained; main model frozen.
- Pros: modular, easy to switch tasks by changing adapters.
d) Prompt Tuning
- Learn task-specific prompts for frozen LLM input.
- Pros: very low parameter count, suitable for few-shot or zero-shot.
e) T-Few (Task-Few)
- Built for few-shot multi-task adaptation.
- Combines PEFT methods (like prefix tuning) with meta-learning.
- Learns task-agnostic knowledge to generalize across unseen tasks with few examples.
- Pros: extremely data-efficient, generalizes better than task-specific PEFT.
4. Advantages of PEFT
Section titled “4. Advantages of PEFT”- Reduces trainable parameters → faster, lower memory.
- Preserves pre-trained knowledge.
- Scalable for multiple tasks → swap adapters/prefixes.
- Works with small datasets / few-shot learning.
- Avoids catastrophic forgetting.
5. Limitations
Section titled “5. Limitations”- Slightly lower performance than full fine-tuning in some tasks.
- Placement of adapters/prefixes affects results.
- Hyperparameter tuning required (rank in LoRA, prefix length).
- T-Few may need meta-task pretraining to work well.
6. Workflow
Section titled “6. Workflow”- Select pre-trained model (GPT, LLaMA, BERT, T5).
- Choose PEFT method (LoRA, Adapters, Prefix, T-Few).
- Insert PEFT modules in transformer layers.
- Prepare task dataset (small data works).
- Train only PEFT parameters (small learning rate).
- Evaluate using task-specific metrics.
7. Applications
Section titled “7. Applications”- QA, summarization, classification, RAG pipelines.
- Multi-task adaptation with shared model (swap adapters or prefixes).
- Low-resource domains or few-shot learning.
- Enterprise: chatbots, domain-specific assistants.
8. Comparison of PEFT Methods
Section titled “8. Comparison of PEFT Methods”| Method | Trainable Params | Data Requirement | Use Case | Notes |
|---|---|---|---|---|
| LoRA | Low-rank matrices | Moderate | Task-specific | Efficient for large LLMs |
| Prefix Tuning | Prefix embeddings | Few-shot | Few-shot tasks | Very memory efficient |
| Adapters | Small FF layers | Moderate | Multi-task | Modular, can swap for tasks |
| Prompt Tuning | Input prompts | Few-shot / zero-shot | Minimal tuning | Extremely light-weight |
| T-Few | Prefix/meta params | Very few | Multi-task, few-shot | Meta-learning + PEFT, strong generalization |