Free and open-source AI models have matured to the point where you can build production-grade applications without relying on expensive API calls. This guide shows you how to get started, covers real-world use cases, and provides a complete architecture for building an AI-powered code review assistant.
Quick Setup: Running Your First Local AI Model
Step 1: Install Ollama
Ollama is the easiest way to run AI models locally.
# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows (PowerShell)
winget install Ollama.Ollama
Step 2: Download a Model
# Pull a lightweight model (7B parameters)
ollama pull llama3.2
# Or a more powerful model (8B, great for coding)
ollama pull deepseek-coder-v2
# For reasoning tasks
ollama pull deepseek-r1
Step 3: Test It Out
# Interactive chat
ollama run llama3.2
# Or use the REST API
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Explain quantum computing in simple terms"
}'
Real-World Use Cases
1. Privacy-First Business Chatbot
Model: LLaMA 3.1 70B or Mistral Large 3
Best For: Companies handling sensitive customer data (healthcare, finance)
Why Local?: HIPAA/GDPR compliance, zero data leakage
2. Code Generation & Review
Model: DeepSeek-Coder V2 or Qwen3-Coder
Best For: Automated code reviews, bug detection, documentation
Why Local?: Protect proprietary code, unlimited usage
3. Document Q&A System
Model: Qwen2.5-Turbo (1M context)
Best For: Legal document analysis, research papers, knowledge bases
Why Local?: Process entire documents without API costs
4. Content Moderation Pipeline
Model: Gemma 3 (4B) or Mistral 7B
Best For: Real-time content filtering, spam detection
Why Local?: Ultra-low latency, high throughput
Deep Dive: Building an AI Code Review Assistant
Let's build a complete system that automatically reviews pull requests, detects bugs, suggests improvements, and generates documentation.
Tech Stack
| Component | Technology | Purpose |
|---|---|---|
| AI Model | DeepSeek-Coder V2 (16B) | Code analysis & generation |
| Inference Server | Ollama | Model serving |
| Backend | Python + FastAPI | API & orchestration |
| Vector Database | ChromaDB | Code embedding search |
| Git Integration | PyGithub / GitLab API | PR monitoring |
| Queue | Redis + Celery | Async task processing |
| Frontend | React + TypeScript | Dashboard UI |
System Architecture
Code Embeddings)] Worker -->|3. Analyze| LLM[Ollama
DeepSeek-Coder V2] end subgraph "AI Analysis" LLM -->|Bug Detection| Bugs[Security Issues
Logic Errors] LLM -->|Code Quality| Quality[Best Practices
Refactoring] LLM -->|Documentation| Docs[Auto-Generated
Comments] end subgraph "Output" Bugs --> Comment[Post PR Comment] Quality --> Comment Docs --> Comment Comment --> DB[(PostgreSQL
Review History)] Comment --> Dashboard[React Dashboard] end PR --> WH style LLM fill:#e1f5fe,stroke:#01579b,stroke-width:2px style VDB fill:#f3e5f5,stroke:#4a148c style Queue fill:#fff3e0,stroke:#e65100
Implementation: Core Components
1. Webhook Handler (FastAPI)
from fastapi import FastAPI, BackgroundTasks
from tasks import analyze_pr
app = FastAPI()
@app.post("/webhook/github")
async def github_webhook(payload: dict, tasks: BackgroundTasks):
if payload["action"] == "opened":
pr_number = payload["pull_request"]["number"]
repo = payload["repository"]["full_name"]
# Queue async analysis
tasks.add_task(analyze_pr, repo, pr_number)
return {"status": "queued"}
2. Code Analysis Worker (Celery + Ollama)
from celery import Celery
import ollama
from github import Github
celery = Celery('tasks', broker='redis://localhost:6379')
@celery.task
def analyze_pr(repo_name, pr_number):
# Fetch PR diff
g = Github(token=GITHUB_TOKEN)
repo = g.get_repo(repo_name)
pr = repo.get_pull(pr_number)
# Get changed files
files = pr.get_files()
analysis_results = []
for file in files:
if file.filename.endswith(('.py', '.js', '.ts', '.go')):
# Analyze with DeepSeek-Coder
result = ollama.chat(
model='deepseek-coder-v2',
messages=[{
'role': 'system',
'content': REVIEW_SYSTEM_PROMPT
}, {
'role': 'user',
'content': f"Review this code:\n```\n{file.patch}\n```"
}]
)
analysis_results.append({
'file': file.filename,
'review': result['message']['content']
})
# Post comment on PR
comment = format_review_comment(analysis_results)
pr.create_issue_comment(comment)
return analysis_results
3. Prompt Engineering: System Prompts
Bug Detection Prompt:
You are an expert code reviewer specializing in security and correctness.
Analyze the following code changes and identify:
1. Security vulnerabilities (SQL injection, XSS, authentication bypass)
2. Logic errors (off-by-one, race conditions, null pointer exceptions)
3. Performance issues (N+1 queries, unnecessary loops, memory leaks)
4. Edge cases not handled
For each issue found:
- Severity: Critical / High / Medium / Low
- Line number(s)
- Explanation
- Suggested fix
Format as markdown. Be concise and actionable.
Code Quality Prompt:
You are a senior software engineer reviewing code for best practices.
Evaluate the code for:
1. Readability and maintainability
2. Adherence to language conventions (PEP 8, ESLint, Go fmt)
3. Proper error handling
4. Test coverage suggestions
5. Opportunities for refactoring
Provide specific, constructive feedback with examples.
Documentation Generation Prompt:
Generate clear, concise documentation for the following code:
1. Function/class summary (one line)
2. Parameters with types and descriptions
3. Return value(s)
4. Example usage
5. Any important notes or warnings
Format as JSDoc/docstring appropriate for the language.
Example Outputs
Input Code (Python):
def process_payment(user_id, amount):
user = db.query(f"SELECT * FROM users WHERE id = {user_id}")
if user.balance >= amount:
user.balance -= amount
db.execute(f"UPDATE users SET balance = {user.balance} WHERE id = {user_id}")
return True
return False
AI Review Output:
## 🔴 Critical Issues
### SQL Injection Vulnerability
- **Lines:** 2, 4
- **Severity:** Critical
- **Issue:** Direct string interpolation creates SQL injection risk
- **Fix:**
```python
user = db.query("SELECT * FROM users WHERE id = ?", (user_id,))
db.execute("UPDATE users SET balance = ? WHERE id = ?",
(user.balance, user_id))
```
### Race Condition
- **Lines:** 3-4
- **Severity:** High
- **Issue:** Balance check and update not atomic
- **Fix:** Use database transaction with row locking
## 🟡 Medium Issues
### Missing Error Handling
- Add try/except for database errors
- Validate `amount > 0`
- Handle case where user doesn't exist
Deployment Considerations
- Hardware: Minimum 16GB VRAM (GPU) or 32GB RAM (CPU) for DeepSeek-Coder 16B
- Quantization: Use Q5_K_M quantization to reduce model size by 50% with minimal quality loss
- Scaling: Run multiple Ollama instances behind a load balancer for high-volume repos
- Context Management: For large PRs, analyze files in batches to stay within context limits
- Caching: Cache embeddings of unchanged files to speed up analysis
Cost Comparison
| Approach | Setup Cost | Per-PR Cost | Monthly (100 PRs) |
|---|---|---|---|
| Local (Our Setup) | $500 (GPU server) | $0 | $0 |
| OpenAI GPT-4 | $0 | ~$0.50 | $50 |
| Claude Sonnet | $0 | ~$0.30 | $30 |
Break-even at ~10 months. After that, significant cost savings.
More Example Use Cases
Customer Support Automation
Prompt Template:
You are a helpful customer support agent for [Company Name].
Customer question: {question}
Relevant documentation: {retrieved_docs}
Provide a clear, friendly response. If you cannot answer,
suggest contacting human support.
Data Extraction from Documents
Prompt Template:
Extract the following information from this invoice:
- Invoice number
- Date
- Vendor name
- Total amount
- Line items (description, quantity, price)
Format as JSON. If a field is missing, use null.
Invoice text:
{document_text}
Best Practices
- Start Small: Test with 7B models (Mistral, Gemma) before scaling to 70B+
- Prompt Engineering: Spend time crafting clear system prompts—they're your only "training"
- Temperature Control: Use 0.1-0.3 for factual tasks, 0.7-0.9 for creative tasks
- Context Windows: Truncate intelligently—keep the most relevant information
- Monitoring: Track response quality, latency, and resource usage
- Fallbacks: Have a backup plan (cloud API) if local inference fails
Getting Started Checklist
- ✅ Install Ollama
- ✅ Download a model suited to your task
- ✅ Test basic prompts interactively
- ✅ Build a simple REST API wrapper
- ✅ Integrate with your application
- ✅ Monitor performance and iterate on prompts
- ✅ Scale horizontally if needed
References & Resources
- Ollama Documentation: GitHub Repository & Guides
- DeepSeek-Coder: Official Website
- LangChain (Orchestration): Python Framework for LLM Apps
- ChromaDB: Vector Database for AI
Last updated: February 7, 2026.
Comments & Reactions