J.A.R.V.I.S Documentation
Just A Rather Very Intelligent System โ hybrid LLM architecture, local-first memory, autonomous agentic workflows.
System Overview
Jarvis is an advanced AI assistant that combines a hybrid brain (Fast Reflex + Agentic Reasoning) with persistent vector memory, realโtime voice, and a rich tool ecosystem. It runs primarily on local infrastructure, respects privacy, and delivers lowโlatency automation โ from opening apps to generating deep research reports.
Fast Brain
Groq Llama 3.3-70B โ instant commands, chat, file open, YouTube.
Agentic Brain
Gemini / Gemma โ multi-step tasks, web research, email, image generation.
Persistent Memory
ChromaDB + Gemini Embeddings (768d) + user bio/mood tracking.
Hybrid Architecture
[Wake Word / Input]
โ
โผ
โโโโโโโโโโโโโโโ Groq Router (Llama 3.1-8B)
โ Intent Routerโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโ โ
โ FAST (apps/chat/open) โ AGENTIC (email/research/write)
โผ โผ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ
โ FAST BRAIN โ โ AGENTIC LOOP โ
โ (Groq 70B) โ โ (Gemini/Gemma 31B) โ
โ response + โ โ ReAct, max_steps=10 โ
โ apps/urls โ โ tool calls (search, โ
โโโโโโโโโโโโโโโโโโโ โ workspace, email) โ
โโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โโโโโโโโโโโโฌโโโโโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโ
โ EXECUTOR (sync/async)โ
โ open/close, web, image โ
โ workspace, WhatsApp โ
โโโโโโโโโโโโโโโโโโโโโโโโ
Router intelligently classifies intent โ routes to Fast Brain for lowโlatency or Agentic Loop for deep tasks. Ephemeral memory & tool outputs are shared across steps.
Core Capabilities
Porcupine wake word + Groq WhisperโLarge STT + Cartesia Sonicโ3 streaming TTS (emotion-aware).
Virtual file system (Creations/Vault/Temp), fuzzy search, rename/move, auto registry.
FLUX (Together AI) + AI Horde editing, autoโsaved to workspace.
Gmail API (send with attachments), Twilio WhatsApp with image compression.
Autonomous agent compiles reports from web (Tavily) + arXiv โ saves .md to Creations.
ChromaDB + Gemini embeddings, conversation summarisation, longโterm user bio, mood tracking.
Tech Stack
| Category | Technologies |
|---|---|
| LLMs | Groq (Llama 3.3-70B, Llama 3.1-8B), Gemini 2.0 Flash, Gemma-4-31B |
| Vector DB | ChromaDB, Gemini Embedding-2 (768d) |
| Voice | Porcupine, Groq Whisper-Large-V3, Cartesia Sonic-3, Edge TTS fallback |
| Tools | Tavily, arXiv, Together AI (FLUX), AI Horde, Twilio, Gmail API, PyWhatKit |
| Frontend UI | PyQt5 (agent panel + STT popup), Pystray, Rich terminal, custom typing popup |
| Infra | Python 3.10+, asyncio, threading, subprocess, FFmpeg (Edge TTS fallback) |
Installation & Environment
1. Clone & virtual environment
git clone https://github.com/thekaifansari01/jarvis-by-kaif-ansari.git
cd jarvis-by-kaif-ansari
python -m venv venv
venv\Scripts\activate # Windows2. Install dependencies
pip install -r requirements.txt3. Create .env file (required keys):
GROQ_API_KEY="gsk_..."
GEMINI_API_KEY="AIza..."
TOGETHER_AI="..." # FLUX
TAVILY_API_KEY="tvly-..."
TWILIO_ACCOUNT_SID, TWILIO_AUTH_TOKEN, TWILIO_FROM_NUMBER
CARTESIA_API_KEY="..."
PICOVOICE_ACCESS_KEY="..." # optional, trial key works
4. Gmail OAuth: place credentials.json inside modules/emailManager/
5. Run
python main.py # voice mode + system tray
python main.py test_jarvis # text dev modeUsage Modes
- ๐๏ธ Voice (default): Say "Jarvis" โ wake sound โ speak command (Hinglish/English). Agent panel + STT popup appear.
- โจ๏ธ Text Dev Mode:
python main.py test_jarvisโ type commands directly, no tray, faster boot. - ๐ฅ๏ธ Disable Tray:
python main.py system_tray=no - Exit: voice: "exit/bye/stop" or right-click tray icon โ Exit.
Agentic Loop (ReAct)
For multiโstep tasks (email, deep research, file operations, web search) Jarvis enters agentic mode: Gemini/Gemma plans, executes tools, observes results, adapts, and finally completes. Budgetโaware: max 10 steps, timeout 120s, retry limit.
โ Tools accessible inside agent loop:
Integrated Tools
SmartAppOpener: registry + fuzzy cache โ opens apps, URLs, web shortcuts.
taskkill with fuzzy matching (exe names).
FLUX generation + AI Horde editing, auto workspace adding.
Tavily web + arXiv parallel search, returns structured content.
Twilio, autoโcompresses images before upload โ link sharing.
Autonomous multiโturn research โ final .md report saved to Creations.
Memory & RAG
ContextMemory class uses ChromaDB with Gemini embeddings (768d). Stores conversation history, RAG file chunks, user bio/preferences, mood history. Automatic summarisation using Groq 120B when limit exceeds 500 messages.
memory.get_relevant_context(query) โ injects time, user facts, mood, recent chats, workspace status & RAG matches into LLM prompt.
Voice Pipeline
- Wake word: Porcupine ('jarvis') onโdevice, low latency.
- STT: Groq WhisperโLargeโV3 (primary) + Google Speech Recognition fallback, language set to 'hi' for Hinglish.
- TTS: Cartesia Sonicโ3 (realโtime streaming, emotion SSML) + Edge TTS as fallback. Emotion detection (cheerful, sad, thinking) autoโapplies voice styles.
- UI: Live STT popup shows listening/transcribed status, typing popup & agent panel with step/thought animations.
Configuration (config.py)
GROQ_ROUTER_MODEL = "llama-3.1-8b-instant"
GROQ_FAST_MODEL = "llama-3.3-70b-versatile"
GEMINI_AGENT_MODEL = "gemma-4-31b-it"
GEMINI_EMBEDDING_MODEL = "gemini-embedding-2"
FLUX_IMAGE_MODEL = "black-forest-labs/FLUX.1-schnell"
EDGE_TTS_VOICE = "hi-IN-MadhurNeural"
AGENT_MAX_STEPS = 10, AGENT_TIMEOUT = 900
All API keys loaded from .env, models easily swappable.
Example Commands
Troubleshooting
- ๐ Wake word not working: Replace
PICOVOICE_ACCESS_KEYwith your own (free tier) or adjuststt.pykeyword. - ๐๏ธ No audio after wake word: Check microphone index, ensure PyAudio installed
pip install pyaudio. - ๐ง Gmail authentication fails: Refresh token, place fresh
credentials.jsonfrom Google Cloud Console. - ๐ผ๏ธ Image generation fails: Verify Together AI API key & FLUX model quota.
- ๐ง ChromaDB errors: Delete
Data/jarvis_memory/chroma_dband restart (reโindex).