v2.0 · Production Ready

J.A.R.V.I.S Documentation

Just A Rather Very Intelligent System — hybrid LLM architecture, local-first memory, autonomous agentic workflows.

Groq Llama 3.3-70B Gemini 2.0 / Gemma-4-31B ChromaDB + Embeddings Multi‑modal (Image/WhatsApp/Email)

System Overview

Jarvis is an advanced AI assistant that combines a hybrid brain (Fast Reflex + Agentic Reasoning) with persistent vector memory, real‑time voice, and a rich tool ecosystem. It runs primarily on local infrastructure, respects privacy, and delivers low‑latency automation — from opening apps to generating deep research reports.

⚡

Fast Brain

Groq Llama 3.3-70B — instant commands, chat, file open, YouTube.

🧠

Agentic Brain

Gemini / Gemma — multi-step tasks, web research, email, image generation.

💾

Persistent Memory

ChromaDB + Gemini Embeddings (768d) + user bio/mood tracking.

Hybrid Architecture

[Wake Word / Input]
      │
      ▼
┌─────────────┐      Groq Router (Llama 3.1-8B)
│ Intent Router├──────────────────────┐
└─────────────┘                        │
            │ FAST (apps/chat/open)    │ AGENTIC (email/research/write)
            ▼                          ▼
   ┌─────────────────┐        ┌─────────────────────┐
   │   FAST BRAIN    │        │   AGENTIC LOOP      │
   │ (Groq 70B)      │        │ (Gemini/Gemma 31B)  │
   │ response +      │        │ ReAct, max_steps=10 │
   │ apps/urls       │        │ tool calls (search, │
   └─────────────────┘        │ workspace, email)   │
                               └─────────────────────┘
            │                          │
            └──────────┬───────────────┘
                       ▼
            ┌──────────────────────┐
            │   EXECUTOR (sync/async)│
            │ open/close, web, image │
            │ workspace, WhatsApp    │
            └──────────────────────┘

Router intelligently classifies intent → routes to Fast Brain for low‑latency or Agentic Loop for deep tasks. Ephemeral memory & tool outputs are shared across steps.

Core Capabilities

🎤 Voice AI

Porcupine wake word + Groq Whisper‑Large STT + Cartesia Sonic‑3 streaming TTS (emotion-aware).

📁 Workspace Manager

Virtual file system (Creations/Vault/Temp), fuzzy search, rename/move, auto registry.

🖼️ Image Gen/Edit

FLUX (Together AI) + AI Horde editing, auto‑saved to workspace.

📧 Email & WhatsApp

Gmail API (send with attachments), Twilio WhatsApp with image compression.

🔍 Deep Research

Autonomous agent compiles reports from web (Tavily) + arXiv → saves .md to Creations.

🧠 RAG Memory

ChromaDB + Gemini embeddings, conversation summarisation, long‑term user bio, mood tracking.

Tech Stack

Category	Technologies
LLMs	Groq (Llama 3.3-70B, Llama 3.1-8B), Gemini 2.0 Flash, Gemma-4-31B
Vector DB	ChromaDB, Gemini Embedding-2 (768d)
Voice	Porcupine, Groq Whisper-Large-V3, Cartesia Sonic-3, Edge TTS fallback
Tools	Tavily, arXiv, Together AI (FLUX), AI Horde, Twilio, Gmail API, PyWhatKit
Frontend UI	PyQt5 (agent panel + STT popup), Pystray, Rich terminal, custom typing popup
Infra	Python 3.10+, asyncio, threading, subprocess, FFmpeg (Edge TTS fallback)

Installation & Environment

1. Clone & virtual environment

git clone https://github.com/thekaifansari01/jarvis-by-kaif-ansari.git
cd jarvis-by-kaif-ansari
python -m venv venv
venv\Scripts\activate  # Windows

2. Install dependencies

pip install -r requirements.txt

3. Create .env file (required keys):

GROQ_API_KEY="gsk_..."
GEMINI_API_KEY="AIza..."
TOGETHER_AI="..."    # FLUX
TAVILY_API_KEY="tvly-..."
TWILIO_ACCOUNT_SID, TWILIO_AUTH_TOKEN, TWILIO_FROM_NUMBER
CARTESIA_API_KEY="..."
PICOVOICE_ACCESS_KEY="..."   # optional, trial key works

4. Gmail OAuth: place credentials.json inside modules/emailManager/

5. Run

python main.py           # voice mode + system tray
python main.py test_jarvis   # text dev mode

Usage Modes

🎙️ Voice (default): Say "Jarvis" → wake sound → speak command (Hinglish/English). Agent panel + STT popup appear.
⌨️ Text Dev Mode: python main.py test_jarvis → type commands directly, no tray, faster boot.
🖥️ Disable Tray: python main.py system_tray=no
Exit: voice: "exit/bye/stop" or right-click tray icon → Exit.

Agentic Loop (ReAct)

For multi‑step tasks (email, deep research, file operations, web search) Jarvis enters agentic mode: Gemini/Gemma plans, executes tools, observes results, adapts, and finally completes. Budget‑aware: max 10 steps, timeout 120s, retry limit.

✅ Tools accessible inside agent loop:

search_actions(web/arxiv)workspace_actionemail_actionwhatsapp_actionimage_commanddeep_researchapps_to_open/close

Integrated Tools

open_any

SmartAppOpener: registry + fuzzy cache → opens apps, URLs, web shortcuts.

close_any

taskkill with fuzzy matching (exe names).

generate_image

FLUX generation + AI Horde editing, auto workspace adding.

search_hub

Tavily web + arXiv parallel search, returns structured content.

messenger (WhatsApp)

Twilio, auto‑compresses images before upload → link sharing.

deep_research

Autonomous multi‑turn research → final .md report saved to Creations.

Memory & RAG

ContextMemory class uses ChromaDB with Gemini embeddings (768d). Stores conversation history, RAG file chunks, user bio/preferences, mood history. Automatic summarisation using Groq 120B when limit exceeds 500 messages.

memory.get_relevant_context(query) → injects time, user facts, mood, recent chats, workspace status & RAG matches into LLM prompt.

Voice Pipeline

Wake word: Porcupine ('jarvis') on‑device, low latency.
STT: Groq Whisper‑Large‑V3 (primary) + Google Speech Recognition fallback, language set to 'hi' for Hinglish.
TTS: Cartesia Sonic‑3 (real‑time streaming, emotion SSML) + Edge TTS as fallback. Emotion detection (cheerful, sad, thinking) auto‑applies voice styles.
UI: Live STT popup shows listening/transcribed status, typing popup & agent panel with step/thought animations.

Configuration (config.py)

GROQ_ROUTER_MODEL = "llama-3.1-8b-instant"
GROQ_FAST_MODEL = "llama-3.3-70b-versatile"
GEMINI_AGENT_MODEL = "gemma-4-31b-it"
GEMINI_EMBEDDING_MODEL = "gemini-embedding-2"
FLUX_IMAGE_MODEL = "black-forest-labs/FLUX.1-schnell"
EDGE_TTS_VOICE = "hi-IN-MadhurNeural"
AGENT_MAX_STEPS = 10, AGENT_TIMEOUT = 900

All API keys loaded from .env, models easily swappable.

Example Commands

🗣️ "Jarvis, chrome khol de" → opens Chrome

📧 "Kaif ko email bhejo subject meeting, body 2pm, attach report.md" → agentic email with attachment

🔬 "Deep research on transformer agents 2025 report bana" → initiates deep research → saves .md to Creations

🖼️ "Edit sunset.png, add a moon" → AI Horde img2img

📁 "Workspace list" / "open my_note.txt" → workspace manager

Troubleshooting

🔊 Wake word not working: Replace PICOVOICE_ACCESS_KEY with your own (free tier) or adjust stt.py keyword.
🎙️ No audio after wake word: Check microphone index, ensure PyAudio installed pip install pyaudio.
📧 Gmail authentication fails: Refresh token, place fresh credentials.json from Google Cloud Console.
🖼️ Image generation fails: Verify Together AI API key & FLUX model quota.
🧠 ChromaDB errors: Delete Data/jarvis_memory/chroma_db and restart (re‑index).