v2.0 ยท Production Ready

J.A.R.V.I.S Documentation

Just A Rather Very Intelligent System โ€” hybrid LLM architecture, local-first memory, autonomous agentic workflows.

Groq Llama 3.3-70B Gemini 2.0 / Gemma-4-31B ChromaDB + Embeddings Multiโ€‘modal (Image/WhatsApp/Email)

System Overview

Jarvis is an advanced AI assistant that combines a hybrid brain (Fast Reflex + Agentic Reasoning) with persistent vector memory, realโ€‘time voice, and a rich tool ecosystem. It runs primarily on local infrastructure, respects privacy, and delivers lowโ€‘latency automation โ€” from opening apps to generating deep research reports.

โšก

Fast Brain

Groq Llama 3.3-70B โ€” instant commands, chat, file open, YouTube.

๐Ÿง 

Agentic Brain

Gemini / Gemma โ€” multi-step tasks, web research, email, image generation.

๐Ÿ’พ

Persistent Memory

ChromaDB + Gemini Embeddings (768d) + user bio/mood tracking.

Hybrid Architecture

[Wake Word / Input]
      โ”‚
      โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      Groq Router (Llama 3.1-8B)
โ”‚ Intent Routerโ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                        โ”‚
            โ”‚ FAST (apps/chat/open)    โ”‚ AGENTIC (email/research/write)
            โ–ผ                          โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚   FAST BRAIN    โ”‚        โ”‚   AGENTIC LOOP      โ”‚
   โ”‚ (Groq 70B)      โ”‚        โ”‚ (Gemini/Gemma 31B)  โ”‚
   โ”‚ response +      โ”‚        โ”‚ ReAct, max_steps=10 โ”‚
   โ”‚ apps/urls       โ”‚        โ”‚ tool calls (search, โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜        โ”‚ workspace, email)   โ”‚
                               โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
            โ”‚                          โ”‚
            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                       โ–ผ
            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
            โ”‚   EXECUTOR (sync/async)โ”‚
            โ”‚ open/close, web, image โ”‚
            โ”‚ workspace, WhatsApp    โ”‚
            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Router intelligently classifies intent โ†’ routes to Fast Brain for lowโ€‘latency or Agentic Loop for deep tasks. Ephemeral memory & tool outputs are shared across steps.

Core Capabilities

๐ŸŽค Voice AI

Porcupine wake word + Groq Whisperโ€‘Large STT + Cartesia Sonicโ€‘3 streaming TTS (emotion-aware).

๐Ÿ“ Workspace Manager

Virtual file system (Creations/Vault/Temp), fuzzy search, rename/move, auto registry.

๐Ÿ–ผ๏ธ Image Gen/Edit

FLUX (Together AI) + AI Horde editing, autoโ€‘saved to workspace.

๐Ÿ“ง Email & WhatsApp

Gmail API (send with attachments), Twilio WhatsApp with image compression.

๐Ÿ” Deep Research

Autonomous agent compiles reports from web (Tavily) + arXiv โ†’ saves .md to Creations.

๐Ÿง  RAG Memory

ChromaDB + Gemini embeddings, conversation summarisation, longโ€‘term user bio, mood tracking.

Tech Stack

CategoryTechnologies
LLMsGroq (Llama 3.3-70B, Llama 3.1-8B), Gemini 2.0 Flash, Gemma-4-31B
Vector DBChromaDB, Gemini Embedding-2 (768d)
VoicePorcupine, Groq Whisper-Large-V3, Cartesia Sonic-3, Edge TTS fallback
ToolsTavily, arXiv, Together AI (FLUX), AI Horde, Twilio, Gmail API, PyWhatKit
Frontend UIPyQt5 (agent panel + STT popup), Pystray, Rich terminal, custom typing popup
InfraPython 3.10+, asyncio, threading, subprocess, FFmpeg (Edge TTS fallback)

Installation & Environment

1. Clone & virtual environment

git clone https://github.com/thekaifansari01/jarvis-by-kaif-ansari.git
cd jarvis-by-kaif-ansari
python -m venv venv
venv\Scripts\activate  # Windows

2. Install dependencies

pip install -r requirements.txt

3. Create .env file (required keys):

GROQ_API_KEY="gsk_..."
GEMINI_API_KEY="AIza..."
TOGETHER_AI="..."    # FLUX
TAVILY_API_KEY="tvly-..."
TWILIO_ACCOUNT_SID, TWILIO_AUTH_TOKEN, TWILIO_FROM_NUMBER
CARTESIA_API_KEY="..."
PICOVOICE_ACCESS_KEY="..."   # optional, trial key works

4. Gmail OAuth: place credentials.json inside modules/emailManager/

5. Run

python main.py           # voice mode + system tray
python main.py test_jarvis   # text dev mode

Usage Modes

  • ๐ŸŽ™๏ธ Voice (default): Say "Jarvis" โ†’ wake sound โ†’ speak command (Hinglish/English). Agent panel + STT popup appear.
  • โŒจ๏ธ Text Dev Mode: python main.py test_jarvis โ†’ type commands directly, no tray, faster boot.
  • ๐Ÿ–ฅ๏ธ Disable Tray: python main.py system_tray=no
  • Exit: voice: "exit/bye/stop" or right-click tray icon โ†’ Exit.

Agentic Loop (ReAct)

For multiโ€‘step tasks (email, deep research, file operations, web search) Jarvis enters agentic mode: Gemini/Gemma plans, executes tools, observes results, adapts, and finally completes. Budgetโ€‘aware: max 10 steps, timeout 120s, retry limit.

โœ… Tools accessible inside agent loop:

search_actions(web/arxiv)workspace_actionemail_actionwhatsapp_actionimage_commanddeep_researchapps_to_open/close

Integrated Tools

open_any

SmartAppOpener: registry + fuzzy cache โ†’ opens apps, URLs, web shortcuts.

close_any

taskkill with fuzzy matching (exe names).

generate_image

FLUX generation + AI Horde editing, auto workspace adding.

search_hub

Tavily web + arXiv parallel search, returns structured content.

messenger (WhatsApp)

Twilio, autoโ€‘compresses images before upload โ†’ link sharing.

deep_research

Autonomous multiโ€‘turn research โ†’ final .md report saved to Creations.

Memory & RAG

ContextMemory class uses ChromaDB with Gemini embeddings (768d). Stores conversation history, RAG file chunks, user bio/preferences, mood history. Automatic summarisation using Groq 120B when limit exceeds 500 messages.

memory.get_relevant_context(query) โ†’ injects time, user facts, mood, recent chats, workspace status & RAG matches into LLM prompt.

Voice Pipeline

  • Wake word: Porcupine ('jarvis') onโ€‘device, low latency.
  • STT: Groq Whisperโ€‘Largeโ€‘V3 (primary) + Google Speech Recognition fallback, language set to 'hi' for Hinglish.
  • TTS: Cartesia Sonicโ€‘3 (realโ€‘time streaming, emotion SSML) + Edge TTS as fallback. Emotion detection (cheerful, sad, thinking) autoโ€‘applies voice styles.
  • UI: Live STT popup shows listening/transcribed status, typing popup & agent panel with step/thought animations.

Configuration (config.py)

GROQ_ROUTER_MODEL = "llama-3.1-8b-instant"
GROQ_FAST_MODEL = "llama-3.3-70b-versatile"
GEMINI_AGENT_MODEL = "gemma-4-31b-it"
GEMINI_EMBEDDING_MODEL = "gemini-embedding-2"
FLUX_IMAGE_MODEL = "black-forest-labs/FLUX.1-schnell"
EDGE_TTS_VOICE = "hi-IN-MadhurNeural"
AGENT_MAX_STEPS = 10, AGENT_TIMEOUT = 900

All API keys loaded from .env, models easily swappable.

Example Commands

๐Ÿ—ฃ๏ธ "Jarvis, chrome khol de" โ†’ opens Chrome
๐Ÿ“ง "Kaif ko email bhejo subject meeting, body 2pm, attach report.md" โ†’ agentic email with attachment
๐Ÿ”ฌ "Deep research on transformer agents 2025 report bana" โ†’ initiates deep research โ†’ saves .md to Creations
๐Ÿ–ผ๏ธ "Edit sunset.png, add a moon" โ†’ AI Horde img2img
๐Ÿ“ "Workspace list" / "open my_note.txt" โ†’ workspace manager

Troubleshooting

  • ๐Ÿ”Š Wake word not working: Replace PICOVOICE_ACCESS_KEY with your own (free tier) or adjust stt.py keyword.
  • ๐ŸŽ™๏ธ No audio after wake word: Check microphone index, ensure PyAudio installed pip install pyaudio.
  • ๐Ÿ“ง Gmail authentication fails: Refresh token, place fresh credentials.json from Google Cloud Console.
  • ๐Ÿ–ผ๏ธ Image generation fails: Verify Together AI API key & FLUX model quota.
  • ๐Ÿง  ChromaDB errors: Delete Data/jarvis_memory/chroma_db and restart (reโ€‘index).