Casper the friendly ghost on a Raspberry Pi 5

When I got a Raspberry Pi 5 as a Christmas present, I stared at the box for a while thinking about what to do with it. Instead of letting it gather dust on my shelf, I decided to breathe some life into it and name it after my childhood idol: Casper, the friendly ghost.

I've been learning Chinese, and honestly, the daily Anki routine was getting old - read English, answer in Chinese, repeat. I wanted something more dynamic, more unexpected. That's when it clicked: AI + Anki integration could give me real practice sessions. So I built it.

But once it worked, it felt... limited. A bit stupid, even. So I added Perplexity for research queries. Then calendar management seemed like a natural fit. Then task execution. The real pain point? Constantly tweaking prompts to make Casper correct my Chinese mistakes, keep responses short, remember my preferences - all that stuff. That's when I realized I needed to give it memory.

What started as a simple Chinese practice buddy evolved into a full-fledged personal AI agent. And it cost me about 200$ to build. Here's how I did it.

What Casper Agent can do?

Small demo of Casper in action

Anki Practice Sessions

In the morning, I just say "Hey Casper, let's do Anki." For those unfamiliar, Anki is spaced repetition software - basically flashcards on steroids. It's my #1 memorization tool for everything from basic AI concepts to Chinese vocabulary. The problem with traditional Anki is you end up memorizing the exact phrasing of your own questions. Ask yourself "What's the Chinese word for apple?" enough times and you're just pattern matching, not actually learning.

Anki flashcard flow

Casper fixes this. It asks me about words in different ways, paraphrases questions The small display shows characters and pinyin when I need visual backup, but the interaction stays fresh.

Research While You Work

This is where Perplexity integration shines. I can ask Casper to look something up mid-conversation: "Hey, what's the deal with constitutional AI?" It searches, reads me a summary, and here's the cool part - I can tell it to create an Anki card about it right there. Or I'll ask it to research something in the background while I'm coding. "Find me recent papers on inference optimization" - and ten minutes later it has a summary ready.

Calendar & Tasks

Plain English commands: "Schedule a tenis practice session for this Friday at 9am" Done. But the real flexibility comes from the Telegram bot and webhooks I set up. Sometimes I'm not near the Pi. So I text Casper: "Add dentist appointment Friday 2pm" or send a voice message: "Remind me to review the deployment docs tomorrow." Commands in text or audio, from anywhere. The to-do list grows, meetings get scheduled, and I don't have to break flow.

Telegram bot: Text and voice commands to manage tasks and calendar
Google Calendar: Event created via voice message through Telegram

Memory That Learns

After every conversation, Casper stores a short summary and extracts facts - about me, my preferences, context from what we talked about. Before the agent runs, it grabs related memories and appends them to the prompt. My user profile gets built up over time and fed right into the context.

What this means in practice: Casper knows I want short answers. It remembers my Chinese level. It picked up that I prefer corrections to be gentle but direct. My preferences aren't hardcoded - they're learned and adapted. The prompt gets steered by accumulated context instead of me saying "keep it brief" every damn time.

Hardware Setup

The build is straightforward. Here's what you need:

  • Raspberry Pi 5 - 4GB RAM, 32GB SD card
  • Display - Small 128x96 pixels screen (just enough for context)
  • Audio - Anker PowerConf A3301 (you can go cheaper with any basic speaker + mic combo)
  • Internet - Stable connection with 8Mbps+ for Gemini Live API and other API calls

You can squeeze everything into 200$+ depending on what you already have and what deals you find.

Casper's Brain: Main Components

Casper's architecture: Three interfaces sharing one brain

Three Parallel Interfaces

Casper runs three interfaces simultaneously - voice assistant, Telegram bot, and webhook server. All three share the same Memory Manager instance and MCP Registry. This means learning propagates across interfaces. A preference you mention via Telegram gets injected into the voice assistant's context. A task you schedule through the webhook shows up in all interfaces.

The state manager orchestrates everything using asyncio. Voice runs in the main loop (blocking on wake word detection or active conversation), while Telegram and webhook run as background tasks:

webhook_task = asyncio.create_task(webhook_server.start())
telegram_task = asyncio.create_task(telegram_bot.start())

Voice Pipeline: Vosk + Gemini Live

Voice starts with wake word detection. I'm using Vosk's vosk-model-small-en-us-0.15 - a 40MB model that runs on CPU. Audio comes in at 16kHz, processed in 512-frame chunks. The recognizer checks both final and partial results, so detection happens fast (~500ms after you say "Casper").

Once triggered, Gemini Live API takes over. This is the gemini-2.5-flash-native-audio-preview model that handles STT, LLM, and TTS in one unified pipeline. Audio streams bidirectionally over WebSocket. Input is 16kHz PCM, output is 24kHz. First response token hits around 800ms, which is fast enough for natural conversation.

The tricky part: ending sessions. Gemini Live doesn't have a built-in "goodbye" detector. So I added a custom function declaration called session_end. The model learns to call it when the conversation wraps up. When that function fires, I set a flag that breaks the conversation loop and returns to wake word detection.

Agent API for Text

Telegram and webhook don't use Live API. They use the standard Gemini 1.5 Flash API (gemini-1.5-flash). Why separate? Live API is optimized for streaming audio with low latency. The text API is more stable for request/response patterns and supports audio file uploads (for Telegram voice messages) without streaming overhead.

Both APIs support function calling. The agent builds a chat session with system instruction, conversation history, and available tool declarations. When the model responds with function calls, I execute them via the Task Executor, send results back, and the model generates the final response.

Sessions auto-reset after 20 messages to prevent context overflow. When that happens, I store a summary in memory and keep the last 4 messages for continuity.

MCP Registry: The Middleware Layer

Claude Desktop has native MCP support. Gemini doesn't. But MCP servers are standardized - they use JSON-RPC 2.0 over stdio, they have well-defined schemas, and they're reusable. So I built a middleware layer that bridges the gap.

At startup, the MCP Registry reads mcp_servers.json, spawns each server as a subprocess (stdin/stdout pipes), sends the initialization handshake, and calls tools/list to discover available tools. For each tool, I translate the JSON Schema into Gemini's function declaration format:

types.FunctionDeclaration(
    name=tool_name,
    description=tool_def["description"],
    parameters=types.Schema(
        type=types.Type.OBJECT,
        properties={...}  # Converted from JSON Schema
    )
)

When Gemini calls a function, the registry routes it to the right MCP server, sends a tools/call request via JSON-RPC, parses the response (usually JSON in a content array), and formats it back for Gemini. The whole translation happens transparently - Gemini just sees function declarations, MCP servers just see JSON-RPC.

Adding a new tool is editing the config file. No code changes. The registry handles discovery and routing automatically.

Memory System

Here's how the agent prompt gets structured:

You are Casper, a helpful AI assistant...

// User profile - built and updated over time
{user_profile}
- Prefers short, concise answers
- Learning Chinese (HSK2 level)
- Wants gentle but direct corrections

// Relative facts - top 5-10 semantically close to current question
{relevant_facts}
- Practices Chinese flashcards daily (similarity: 0.89)
- Plays tennis every Friday at 9am (similarity: 0.76)

Memory runs on ChromaDB with local embeddings. Every 10 messages, conversations get stored and facts get extracted. The extraction happens async - I fire off a Gemini API request asking it to pull structured facts from the conversation. Each fact gets a category (habit, preference, learning, information), confidence score, scope, and tags.

Facts get embedded using all-MiniLM-L6-v2 from Sentence Transformers. It's a 384-dimensional model that runs on CPU in about 50ms per query. The embeddings go into ChromaDB's facts collection with metadata filtering support.

Before any agent interaction, I do semantic search: generate a query embedding, run cosine similarity search in ChromaDB, filter by distance < 1.6, and grab the top 10 results. These facts get appended to the system instruction. That's how context accumulates without manually repeating preferences.

The same memory instance is shared across all three interfaces. ChromaDB is persistent (stored in data/chroma/), so facts survive restarts. Embeddings happen on the Pi, no API calls needed.

Background Task Execution

Webhook and Telegram both go through the Task Executor. When a tool call comes in, the executor checks if it's an MCP tool, a display operation, or a memory operation, then routes accordingly. Everything's async, so I can fire off multiple tool calls in parallel and await results.

Task history is tracked in memory (ring buffer, last 100 tasks) with timestamps and execution times. Useful for debugging and seeing what the agent's been up to.

Where This Goes Next

I started this whole thing just to make Anki reviews less boring - add some speaking and listening, make the questions less predictable. But it grew. Now it's a voice assistant that remembers conversations, manages my calendar, creates flashcards on the fly, and answers research questions.

There's a lot of hype around agents lately. Projects like OpenClaw are showing what's possible with browser automation through their browser tool - clicking buttons, filling forms, navigating websites, running actual errands. Buying stuff, selling stuff, interacting with web apps on your behalf. I tried using OpenClaw while building Casper, but it was pretty expensive to run. Some questions can trigger tasks that burn through around 50\$ for default operations. In comparison, Casper costs me around 1-3\$ a day with frequent use. Still, some of the concepts are cool. OpenClaw also has a "heartbeat" feature where the agent runs periodic background checks and can proactively reach out when something needs attention - instead of waiting for you to ask.

The middleware setup I built here (MCP Registry) means I could plug in those kinds of tools pretty easily. Add browser search? Just install Perplexity MCP server via npm install @perplexity-ai/mcp-server, add it to the config and it will be available to Casper

I think we're not far from Jarvis-level assistants. The pieces exist - natural language understanding, function calling, memory systems, multi-modal inputs. It's just a matter of wiring them together and making them reliable enough to actually use daily.

References