muesli
Voice notes are clunky. You record your thoughts, but finding what you said last week requires scrubbing through hours of audio. And if you're watching a tutorial video or in a meeting, good luck capturing both your commentary and what you're hearing.
I wanted something better: an app that transcribes everything in real-time, understands what you're saying through semantic search, and keeps your notes organized as simple markdown files.
the problem: voice notes without context
Voice notes are powerful for quick capture, but terrible for retrieval. You know you recorded something about "authentication implementation" three weeks ago, but which recording was it?
And there's another limitation: most voice note apps only capture your microphone. What if you're watching an educational video and want to transcribe what the speaker is saying? Or you're in a video call and want to capture both sides of the conversation?
Traditional apps can't help here. Muesli can.
dual audio capture
The main idea behind Muesli is capturing two audio streams simultaneously:
Your microphone - Your thoughts, questions, and commentary using the Web Audio API with echo cancellation and noise suppression.
System audio - Everything playing through your speakers: video tutorials, meetings, podcasts, even background music. On macOS, this uses a custom Swift binary leveraging ScreenCaptureKit to capture the audio loopback.
Both streams are transcribed separately and displayed side-by-side in a chat-style interface. You can see what you said and what you heard, timestamped and organized.
This isn't just a gimmick - it completely changes what voice notes can do. Capture a coding tutorial video while adding your own notes. Record a meeting with your commentary in one column and the other participants in another. Listen to a podcast while voice-journaling your reactions.
real-time transcription
Muesli uses DeepL's Voice API for streaming transcription. Audio is captured in 100ms chunks, preprocessed to 16-bit PCM at 16kHz, and sent via WebSocket to DeepL's servers.
The transcription appears in real-time as you speak. No "processing..." spinner, no waiting. Just instant text flowing onto the screen as you talk.
The WebSocket connection handles both microphone and system audio simultaneously, with separate sessions for each stream. DeepL provides language detection, punctuation, capitalization, and distinguishes between interim and final results.
markdown storage
Every note is saved as a plain markdown file with YAML frontmatter:
---
id: note-abc123
createdAt: 2024-09-15T10:30:00Z
updatedAt: 2024-09-15T11:45:00Z
---
# Meeting Notes: Product Sync
## User Transcript
[10:30:15] We need to prioritize the authentication flow
[10:31:22] The deadline is next Friday
## System Transcript
[10:30:45] [Video audio] "Best practices for OAuth 2.0..."
This design choice is deliberate. Your notes are:
- Human-readable in any text editor
- Version-controllable with Git
- Portable across machines
- Long-lasting (markdown will outlive any proprietary format)
Files are written atomically to prevent corruption, and stored locally in ~/Documents/Muesli/.
semantic search with embeddings
Here's where Muesli gets interesting. Instead of keyword matching, it uses vector embeddings for semantic search.
When you create a note, the transcript text is sent to OpenAI's embedding API, generating a 1536-dimensional vector representation. This vector captures the meaning of the text, not just the words.
Search for "authentication discussion" and Muesli finds transcripts containing "OAuth", "login flow", "user credentials" - even if those exact words aren't in your query. It understands semantic similarity.
The embeddings are stored in a local SQLite database with vector extensions (LibSQL), making searches instant even across hundreds of notes. The implementation uses Mastra's RAG toolkit, which handles the complicated parts of vector similarity calculations.
ai-powered features
Beyond transcription, Muesli includes AI capabilities powered by OpenAI:
Summarization - Generate a concise summary of your transcript with streaming responses. Perfect for long meetings or multi-hour recording sessions.
AI Chat - Ask questions about your notes. "What did I decide about the authentication approach?" The AI searches your embeddings and provides context-aware answers.
Tool Calling - The chat interface can invoke semantic search automatically, pulling relevant transcript segments to answer your questions.
All AI prompts are stored as markdown files in the app resources, making it easy to customize behavior without touching code.
building it
Muesli went from first commit to production-ready very quickly. The rapid development was possible thanks to choosing the right tools:
Electron with electron-vite - Fast HMR during development, cross-platform builds, and modern patterns with context bridge for secure IPC.
React 19 with TypeScript - Type-safe components with Tailwind CSS for styling and Radix UI for accessible primitives.
Mastra RAG - Embeddings and semantic search without ML knowledge. Just a few function calls to integrate vector search.
DeepL Voice API - Streaming transcription via WebSocket, no need to build audio processing pipelines from scratch.
The tech stack accelerated development dramatically. Features like embeddings and streaming AI that would take weeks to implement were working in hours.
cross-platform deployment
Muesli ships on macOS, Windows, and Linux with platform-specific optimizations:
macOS gets native system audio capture via a custom Swift binary using ScreenCaptureKit. This provides the best quality loopback audio capture.
Windows and Linux use the getDisplayMedia API with audio, requiring user selection of which audio source to capture but working without special permissions.
All builds include:
- Code signing and notarization (macOS)
- Auto-update system checking every 15 minutes
- Universal binaries supporting both Intel and Apple Silicon (macOS)
The CI/CD pipeline builds and signs releases automatically, pushing updates to GitLab's package registry for distribution.
lessons from rapid development
Building Muesli taught important lessons:
Start with core value - Day one had transcription working. Everything else built on that foundation. Too many projects perfect infrastructure before delivering features.
Modern tooling accelerates - electron-vite, Shadcn UI components, Vercel AI SDK, Mastra RAG - these tools saved weeks of development time.
Real-time feels magical - Watching your words appear on screen as you speak creates a sense of the app "understanding" you.
Markdown lasts - Proprietary formats lock your data. Markdown keeps it accessible forever.
what's next
Muesli works well, but there's room for evolution:
Collaborative features - The markdown format is Git-friendly. Imagine syncing notes across devices or sharing transcripts with your team.
Enhanced AI - Local LLMs via Ollama for privacy-conscious users. Automated tagging and categorization. Meeting action item extraction.
Better search - Date filters, speaker identification, regex support for power users. Export search results.
Integrations - Calendar integration to auto-record scheduled meetings. Export to Notion or Obsidian. Podcast hosting platform integration.
reflection
Muesli solves a real problem: voice notes are terrible for retrieval, and most apps can't capture system audio. By combining dual audio capture, real-time transcription, and semantic search, it transforms voice notes from a dumping ground into a searchable knowledge base.
For me, Muesli represents what's possible with modern tooling and clear focus. You don't need months of development to build sophisticated applications. Choose the right abstractions, ship early, iterate based on usage.
Voice notes shouldn't require scrubbing through audio files. With Muesli, they don't have to.
Tech Stack: Electron 37, React 19, TypeScript, DeepL Voice API, OpenAI, Mastra RAG, LibSQL Vector DB, Tailwind CSS, electron-vite
Timeline: 188 commits over 6 weeks (September-October 2024)
Platforms: macOS (native system audio), Windows, Linux