Nomos — A Personal AI Assistant That Reads Your Notes

active March 2026

Python FastAPI PostgreSQL pgvector Kotlin Jetpack Compose DeepInfra MCP

Nomos

Most AI assistants start every conversation from zero. Nomos doesn’t. It treats your Obsidian vault — your notes about projects, people, goals, half-finished ideas — as long-term memory, embeds it into a vector index, and pulls relevant context into every conversation.

Ask “what should I focus on this week?” and the answer comes from your actual notes, your calendar, and recent articles from feeds you care about. Ask “what was that thing I read last month about edge AI?” and it can find it. The assistant remembers things you tell it (persistent memory across sessions), can write notes back into the vault, and runs entirely on infrastructure you control — only LLM inference and embeddings touch a third party.

Why this exists

I keep most of what matters to me in plain markdown in an Obsidian vault. The frustrating part of using mainstream AI assistants is having to brief them every time. Nomos is the version of “an assistant that knows me” where me is defined by my notes, not by a profile some company keeps about me.

The bonus: building it forced a real exploration of agentic loops, MCP, low-latency voice pipelines, and what privacy-by-default actually looks like in practice.

How it works

Vault as memory. Nomos watches the vault, chunks each note, embeds the chunks with BAAI/bge-large-en-v1.5, and stores them in PostgreSQL with pgvector + an HNSW index. Reindexing is incremental — unchanged chunks are skipped, so a 5000-note vault doesn’t get re-embedded on every change.

Agentic chat. A loop on the backend runs the LLM with six tools available:

search_notes — semantic search over the vault
read_note — fetch a specific note’s full text
write_note — create or append to a note
get_news — fetch RSS articles, scored against your interest vector
suggest_calendar_event — suggest a calendar block based on context
update_memory — append durable facts to _nomos/memory.md, persisted across sessions

The model decides when to call which. The frontend gets the entire trace as Server-Sent Events: thinking tokens, tool calls, tool results, the final answer. Nothing is hidden.

Voice mode. Push-to-talk on the Android app. The pipeline is Groq Whisper-large-v3-turbo for STT, Groq GPT-OSS 20B for the LLM (fast inference, cheap), DeepInfra Kokoro-82M for TTS. Cost lands around $0.0024/minute and perceived latency is sub-1-second — which we get partly through routing choices and partly by playing filler audio while tool calls resolve.

Android app. Native Kotlin + Jetpack Compose. Hilt for DI, Ktor for SSE streaming, native SpeechRecognizer and CalendarContract. The app handles its own input modes (text + voice) and renders the streaming events as they arrive.

Stack

Backend — FastAPI + uvicorn + gunicorn on Python 3.12. SQLAlchemy 2.0 (async) + asyncpg. PostgreSQL with pgvector. APScheduler for the periodic vault re-indexing. sse-starlette for the streaming protocol.

Models (via DeepInfra and Groq)

LLM: Nemotron-3-Super-120B-A12B (chat), Groq GPT-OSS 20B (voice mode)
Embeddings: BAAI/bge-large-en-v1.5
STT: Groq Whisper-large-v3-turbo
TTS: DeepInfra Kokoro-82M

MCP integration — recent work adopts Model Context Protocol for knowledge graph extraction over the vault.

Android — Kotlin, Jetpack Compose, Material 3, Hilt, Ktor, WorkManager.

Sync — Syncthing keeps the Obsidian vault in sync between devices and the VPS.

Deploy — Docker on a Hetzner VPS, behind Coolify + Traefik, with Let’s Encrypt.

Things I’m proud of

Voice mode under a second. Multi-provider orchestration tuned for cost and perceived latency, not just one or the other. Filler audio bridges the awkward silence during tool calls so the experience stays conversational instead of feeling mechanical.

SSE-first streaming with structured events. Every step the agent takes is a typed event the client can render — thinking, tool_call, tool_result, token, error. No black-box “loading…” spinners.

Privacy by design. The vault stays on your devices and your VPS. The only data that leaves are individual prompts and embeddings, sent to one inference provider, with no account tying calls together beyond the API key.

Persistent memory that’s just a markdown file. _nomos/memory.md is editable in Obsidian. The assistant updates it as a tool call. You can read what it remembers about you. You can delete entries. You can grep it.

Incremental reindexing. When the vault changes, only modified chunks get re-embedded. Saves both time and inference cost on a vault that grows daily.

Status

v1 is running. Indexing, six-tool agentic chat, SSE streaming, Android app with voice + text, calendar suggestions, persistent memory — all working.

v2 roadmap: actual calendar event creation (not just suggestion), session history, push notifications, a web clipper for capturing articles into the vault, and contact enrichment.

The primary interface is the Android app. A web chat client is on the list.

GitHub: github.com/IgnacioLD/nomos.