Glosso Studio — Phoneme-Level Pronunciation Training, Offline

active
Kotlin Multiplatform Jetpack Compose ONNX Runtime wav2vec 2.0 Room Koin Ktor

Glosso Studio

Most pronunciation apps run your speech through speech-to-text and tell you whether the words match. That tells you very little about how you actually sound. Glosso Studio runs your speech through a phonetic recognizer — a wav2vec 2.0 acoustic model, on-device via ONNX Runtime — and gives you feedback at the level of individual phonemes.

You read a sentence. The app shows you which phonemes you nailed, which ones drifted, and exactly where in the word you mispronounced. Over time it tracks your weak phonemes and surfaces them in spaced-repetition drills. There’s a structured curriculum across six difficulty levels, from Beginner to Mastery.

It runs offline. The ML inference, the audio processing, the curriculum — all on the device. No account, no cloud round-trip, no telemetry.

Why this exists

Adult language learners have a pronunciation problem. Apps either ignore it (Duolingo, Babbel) or treat it superficially. The few that try (“does the speech-to-text match the target text?”) give you a thumbs-up for slurring through a sentence in a way no native speaker would.

Phonetic analysis is the right tool, and it’s now small enough to run on a phone. Glosso turns the technical tractability into a usable product.

How it works

Speech in. The user records a sentence. The audio is downsampled to 8 kHz and passed through a custom MFCC (Mel-Frequency Cepstral Coefficient) implementation — RMS-normalized to a target value of ~3000 to keep input consistent regardless of how loudly the user spoke.

Phoneme recognition. The audio is fed into a single multilingual wav2vec 2.0 model running through ONNX Runtime — one model handles every supported language. The output is a sequence of IPA phonemes for the utterance.

Comparison. The phonemes the user produced are aligned against the target sentence’s expected phonemes. Mismatches are scored, and per-phoneme accuracy is rolled up into a score for the sentence, the word, and the individual phonemes themselves.

Feedback. The user sees their score and the IPA transcription with per-phoneme color coding, plus the option to replay the full target pronunciation from the TTS reference.

Spaced repetition. Weak phonemes are tracked in Room (ReviewEntity / ReviewDao) and surfaced in drills using the SM-2 algorithm. Streaks, mastery levels, and progress per language are local-only.

Stack

Mobile — Kotlin Multiplatform. Compose for the Android UI. Koin for DI. Ktor for HTTP. Room for the local database (currently at schema v8 with full migration history).

ML/Audio

  • ONNX Runtime for on-device inference
  • A single multilingual wav2vec 2.0 acoustic model covering all five supported languages (English, French, Spanish, German, Latin)
  • Custom MFCC implementation, no third-party DSP library
  • Custom speech controller with RMS normalization
  • ~318 MB of ONNX weights, stored via Git LFS

TTS layer

  • Qwen3-TTS (Apache-2.0) for primary reference audio
  • Piper TTS for language-specific voices
  • Inworld TTS 1.5 Max via DeepInfra for the Spanish “Diego” voice

Marketing siteglosso-studio-website is a Hugo static site (Blowfish theme) at glossostudio.com.

Distribution — Play Store, F-Droid, and GitLab Releases. Fully free on every channel — no paywall, no in-app purchases. AGPLv3 license. F-Droid metadata in fastlane format.

Things I’m proud of

Cloud-quality phonetic analysis, offline. wav2vec 2.0 + ONNX Runtime gets you per-phoneme accuracy on a mid-range Android phone. No server, no API calls, no rate limits. This is the genuine differentiator from every other pronunciation app on the market.

Dynamic asset loading. The curriculum DBs would balloon the APK if shipped together. Instead, the base APK is small, and language-specific curriculum assets download on demand from the GitLab Package Registry the first time a user opens that level.

Real audio pipeline. RMS normalization, custom MFCC at 8 kHz, support for multiple playback speeds, phoneme-based drills. None of this is off-the-shelf.

SM-2 spaced repetition with weak-phoneme tracking. Not just sentence-level review — the database tracks individual phonemes the user struggles with and biases drill selection toward them.

F-Droid clean. Reproducible builds, no proprietary SDKs, AGPLv3, full fastlane metadata. Reviewed and accepted into F-Droid.

Status

Live and shipping. Currently v2.2.6 (versionCode 2207). Available on the Play Store, F-Droid, and via GitLab Releases. Fully free on every channel, AGPLv3.

Five languages today: English (US/GB), French, Spanish, German, Latin. The Latin one is a small joy.

GitLab: shirobyte421/glosso-studio. GitHub mirror at IgnacioLD/glosso-studio.