Speak mk1: Multimodal Speech Therapy

Building a multimodal Mamba-attention hybrid model from scratch for speech therapy, enabling real-time articulation feedback and error detection using only a webcam and microphone.

Overview

I am currently building a mamba-attention hybrid encoder and decoder from scratch for speech therapy, not fine-tuning, not wrapping an API, but training every component end to end on my RTX 4060 laptop.
The system has three components I am building in parallel:
The first is a custom Mamba SSM-based audio encoder trained on LibriSpeech with multi-task phonological heads for voicing, manner, place of articulation, and correctness detection. The encoder uses a BLIP-2-style Q-Former to bridge audio representations into the language model.
The second is a video pipeline using MediaPipe FaceLandmarker to isolate and analyze oral region motion frame by frame, extracting articulatory features like tongue tip position, mouth opening geometry, and lip protrusion in real time from a standard webcam.
The third component is SpeakMK1LLM, which serves as the project’s core reasoning engine. To balance cutting-edge architectural research with deployment stability for the current phase, we have developed two iterations of this model:
The Hybrid Prototype: A custom 70M parameter Mamba-attention hybrid model designed for high-efficiency sequence modeling. It was trained on a rigorous four-stage curriculum: general pretraining on Tinystories, domain adaptation on CHILDES, clinical knowledge injection from pubmed central, and final instruction tuning.
The Deployment Model: A fine-tuned Gemma 3n E4B, which leverages the same four-stage curriculum to adapt its large-scale reasoning capabilities to the specific nuances of Speech-Language Pathology (SLP).

The entire stack requires only a webcam and a microphone, which is the point — making clinical-quality articulation feedback accessible without specialized hardware or a therapist present in the room.
For the demo I would show the live pipeline taking a child’s speech, flagging a specific phoneme error, and generating a graded corrective prompt in real time, alongside the clinician dashboard that logs session data for remote review.

Links

Tech stack