Record yourself speaking. Receive an improved version of what you said, played back immediately in your own cloned voice.
Most speech coaching tools require friction – multiple sessions, external feedback, a gap between saying something and understanding how it could have landed differently. Better Words compresses that gap into a single interaction: record yourself speaking, receive an improved version of what you said, hear it played back in your own cloned voice.
Not a coaching note. Not a transcript with edits. Your words, restructured, in your voice. Used daily for several months, the tool began to feel less like software and more like a mirror held slightly ahead of where you are.
What happens when you hear an elevated version of your speech, immediately, in your own voice.
One button. No menus. No dashboard. The entire workflow – recording, processing, rewriting, voice generation, playback – had to live inside a single object. If the interface became cluttered, the habit would break.
The gesture language follows from the constraint. Tap to pause, tap again to resume – keeping the experience flexible without adding controls. Holding the button signals intent: it turns red, microcopy appears ("Hold to send"), preventing accidental submission while still keeping the interface minimal. When processing finishes, the button expands into a full-screen feedback space. The result feels like an event, not output.
An elevation slider lets you move between close to original and more sophisticated – keeping the output within your own register rather than replacing your voice with someone else's idea of better. Voice cloning requires trust before it can work. Before first use, you choose between cloning your own voice or using a synthetic one. Clear privacy messaging explains how audio is processed and deleted.
Better Words was built entirely using AI-assisted workflows – iterating in code and interface at the same time, without separating design from development. No handoff. No mockups passed to an engineer. The constraint shaped the tool and the tool sharpened the constraint, each iteration tightening what the single button was allowed to mean.
Underneath the interface, a RAG system handles retrieval – allowing the tool to reference stored speech patterns and personal context when generating improvements. That layer doesn't appear in the demo, but it's where the harder design questions live: what gets surfaced depends entirely on how well the retrieval layer understands what you actually meant. The boundary between relevant context and hallucination is not a technical problem with a clean solution. It is a design problem about trust, and about how much of yourself you want the system to remember.