AI summaries are table stakes. Persistent speaker identity is the moat.
Every meeting tool has AI summaries now. The question that actually matters: does your tool know who's talking — across every meeting you've ever recorded?
The summary problem
In 2026, every meeting transcription tool has AI summaries. Otter has them. Fireflies has them. Granola has them. Fathom has them. Zoom has them built in. It's the feature every pitch deck leads with, and it's the feature that matters least.
Why? Because summaries are a commodity. You can take any transcript and feed it to Claude, ChatGPT, or a local model. The output is roughly equivalent regardless of which tool generated the transcript. The AI summary is a thin layer on top of the real work — capturing the audio, identifying the speakers, and producing an accurate transcript.
The hard problem isn't summarization. The hard problem is identity.
The identity problem
Here's the question that separates useful meeting tools from toy demos: does the tool know that "Sarah" in your October standup is the same Sarah in your March performance review?
Most meeting tools don't. They identify speakers within a single meeting session — "Speaker 1," "Speaker 2," maybe with names if the meeting platform provides them. But they start from zero every time. Each meeting is an island. The tool has no memory of who it's heard before.
This means you can't ask: "What did Sarah say about the API migration across all our standups?" You can't query: "Show me every meeting where Marcus and I discussed hiring." You can't build a longitudinal view of conversations with a specific person because the tool doesn't know that the same person appears across your meetings.
Persistent speaker identity — the ability to recognize and name speakers across sessions, over time, without manual labeling — is the feature that turns a transcription tool into a knowledge system.
How Transcripted solves this
Transcripted uses WeSpeaker, a speaker verification framework, to generate 256-dimensional voice embeddings for every speaker in every meeting. In plain English: it creates a mathematical fingerprint of each person's voice.
Here's what that means in practice:
Meeting 1: You record a standup. Transcripted identifies three speakers. It doesn't know their names yet, so it labels them Speaker 1, Speaker 2, and Speaker 3. Behind the scenes, it saves a voice embedding for each one.
Meeting 2: You record another meeting. Two of the speakers are the same people from Meeting 1. Transcripted compares the voice embeddings and matches them — Speaker 1 from today is the same as Speaker 2 from yesterday. If you've named them in the app, those names carry forward automatically.
Meeting 50: You've had dozens of meetings over several months. Transcripted recognizes your regular collaborators instantly. Sarah is Sarah. Marcus is Marcus. New speakers are flagged as unknown and get their own embeddings for future matching.
All of this happens locally. The voice embeddings are stored on your Mac. No voice data is sent to a server. The matching algorithm runs on your machine every time you record a meeting.
256 dimensions of voice
A voice embedding is a vector — a list of 256 numbers — that captures the characteristics of a person's voice. Not what they said, but how they sound. Pitch patterns, timbre, cadence, resonance. The things that make your voice recognizably yours.
WeSpeaker's embedding model is trained on hundreds of thousands of speakers. It learns to map each voice into a 256-dimensional space where similar voices are close together and different voices are far apart. When Transcripted compares two embeddings, it's measuring the distance between two points in this space. If the distance is small enough, it's the same person.
This approach is robust against background noise, different microphones, varying audio quality, and even mild changes in someone's voice (like having a cold). It's the same fundamental technology used in speaker verification systems at a much larger scale — adapted for local, on-device use.
Why this matters for AI agents
Here's where persistent speaker identity goes from "nice feature" to "architectural advantage." AI agents are starting to use meeting transcripts as context. Claude Desktop can read your files. Cursor can search your documents. Custom agents can query your meeting history for context when generating responses.
But the quality of an agent's response depends entirely on the quality of the context it receives. If your transcripts say "Speaker 1" and "Speaker 2," the agent can't answer "what did Sarah say about the roadmap?" It doesn't know who Sarah is. It can't connect statements across meetings because it has no concept of persistent identity.
With Transcripted, every transcript has named, identified speakers. An agent reading your meeting folder can:
Answer cross-meeting questions. "What has Sarah said about the API migration over the last three months?" The agent finds every meeting where Sarah spoke and extracts relevant statements. This only works because "Sarah" is consistently labeled across all transcripts.
Build relationship context. "Summarize my last five meetings with Marcus." The agent finds meetings where both you and Marcus were present, in chronological order, and synthesizes the arc of your conversations.
Detect patterns. "Has anyone raised concerns about the timeline more than once?" The agent can track who raised what, when, and whether the concern was addressed in a later meeting — because it knows the speakers are the same people.
This is the difference between a transcript archive and a queryable meeting memory. Identity is the key that unlocks it.
No other local tool does this
There are other local transcription tools. Whisper can run on your Mac. MacWhisper wraps it in a nice UI. Various open source projects use different speech-to-text models locally.
None of them do persistent speaker identification. They might do basic diarization — splitting audio into segments by speaker — but they don't learn voices across sessions. They don't build a persistent identity model. Each recording starts from scratch.
Transcripted is, as far as I'm aware, the only local macOS tool that combines high-accuracy transcription (Parakeet TDT V3) with persistent speaker identification (WeSpeaker embeddings) in a way that builds a cumulative understanding of who you talk to.
Cloud tools could theoretically do this — they have the compute and the data. But they haven't prioritized it because their business model is built around summaries and integrations, not around building a persistent local knowledge graph of your professional conversations.
The moat isn't the summary
The meeting transcription space is crowded. Everyone has AI summaries. Everyone has integrations. Everyone has a clean UI and a "just works" pitch.
The moat is in the data layer. Can the tool answer questions about people, not just meetings? Can it connect a conversation from October to a decision in March? Can it give an AI agent enough context to be genuinely useful about your work relationships and recurring discussions?
Persistent speaker identity is what makes that possible. It's the difference between a tool that generates meeting notes and a tool that understands your professional life.
Transcripted builds that understanding locally, privately, and cumulatively — one meeting at a time.
Try Transcripted free
Free forever. No account. No cloud. Works on M1 through M5.