Can Built-In Speech-to-Text Replace Specialist Assistive Technology?

A young woman in a wheelchair uses assistive technology on her tablet and smiles, whilst two other women work at a table with a laptop and a tablet in a bright, modern setting.

Built-In Speech-to-Text Tools Are Not the Same as Accessible Support

Speech-to-text features are now embedded into many of the devices and platforms people use every day. Phones, laptops, browsers, meeting platforms and operating systems often include some form of dictation, captioning or transcription by default. As these features have become more visible and widely available, a broader assumption has emerged that speech-to-text is now a solved problem.

But availability is not the same as accessibility.

Built-in tools can be genuinely useful. For some people, in some situations, they may work perfectly well. However, tools designed as mainstream convenience features are not necessarily designed to provide consistent, reliable access for people who depend on speech-to-text as part of how they work, study, communicate or process information.

That distinction matters.

Speech-to-text performance can vary significantly depending on:

accents and regional speech patterns
speech clarity, pacing and hesitations
background noise and room acoustics
specialist or technical terminology
multiple speakers and overlapping speech
microphones, audio routing and listening devices

In many cases, these tools work well in ideal conditions, but become substantially less reliable in real-world environments.

For users who rely on speech-to-text occasionally, this may simply be frustrating. For users who depend on it to write, follow conversations, participate in meetings, reduce cognitive load, or access spoken information, inconsistency becomes a genuine accessibility issue.

A transcript that requires constant correction, captions that lag behind conversation, or dictation that repeatedly misunderstands speech patterns can increase cognitive effort rather than reduce it.

There is also an increasing shift within mainstream tools away from transparent, user-led correction and toward AI-driven prediction, rewriting and content generation. While these features may improve convenience for some users, they are not necessarily aligned with accessibility, independent communication, or individual support needs.

The question is therefore not whether built-in speech-to-text tools exist, or whether they can sometimes be helpful. The real question is whether mainstream tools can reasonably be assumed to provide reliable, consistent and equitable access across the wide range of people, environments and communication styles that speech-to-text users represent.

Current evidence suggests that assumption cannot safely be made.

Why Speech-to-Text Isn’t Uniform

Speech-to-text is often discussed as if it is a single, standard capability. In reality, it is a broad category of technologies that can behave very differently depending on the tool being used, the speaker, the environment, and the task being performed. Dictating a sentence into a quiet laptop is not the same as following live captions in a busy lecture theatre. Reading prepared text aloud is not the same as speaking naturally, with pauses, hesitations, interruptions, corrections, or changes in pace.

That distinction matters because speech-to-text accuracy is highly context-dependent. A tool that performs well in controlled conditions may behave very differently in real-world environments. Research comparing modern speech recognition systems has consistently found significantly lower error rates for prepared or semi-structured speech than for spontaneous speech — the kind of speech people naturally use in classrooms, meetings, seminars and conversations. In one benchmark study, the best-performing commercial system recorded a word error rate (WER) of 4.43% on read-aloud speech, rising to 24.9% on spontaneous speech. Across commercial systems overall, average WERs increased from 11.6% to 35.5% when moving from prepared speech to spontaneous conversation. Put simply, the same broad category of technology can perform several times worse once speech becomes more natural and less controlled.

Academic and workplace environments rarely involve perfect speech in perfect conditions. Lectures, seminars, tutorials, placements and online sessions frequently involve:

fast or overlapping speech
multiple speakers
subject-specific terminology
unclear audio
accents and dialects
acronyms, names and technical language
unfinished sentences and self-correction
background noise and variable acoustics

Research suggests these conditions can significantly affect accuracy. Studies examining speech recognition in noisy classroom environments have reported substantial drops in performance across mainstream systems, demonstrating the gap between controlled demonstrations and realistic usage conditions.

It is also important to separate availability from functional equivalence. Mainstream built-in tools such as Apple Dictation, Windows Voice Access, Google Voice Typing, browser transcription tools, and live captions within meeting platforms are widely available and may be genuinely useful for many people. However, availability alone does not demonstrate that all speech-to-text tools provide the same level of access or reliability.

Research has repeatedly shown that speech recognition performance varies across different speaker groups and communication styles. Studies have identified substantially higher error rates for accented speech, non-native English speakers, Black speakers, people who stutter, and some d/Deaf and hard-of-hearing users. These differences are not necessarily caused by isolated technical faults, but by the reality that speech recognition systems are trained on particular datasets, speech patterns and assumptions about how people speak.

This does not mean mainstream tools are inherently poor, or that specialist assistive technologies are unaffected by the same technical challenges. Accent, background noise, speech variation and specialist vocabulary can affect any speech-to-text system. The important distinction is that specialist assistive technologies are typically designed, configured and supported with accessibility and reliability as core priorities, rather than as general convenience features for broad consumer use.

That difference matters because accessibility is not determined only by whether text appears on a screen. It depends on whether the output is accurate, timely and reliable enough for someone to:

follow spoken information
participate in discussion
produce written work
reduce cognitive load
communicate independently
access education or work effectively

A transcript that requires constant correction, captions that lag behind conversation, or dictation that repeatedly misunderstands speech patterns can increase effort rather than reduce it.

There is also a growing shift within mainstream speech and writing tools toward AI-driven prediction, rewriting and content generation. While these features may improve convenience for some users, they are not necessarily designed around accessibility, independent communication, or the highly individual support needs of disabled users.

The key issue is therefore not whether built-in speech-to-text tools exist, or whether they can sometimes be helpful. It is whether all speech-to-text tools can reasonably be assumed to provide the same level of access, reliability and usability across the wide range of people, environments and communication styles that exist in real-world education and work settings.

Current evidence suggests they cannot.

Concern	Statistic	What this tells us	Source
Speech patterns	The best commercial ASR system had a 4.43% WER on read-aloud speech, compared with 24.9% WER on spontaneous speech.	Speech-to-text performs much better when speech is controlled than when it reflects natural conversation.	CEASR benchmark
Speech patterns	Across commercial ASR systems, average WER rose from 11.6% on read-aloud and semi-spontaneous speech to 35.5% on spontaneous speech.	Accuracy varies significantly depending on how naturally someone is speaking.	CEASR benchmark
Speech patterns	Apple research improved WER for people who stutter from 25.4% to 9.9% after tuning and dysfluency refinement.	Speech variation can significantly affect ASR accuracy, and performance can drop significantly unless the technology is specifically adapted.	Apple research on stuttered speech recognition
Environment	In noisy authentic classrooms, one study reported mean WERs of 0.84 for Google, 0.91 for Rev and 0.95 for Watson.	Real-world classroom noise can make ASR unreliable.	Educational Data Mining study
Environment	In noisy classroom recordings, one study reported word-accuracy scores of 56% for Google and 8% for Microsoft.	Built-in tools can perform very differently under difficult real-world audio conditions.	Blanchard et al., 2015
Accent bias	Five major ASR systems averaged 0.35 WER for Black speakers compared with 0.19 WER for white speakers.	ASR bias across speaker groups is documented across multiple systems.	Stanford-led study on racial disparities in ASR
Accent bias	ASR WER can be up to seven times greater for accented speech compared with standard British English.	Regional accent bias is directly relevant in a UK education context.	University of Birmingham research
Accent bias	Research across 191 language backgrounds found a 10–15% mean WER gap between first-language and second-language English speech.	Non-native English speakers may experience less reliable speech-to-text performance.	Hollands et al., Interspeech 2022
Technical vocabulary	Speechmatics reported 93% general accuracy, 7% word error rate, 96% medical keyword recall and a 4% keyword error rate in medical speech-to-text benchmarking.	Specialist systems can focus on the critical terms that matter most, not just overall accuracy.	Speechmatics medical speech-to-text benchmarking
Technical vocabulary	In spoken clinical questions, one system improved by 36% after domain adaptation, reaching a final WER of 26.7%.	Specialist language models can materially improve recognition of technical vocabulary.	JAMIA study on spoken clinical questions

Evidence

Research shows that speech-to-text accuracy can vary significantly depending on who is speaking, where they are speaking, how they speak and what they are speaking about. To show what this looks like in practice, we tested a range of built-in and specialist speech-to-text tools across scenarios that reflect common access challenges: speech variation, background noise, regional and Deaf accents, and technical vocabulary.

These tests are not intended to be a formal benchmark or a universal ranking of every tool. They are practical, side-by-side examples of how different speech-to-text tools can behave under the same conditions. In each recording, the tools process the same speech at the same time, making it possible to compare not only whether words are recognised, but whether the final output captures the intended meaning, remains readable, and is usable for the student.

Speech patterns

Speech disorder: Stuttering

In the stuttering test, the difference between the built-in tool and the specialist assistive technology was substantial.

Apple Dictation struggled to produce a consistently understandable transcript. Repeated sounds, disrupted phrasing and speech variation frequently resulted in output that was difficult to follow, requiring significant interpretation and correction from the user.

The specialist assistive technology, by comparison, produced a clearer, more coherent and substantially more accurate transcript that remained usable as written communication. Sentence structure, meaning and intent were retained far more effectively, despite the same speech patterns and recording conditions.

The distinction here is important. This is not simply a marginal improvement in accuracy or convenience. It is the difference between output that can realistically support independent communication and output that cannot.

Stuttering - Apple vs Paid For STT | Auto-punctuation is on for both tools

Speech disorder: Cluttering

In the cluttering test, the difference between the built-in tool and the specialist assistive technology was even more pronounced.

Apple Dictation captured isolated words and short phrases, but the transcript quickly became fragmented and difficult to follow. Important context, sentence structure and meaning were frequently lost, resulting in output that would require substantial interpretation and rewriting before it could realistically be used.

The specialist assistive technology captured substantially more of the speech accurately and preserved the speaker’s narrative and intended meaning far more effectively, despite the fast pace and disrupted speech patterns. The resulting transcript remained coherent and usable as written communication.

For users relying on dictation as part of accessible communication, this distinction is critical. The difference is not between “perfect” and “imperfect” transcription. It is the difference between output that remains usable and output that does not.

Cluttering - Apple vs Paid For STT | Auto-punctuation is on for both tools

Environment

Background noise

In the first background noise test, both tools captured much of the intended speech, but the outputs were not identical. Apple Dictation produced a more compressed transcript, while TalkType captured more of the speech and presented it in a more readable structure.

Background noise - Apple vs TalkType | Auto-punctuation is on for both tools

In the second test, where the background voice became more dominant, the difference between “works” and “works reliably” became clearer. Apple Dictation missed more of the opening context and produced a shorter output. TalkType captured more of the intended speech overall.

Background noise (dominant) - Apple vs TalkType | Auto-punctuation is on for both tools

Accent bias

Spontaneous Scottish accent

In the spontaneous light Scottish accent test, the difference between the two outputs became clear once the speech became more natural and conversational.

The built-in dictation struggled to consistently preserve meaning and sentence structure, with the transcript becoming increasingly compressed and unreliable as the speech became less scripted. Important context and phrasing were lost, reducing the usability of the output as written communication.

The specialist assistive technology retained far more of the speaker’s flow, structure and intended meaning, producing a transcript that remained clear and usable despite the accent, pace and spontaneity of the speech.

This distinction matters because real-world dictation rarely involves perfectly prepared sentences. Students often dictate while thinking, revising ideas, asking questions or capturing thoughts in real time. In those situations, the difference between “roughly recognisable” output and genuinely usable written communication becomes highly significant.

Scottish accent (spontaneous) - Paid for STT vs Windows | Auto-punctuation is on for both tools

Spontaneous Scottish accent (heavy)

In the spontaneous heavy Scottish accent test, the difference between the two outputs became more pronounced.

Both tools were affected by the stronger accent and natural conversational delivery. However, the specialist assistive technology retained substantially more of the speaker’s intended meaning, sentence structure and detail, producing a transcript that remained largely understandable and usable as written communication.

The built-in dictation tool captured parts of the speech, but missed significantly more information and produced a much shorter, less coherent transcript overall. Important context and meaning were lost, making the output considerably harder to use without substantial interpretation or correction.

This distinction matters because accessibility is not determined by whether isolated words are recognised. For speech-to-text to support independent study or communication, the output must preserve enough meaning, structure and context to remain genuinely usable.

Scottish accent (heavy) - Paid for STT vs Windows | Auto-punctuation is on for both tools

Welsh accent

In the Welsh accent test, both tools recognised parts of the speech, but the outputs were clearly not equivalent.

Google Docs Dictation identified some individual words and phrases, including the speaker’s name and references to Abergavenny, but lost important context and meaning within the transcript. For example, it failed to correctly capture the phrase “my favourite judge on The Voice is Tom Jones”, missing both the structure and key meaning of the sentence.

The specialist assistive technology captured substantially more of the spoken content accurately and preserved far more of the speaker’s intended meaning and sentence structure, resulting in a transcript that remained coherent and usable as written communication.

This distinction matters because speech-to-text is not simply about recognising occasional keywords. For someone relying on dictation to study, communicate or produce written work, the output must preserve enough meaning, context and continuity to be genuinely usable without substantial correction or reconstruction.

Welsh accent - Google vs Paid for STT | Auto-punctuation is off for both tools since Google Docs does not provide this feature.

Deaf accent

In the Deaf accent test, the difference between the two outputs was particularly pronounced.

Google Docs Voice Typing missed much of the opening speech and produced a significantly shorter transcript overall, losing important context and narrative detail. Large sections of meaning were either missed entirely or became difficult to interpret from the output alone.

The specialist assistive technology captured substantially more of the speech from the outset and preserved far more of the speaker’s intended meaning and flow. This included key details about communication being a mixture of English and ASL, as well as the speaker reflecting on how they sound when speaking aloud.

This distinction is important because speech-to-text systems are often trained primarily on hearing speech patterns and more conventional speech delivery. For users whose speech differs from those assumptions, the difference between partial recognition and genuinely usable communication can be substantial.

Deaf accent - Google vs Paid for STT | Auto-punctuation is off for both tools since Google Docs does not provide this feature.

Technical vocabulary

Medical terminology with Northern accent

In this test, the difference between the two outputs was substantial. Apple Dictation misrecognised “lupus erythematosus” as “Blue is everything mitosis”, fundamentally changing the meaning of the opening sentence. It also failed to accurately capture several specialist medical terms, including “autoantibodies”, “pancreatitis”, “avascular necrosis”, “antiphospholipid syndrome” and “pseudopseudohypoparathyroidism”. As a result, large parts of the transcript became unreliable as accurate written communication.

The specialist assistive technology captured the medical terminology far more accurately and preserved the meaning and structure of the passage to a substantially greater extent.

The test also highlighted an important functional difference beyond raw transcription accuracy. Apple Dictation treated spoken formatting instructions such as “bullet point” as dictated text, reflecting its limited built-in command support. The specialist assistive technology correctly interpreted the instruction as a formatting command and structured the content accordingly. This distinction matters because effective speech-to-text support is not only about converting speech into words. For many users, it is also about producing structured, usable written communication in real time, particularly when working with specialist or technical language.

Medical terminology with Northern accent - Apple vs Paid for STT | Auto-punctuation is on for both tools

Conclusion

Built-in speech-to-text tools have improved, and for some people they can be genuinely useful. But usefulness is not the same as reliable access, and the evidence and testing in this article demonstrate why built-in tools should not be assumed to be functionally equivalent to specialist assistive technology.

Speech-to-text performance is highly dependent on context. Accuracy changes based on the speaker, accent, speech pattern, environment, audio quality and subject matter. A tool that performs well with clear, prepared speech in quiet conditions may behave very differently when faced with spontaneous conversation, regional accents, specialist terminology, speech variation or noisy real-world environments.

That distinction matters because students do not study in perfect conditions. They dictate while thinking, revising ideas, asking questions, participating in seminars, attending placements and working with technical language. In those situations, the difference between “partially recognisable” output and genuinely usable written communication becomes highly significant.

The testing in this article repeatedly showed the same pattern:

built-in tools often captured parts of the speech
specialist assistive technology preserved substantially more meaning, structure and context
the resulting output was materially more usable with significantly less correction or interpretation required

For users who rely on speech-to-text as part of accessible communication, this is not simply a matter of convenience or preference. It is the difference between technology that reduces barriers and technology that creates additional effort at the point support is needed most.

Availability alone therefore cannot be treated as evidence of equivalence. The fact that dictation, captioning or transcription features exist within mainstream devices and platforms does not mean they will provide reliable access across the wide range of environments, communication styles and support needs that exist in education and work.

The key question is not whether built-in speech-to-text tools exist. It is whether they can consistently provide accurate, usable and dependable access for people with disabilities in higher education.