As Covid wreaked havoc worldwide, public health information had to be made available at scale using technology. World leaders and health organizations had to get the word out on measures to avoid spreading the virus. Accessibility to information was the need of the hour. And as the world increasingly began to use the virtual sphere to stay connected, video captioning became imperative.
In the world of captioning, speech recognition technology leverages Artificial Intelligence to scale the captioning process. It ticks two “scale” boxes: it saves time, and it cuts costs. AI programs are trained to produce high-quality transcription from speech. Where this differs from human effort is that this is scalable, whereas stenography and voice writing come with dependencies at the individual human level. Although human transcription and captioning is still more accurate than automation, Automated (or automatic) Speech Recognition, i.e., ASR, drives this process at scale.
This production of speech to text is made possible by Artificial Intelligence (AI). ASR has had a significant impact on these production techniques. Not only has ASR reduced costs of production, but it has also sped up the process. This seemed to be an unattainable goal only a couple decades ago, but today, it is mainstream and used extensively in the media and entertainment industry.
What is ASR?
Every time an engineer has to build a program, their first step is to reverse-engineer the process to understand it at the component level. There are three components to traditional ASR technology. One, there is an acoustic model that predicts phonemes (the smallest unit of speech), and the program is trained with short audio inputs to help it recognize these phonemes. Two, there is a lexicon or vocabulary input that the algorithm parses along with the acoustic component. And third is the overarching language component, which brings the two together to string words into machine-recognizable speech patterns.
In a nutshell: Machines are trained to recognize patterns in speech and language and then parse that information to arrive at the textual output that is as close to human output as is possible.
Limitations to ASR
But ASR is not a perfect technology. It is dependent on many factors including audio quality, speaker accents, overlapping speech etc. Another instance in producing text from speech that unfortunately has room for error is repetition and redundant speech. Speech fillers that have evolved alongside language culture and as part of the human thought process are not fully comprehended by machines.
The most common errors in ASR fall into one of these buckets:
- Errors in recognizing speakers, especially in the case of multiple speakers
- False starts and speech fillers – All those “ahs,” “ums,” and “mm-hmms” that we use in conversation.
- Overlapping speech and background noise
- Poor audio quality
ASR also faces challenges in instances where a speaker corrects themselves mid-sentence. These are gaps that a human captioner or transcriber will recognize and use judgment in rendering the text in a comprehensible format that mirrors the speaker’s intentions. Speech recognition technology still has some distance to go in discerning these patterns of speech and understanding context.
Advantages of ASR
Having said that, ASR is also one of the technologies that can simplify captioning and transcription when cost and time considerations have to be factored in. One of the effective workarounds to the inaccuracy issue is adding an editing layer in between the automated speech recognition and subsequent transcription. But rest assured that, like with most things in the technological sphere, improvement in accuracy rates is something you can expect in newer iterations of the technology.
And irrespective of these limitations, ASR plays a role in captioning, especially for live videos that don’t enjoy the luxury of ample production time. Caption providers work with human captioners for live events and also recognize the role that ASR can play in live video streaming instances.
Another role that AI plays in captioning and transcribing is Machine Translation (MT), which is the need of the hour in the world of localization. The irony here is that localization has paved the way for globalization and vice versa. Translation of captions, powered by AI, opens up the world of captions to non-native and non-English speakers to comprehend the content in a way that’s more intimate and easier. During the pandemic, this added service to captions for live and online events made it possible for participants worldwide to engage with the content. Consumption of content became more accessible and inclusive through AI.
How SyncWords Leverages AI
While AI can generate the scale and cost effectiveness to generate captions, SyncWords’ unique approach leverages human inputs in critical phases of the project to boost accuracy, which is the key factor in driving customer satisfaction. For on-demand/ pre-recorded captions, SyncWords’ proprietary AI technology syncs the media very accurately with the transcript, and by using transcripts generated by trained professionals, SyncWords’ produces accurately timed and worded captions. SyncWords also offers captions from ASR transcripts for customers that want captions generated quickly and affordably, and are ok with using ASR generated text.
For Live Captions, SyncWords offers both Human and ASR outputs, however, for Live Translations we encourage customers to use Human captions as the source and power the live translations to 100+ languages using AI translation.
In the words of SyncWords co-founder Ashish Shah: “SyncWords’ core technologies are powered using its proprietary Machine Learning technology and infrastructure. Using Artificial Intelligence in combination with automation, tools, and human services has reduced the time to generate captions and subtitles from a few days to just a few minutes. This hybrid approach has helped our customers immensely and increased their output and accuracy of captions.”
Artificial intelligence has made it possible to program machines with multiple rules while building out algorithms for technologies such as ASR and MT. In the last decade or two, we have seen many artificial intelligence platforms and services emerge like Siri, Alexa, Cortana, chatbots, and Google speech-to-text. Add to this personalized search results and those prompted email responses (creepy sometimes!) to simplify the world of business communication. To get the best results for captioning, live events or on-demand videos, it’s best to combine Human and AI, to leverage accuracy of one output to feed into another and produce the best results.
Related Reads on the SyncWords Blog
- 5 Reasons Why You Should Caption Live Meetings
- Live Captioning: 4 Best Practices for Virtual Events
- SyncWords’ Origins, AI and the Future: An Interview with Ashish Shah
- Inclusion and Accessibility at your Organization