Introduction 

Due to the rise of multilingual content, accurate and efficient transcription services are now a necessity for many organizations dealing with audio and video content featuring a diverse range of languages, dialects, and accents.  

The Multi-Language Audio Transcription feature helps to address language barriers by enabling the transcription of audio content in different languages. 

Concept 

VIDIZMO harnesses the capabilities of Amazon Transcribe services to facilitate audio and video file transcription within the portal when utilizing AWS INDEXER. With the recent enhancement, transcription in VIDIZMO has been significantly improved. 


The platform now features automated language detection within media files. This advancement ensures accurate transcription even in scenarios where speakers switch between languages during a conversation or when multiple participants speak different languages. 


Learn more about multilingual transcription services provided by AWS Transcribe here. 


VIDIZMO offers users three options for Transcription Language Mode: Specific Language, Auto Detect, and Auto-Detect Multi-Language, allowing for greater customization according to user needs. 


Specific Language: Users can manually select the language they are certain about for the media files being uploaded to the portal. This mode allows users to specify the desired language for transcription. Transcriptions will then be generated exclusively in the selected language. 


Auto Detect: This mode automatically identifies the primary language spoken in your media file if you're unsure about its language content. It then generates the transcript based on the identified language. For improved transcription accuracy, you have the option to select at least two dominant languages manually. However, specifying more than five languages is not recommended, as it may overwhelm the system. If uncertain, you can leave this field empty. 


Auto-Detect Multi-Language: If your media file includes multiple languages. This feature automatically identifies the languages spoken in your media file and generates the multilingual transcript accordingly. Additionally, you can manually select at least two primary languages to improve transcription accuracy if desired. Furthermore, it is advised that you input a maximum of five languages, not more than that, as it may overwhelm the system. If uncertain, you can leave this field empty. 

Language Support  

VIDIZMO is leveraging Amazon transcribe services this means that Multi-Language Audio Recognition and Transcription is enabled for all 14 supported languages across various AWS Regions, including US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Singapore), Asia Pacific (Seoul), Asia Pacific (Sydney), Asia Pacific (Tokyo), Africa (Cape Town), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (London), South America (São Paulo), and AWS GovCloud (US-West). Refer to the documentation. 

Language Identification Accuracy 

To enhance language identification, you can provide a list of languages that you anticipate might be present in your media. By specifying language options, you limit Amazon Transcribe to consider only those languages you designate when matching your audio to the appropriate language. This approach accelerates language identification and enhances the accuracy of assigning the correct language dialect. Refer here for more information. 


Word/Character Limits

The Multi-Language Audio recognition functionality within VIDIZMO operates on audio data without imposing a character limit on the transcribed output. This means that audio files of varying lengths can be processed and transcribed without encountering character count restrictions. 

However, it's essential to be aware of file size limitations. It can only handle audio files up to four hours in length (or a maximum file size of 2 GB).


Minimum Speech Duration

The accuracy of the generated transcript is directly influenced by the minimum duration of spoken language provided. Generally, a minimum of 30 seconds of speech is recommended for optimal performance. Shorter durations may lead to decreased accuracy due to the system having less speech context to analyze. 


Language Mismatch

When the Transcription Language Mode is set to a specific language, and a user inputs the incorrect language in the AWS indexer, the system will still generate a transcript by translating the spoken language in the audio to the selected language. This means that even if the audio contains multiple spoken languages, the system will transcribe the spoken language in the chosen language accurately. 


If the user selects the Transcription Language Mode as "auto detect" or "auto-detect Multilanguage," or they do not specify the language input or provide an incorrect input, the system will generate a transcript based on the dominant language detected by default. 


Typically, the system analyzes the first three seconds of speech to make an initial guess. The system evaluates the phonetic and structural similarities between the provided language codes and the actual language spoken in the audio. For example, if the audio is in French (fr-FR) but you provided Spanish (es-ES) and German (de-DE) codes, Spanish will be a closer match than German due to the shared pronunciation and vocabulary stemming from their Latin roots.


Supported File Formats

The supported media formats are AMR, FLAC, M4A, MP3, MP4, Ogg, WebM, and WAV.


Adding Custom Vocabulary

Suppose there are words in a speech in a media file the system cannot detect accurately. In that case, you can create custom vocabularies to improve transcription accuracy for those specific words. These words may include domain-specific terms, proper nouns, brand names, acronyms, or words the system can't recognize correctly. You can create up to 100 custom vocabularies in your account. When defining a custom vocabulary, you can use either a list or table format. 

However, please note that the maximum size limit is 50 KB. Once created, you can provide a name for the custom vocabulary file within the AWS indexer in the VIDIZMO portal that you have established in the AWS Management Console.


To learn more about custom vocabularies, consult the documentation.


Supported Audio Channels

VIDIZMO supports both single and dual audio channels for generating transcriptions. The system can transcribe multiple audio channels separately within a single file. This feature is beneficial for call recordings, where an agent and a caller are usually recorded on separate channels that are later merged into one audio file. While VIDIZMO can handle both mono and stereo audio formats, it's important to note that Amazon Transcribe does not currently support files with more than two audio channels. Uploading files with more than two channels may result in overlapping timestamps for each channel when individuals speak over each other.  


Output and Handling

VIDIZMO's transcription feature supports multilingual audio that contains punctuation and numbers. The system automatically adds appropriate punctuations, such as commas, periods, question marks, exclamation points, etc., to the transcribed text for all supported languages. Additionally, it capitalizes words appropriately for languages that use case-sensitive writing systems. For most languages, except English and German, numbers spoken in the audio are transcribed into their written word forms (e.g., "twenty-three" instead of "23"). However, for English and German, the system handles numbers differently depending on the context. With automatic punctuation addition and conversion of numbers into their respective written forms, VIDIZMO's transcription delivers transcripts that closely resemble manual transcriptions.  


To enable this feature, refer to our documentation on How to Configure AWS Indexer App


On-Demand Processing 

Users have the option to selectively process media files of their choice through On-Demand processing. By choosing "Process" from the media file options, users can enable Multi-Language Audio Recognition and Transcription and select the specific AI Insights they wish to generate for a media file.


To learn how to enable it using a process modal, refer to our document “Multi-Language Audio and Transcription with On-Demand Processing.