Introduction  

  

The VIDIZMO Audio Indexer enables users to generate AI insights related to audio for both audio and video files. You can either specify languages or allow the application to auto-detect the language for various processes like transcription or translation. This article will explain how these AI insights are generated and how the application performs its functionality.


Concept  


When you choose an audio or video file for processing with Auto-Detect selected, the AI model set up in the VIDIZMO Audio Indexer app detects and predicts the language during playback. Once the language is detected and predicted, the app will generate the transcript in that language for the entire playback duration of the audio or video file.


The accuracy of transcriptions or translation for a language depends on that specific language's word error rate (WER). A lower WER indicates that the transcriptions generated for that language are likely to be accurate. In addition, your application can also impact the accuracy of the transcriptions generated. 

 

Here is a list of languages supported by the VIDIZMO Audio Indexer, along with their respective WER when the Large-V1 model is used. The performance of the Audio indexer and the accuracy of its Insights vary depending on the model size. The following sections of this article provide more information about the models. The languages in this list arranged from most accurate (least WER) to less accurate (most WER).


 
Serial No. 
 

Language 

WER 

1 

Spanish 

3.5 

2 

Italian 

4.2 

3 

English 

4.5 

4 

Portuguese 

4.8 

5 

German 

5.5 

6 

Japanese 

6.4 

7 

Russian 

6.4 

8 

Polish 

7.2 

9 

French 

7.7 

10 

Catalan 

8.0 

11 

Dutch 

8.3 

12 

Indonesian 

8.5 

13 

Turkmen 

9.4 

14 

Turkish 

9.4 

15 

Malay 

10.2 

16 

Ukrainian 

10.3 

17 

Swedish 

10.5 

18 

Vietnamese 

10.7 

19 

Norwegian 

11.4 

20 

Finnish 

12.2 

21 

Thai 

13.2 

22 

Korean 

15.2 

23 

Romanian 

15.4 

24 

Slovak 

15.7 

25 

Tagalog 

15.8 

26 

Croatian 

16.7 

27 

Danish 

16.8 

28 

Czech 

17.4 

29 

Arabic 

18.1 

30 

Bulgarian 

18.4 

31 

Greek 

18.7 

32 

Galician 

19.0 

33 

Chinese 

19.6 

34 

Macedonian 

20.6 

35 

Tamil 

20.6 

36 

Bosnian 

20.7 

37 

Hungarian 

21.0 

38 

Urdu 

25.0 

39 

Estonian 

25.5 

40 

Hindi 

26.9 

41 

Slovenian 

27.8 

42 

Latvian 

28.3 

43 

Azerbaijani 

28.7 

44 

Serbian 

29.2 

45 

Hebrew 

30.2 

46 

Lithuanian 

35.2 

47 

Persian 

36.1 

48 

Welsh 

36.6 

49 

Afrikaans 

42.6 

50 

Icelandic 

43.0 

51 

Marathi 

43.7 

52 

Kazakh 

43.8 

53 

Māori 

45.7 

54 

Swahili 

47.9 

55 

Nepali 

52.2 

56 

Armenian 

53.7 

57 

Belarusian 

56.6 

58 

Kannada 

69.8 

59 

Tajik 

74.5 

60 

Occitan 

75.9 

61 

Lingala 

76.8 

62 

Maltese 

80.5 

63 

Luxembourgish 

86.5 

64 

Hausa 

87.0 

65 

Javanese 

87.0 

66 

Pashto 

92.7 

67 

Uzbek 

93.3 

68 

Khmer 

96.0 

69 

Georgian 

100.5 

70 

Telugu 

100.6 

71 

Malayalam 

101.4 

72 

Lao 

101.6 

73 

Punjabi 

102.8 

74 

Somali 

103.5 

75 

Gujarati 

103.9 

76 

Bengali 

104.9 

77 

Assamese 

105.6 

78 

Mongolian 

106.2 

79 

Yoruba 

111.7 

80 

Amharic 

129.3 

81 

Shona 

130.0 

82 

Sindhi 

177.9 


In addition to the languages above, the VIDIZMO Audio Indexer has the capacity to generate insights for some additional rare languages. However, due to their scarcity and inadequate training data, the estimated Word Error Rate (WER) for these languages is high and may create unusual or insufficient results.


  • Bashkir
  • Tibetan 
  • Breton
  • Basque
  • Faroese 
  • Hawaiian
  • Haitian Creole
  • Latin
  • Malagasy
  • Burmese
  • Norwegian Nynorsk
  • Sanskrit
  • Sinhalese
  • Albanian
  • Sundanese
  • Tatar
  • Yiddish
  • Cantonese

  

The Audio Indexer application exclusively handles videos that have audio. If speech is detected in that audio, the app generates transcripts accordingly. When the input is a video, the VIDIZMO Audio Indexer separates the audio component from the video and then performs the transcription process solely on the audio portion. 

  

When the VIDIZMO Audio Indexer has Transcriptions in its Insights, you can’t use the AWS Indexer App or Azure Video Analyzer ARM for transcriptions. Additionally, you can't activate the Audio Indexer app, if either the AWS Indexer App or Azure Video Analyzer ARM are enabled with the option to generate Transcriptions.


  • Translation Concept


If the user has also opted for Translation, the VIDIZMO Audio Indexer app will translate the detected language in the audio or video file. These translations will be present in the transcription pane of your file's playback page. If both Transcription and Translations are selected, the Audio Indexer generates both Insights which the user can choose to see from the Transcription pane. As of now, the VIDIZMO Audio Indexer can only perform Translations in English.


If an audio or video file doesn't have any transcriptions, processing it with the Translation Insight generates the English translations, regardless of the spoken language in it. Portal content that already has transcriptions, either from another indexing application or the user uploading a .vtt file, can still be processed for the English translations. 

 
The features offered by the VIDIZMO Audio Indexer app utilize AI processing as a consumption metric for your VIDIZMO Account. To learn how you can view consumption reports, refer to Consumption Reports for SaaS Deployment Overview.

  

Processing

  

To generate accurate transcriptions or translations for your audio and video files, the Audio Indexer aims to minimize substitutions, insertions, and deletions, which contributes to reducing the overall Word Error Rate (WER). The less the WER of a transcription, the more accurate it will be.

  

In the first step of the indexing process, the raw audio inputs from your audio or video files are converted into a log-Mel spectrogram using a feature extractor. The system then maps these audio spectrogram features to a sequence of text tokens that have encoder-hidden states. These text tokens are then decoded regressively by an internal language model (LM).


You can also configure the application to automatically process any audio or video file uploaded or added to your Portal via other means. Learn how to do this configuration and more by visiting: Configuring VIDIZMO Audio Indexer for Insights


Conclusion  

 

The VIDIZMO Audio Indexer aims to generate reliable, accurate transcriptions, translations and AI insights for all your audio and video files. You can also configure the application to automatically generate insights for files that you add to the Portal using the same configured settings. You can also manually create and regenerate transcriptions with different parameters from your Portal if you are unsatisfied with the results.  

  

For a guide to a more practical or hands-on approach to the transcription generation process, visit How to Generate Insights using VIDIZMO Audio Indexer