Acoustic Model and Language Model

In summary, the conversation is discussing an exercise involving a vowel (V), a feature vector (O), and two models: an acoustic model (P AM) and a language model (P LM). The goal is to find a vowel (V) that will maximize the probability of V given O (P(V|O)), using the log likelihoods provided in a table. The specific steps for finding this vowel are unclear, as well as the role of the numbers in the log table. It is recommended to seek clarification from an instructor.
  • #1
nao113
68
13
Homework Statement
Suppose 𝑉 is a vowel and 𝑂 is a feature vector.
Suppose that 𝑃 AM (𝑂|𝑉) is an acoustic model and 𝑃 𝐿M (𝑉) is a language model. Obtain a vowel 𝑉 that maximizes 𝑃(𝑉|𝑂) when the acoustic and language model log likelihoods are given in the following table.
Relevant Equations
W: a vowel v (v ∊ {a,i,u,e,o})
O: a feature vector
Question:
Screenshot 2023-04-25 at 19.26.03.png


My Answer:
WhatsApp Image 2023-04-25 at 19.32.30.jpeg


Is it correct? Thank you
 
Last edited by a moderator:
Physics news on Phys.org
  • #2
nao113 said:
Homework Statement: Suppose 𝑉 is a vowel and 𝑂 is a feature vector.
Suppose that 𝑃 AM (𝑂|𝑉) is an acoustic model and 𝑃 𝐿M (𝑉) is a language model. Obtain a vowel 𝑉 that maximizes 𝑃(𝑉|𝑂) when the acoustic and language model log likelihoods are given in the following table.
Relevant Equations: W: a vowel v (v ∊ {a,i,u,e,o})
O: a feature vector

Question:
View attachment 325473

My Answer:
View attachment 325474

Is it correct? Thank you
No idea without some more context.
Is P(V|O) a conditional probability?
What does argmax mean?
How did you go from ##P(V|O)## to ##\frac{P(O|V)P(V)}{P(O)}## in the 2nd line of your work and similar for the 3rd line?
What role do the numbers in the log table play?
 
  • #3
This is the reference that I got, I don t know about what argmax mean here, so I assumed it has the same meaning as log e (P(V|O)).
Screenshot 2023-04-26 at 17.05.46.png

Screenshot 2023-04-26 at 17.06.12.png

Screenshot 2023-04-26 at 17.05.55.png
 
  • #4
What you've posted so far doesn't give any definition of "argmax". In your work that you showed in post #1, you added the numbers in the first row of the table to get one sum, and then added the numbers in the second row to get another sum. You then multiplied the two sums.

Given that I know nothing more about this than what you posted, I think your work is incorrect. My guess, and this is only a guess, is that to maximize ##P(O|W)P(O)## what you need to do is to look at the five separate products of the numbers in the five columns, and pick whichever one is the largest. You might get better advice by contacting your instructor.
 

FAQ: Acoustic Model and Language Model

What is an acoustic model in speech recognition?

An acoustic model in speech recognition is a statistical representation of the relationship between audio signals and the phonetic units of speech. It is used to convert raw audio data into a sequence of phonemes, which are the basic sound units of a language. The model is typically trained using a large dataset of audio recordings and their corresponding transcriptions.

What is a language model in speech recognition?

A language model in speech recognition is a statistical tool that predicts the probability of a sequence of words. It helps in determining the likelihood of a given word sequence, thereby improving the accuracy of transcription by considering the context of the words. Language models are trained on large text corpora and can be based on various techniques, such as n-grams, neural networks, or transformers.

How do acoustic models and language models work together in speech recognition systems?

In speech recognition systems, the acoustic model and language model work together to convert spoken language into text. The acoustic model first processes the audio signal to identify phonetic units. These phonetic units are then passed to the language model, which uses contextual information to construct the most likely sequence of words. This collaboration helps in improving the overall accuracy and coherence of the recognized text.

What are the common techniques used to train acoustic models?

Common techniques used to train acoustic models include Hidden Markov Models (HMMs), Gaussian Mixture Models (GMMs), and more recently, deep learning methods such as Deep Neural Networks (DNNs), Convolutional Neural Networks (CNNs), and Recurrent Neural Networks (RNNs). These techniques involve training on large datasets of audio recordings paired with their transcriptions to learn the mapping between audio features and phonetic units.

What types of language models are commonly used in speech recognition?

Common types of language models used in speech recognition include n-gram models, which predict the next word based on the previous n words, and neural network-based models, such as Recurrent Neural Networks (RNNs), Long Short-Term Memory networks (LSTMs), and Transformer models. These advanced models can capture long-range dependencies and provide more accurate predictions by considering a larger context of the word sequence.

Similar threads

Replies
9
Views
2K
Replies
17
Views
1K
Replies
1
Views
1K
Replies
2
Views
809
Back
Top