Is this a good word embedding model for text analysis?

Trollfaz · Oct 15, 2024

I am trying to build an AI to group texts into topics using K clustering so I must embed a text into a vector. Everyone knows that the conventional way is to first tokenize the text into words and stripping all the punctuation chars. I get a list of word from this. This is what I plan to do with each word.
1) Convert into to lower
2)One hot encode all characters so the inputs will be a 26 dim vector/array.
3)Feed the arrays/vectors into a RNN sequentially so that the sequence of the characters is known.
Okay this is the mathematical description. The RNN before processing the nth char is at state ##s_{n-1}##. The encoded vector of the nth char is ##c_n##.
$$s_0=0$$
$$s_n=s_{n-1}+Wc_n$$
W here is a 26 by 26 transformation matrix. Return ##s_n## when all chars have been processed. If one wants to emphasize the significance of capital letters we can assign a value of 2 instead of 1 at the index of the one hot encoded vector for the char
But my main problem is how to get W, can it be random or must the values be adjusted through training?

pbuk · Oct 15, 2024

Trollfaz said:

Is this a good word embedding model for text analysis?

No, it's not a word embedding model at all - your model is character based, not word based.

Trollfaz said:

Everyone knows that the conventional way is to first tokenize the text into words and stripping all the punctuation chars.

Does everyone know that? I think I could probably find an exception.

Trollfaz said:

This is what I plan to do with each word...

There is no tokenization of words in the process you describe. There are many descriptions of tokenization on the interweb e.g. https://medium.com/@james.moody/preparing-text-for-rnn-language-modeling-in-python-6474c3eae66e

Is this a good word embedding model for text analysis?

Similar threads

Hot Threads

Recent Insights