Is this a good word embedding model for text analysis?

  • #1
Trollfaz
141
14
I am trying to build an AI to group texts into topics using K clustering so I must embed a text into a vector. Everyone knows that the conventional way is to first tokenize the text into words and stripping all the punctuation chars. I get a list of word from this. This is what I plan to do with each word.
1) Convert into to lower
2)One hot encode all characters so the inputs will be a 26 dim vector/array.
3)Feed the arrays/vectors into a RNN sequentially so that the sequence of the characters is known.
Okay this is the mathematical description. The RNN before processing the nth char is at state ##s_{n-1}##. The encoded vector of the nth char is ##c_n##.
$$s_0=0$$
$$s_n=s_{n-1}+Wc_n$$
W here is a 26 by 26 transformation matrix. Return ##s_n## when all chars have been processed. If one wants to emphasize the significance of capital letters we can assign a value of 2 instead of 1 at the index of the one hot encoded vector for the char
But my main problem is how to get W, can it be random or must the values be adjusted through training?
 
Computer science news on Phys.org
  • #2
Trollfaz said:
Is this a good word embedding model for text analysis?
No, it's not a word embedding model at all - your model is character based, not word based.

Trollfaz said:
Everyone knows that the conventional way is to first tokenize the text into words and stripping all the punctuation chars.
Does everyone know that? I think I could probably find an exception.

Trollfaz said:
This is what I plan to do with each word...
There is no tokenization of words in the process you describe. There are many descriptions of tokenization on the interweb e.g. https://medium.com/@james.moody/preparing-text-for-rnn-language-modeling-in-python-6474c3eae66e
 
  • Like
Likes berkeman

Similar threads

Replies
5
Views
2K
Replies
6
Views
3K
Replies
1
Views
6K
Replies
4
Views
8K
Replies
3
Views
2K
Replies
2
Views
10K
Replies
17
Views
317K
3
Replies
93
Views
12K
Back
Top