walk through word2vec python tensorflow basic

Understanding embedding in tensorflow

You will get a better understanding going through the content below, but first, we need to get some key concepts going.

Step 1 we need convert text into words with an id assigned to it. For example we have "word_ids" defined which contain "we" "are" "the" "world" - shape 4, with 4 integer.

Next, we need to map "word_ids" to vector using "tf.nn.embedding_lookup". Results are given in "embedded_words_ids" which contains a shape(4) and what these words means in a specific context given our vocabulary.

word_embeddings = tf.get_variable(“word_embeddings”,
    [vocabulary_size, embedding_size])
embedded_word_ids = tf.nn.embedding_lookup(word_embeddings, word_ids)

Basically the same program used in word2vec basic tensorflow example but I used a smaller dataset. There are just example with 100 different word (which will be discussed later)

Please note, the processing here is exactly the same as a tutorial from Gensim here. This helps you to see how we build up vectors from words and see if a search string appears or how similar they are. 

The aim is we would like to match words into a context. For example, 

Example 1 :- Cooling Orange Drink - drink here means juice

Example 2 :- Eric drink beer - drink is the act of drinking

if we are trying to find a product, example 1 is what we would like to identify.

It will be good to have a look at this sample program to see how this work. After selecting "fruit and juice" or "king and queen" as the context, click on "next" to see how vectors are generated below. It's pretty cool. 

The file has a data size of 144.

If you run this code, you will see the following :

The count variable stores these data. Think of it as a wordcount array. You can see that word "the" is found 9 times, and "in" occurred 6 times...etc.

Most common words (+UNK) [['UNK', 3], ('the', 9), ('in', 6), ('he', 6), ('and', 4)]

Sample data [18, 8, 1, 86, 2, 1, 85, 6, 62, 18] ['jeremy', 'is', 'the', 'best', 'in', 'the', 'world', 'of', 'soccer.', 'jeremy']

In the sample data above, a variable called data stores index reference to your reverse_dictionary. Reversed_dictionary contains all data used in your dataset.

Reversed_dictionary looks like this :-

{0: 'UNK', 1: 'the', 2: 'in', 3: 'he', 4: 'and', 5: 'to', 6: 'of', 7: 'his', 8: 'is', 9: 'went', 10: 'Real', 11: 'a', 12: 'As', 13: 'down.', 14: 'become', 15: 'up', 16: 'eventually', 17: 'more', 18: 'jeremy', 19: 'was', 20: 'turn', 21: 'soccer', 22: 'talent', 23: 'way.', 24: 'grew', 25: 'clubs', 26: 'for', 27: 'Manchester', 28: 'interest', 29: 'kuala', 30: 'scorer', 31: 'league', 32: 'good', 33: 'dollars,', 34: 'debut,', 35: 'double', 36: 'town', 37: 'call', 38: 'Madrid.', 39: 'shown', 40: 'United', 41: 'most', 42: 'age', 43: 'price', 44: 'Madrid,', 45: 'soon,', 46: 'leading', 47: 'sold', 48: 'goals,', 49: 'master', 50: 'it', 51: 'that', 52: 'skills.', 53: 'time.', 54: 'After', 55: 'Many', 56: 'coach', 57: 'lumpur', 58: 'at', 59: 'not', 60: '18,', 61: 'bidding', 62: 'soccer.', 63: 'player', 64: 'Saleh.', 65: 'league.', 66: 'million', 67: 'premier', 68: 'one', 69: 'small', 70: 'making', 71: 'At', 72: 'by', 73: 'where', 74: 'score', 75: 'were', 76: 'too', 77: 'leader.', 78: 'The', 79: 'Dollah', 80: 'study', 81: 'born', 82: 'identified', 83: 'He', 84: 'bid', 85: 'world', 86: 'best', 87: 'come', 88: 'expensive', 89: 'stuff', 90: 'great', 91: 'grew,', 92: 'offer', 93: 'Liverpool', 94: 'first', 95: 'played', 96: 'malaysia', 97: 'natural', 98: '100', 99: 'him.'}

If you look at "data" variable above which i shall copy and paste it here for easy reference :-

Sample data [18, 8, 1, 86, 2, 1, 85, 6, 62, 18], basically maps to your reversed_dictionary variable.

Ok so far so good. After this, we need to understand that we are feeding random data in a variable called "valid_examples" - with tf.randon.choice. The valid_examples are generated randomly

valid_examples = np.random.choice(valid_window, valid_size, replace=False)

It will generate something like :- [ 3  5  8  1  4 11 10  2  6  9  7 12  0] and it is different every time.

The learning process

How do we get the following output :

Nearest to he: eventually, best, Many, bid, to, jeremy, and, player,

Nearest to his: league, Liverpool, bid, Dollah, Many, first, grew,, premier,

Nearest to As: making, study, Saleh., it, up, 18, call, 100

We create a weight of similarity index and get words with the highest value.
Python are able to do with using argsort() - it is a numpy function that returns "index" of a sorted or about to be sorted array. Normally you will get the highest to smallest. Highest being very close math. Now you also get the index before the sort, which you matched with reversed_dictionary to get what words are closest.

For example,

Nearest to he: eventually, best, Many, bid, to, jeremy, and, player,

After calling argsort(), you get 'eventually' -

[ 3 16 86 ........] 

If you look this up in reversed dictionary you get

3 - he 
16 - eventually
86 - best 


Embedding in its simplest form, is a data structure that maps words to vectors. Vectors here numbers that trying to describe something about a mapping. Example below shows an embedding

king :  (0.99, 0.243, 010) 
servant : (0.01, 0.2, 0.45) 

These numbers makes it suitable for machine learning or classifier to learn and update values. It is also important to have embedding output as tells us what we have learned, for example

king : (castle : 0.9, dragon : 0.6, crown : 0.8, floor : 0.2)

In this example, we learn that king is closely related to the word "castle", "dragon" then "floor".

We need to see what does embedding looks like.

First of all, embedding (in this scenario) is a representation of "how close a word" to OTHER words in your dataset. Being a representation, it must be or almost as huge as your data set vocabulary (that you would like all your sample to match eventually.. If you set embedding too small, then it won't be complete picture. Since my dataset is only 100, it's pretty useless

vocabulary_size - is 100 as defined earlier

# define our embedding

embeddings = tf.Variable(
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

# Look up our embedding from train_input
embed = tf.nn.embedding_lookup(embeddings, train_inputs)

Here we will get a shape of 100 x 128 (embedding_size)

Next the code below, we define our nce_weights.

nce_weights = tf.Variable(
tf.truncated_normal([vocabulary_size, embedding_size],
stddev=1.0 / math.sqrt(embedding_size)))

nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

We are defining a loss factor to be used by our gradient descent optimizer

loss = tf.reduce_mean(

optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)

Starting from line 218 onwards, we begin training. So what do we expect to see? We will see that every word in our sample gets evaluated. For example, given this sample data :

['jeremy', 'is', 'the', 'best', 'in', 'the', 'world', 'of', 'soccer.', 'jeremy']

each word starting with "jeremy" gets computed together with cos similiarity, which looks something like this :-

[-5.68465097e-04  1.42556906e-01 -2.13468680e-03  9.79696051e-04
 -4.15585376e-03  3.45100686e-02  7.32974112e-02  1.18374474e-01
  5.04199043e-02 -1.00000012e+00  1.93326920e-02 -1.51704356e-01
  3.00122537e-02 -3.77981998e-02  7.34577999e-02 -1.36730686e-01
 -1.66732311e-01 -1.13654330e-01 -8.84845504e-04  3.10374331e-02
 -2.32265756e-01  5.24142049e-02 -3.00262541e-01 -6.34672940e-02
  2.78285332e-02 -1.01595841e-01 -3.91205885e-02 -9.69273672e-02
  6.87757581e-02  8.64563975e-03 -3.42060663e-02  2.50732958e-01
  3.48346261e-03  1.52630433e-01 -1.93282276e-01 -1.09572411e-01
 -8.20828602e-02 -1.37085319e-01 -7.82828256e-02  5.01848906e-02
 -1.01905800e-01 -1.04515530e-01 -4.82192412e-02  1.68212857e-02
 -1.98008884e-02 -1.16640903e-01 -3.84130292e-02 -1.12949029e-01
 -1.06050819e-01 -1.23330392e-02  9.10504907e-02 -3.33651006e-02
 -4.24655154e-02 -1.15328999e-02 -5.77907786e-02  9.77605656e-02
  8.10332969e-02  6.07379079e-02  2.02641532e-01 -1.51244719e-02
  4.14249226e-02 -1.14626959e-02 -6.36482537e-02 -1.35519411e-02
  9.30523202e-02  3.21833454e-02 -6.66338503e-02  1.20710367e-02
 -6.63076853e-03 -1.86744314e-02  1.22547029e-02 -2.33999155e-02
 -5.84852956e-02 -1.43135965e-01 -4.12122272e-02  4.74856272e-02
  6.39823750e-02  2.35483702e-02  5.01024872e-02 -2.40460993e-03
 -1.90819666e-01  5.63637652e-02 -4.26209439e-03 -1.62980296e-02
  3.03985439e-02  3.69692743e-02 -7.73100853e-02 -1.10782385e-02
 -2.14466602e-02  1.37837037e-01  4.03530151e-02 -5.49760759e-02
 -3.53281312e-02 -1.06219426e-01  8.35022479e-02 -3.85019407e-02
  2.11433484e-03 -7.62748420e-02  4.74403687e-02  7.15355203e-02]

Subsequently, we will get the highest (closest matched value) and prints out the top 8 closest match as shown below :-

Nearest to he: eventually, best, Many, bid, to, jeremy, and, player,

Final results are stored in

final_embeddings = normalized_embeddings.eval()

We plot our graph using TSNE and passing in "final_embeddings" variable, which you can open it on your USERS directory.

Some notes : why do we need embeddings?
When dealing with interpreting words, we used large corpus and text is huge. So we convert it to a multidimensional representation to make it easy to work with.

Read this if you have need to understand what specific function reference does in word2vec basic, please refer to here.


Popular posts from this blog

Solving Sonarqube :- Project was never analyzed. A regular analysis is required before a branch analysis

spark - pyspark reading from excel files