word2vec tensorflow example : word embeddings function reference
Walking through the word to vector basic example :-
Quoting from s0urcer
The idea of skip-gramm is comparing words by their contexts. So we consider words equal if they appear in equal contexts. The first layer of NN represents words vector encodings (basically what is called embeddings). The second layer represents context. Every time we take just one row (Ri) of first layer (because input vector always looks like 0, ..., 0, 1, 0, ..., 0) and multiply it by all columns of second layer (Cj , j = 1..num of words) and that product will be the output of NN. We train neural network to have maximum output components Ri * Cj if word i and j appear nearby (in the same context) often. During each cycle of training we tune only one Ri (again because of the way input vectors are chosen) and all Cj, j = 1..w. When training ends we toss the matrix of the second layer because it represents context. We use only matrix of the first layer which represents vector encoding of the words.
read_data(filename)
Output content of the file using comma as a delimiter.
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst', 'the', 'term', 'is', 'still', 'used', 'in', 'a', 'pejorative', 'way', 'to', 'describe', 'any', 'act', 'that', 'used', 'violent', 'means', 'to', 'destroy', 'the', 'organization', 'of', 'society', 'it', 'has', 'also', 'been', 'taken', 'up', 'as', 'a', 'positive', 'label', 'by', 'self', 'defined', 'anarchists', 'the', 'word', 'anarchism', 'is', 'derived', 'from', 'the', 'greek', 'without', 'archons', 'ruler', 'chief', 'king', 'anarchism', 'as', 'a', 'political', 'philosophy', 'is', 'the', 'belief', 'that', 'rulers', 'are', 'unnecessary', 'and', 'should', 'be', 'abolished', 'although', 'there', 'are', 'differing']
Data size 17005207
build_dataset()
count.extend(collections.Counter(words).most_common(n_words - 1)) - this command scan and create a ranking of words used. For example, it might out out the following to show the word 'the' is used 9 times.
Counter({'the': 9, 'of': 4, 'a': 4, 'anarchism': 3, 'used': 3, 'is': 3, 'as': 3, 'and': 2, 'are': 2, 'revolution': 2, 'to': 2, 'term': 2, 'that': 2, 'any': 1, 'abolished': 1, 'defined': 1, 'abuse': 1, 'organization': 1, 'describe': 1, 'violent': 1, 'pejorative': 1, 'archons': 1, 'belief': 1, 'including': 1, 'up': 1, 'without': 1, 'in': 1, 'from': 1, 'has': 1, 'self': 1, 'should': 1, 'although': 1, 'be': 1, 'originated': 1, 'anarchists': 1, 'derived': 1, 'it': 1, 'taken': 1, 'positive': 1, 'still': 1, 'there': 1, 'destroy': 1, 'political': 1, 'working': 1, 'unnecessary': 1, 'act': 1, 'society': 1, 'differing': 1, 'word': 1, 'class': 1, 'french': 1, 'culottes': 1, 'english': 1, 'by': 1, 'against': 1, 'king': 1, 'rulers': 1, 'been': 1, 'early': 1, 'label': 1, 'also': 1, 'whilst': 1, 'radicals': 1, 'greek': 1, 'diggers': 1, 'ruler': 1, 'sans': 1, 'means': 1, 'philosophy': 1, 'way': 1, 'first': 1, 'chief': 1})
Also builds up a keyword called "dictionary" starting from 1 to last. For example
"the"- 1
"of" - 2
"a" - 3
Next it also crate a list that store index words - location of words within a dictionary above.
// Sample output for reversed_dictionary - this is where all words are key by highest frequency of usage.
{0: 'UNK', 1: 'the', 2: 'of', 3: 'and', 4: 'one', 5: 'in', 6: 'a', 7: 'to', 8: 'zero', 9: 'nine', 10: 'two', 11: 'is', 12: 'as', 13: 'eight', 14: 'for', 15: 's', 16: 'five', 17: 'three', 18: 'was', 19: 'by', 20: 'that', 21: 'four', 22: 'six', 23: 'seven', 24: 'with', 25: 'on', 26: 'are', 27: 'it', 28: 'from', 29: 'or', 30: 'his', 31: 'an', 32: 'be', 33: 'this', 34: 'which', 35: 'at', 36: 'he', 37: 'also'}
build_batch()
batch looks like this :-
generate_batch looks like this
References to method used.
Common method used the code are :-
tf.reduce_mean
Reduce dimension or getting averages.
Examples :-
x = tf.constant([[1., 1.], [2., 2.]])
tf.reduce_mean(x) # 1.5
tf.nn.nce_loss Compute and return noised portion of the training losses.
tf.train.GradientDescentOptimizer
tf.nn.embedding_lookup
This function looks up an item within an array. It is similiar using index to retrieve value from an array.
When the params tensor is in high dimensions, the ids only refers to top dimension. Maybe it's obvious to most of people but I have to run the following code to understand that:
tf.truncated_normal - get a random value from a normal or Gaussian distribution that has been truncated. Both bottom side of the distribution got chopped off!
tf.random_uniform
Gets random value from a uniform distribution. Depending on the output you want, share parameter allows you to randomize your data as a value of say, 5 random values in an array as shown in code below.
Other references that might be helpful are:
https://deeplearning4j.org/word2vec
Comments