Embeddings with Word2vec
v = vector_representation. aka word embedding
v(king) - v(man) + v(woman) = v(queen)
Word embeddings are like a dictionary with space. Woah.
Word2vec is a family of models used for generating word embeddings. The idea was first published in 2013, in a Google paper written by Mikolov et al. (source). This approach is a step in the evolution of natural language processing. Word2vec represents human words in a computer friendly way; more so than any of its predecessors. Using clever techniques, the shallow neural network is able to learn to map words to a place in vector space.
Word2vec_model(word) => vectorized embedding
# similar to
dictionary(key) => value
During the training phase, the network learns word "meanings" from context (surrounding words) in the training data. The knowledge of the network is stored in the structure of the word space it constructs. A feature of it's learning strategy is to put like with like. This means that the cosine similarity and euclidian distance (both are units of measurement), reflect similarity in words. This similarity is multidimensional: semantic and syntactic. This is a powerful way to represent knowledge.
It was revolutionary. The models aren't just useful for text, the same principle proves to be useful with any sequential data. Text just happens to be the most common data format that fits this blueprint. Word2vec opened new possibilities for the field of NLP (natural language processing) and beyond, especially in the field of reccomendation systems.
"The Word2vec model is in many ways the dawn of modern-day NLP." source
The challenge of encoding text
With computers representing images, sound, and video are all pretty straightforward. Representing text isn't. The complexity of language is vast. There is so much data in all of its messy little details. How can you define text in a way that a computer can understand? There is a legacy of methods that predate word embeddings.
Older approaches
- One hot encoding:
- 1 or 0
- treats each word as an atomic unit
- no notion of word similarity
- Bag of words:
- ignores context and order
- represents the number of occurences of each word in the set of unique words
- some notion of similarity: documents with similar bag of words are close in euclidian space
- Bag of N-grams:
- instead of bag of single words, its a bag of word chunks
- vocabulary is made of N length chunks of words
- some notion of context: captures word order and association
There are some flaws shared by these models:
- sparse vectors (mostly filled with 0's)
- don't represent similarity between words well
The solution
The Word2vec paper proposes 2 new models. They are both shallow, 1 layer, feed forward neural networks. Their goal is to take a corpus of text, and output word embeddings. The models are based on concepts called distributional similarity and the distributional hypothesis:
A word can be understood from the context it appears in.
Words that occur in similar contexts must have similar meanings.
- Continous bag of words model (CBOW):
- let time = t
- given words [t-2, t-1, t+1, t+2] predict the word that belongs at t
- learns to predict the center word based on the surrounding words
- Continous skip-gram model:
- inverse of CBOW
- let time = t
- given the word at t, predict the words at [t-2, t-1, t+1, t+2]
Results
The coolest thing about word embeddings is that relationships between words are modeled in meaningful ways. The Word2vec paper gives the example:
v = vector_representation
v(king) - v(man) + v(woman) = v(queen)
This is incredible. Words can be represented meaningfully to a computer! This approach solves the problems the older methods had. Accordingly, Word2vec methods output better results than anything before it. Not only that, but these models are computationally cheap. Meaning, they can be trained on massive datasets in a relatively short time. Since 2013, other models have been built to generate word embeddings. Ultimately, what I think was groundbreaking about Word2vec isn't the technical implementation. It was when word embeddings were born.
An interesting aside...
There's a memorization technique called the "Method of loci". It is better known today as a "Memory Palace". Practiced in ancient Greece and Rome, the technique requires you to assoicate knowledge with a certain location in your memory palace (a visualization). Creating an associative knowledge with location. Just like the intiution behind word vectors! Pretty cool stuff.