Skip to content

Commit f214c37

Browse files
authored
Compute Word Embeddings for sentiment analysis
1 parent 940b3cd commit f214c37

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

51 files changed

+2346096
-0
lines changed

Compute_Word_Embeddings_for_sentiment_analysis/III_Training_the_CBOW_model.ipynb

+1,239
Large diffs are not rendered by default.

Compute_Word_Embeddings_for_sentiment_analysis/II_continuous_bag-of-words_model_architecture.ipynb

+696
Large diffs are not rendered by default.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,382 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Word Embeddings: Hands On\n",
8+
"\n",
9+
"In previous lecture notebooks you saw all the steps needed to train the CBOW model. This notebook will walk you through how to extract the word embedding vectors from a model.\n",
10+
"\n",
11+
"Let's dive into it!"
12+
]
13+
},
14+
{
15+
"cell_type": "code",
16+
"execution_count": 1,
17+
"metadata": {},
18+
"outputs": [],
19+
"source": [
20+
"import numpy as np\n",
21+
"from utils2 import get_dict"
22+
]
23+
},
24+
{
25+
"cell_type": "markdown",
26+
"metadata": {},
27+
"source": [
28+
"Before moving on, you will be provided with some variables needed for further procedures, which should be familiar by now. Also a trained CBOW model will be simulated, the corresponding weights and biases are provided: "
29+
]
30+
},
31+
{
32+
"cell_type": "code",
33+
"execution_count": 2,
34+
"metadata": {},
35+
"outputs": [],
36+
"source": [
37+
"# Define the tokenized version of the corpus\n",
38+
"words = ['i', 'am', 'happy', 'because', 'i', 'am', 'learning']\n",
39+
"\n",
40+
"# Define V. Remember this is the size of the vocabulary\n",
41+
"V = 5\n",
42+
"\n",
43+
"# Get 'word2Ind' and 'Ind2word' dictionaries for the tokenized corpus\n",
44+
"word2Ind, Ind2word = get_dict(words)\n",
45+
"\n",
46+
"\n",
47+
"# Define first matrix of weights\n",
48+
"W1 = np.array([[ 0.41687358, 0.08854191, -0.23495225, 0.28320538, 0.41800106],\n",
49+
" [ 0.32735501, 0.22795148, -0.23951958, 0.4117634 , -0.23924344],\n",
50+
" [ 0.26637602, -0.23846886, -0.37770863, -0.11399446, 0.34008124]])\n",
51+
"\n",
52+
"# Define second matrix of weights\n",
53+
"W2 = np.array([[-0.22182064, -0.43008631, 0.13310965],\n",
54+
" [ 0.08476603, 0.08123194, 0.1772054 ],\n",
55+
" [ 0.1871551 , -0.06107263, -0.1790735 ],\n",
56+
" [ 0.07055222, -0.02015138, 0.36107434],\n",
57+
" [ 0.33480474, -0.39423389, -0.43959196]])\n",
58+
"\n",
59+
"# Define first vector of biases\n",
60+
"b1 = np.array([[ 0.09688219],\n",
61+
" [ 0.29239497],\n",
62+
" [-0.27364426]])\n",
63+
"\n",
64+
"# Define second vector of biases\n",
65+
"b2 = np.array([[ 0.0352008 ],\n",
66+
" [-0.36393384],\n",
67+
" [-0.12775555],\n",
68+
" [-0.34802326],\n",
69+
" [-0.07017815]])"
70+
]
71+
},
72+
{
73+
"cell_type": "markdown",
74+
"metadata": {},
75+
"source": [
76+
"\n",
77+
"\n",
78+
"## Extracting word embedding vectors\n",
79+
"\n",
80+
"Once you have finished training the neural network, you have three options to get word embedding vectors for the words of your vocabulary, based on the weight matrices $\\mathbf{W_1}$ and/or $\\mathbf{W_2}$.\n",
81+
"\n",
82+
"### Option 1: extract embedding vectors from $\\mathbf{W_1}$\n",
83+
"\n",
84+
"The first option is to take the columns of $\\mathbf{W_1}$ as the embedding vectors of the words of the vocabulary, using the same order of the words as for the input and output vectors.\n",
85+
"\n",
86+
"> Note: in this practice notebooks the values of the word embedding vectors are meaningless after a single iteration with just one training example, but here's how you would proceed after the training process is complete.\n",
87+
"\n",
88+
"For example $\\mathbf{W_1}$ is this matrix:"
89+
]
90+
},
91+
{
92+
"cell_type": "code",
93+
"execution_count": 3,
94+
"metadata": {},
95+
"outputs": [
96+
{
97+
"data": {
98+
"text/plain": [
99+
"array([[ 0.41687358, 0.08854191, -0.23495225, 0.28320538, 0.41800106],\n",
100+
" [ 0.32735501, 0.22795148, -0.23951958, 0.4117634 , -0.23924344],\n",
101+
" [ 0.26637602, -0.23846886, -0.37770863, -0.11399446, 0.34008124]])"
102+
]
103+
},
104+
"execution_count": 3,
105+
"metadata": {},
106+
"output_type": "execute_result"
107+
}
108+
],
109+
"source": [
110+
"# Print W1\n",
111+
"W1"
112+
]
113+
},
114+
{
115+
"cell_type": "markdown",
116+
"metadata": {},
117+
"source": [
118+
"The first column, which is a 3-element vector, is the embedding vector of the first word of your vocabulary. The second column is the word embedding vector for the second word, and so on.\n",
119+
"\n",
120+
"The first, second, etc. words are ordered as follows."
121+
]
122+
},
123+
{
124+
"cell_type": "code",
125+
"execution_count": 4,
126+
"metadata": {},
127+
"outputs": [
128+
{
129+
"name": "stdout",
130+
"output_type": "stream",
131+
"text": [
132+
"am\n",
133+
"because\n",
134+
"happy\n",
135+
"i\n",
136+
"learning\n"
137+
]
138+
}
139+
],
140+
"source": [
141+
"# Print corresponding word for each index within vocabulary's range\n",
142+
"for i in range(V):\n",
143+
" print(Ind2word[i])"
144+
]
145+
},
146+
{
147+
"cell_type": "markdown",
148+
"metadata": {},
149+
"source": [
150+
"So the word embedding vectors corresponding to each word are:"
151+
]
152+
},
153+
{
154+
"cell_type": "code",
155+
"execution_count": 5,
156+
"metadata": {},
157+
"outputs": [
158+
{
159+
"name": "stdout",
160+
"output_type": "stream",
161+
"text": [
162+
"am: [0.41687358 0.32735501 0.26637602]\n",
163+
"because: [ 0.08854191 0.22795148 -0.23846886]\n",
164+
"happy: [-0.23495225 -0.23951958 -0.37770863]\n",
165+
"i: [ 0.28320538 0.4117634 -0.11399446]\n",
166+
"learning: [ 0.41800106 -0.23924344 0.34008124]\n"
167+
]
168+
}
169+
],
170+
"source": [
171+
"# Loop through each word of the vocabulary\n",
172+
"for word in word2Ind:\n",
173+
" # Extract the column corresponding to the index of the word in the vocabulary\n",
174+
" word_embedding_vector = W1[:, word2Ind[word]]\n",
175+
" # Print word alongside word embedding vector\n",
176+
" print(f'{word}: {word_embedding_vector}')"
177+
]
178+
},
179+
{
180+
"cell_type": "markdown",
181+
"metadata": {},
182+
"source": [
183+
"### Option 2: extract embedding vectors from $\\mathbf{W_2}$"
184+
]
185+
},
186+
{
187+
"cell_type": "markdown",
188+
"metadata": {},
189+
"source": [
190+
"The second option is to take $\\mathbf{W_2}$ transposed, and take its columns as the word embedding vectors just like you did for $\\mathbf{W_1}$."
191+
]
192+
},
193+
{
194+
"cell_type": "code",
195+
"execution_count": 6,
196+
"metadata": {},
197+
"outputs": [
198+
{
199+
"data": {
200+
"text/plain": [
201+
"array([[-0.22182064, 0.08476603, 0.1871551 , 0.07055222, 0.33480474],\n",
202+
" [-0.43008631, 0.08123194, -0.06107263, -0.02015138, -0.39423389],\n",
203+
" [ 0.13310965, 0.1772054 , -0.1790735 , 0.36107434, -0.43959196]])"
204+
]
205+
},
206+
"execution_count": 6,
207+
"metadata": {},
208+
"output_type": "execute_result"
209+
}
210+
],
211+
"source": [
212+
"# Print transposed W2\n",
213+
"W2.T"
214+
]
215+
},
216+
{
217+
"cell_type": "code",
218+
"execution_count": 7,
219+
"metadata": {},
220+
"outputs": [
221+
{
222+
"name": "stdout",
223+
"output_type": "stream",
224+
"text": [
225+
"am: [-0.22182064 -0.43008631 0.13310965]\n",
226+
"because: [0.08476603 0.08123194 0.1772054 ]\n",
227+
"happy: [ 0.1871551 -0.06107263 -0.1790735 ]\n",
228+
"i: [ 0.07055222 -0.02015138 0.36107434]\n",
229+
"learning: [ 0.33480474 -0.39423389 -0.43959196]\n"
230+
]
231+
}
232+
],
233+
"source": [
234+
"# Loop through each word of the vocabulary\n",
235+
"for word in word2Ind:\n",
236+
" # Extract the column corresponding to the index of the word in the vocabulary\n",
237+
" word_embedding_vector = W2.T[:, word2Ind[word]]\n",
238+
" # Print word alongside word embedding vector\n",
239+
" print(f'{word}: {word_embedding_vector}')"
240+
]
241+
},
242+
{
243+
"cell_type": "markdown",
244+
"metadata": {},
245+
"source": [
246+
"### Option 3: extract embedding vectors from $\\mathbf{W_1}$ and $\\mathbf{W_2}$"
247+
]
248+
},
249+
{
250+
"cell_type": "markdown",
251+
"metadata": {},
252+
"source": [
253+
"The third option, which is the one you will use in this week's assignment, uses the average of $\\mathbf{W_1}$ and $\\mathbf{W_2^\\top}$."
254+
]
255+
},
256+
{
257+
"cell_type": "markdown",
258+
"metadata": {},
259+
"source": [
260+
"**Calculate the average of $\\mathbf{W_1}$ and $\\mathbf{W_2^\\top}$, and store the result in `W3`.**"
261+
]
262+
},
263+
{
264+
"cell_type": "code",
265+
"execution_count": 8,
266+
"metadata": {},
267+
"outputs": [
268+
{
269+
"data": {
270+
"text/plain": [
271+
"array([[ 0.09752647, 0.08665397, -0.02389858, 0.1768788 , 0.3764029 ],\n",
272+
" [-0.05136565, 0.15459171, -0.15029611, 0.19580601, -0.31673866],\n",
273+
" [ 0.19974284, -0.03063173, -0.27839106, 0.12353994, -0.04975536]])"
274+
]
275+
},
276+
"execution_count": 8,
277+
"metadata": {},
278+
"output_type": "execute_result"
279+
}
280+
],
281+
"source": [
282+
"# Compute W3 as the average of W1 and W2 transposed\n",
283+
"W3 = (W1+W2.T)/2\n",
284+
"\n",
285+
"# Print W3\n",
286+
"W3"
287+
]
288+
},
289+
{
290+
"cell_type": "markdown",
291+
"metadata": {},
292+
"source": [
293+
"Expected output:\n",
294+
"\n",
295+
" array([[ 0.09752647, 0.08665397, -0.02389858, 0.1768788 , 0.3764029 ],\n",
296+
" [-0.05136565, 0.15459171, -0.15029611, 0.19580601, -0.31673866],\n",
297+
" [ 0.19974284, -0.03063173, -0.27839106, 0.12353994, -0.04975536]])"
298+
]
299+
},
300+
{
301+
"cell_type": "markdown",
302+
"metadata": {},
303+
"source": [
304+
"Extracting the word embedding vectors works just like the two previous options, by taking the columns of the matrix you've just created."
305+
]
306+
},
307+
{
308+
"cell_type": "code",
309+
"execution_count": 9,
310+
"metadata": {},
311+
"outputs": [
312+
{
313+
"name": "stdout",
314+
"output_type": "stream",
315+
"text": [
316+
"am: [ 0.09752647 -0.05136565 0.19974284]\n",
317+
"because: [ 0.08665397 0.15459171 -0.03063173]\n",
318+
"happy: [-0.02389858 -0.15029611 -0.27839106]\n",
319+
"i: [0.1768788 0.19580601 0.12353994]\n",
320+
"learning: [ 0.3764029 -0.31673866 -0.04975536]\n"
321+
]
322+
}
323+
],
324+
"source": [
325+
"# Loop through each word of the vocabulary\n",
326+
"for word in word2Ind:\n",
327+
" # Extract the column corresponding to the index of the word in the vocabulary\n",
328+
" word_embedding_vector = W3[:, word2Ind[word]]\n",
329+
" # Print word alongside word embedding vector\n",
330+
" print(f'{word}: {word_embedding_vector}')"
331+
]
332+
},
333+
{
334+
"cell_type": "markdown",
335+
"metadata": {},
336+
"source": [
337+
"Now you know 3 different options to get the word embedding vectors from a model! "
338+
]
339+
},
340+
{
341+
"cell_type": "markdown",
342+
"metadata": {},
343+
"source": [
344+
"### How this practice relates to and differs from the upcoming graded assignment\n",
345+
"\n",
346+
"- After extracting the word embedding vectors, you will use principal component analysis (PCA) to visualize the vectors, which will enable you to perform an intrinsic evaluation of the quality of the vectors, as explained in the lecture."
347+
]
348+
},
349+
{
350+
"cell_type": "markdown",
351+
"metadata": {},
352+
"source": [
353+
"**Congratulations on finishing all lecture notebooks for this week!** \n",
354+
"\n",
355+
"You're now ready to take on this week's assignment!\n",
356+
"\n",
357+
"**Keep it up!**"
358+
]
359+
}
360+
],
361+
"metadata": {
362+
"kernelspec": {
363+
"display_name": "Python 3",
364+
"language": "python",
365+
"name": "python3"
366+
},
367+
"language_info": {
368+
"codemirror_mode": {
369+
"name": "ipython",
370+
"version": 3
371+
},
372+
"file_extension": ".py",
373+
"mimetype": "text/x-python",
374+
"name": "python",
375+
"nbconvert_exporter": "python",
376+
"pygments_lexer": "ipython3",
377+
"version": "3.7.1"
378+
}
379+
},
380+
"nbformat": 4,
381+
"nbformat_minor": 4
382+
}

0 commit comments

Comments
 (0)