Pretraining GloVe⚓︎
In this section, we will train a
GloVe model defined in
First, import the packages and modules required for the experiment.
from collections import defaultdict
from d2l import mxnet as d2l
from mxnet import autograd, gluon, init, np, npx, cpu
from mxnet.gluon import nn
import random
Preprocessing Dataset⚓︎
We will train GloVe model on PTB dataset.
First, we read the PTB dataset, build a vocabulary with words and map each token into an index to construct the corpus.
sentences = d2l.read_ptb()
vocab = d2l.Vocab(sentences, min_freq=10)
corpus = [vocab[line] for line in sentences]
Construct Cooccurrence Counts⚓︎
Let the word-word cooccurrence counts be denoted by \(X\), whose entries \(x_{ij}\) tabulate the number of times word \(j\) occurs in the context of word \(i\).
Next, we define following function to extracts all the central target words and their context words. It use a decreasing weighting function, so that word pairs that are \(d\) words apart contribute \(1/d\) to the total count. This is one way to account for the fact that very distant word pairs are expected to contain less relevant information about the words’ relationship to one another.
def get_coocurrence_counts(corpus, window_size):
centers, contexts = [], []
cooccurence_counts = defaultdict(float)
for line in corpus:
# Each sentence needs at least 2 words to form a
# "central target word - context word" pair
if len(line) < 2:
centers += line
for i in range(len(line)): # Context window centered at i
left_indices = list(range(max(0, i - window_size), i))
right_indices = list(range(i + 1,
min(len(line), i + 1 + window_size)))
left_context = [line[idx] for idx in left_indices]
right_context = [line[idx] for idx in right_indices]
for distance, word in enumerate(left_context[::-1]):
cooccurence_counts[line[i], word] += 1 / (distance + 1)
for distance, word in enumerate(right_context):
cooccurence_counts[line[i], word] += 1 / (distance + 1)
cooccurence_counts = [(word[0], word[1], count)
for word, count in cooccurence_counts.items()]
return cooccurence_counts
We create an artificial dataset containing two sentences of 5 and 2 words, respectively. Assume the maximum context window is 4. Then, we print the cooccurrence counts of all the central target words and context words.
tiny_dataset = [list(range(5)), list(range(5, 7))]
print('dataset', tiny_dataset)
for center, context, coocurrence in get_coocurrence_counts(tiny_dataset, 4):
print('center: %s, context: %s, coocurrence: %.2f' %
(center, context, coocurrence))
We set the maximum context window size to 5. The following extracts all the central target words and their context words in the dataset, and calculate their cooccurrence counts
coocurrence_matrix = get_coocurrence_counts(corpus, 5)
'# center-context pairs: %d' % len(coocurrence_matrix)
Putting All Things Together⚓︎
Last, We define the load_data_ptb_glove function that read the PTB dataset and return the data loader.
def load_data_ptb_glove(batch_size, window_size):
num_workers = d2l.get_dataloader_workers()
sentences = d2l.read_ptb()
vocab = d2l.Vocab(sentences, min_freq=5)
corpus = [vocab[line] for line in sentences]
coocurrence_matrix = get_coocurrence_counts(corpus, window_size)
dataset =
data_iter =, batch_size, shuffle=True,
return data_iter, vocab
batch_size, window_size = 1024, 10
data_iter, vocab = load_data_ptb_glove(batch_size, window_size)
Let’s print the first minibatch of the data iterator.
names = ['center', 'context', 'Cooccurence']
for batch in data_iter:
for name, data in zip(names, batch):
print(name, 'shape:', data.shape)
The GloVe Model⚓︎
In section 15.1, we introduced the goal of GloVe is to minimize the loss function.
We will implement the GloVe model by implementing each part of the loss function.
Weight function⚓︎
GloVe introduced a weighting function \(h(x_{ij})\) into the loss function.
We implement the weighting function \(h(x_{ij})\). Since \(x_{ij}<x_{max}\)is equivalent to \((\frac{x}{x_{max}})^\alpha < 1\), we can give the following implementation.
def compute_weight(x, x_max = 30, alpha = 0.75):
w = (x / x_max) ** alpha
return np.minimum(w, 1)
The following prints the weight of the cooccurrence counts of all the central target words and context words when the \(x_{max}\) set to 2 and \(\alpha\) to 0.75
for center, context, coocurrence in get_coocurrence_counts(tiny_dataset, 4)[:5]:
print('center: %s, context: %s, coocurrence: %.2f, weight: %.2f' %
(center, context, coocurrence, compute_weight(coocurrence, x_max = 2, alpha = 0.75)))
Bias Term⚓︎
GloVe has two scalar model parameters for each word \(w_i\) : the bias terms \(b_i\) (for central target words) and \(c_i\) (for context words). Bias term can be realized by embedding layer. The weight of the embedding layer is a matrix whose number of rows is the dictionary size (input_dim) and whose number of columns is one.
We set the dictionary size to 20.
embed_bias = nn.Embedding(input_dim=20, output_dim=1)
The input of the embedding layer is the index of the word. When we enter the index \(i\) of a word, the embedding layer returns the \(i\) th row of the weight value as its bias term.
x = np.array([1, 2, 3])
GloVe Model Forward Calculation⚓︎
In forward calculation, the input of
GloVe model contains the central
target word index center
and the context word
. In which,
the center
variable has the shape (batch
size, 1),
while the
variable has the shape (batch size,
1). These
two variables
are first transformed from word indexes to word
vectors by the
word embedding
def GloVe(center, context, coocurrence, embed_v, embed_u,
bias_v, bias_u, x_max, alpha):
# Shape of v: (batch_size, embed_size)
v = embed_v(center)
# Shape of u: (batch_size, embed_size)
u = embed_u(context)
# Shape of b: (batch_size, )
b = bias_v(center).squeeze()
# Shape of c: (batch_size, )
c = bias_u(context).squeeze()
# Shape of embed_products: (batch_size,)
embed_products = npx.batch_dot(np.expand_dims(v, 1),
np.expand_dims(u, 2)).squeeze()
# Shape of distance_expr: (batch_size,)
distance_expr = np.power(embed_products + b +
c - np.log(coocurrence), 2)
# Shape of weight: (batch_size,)
weight = compute_weight(coocurrence)
return weight * distance_expr
Verify that the output shape should be (batch size, ).
embed_word = nn.Embedding(input_dim=20, output_dim=4)
GloVe(np.ones((2)), np.ones((2)), np.ones((2)), embed_word, embed_word,
embed_bias, embed_bias, x_max = 2, alpha = 0.75).shape
Before training the word embedding model, we need to define the loss function of the model.
Initializing Model Parameters⚓︎
We construct the
embedding layers of words and
additional biases,
and set the hyperparameter word
dimension embed_size
embed_size = 100
net = nn.Sequential()
net.add(nn.Embedding(input_dim=len(vocab), output_dim=embed_size),
nn.Embedding(input_dim=len(vocab), output_dim=embed_size),
nn.Embedding(input_dim=len(vocab), output_dim=1),
nn.Embedding(input_dim=len(vocab), output_dim=1))
The training function is defined below.
def train(net, data_iter, lr, num_epochs, x_max, alpha, ctx=d2l.try_gpu()):
net.initialize(ctx=ctx, force_reinit=True)
trainer = gluon.Trainer(net.collect_params(), 'AdaGrad',
{'learning_rate': lr})
animator = d2l.Animator(xlabel='epoch', ylabel='loss',
xlim=[0, num_epochs])
for epoch in range(num_epochs):
timer = d2l.Timer()
metric = d2l.Accumulator(2) # loss_sum, num_tokens
for i, batch in enumerate(data_iter):
center, context, coocurrence = [
data.as_in_context(ctx) for data in batch]
with autograd.record():
l = GloVe(center, context, coocurrence.astype('float32'),
net[0], net[1], net[2], net[3], x_max, alpha)
metric.add(l.sum(), l.size)
if (i+1) % 50 == 0:
print('loss %.3f, %d tokens/sec on %s ' % (
metric[0]/metric[1], metric[1]/timer.stop(), ctx))
Now, we can train a GloVe model.
lr, num_epochs = 0.1, 5
x_max, alpha = 100, 0.75
train(net, data_iter, lr, num_epochs, x_max, alpha)
Applying the GloVe Model⚓︎
GloVe model generates two sets of word vectors,
and embed_u
. embed_v
and embed_u
equivalent and differ only
as a result of their random initializations; the two sets of vectors should
perform equivalently.Generally, we choose to use the sum embed_v
our word vectors.
After training the GloVe model, we can still represent similarity in meaning between words based on the cosine similarity of two word vectors.
def get_similar_tokens(query_token, k, embed_v, embed_u):
W = +
x = W[vocab[query_token]]
# Compute the cosine similarity. Add 1e-9 for numerical stability
cos =, x) / np.sqrt(np.sum(W * W, axis=1) * np.sum(x * x) + 1e-9)
topk = npx.topk(cos, k=k+1, ret_typ='indices').asnumpy().astype('int32')
for i in topk[1:]: # Remove the input words
print('cosine sim=%.3f: %s' % (cos[i], (vocab.idx_to_token[i])))
get_similar_tokens('chip', 3, net[0], net[1])
- We can pretrain a GloVe model.
