Skip to content

Commit 47e5d8f

Browse files
authored
Text Summarization using Transformer
1 parent 97e5803 commit 47e5d8f

14 files changed

+4102
-0
lines changed

Text Summarization using Transformer/III_Text_summarizer.ipynb

+2,959
Large diffs are not rendered by default.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,309 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# The Transformer Decoder\n",
8+
"\n",
9+
"In this notebook, you'll explore the transformer decoder and how to implement it with Trax. \n",
10+
"\n",
11+
"## Background\n",
12+
"\n",
13+
"In the last lecture notebook, you saw how to translate the mathematics of attention into NumPy code. Here, you'll see how multi-head causal attention fits into a GPT-2 transformer decoder, and how to build one with Trax layers. In the assignment notebook, you'll implement causal attention from scratch, but here, you'll exploit the handy-dandy `tl.CausalAttention()` layer.\n",
14+
"\n",
15+
"The schematic below illustrates the components and flow of a transformer decoder. Note that while the algorithm diagram flows from the bottom to the top, the overview and subsequent Trax layer codes are top-down.\n",
16+
"\n",
17+
"<img src=\"transformer_decoder_lnb_figs/C4_W2_L6_transformer-decoder_S01_transformer-decoder.png\" width=\"1000\"/>"
18+
]
19+
},
20+
{
21+
"cell_type": "markdown",
22+
"metadata": {},
23+
"source": [
24+
"## Imports"
25+
]
26+
},
27+
{
28+
"cell_type": "code",
29+
"execution_count": 1,
30+
"metadata": {},
31+
"outputs": [
32+
{
33+
"name": "stdout",
34+
"output_type": "stream",
35+
"text": [
36+
"INFO:tensorflow:tokens_length=568 inputs_length=512 targets_length=114 noise_density=0.15 mean_noise_span_length=3.0 \n"
37+
]
38+
}
39+
],
40+
"source": [
41+
"import sys\n",
42+
"import os\n",
43+
"\n",
44+
"import time\n",
45+
"import numpy as np\n",
46+
"import gin\n",
47+
"\n",
48+
"import textwrap\n",
49+
"wrapper = textwrap.TextWrapper(width=70)\n",
50+
"\n",
51+
"import trax\n",
52+
"from trax import layers as tl\n",
53+
"from trax.fastmath import numpy as jnp\n",
54+
"\n",
55+
"# to print the entire np array\n",
56+
"np.set_printoptions(threshold=sys.maxsize)"
57+
]
58+
},
59+
{
60+
"cell_type": "markdown",
61+
"metadata": {},
62+
"source": [
63+
"## Sentence gets embedded, add positional encoding\n",
64+
"Embed the words, then create vectors representing each word's position in each sentence $\\in \\{ 0, 1, 2, \\ldots , K\\}$ = `range(max_len)`, where `max_len` = $K+1$)"
65+
]
66+
},
67+
{
68+
"cell_type": "code",
69+
"execution_count": 2,
70+
"metadata": {},
71+
"outputs": [],
72+
"source": [
73+
"def PositionalEncoder(vocab_size, d_model, dropout, max_len, mode):\n",
74+
" \"\"\"Returns a list of layers that: \n",
75+
" 1. takes a block of text as input, \n",
76+
" 2. embeds the words in that text, and \n",
77+
" 3. adds positional encoding, \n",
78+
" i.e. associates a number in range(max_len) with \n",
79+
" each word in each sentence of embedded input text \n",
80+
" \n",
81+
" The input is a list of tokenized blocks of text\n",
82+
" \n",
83+
" Args:\n",
84+
" vocab_size (int): vocab size.\n",
85+
" d_model (int): depth of embedding.\n",
86+
" dropout (float): dropout rate (how much to drop out).\n",
87+
" max_len (int): maximum symbol length for positional encoding.\n",
88+
" mode (str): 'train' or 'eval'.\n",
89+
" \"\"\"\n",
90+
" # Embedding inputs and positional encoder\n",
91+
" return [ \n",
92+
" # Add embedding layer of dimension (vocab_size, d_model)\n",
93+
" tl.Embedding(vocab_size, d_model), \n",
94+
" # Use dropout with rate and mode specified\n",
95+
" tl.Dropout(rate=dropout, mode=mode), \n",
96+
" # Add positional encoding layer with maximum input length and mode specified\n",
97+
" tl.PositionalEncoding(max_len=max_len, mode=mode)] "
98+
]
99+
},
100+
{
101+
"cell_type": "markdown",
102+
"metadata": {},
103+
"source": [
104+
"## Multi-head causal attention\n",
105+
"\n",
106+
"The layers and array dimensions involved in multi-head causal attention (which looks at previous words in the input text) are summarized in the figure below: \n",
107+
"\n",
108+
"<img src=\"transformer_decoder_lnb_figs/C4_W2_L5_multi-head-attention_S05_multi-head-attention-concatenation_stripped.png\" width=\"1000\"/>\n",
109+
"\n",
110+
"`tl.CausalAttention()` does all of this for you! You might be wondering, though, whether you need to pass in your input text 3 times, since for causal attention, the queries Q, keys K, and values V all come from the same source. Fortunately, `tl.CausalAttention()` handles this as well by making use of the [`tl.Branch()`](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#module-trax.layers.combinators) combinator layer. In general, each branch within a `tl.Branch()` layer performs parallel operations on copies of the layer's inputs. For causal attention, each branch (representing Q, K, and V) applies a linear transformation (i.e. a dense layer without a subsequent activation) to its copy of the input, then splits that result into heads. You can see the syntax for this in the screenshot from the `trax.layers.attention.py` [source code](https://github.com/google/trax/blob/master/trax/layers/attention.py) below: \n",
111+
"\n",
112+
"<img src=\"transformer_decoder_lnb_figs/use-of-tl-Branch-in-tl-CausalAttention.png\" width=\"500\"/>"
113+
]
114+
},
115+
{
116+
"cell_type": "markdown",
117+
"metadata": {},
118+
"source": [
119+
"## Feed-forward layer \n",
120+
"* Typically ends with a ReLU activation, but we'll leave open the possibility of a different activation\n",
121+
"* Most of the parameters are here"
122+
]
123+
},
124+
{
125+
"cell_type": "code",
126+
"execution_count": 3,
127+
"metadata": {},
128+
"outputs": [],
129+
"source": [
130+
"def FeedForward(d_model, d_ff, dropout, mode, ff_activation):\n",
131+
" \"\"\"Returns a list of layers that implements a feed-forward block.\n",
132+
"\n",
133+
" The input is an activation tensor.\n",
134+
"\n",
135+
" Args:\n",
136+
" d_model (int): depth of embedding.\n",
137+
" d_ff (int): depth of feed-forward layer.\n",
138+
" dropout (float): dropout rate (how much to drop out).\n",
139+
" mode (str): 'train' or 'eval'.\n",
140+
" ff_activation (function): the non-linearity in feed-forward layer.\n",
141+
"\n",
142+
" Returns:\n",
143+
" list: list of trax.layers.combinators.Serial that maps an activation tensor to an activation tensor.\n",
144+
" \"\"\"\n",
145+
" \n",
146+
" # Create feed-forward block (list) with two dense layers with dropout and input normalized\n",
147+
" return [ \n",
148+
" # Normalize layer inputs\n",
149+
" tl.LayerNorm(), \n",
150+
" # Add first feed forward (dense) layer (don't forget to set the correct value for n_units)\n",
151+
" tl.Dense(d_ff), \n",
152+
" # Add activation function passed in as a parameter (you need to call it!)\n",
153+
" ff_activation(), # Generally ReLU\n",
154+
" # Add dropout with rate and mode specified (i.e., don't use dropout during evaluation)\n",
155+
" tl.Dropout(rate=dropout, mode=mode), \n",
156+
" # Add second feed forward layer (don't forget to set the correct value for n_units)\n",
157+
" tl.Dense(d_model), \n",
158+
" # Add dropout with rate and mode specified (i.e., don't use dropout during evaluation)\n",
159+
" tl.Dropout(rate=dropout, mode=mode) \n",
160+
" ]"
161+
]
162+
},
163+
{
164+
"cell_type": "markdown",
165+
"metadata": {},
166+
"source": [
167+
"## Decoder block\n",
168+
"Here, we return a list containing two residual blocks. The first wraps around the causal attention layer, whose inputs are normalized and to which we apply dropout regulation. The second wraps around the feed-forward layer. You may notice that the second call to `tl.Residual()` doesn't call a normalization layer before calling the feed-forward layer. This is because the normalization layer is included in the feed-forward layer."
169+
]
170+
},
171+
{
172+
"cell_type": "code",
173+
"execution_count": 4,
174+
"metadata": {},
175+
"outputs": [],
176+
"source": [
177+
"def DecoderBlock(d_model, d_ff, n_heads,\n",
178+
" dropout, mode, ff_activation):\n",
179+
" \"\"\"Returns a list of layers that implements a Transformer decoder block.\n",
180+
"\n",
181+
" The input is an activation tensor.\n",
182+
"\n",
183+
" Args:\n",
184+
" d_model (int): depth of embedding.\n",
185+
" d_ff (int): depth of feed-forward layer.\n",
186+
" n_heads (int): number of attention heads.\n",
187+
" dropout (float): dropout rate (how much to drop out).\n",
188+
" mode (str): 'train' or 'eval'.\n",
189+
" ff_activation (function): the non-linearity in feed-forward layer.\n",
190+
"\n",
191+
" Returns:\n",
192+
" list: list of trax.layers.combinators.Serial that maps an activation tensor to an activation tensor.\n",
193+
" \"\"\"\n",
194+
" \n",
195+
" # Add list of two Residual blocks: the attention with normalization and dropout and feed-forward blocks\n",
196+
" return [\n",
197+
" tl.Residual(\n",
198+
" # Normalize layer input\n",
199+
" tl.LayerNorm(), \n",
200+
" # Add causal attention \n",
201+
" tl.CausalAttention(d_feature, n_heads=n_heads, dropout=dropout, mode=mode) \n",
202+
" ),\n",
203+
" tl.Residual(\n",
204+
" # Add feed-forward block\n",
205+
" # We don't need to normalize the layer inputs here. The feed-forward block takes care of that for us.\n",
206+
" FeedForward(d_model, d_ff, dropout, mode, ff_activation)\n",
207+
" ),\n",
208+
" ]"
209+
]
210+
},
211+
{
212+
"cell_type": "markdown",
213+
"metadata": {},
214+
"source": [
215+
"## The transformer decoder: putting it all together\n",
216+
"## A.k.a. repeat N times, dense layer and softmax for output"
217+
]
218+
},
219+
{
220+
"cell_type": "code",
221+
"execution_count": 5,
222+
"metadata": {},
223+
"outputs": [],
224+
"source": [
225+
"def TransformerLM(vocab_size=33300,\n",
226+
" d_model=512,\n",
227+
" d_ff=2048,\n",
228+
" n_layers=6,\n",
229+
" n_heads=8,\n",
230+
" dropout=0.1,\n",
231+
" max_len=4096,\n",
232+
" mode='train',\n",
233+
" ff_activation=tl.Relu):\n",
234+
" \"\"\"Returns a Transformer language model.\n",
235+
"\n",
236+
" The input to the model is a tensor of tokens. (This model uses only the\n",
237+
" decoder part of the overall Transformer.)\n",
238+
"\n",
239+
" Args:\n",
240+
" vocab_size (int): vocab size.\n",
241+
" d_model (int): depth of embedding.\n",
242+
" d_ff (int): depth of feed-forward layer.\n",
243+
" n_layers (int): number of decoder layers.\n",
244+
" n_heads (int): number of attention heads.\n",
245+
" dropout (float): dropout rate (how much to drop out).\n",
246+
" max_len (int): maximum symbol length for positional encoding.\n",
247+
" mode (str): 'train', 'eval' or 'predict', predict mode is for fast inference.\n",
248+
" ff_activation (function): the non-linearity in feed-forward layer.\n",
249+
"\n",
250+
" Returns:\n",
251+
" trax.layers.combinators.Serial: A Transformer language model as a layer that maps from a tensor of tokens\n",
252+
" to activations over a vocab set.\n",
253+
" \"\"\"\n",
254+
" \n",
255+
" # Create stack (list) of decoder blocks with n_layers with necessary parameters\n",
256+
" decoder_blocks = [ \n",
257+
" DecoderBlock(d_model, d_ff, n_heads, dropout, mode, ff_activation) for _ in range(n_layers)] \n",
258+
"\n",
259+
" # Create the complete model as written in the figure\n",
260+
" return tl.Serial(\n",
261+
" # Use teacher forcing (feed output of previous step to current step)\n",
262+
" tl.ShiftRight(mode=mode), \n",
263+
" # Add embedding inputs and positional encoder\n",
264+
" PositionalEncoder(vocab_size, d_model, dropout, max_len, mode),\n",
265+
" # Add decoder blocks\n",
266+
" decoder_blocks, \n",
267+
" # Normalize layer\n",
268+
" tl.LayerNorm(), \n",
269+
"\n",
270+
" # Add dense layer of vocab_size (since need to select a word to translate to)\n",
271+
" # (a.k.a., logits layer. Note: activation already set by ff_activation)\n",
272+
" tl.Dense(vocab_size), \n",
273+
" # Get probabilities with Logsoftmax\n",
274+
" tl.LogSoftmax() \n",
275+
" )"
276+
]
277+
},
278+
{
279+
"cell_type": "markdown",
280+
"metadata": {},
281+
"source": [
282+
"## Concluding remarks\n",
283+
"\n",
284+
"In this week's assignment, you'll see how to train a transformer decoder on the [cnn_dailymail](https://www.tensorflow.org/datasets/catalog/cnn_dailymail) dataset, available from TensorFlow Datasets (part of TensorFlow Data Services). Because training such a model from scratch is time-intensive, you'll use a pre-trained model to summarize documents later in the assignment. Due to time and storage concerns, we will also not train the decoder on a different summarization dataset in this lab. If you have the time and space, we encourage you to explore the other [summarization](https://www.tensorflow.org/datasets/catalog/overview#summarization) datasets at TensorFlow Datasets. Which of them might suit your purposes better than the `cnn_dailymail` dataset? Where else can you find datasets for text summarization models?"
285+
]
286+
}
287+
],
288+
"metadata": {
289+
"kernelspec": {
290+
"display_name": "Python 3",
291+
"language": "python",
292+
"name": "python3"
293+
},
294+
"language_info": {
295+
"codemirror_mode": {
296+
"name": "ipython",
297+
"version": 3
298+
},
299+
"file_extension": ".py",
300+
"mimetype": "text/x-python",
301+
"name": "python",
302+
"nbconvert_exporter": "python",
303+
"pygments_lexer": "ipython3",
304+
"version": "3.7.6"
305+
}
306+
},
307+
"nbformat": 4,
308+
"nbformat_minor": 4
309+
}

0 commit comments

Comments
 (0)