pawankg
diff --git a/‎Text Summarization using Transformer/III_Text_summarizer.ipynb
+2,959 b/‎Text Summarization using Transformer/III_Text_summarizer.ipynb
+2,959
diff --git a/‎Text Summarization using Transformer/II_Transformer_Decoder.ipynb
+309 b/‎Text Summarization using Transformer/II_Transformer_Decoder.ipynb
+309
@@ -0,0 +1,309 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# The Transformer Decoder\n",
+    "\n",
+    "In this notebook, you'll explore the transformer decoder and how to implement it with Trax. \n",
+    "\n",
+    "## Background\n",
+    "\n",
+    "In the last lecture notebook, you saw how to translate the mathematics of attention into NumPy code. Here, you'll see how multi-head causal attention fits into a GPT-2 transformer decoder, and how to build one with Trax layers. In the assignment notebook, you'll implement causal attention from scratch, but here, you'll exploit the handy-dandy `tl.CausalAttention()` layer.\n",
+    "\n",
+    "The schematic below illustrates the components and flow of a transformer decoder. Note that while the algorithm diagram flows from the bottom to the top, the overview and subsequent Trax layer codes are top-down.\n",
+    "\n",
+    "<img src=\"transformer_decoder_lnb_figs/C4_W2_L6_transformer-decoder_S01_transformer-decoder.png\" width=\"1000\"/>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "INFO:tensorflow:tokens_length=568 inputs_length=512 targets_length=114 noise_density=0.15 mean_noise_span_length=3.0 \n"
+     ]
+    }
+   ],
+   "source": [
+    "import sys\n",
+    "import os\n",
+    "\n",
+    "import time\n",
+    "import numpy as np\n",
+    "import gin\n",
+    "\n",
+    "import textwrap\n",
+    "wrapper = textwrap.TextWrapper(width=70)\n",
+    "\n",
+    "import trax\n",
+    "from trax import layers as tl\n",
+    "from trax.fastmath import numpy as jnp\n",
+    "\n",
+    "# to print the entire np array\n",
+    "np.set_printoptions(threshold=sys.maxsize)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Sentence gets embedded, add positional encoding\n",
+    "Embed the words, then create vectors representing each word's position in each sentence $\\in \\{ 0, 1, 2, \\ldots , K\\}$ = `range(max_len)`, where `max_len` = $K+1$)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def PositionalEncoder(vocab_size, d_model, dropout, max_len, mode):\n",
+    "    \"\"\"Returns a list of layers that: \n",
+    "    1. takes a block of text as input, \n",
+    "    2. embeds the words in that text, and \n",
+    "    3. adds positional encoding, \n",
+    "       i.e. associates a number in range(max_len) with \n",
+    "       each word in each sentence of embedded input text \n",
+    "    \n",
+    "    The input is a list of tokenized blocks of text\n",
+    "    \n",
+    "    Args:\n",
+    "        vocab_size (int): vocab size.\n",
+    "        d_model (int):  depth of embedding.\n",
+    "        dropout (float): dropout rate (how much to drop out).\n",
+    "        max_len (int): maximum symbol length for positional encoding.\n",
+    "        mode (str): 'train' or 'eval'.\n",
+    "    \"\"\"\n",
+    "    # Embedding inputs and positional encoder\n",
+    "    return [ \n",
+    "        # Add embedding layer of dimension (vocab_size, d_model)\n",
+    "        tl.Embedding(vocab_size, d_model),  \n",
+    "        # Use dropout with rate and mode specified\n",
+    "        tl.Dropout(rate=dropout, mode=mode), \n",
+    "        # Add positional encoding layer with maximum input length and mode specified\n",
+    "        tl.PositionalEncoding(max_len=max_len, mode=mode)] "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Multi-head causal attention\n",
+    "\n",
+    "The layers and array dimensions involved in multi-head causal attention (which looks at previous words in the input text) are summarized in the figure below: \n",
+    "\n",
+    "<img src=\"transformer_decoder_lnb_figs/C4_W2_L5_multi-head-attention_S05_multi-head-attention-concatenation_stripped.png\" width=\"1000\"/>\n",
+    "\n",
+    "`tl.CausalAttention()` does all of this for you! You might be wondering, though, whether you need to pass in your input text 3 times, since for causal attention, the queries Q, keys K, and values V all come from the same source. Fortunately, `tl.CausalAttention()` handles this as well by making use of the [`tl.Branch()`](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#module-trax.layers.combinators) combinator layer. In general, each branch within a `tl.Branch()` layer performs parallel operations on copies of the layer's inputs. For causal attention, each branch (representing Q, K, and V) applies a linear transformation (i.e. a dense layer without a subsequent activation) to its copy of the input, then splits that result into heads. You can see the syntax for this in the screenshot from the `trax.layers.attention.py` [source code](https://github.com/google/trax/blob/master/trax/layers/attention.py) below: \n",
+    "\n",
+    "<img src=\"transformer_decoder_lnb_figs/use-of-tl-Branch-in-tl-CausalAttention.png\" width=\"500\"/>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Feed-forward layer \n",
+    "* Typically ends with a ReLU activation, but we'll leave open the possibility of a different activation\n",
+    "* Most of the parameters are here"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def FeedForward(d_model, d_ff, dropout, mode, ff_activation):\n",
+    "    \"\"\"Returns a list of layers that implements a feed-forward block.\n",
+    "\n",
+    "    The input is an activation tensor.\n",
+    "\n",
+    "    Args:\n",
+    "        d_model (int):  depth of embedding.\n",
+    "        d_ff (int): depth of feed-forward layer.\n",
+    "        dropout (float): dropout rate (how much to drop out).\n",
+    "        mode (str): 'train' or 'eval'.\n",
+    "        ff_activation (function): the non-linearity in feed-forward layer.\n",
+    "\n",
+    "    Returns:\n",
+    "        list: list of trax.layers.combinators.Serial that maps an activation tensor to an activation tensor.\n",
+    "    \"\"\"\n",
+    "    \n",
+    "    # Create feed-forward block (list) with two dense layers with dropout and input normalized\n",
+    "    return [ \n",
+    "        # Normalize layer inputs\n",
+    "        tl.LayerNorm(), \n",
+    "        # Add first feed forward (dense) layer (don't forget to set the correct value for n_units)\n",
+    "        tl.Dense(d_ff), \n",
+    "        # Add activation function passed in as a parameter (you need to call it!)\n",
+    "        ff_activation(),  # Generally ReLU\n",
+    "        # Add dropout with rate and mode specified (i.e., don't use dropout during evaluation)\n",
+    "        tl.Dropout(rate=dropout, mode=mode), \n",
+    "        # Add second feed forward layer (don't forget to set the correct value for n_units)\n",
+    "        tl.Dense(d_model), \n",
+    "        # Add dropout with rate and mode specified (i.e., don't use dropout during evaluation)\n",
+    "        tl.Dropout(rate=dropout, mode=mode) \n",
+    "    ]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Decoder block\n",
+    "Here, we return a list containing two residual blocks. The first wraps around the causal attention layer, whose inputs are normalized and to which we apply dropout regulation. The second wraps around the feed-forward layer. You may notice that the second call to `tl.Residual()` doesn't call a normalization layer before calling the feed-forward layer. This is because the normalization layer is included in the feed-forward layer."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def DecoderBlock(d_model, d_ff, n_heads,\n",
+    "                 dropout, mode, ff_activation):\n",
+    "    \"\"\"Returns a list of layers that implements a Transformer decoder block.\n",
+    "\n",
+    "    The input is an activation tensor.\n",
+    "\n",
+    "    Args:\n",
+    "        d_model (int):  depth of embedding.\n",
+    "        d_ff (int): depth of feed-forward layer.\n",
+    "        n_heads (int): number of attention heads.\n",
+    "        dropout (float): dropout rate (how much to drop out).\n",
+    "        mode (str): 'train' or 'eval'.\n",
+    "        ff_activation (function): the non-linearity in feed-forward layer.\n",
+    "\n",
+    "    Returns:\n",
+    "        list: list of trax.layers.combinators.Serial that maps an activation tensor to an activation tensor.\n",
+    "    \"\"\"\n",
+    "        \n",
+    "    # Add list of two Residual blocks: the attention with normalization and dropout and feed-forward blocks\n",
+    "    return [\n",
+    "      tl.Residual(\n",
+    "          # Normalize layer input\n",
+    "          tl.LayerNorm(), \n",
+    "          # Add causal attention \n",
+    "          tl.CausalAttention(d_feature, n_heads=n_heads, dropout=dropout, mode=mode) \n",
+    "        ),\n",
+    "      tl.Residual(\n",
+    "          # Add feed-forward block\n",
+    "          # We don't need to normalize the layer inputs here. The feed-forward block takes care of that for us.\n",
+    "          FeedForward(d_model, d_ff, dropout, mode, ff_activation)\n",
+    "        ),\n",
+    "      ]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## The transformer decoder: putting it all together\n",
+    "## A.k.a. repeat N times, dense layer and softmax for output"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def TransformerLM(vocab_size=33300,\n",
+    "                  d_model=512,\n",
+    "                  d_ff=2048,\n",
+    "                  n_layers=6,\n",
+    "                  n_heads=8,\n",
+    "                  dropout=0.1,\n",
+    "                  max_len=4096,\n",
+    "                  mode='train',\n",
+    "                  ff_activation=tl.Relu):\n",
+    "    \"\"\"Returns a Transformer language model.\n",
+    "\n",
+    "    The input to the model is a tensor of tokens. (This model uses only the\n",
+    "    decoder part of the overall Transformer.)\n",
+    "\n",
+    "    Args:\n",
+    "        vocab_size (int): vocab size.\n",
+    "        d_model (int):  depth of embedding.\n",
+    "        d_ff (int): depth of feed-forward layer.\n",
+    "        n_layers (int): number of decoder layers.\n",
+    "        n_heads (int): number of attention heads.\n",
+    "        dropout (float): dropout rate (how much to drop out).\n",
+    "        max_len (int): maximum symbol length for positional encoding.\n",
+    "        mode (str): 'train', 'eval' or 'predict', predict mode is for fast inference.\n",
+    "        ff_activation (function): the non-linearity in feed-forward layer.\n",
+    "\n",
+    "    Returns:\n",
+    "        trax.layers.combinators.Serial: A Transformer language model as a layer that maps from a tensor of tokens\n",
+    "        to activations over a vocab set.\n",
+    "    \"\"\"\n",
+    "    \n",
+    "    # Create stack (list) of decoder blocks with n_layers with necessary parameters\n",
+    "    decoder_blocks = [ \n",
+    "        DecoderBlock(d_model, d_ff, n_heads, dropout, mode, ff_activation) for _ in range(n_layers)] \n",
+    "\n",
+    "    # Create the complete model as written in the figure\n",
+    "    return tl.Serial(\n",
+    "        # Use teacher forcing (feed output of previous step to current step)\n",
+    "        tl.ShiftRight(mode=mode), \n",
+    "        # Add embedding inputs and positional encoder\n",
+    "        PositionalEncoder(vocab_size, d_model, dropout, max_len, mode),\n",
+    "        # Add decoder blocks\n",
+    "        decoder_blocks, \n",
+    "        # Normalize layer\n",
+    "        tl.LayerNorm(), \n",
+    "\n",
+    "        # Add dense layer of vocab_size (since need to select a word to translate to)\n",
+    "        # (a.k.a., logits layer. Note: activation already set by ff_activation)\n",
+    "        tl.Dense(vocab_size), \n",
+    "        # Get probabilities with Logsoftmax\n",
+    "        tl.LogSoftmax() \n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Concluding remarks\n",
+    "\n",
+    "In this week's assignment, you'll see how to train a transformer decoder on the [cnn_dailymail](https://www.tensorflow.org/datasets/catalog/cnn_dailymail) dataset, available from TensorFlow Datasets (part of TensorFlow Data Services). Because training such a model from scratch is time-intensive, you'll use a pre-trained model to summarize documents later in the assignment. Due to time and storage concerns, we will also not train the decoder on a different summarization dataset in this lab. If you have the time and space, we encourage you to explore the other [summarization](https://www.tensorflow.org/datasets/catalog/overview#summarization) datasets at TensorFlow Datasets. Which of them might suit your purposes better than the `cnn_dailymail` dataset? Where else can you find datasets for text summarization models?"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}