T5 (#4969)

dirkgr · epwalsh · jacob-morrison · dirkgr · commit dcec28489fd8 · 2021-05-10T13:43:06.000-04:00
* Formatting * New activation functions * Makes position embeddings optional in the transformer embeddings * Adds T5 * Various fixes to make this start up * Share weights * Adds one test that passes, and one test that fails * use min_value_of_dtype in apply_mask * fixes, add beam search * encoder fixes * fix * fix beam search * fix tests * rename to just 'T5' * fix initialization from pretrained * add Model, DatasetReader, and Predictor * remove useless dataset reader * move high-level peices to allennlp-models * revert predictor changes * remove unneeded hidden_size * remove stray comment * bool masks * CHANGELOG * fix test file name * revert other change * revert other change * Distributed training with gradient accumulation (#5100) * Fixes distributed training with gradient accumulation * Fix in case we don't do anything in a batch group * Test for the problematic condition * Formatting * More formatting * Changelog * Fix another test * Fix even more tests * Fixes one more test * I can fix these tests all day. * Add link to gallery and demo in README (#5103) * Add link to gallery in README * Update README.md * try emojis Is this overkill? * Adding a metadata field to the basic classifier (#5104) * Adding metadata parameter to BasicClassifier * Fix * Updating the changelog * reformatting * updating parameter type * fixing import Co-authored-by: Dirk Groeneveld <dirkg@allenai.org> * additional W&B params (#5114) * additional W&B params * add wandb_kwargs * fix * fix docs * Add eval_mode argument to pretrained transformer embedder (#5111) * Add eval_mode argument to pretrained transformer embedder * Edit changelog entry * Lint * Update allennlp/modules/token_embedders/pretrained_transformer_embedder.py * Apply suggestions from code review Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> Co-authored-by: Evan Pete Walsh <petew@allenai.org> * specify 'truncation' to avoid transformers warning (#5120) * specify 'truncation' to avoid transformers warning * Update docs * Remove `stride` param * Update CHANGELOG.md Co-authored-by: Dirk Groeneveld <dirkg@allenai.org> * Predicting with a dataset reader on a multitask model (#5115) * Create a way to use allennlp predict with a dataset and a multitask model * Fix type ignoration * Changelog * Fix to the predictor * fix bug with interleaving dataset reader (#5122) * fix bug with interleaving dataset reader * more tests * Update allennlp/data/dataset_readers/interleaving_dataset_reader.py * Update allennlp/data/dataset_readers/interleaving_dataset_reader.py * remove jsonpickle from dependencies (#5121) Co-authored-by: Dirk Groeneveld <dirkg@allenai.org> * Update docstring for basic_classifier (#5124) * improve error message from Registrable class (#5125) Co-authored-by: Akshita Bhagia <akshita23bhagia@gmail.com> * Prepare for release v2.3.0 * fix docs CI * Take the number of runs in the test for distributed metrics (#5127) * Take the number of runs in the test for distributed metrics * Changelog * Add influence functions to interpret module (#4988) * creating a new functionality to fields and instances to support outputing instnaces to json files * creating tests for the new functionality * fixing docs * Delete __init__.py * Delete influence_interpreter.py * Delete use_if.py * Delete simple_influence_test.py * fixing docs * finishing up SimpleInfluence * passing lint * passing format * making small progress in coding * Delete fast_influence.py Submit to the wrong branch * Delete faiss_utils.py wrong branch * Delete gpt2_bug.py not sure why it's included * Delete text_class.py not sure why it's included * adding test file * adding testing files * deleted unwanted files * deleted unwanted files and rearrange test files * small bug * adjust function call to save instance in json * Update allennlp/interpret/influence_interpreters/influence_interpreter.py Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> * Update allennlp/interpret/influence_interpreters/influence_interpreter.py Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> * Update allennlp/interpret/influence_interpreters/influence_interpreter.py Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> * move some documentation of parameters to base class * delete one comment * delete one deprecated abstract method * changing interface * formatting * formatting err * passing mypy * passing mypy * passing mypy * passing mypy * passing integration test * passing integration test * adding a new option to the do-all function * modifying the callable function to the interface * update API, fixes * doc fixes * add `from_path` and `from_archive` methods * fix docs, improve logging * add test * address @matt-gardner's comments * fixes to documentation * update docs Co-authored-by: Evan Pete Walsh <epwalsh10@gmail.com> Co-authored-by: Evan Pete Walsh <petew@allenai.org> * Update CONTRIBUTING.md (#5133) * Update CONTRIBUTING.md * updated changelog Co-authored-by: Akshita Bhagia <akshita23bhagia@gmail.com> Co-authored-by: Arjun Subramonian <arjuns@ip-192-168-0-106.us-west-2.compute.internal> * fix #5132 (#5134) * fix * Prepare for release v2.3.1 * Fairness Metrics (#5093) * Added three definitions of fairness * Updated CHANGELOG * Added DemographicParityWithoutGroundTruth and finished tests * finished refactoring Independence, Separation, and Sufficiency to accumulate * added distributed functionality to Independence, Sufficiency, and Separation * Finished aggregate and distributed functionality for DemographicParityWithoutGroundTruth * fixed GPU and doc issues * fixed GPU and doc issues * fixed GPU and doc issues * fixed GPU issues * fixed GPU issues * added init file * fixed typo * minor docstring changes * minor changes to docstring * Added simple explanations of fairness metrics to docstrings * Further vectorized all metric implementations * Fixed device issue Co-authored-by: Arjun Subramonian <arjuns@Arjuns-MacBook-Pro.local> Co-authored-by: Akshita Bhagia <akshita23bhagia@gmail.com> Co-authored-by: Dirk Groeneveld <dirkg@allenai.org> * fix cached_path for hub downloads (#5141) * fix cached_path for hub downloads * fix test name * fix type hint * Update allennlp/common/file_utils.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * fix * fix Co-authored-by: epwalsh <epwalsh10@gmail.com> Co-authored-by: Evan Pete Walsh <petew@allenai.org> Co-authored-by: Jacob Morrison <jacob1morrison@gmail.com> Co-authored-by: Nelson Liu <nelson-liu@users.noreply.github.com> Co-authored-by: Akshita Bhagia <akshita23bhagia@gmail.com> Co-authored-by: Leo Liu <zeyuliu2@uw.edu> Co-authored-by: ArjunSubramonian <arjun.subramonian@gmail.com> Co-authored-by: Arjun Subramonian <arjuns@ip-192-168-0-106.us-west-2.compute.internal> Co-authored-by: Arjun Subramonian <arjuns@Arjuns-MacBook-Pro.local> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## Unreleased
 
+### Added
+
+- Added a T5 implementation to `modules.transformers`.
+
 ### Fixed
 
 - Fixed `cached_path()` for "hf://" files.
diff --git a/allennlp/modules/transformer/__init__.py b/allennlp/modules/transformer/__init__.py
@@ -140,3 +140,4 @@ def forward(self, token_ids: torch.LongTensor, mask: torch.BoolTensor):
 
 from allennlp.modules.transformer.bimodal_attention import BiModalAttention
 from allennlp.modules.transformer.bimodal_encoder import BiModalEncoder
+from allennlp.modules.transformer.t5 import T5
diff --git a/allennlp/modules/transformer/t5.py b/allennlp/modules/transformer/t5.py
diff --git a/allennlp/modules/transformer/transformer_embeddings.py b/allennlp/modules/transformer/transformer_embeddings.py
@@ -119,14 +119,14 @@ def __init__(
         dropout: float = 0.1,
         output_size: Optional[int] = None,
     ):
-
         embedding_dict = {}
 
         word_embeddings = torch.nn.Embedding(vocab_size, embedding_size, padding_idx=pad_token_id)
         embedding_dict["word_embeddings"] = word_embeddings
 
-        position_embeddings = torch.nn.Embedding(max_position_embeddings, embedding_size)
-        embedding_dict["position_embeddings"] = position_embeddings
+        if max_position_embeddings > 0:
+            position_embeddings = torch.nn.Embedding(max_position_embeddings, embedding_size)
+            embedding_dict["position_embeddings"] = position_embeddings
 
         if type_vocab_size > 0:
             token_type_embeddings = torch.nn.Embedding(type_vocab_size, embedding_size)
@@ -163,16 +163,15 @@ def forward(  # type: ignore
 
         embedding_inputs = [input_ids]
 
-        if position_ids is None:
-            position_ids = torch.arange(seq_length, dtype=torch.long, device=device)
-            position_ids = position_ids.unsqueeze(0).expand(input_shape)
-
-        embedding_inputs.append(position_ids)
-
-        if token_type_ids is None:
-            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)
+        if "position_embeddings" in self.embeddings:
+            if position_ids is None:
+                position_ids = torch.arange(seq_length, dtype=torch.long, device=device)
+                position_ids = position_ids.unsqueeze(0).expand(input_shape)
+            embedding_inputs.append(position_ids)
 
-        if len(self.embeddings) == 3:
+        if "token_type_embeddings" in self.embeddings:
+            if token_type_ids is None:
+                token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)
             embedding_inputs.append(token_type_ids)
 
         embeddings = super().forward(*embedding_inputs)
diff --git a/allennlp/modules/transformer/transformer_module.py b/allennlp/modules/transformer/transformer_module.py
@@ -1,4 +1,4 @@
-from typing import Optional, Dict, Union, List
+from typing import Optional, Dict, Union, List, Any
 import logging
 import inspect
 
@@ -32,23 +32,26 @@ def __init__(self, *args, **kwargs):
     def _get_mapping(
         cls,
         pretrained_module: Optional[torch.nn.Module] = None,
-        source="huggingface",
+        source: str = "huggingface",
         mapping: Optional[Dict[str, str]] = None,
     ):
         """
         Returns the mapping to be used, based on the optional `pretrained_module`.
         If `pretrained_module` is not given, the default module-level mapping is returned.
         """
         combined_mapping = {}
-        if "huggingface" in source:
+        if "huggingface" == source:
             combined_mapping.update(cls._huggingface_mapping)
         if mapping is not None:
             combined_mapping.update(mapping)
         return combined_mapping
 
     @classmethod
     def _get_mapped_submodules(
-        cls, pretrained_module, source="huggingface", mapping: Optional[Dict[str, str]] = None
+        cls,
+        pretrained_module: torch.nn.Module,
+        source: str = "huggingface",
+        mapping: Optional[Dict[str, str]] = None,
     ):
         """
         Subclasses overload this method, and provide appropriate name mapping based on the source.
@@ -64,7 +67,7 @@ def _get_mapped_submodules(
 
     def _construct_default_mapping(
         self,
-        pretrained_module,
+        pretrained_module: torch.nn.Module,
         source: str = "huggingface",
         mapping: Optional[Dict[str, str]] = None,
     ):
@@ -127,10 +130,10 @@ def _load_from_pretrained_module(
     def _get_input_arguments(
         cls,
         pretrained_module: torch.nn.Module,
-        source="huggingface",
+        source: str = "huggingface",
         mapping: Optional[Dict[str, str]] = None,
         **kwargs,
-    ):
+    ) -> Dict[str, Any]:
         """
         Constructs the arguments required for instantiating an object of this class, using
         the values from `pretrained_module`.
@@ -142,7 +145,7 @@ def get_relevant_module(
         cls,
         pretrained_module: Union[str, torch.nn.Module],
         relevant_module: Optional[Union[str, List[str]]] = None,
-        source="huggingface",
+        source: str = "huggingface",
         mapping: Optional[Dict[str, str]] = None,
     ):
         """
@@ -187,7 +190,7 @@ def get_relevant_module(
     def from_pretrained_module(
         cls,
         pretrained_module: Union[str, torch.nn.Module],
-        source="huggingface",
+        source: str = "huggingface",
         mapping: Optional[Dict[str, str]] = None,
         **kwargs,
     ):
diff --git a/allennlp/modules/transformer/util.py b/allennlp/modules/transformer/util.py
@@ -1,6 +1,8 @@
-from typing import Union
+from typing import Union, Tuple
 import torch
 
+from allennlp.nn.util import min_value_of_dtype
+
 
 def apply_mask(
     values: torch.FloatTensor, mask: Union[torch.BoolTensor, torch.IntTensor, torch.FloatTensor]
@@ -13,13 +15,87 @@ def apply_mask(
     mask : `torch.BoolTensor`
         Shape `batch_size x target_seq_len` OR `batch_size x 1 x 1 x target_seq_len`
     """
-    if len(mask.shape) == 2:
-        # We create a 4D attention mask from a 2D tensor mask.
+    # We create a 4D attention mask from a 2D or 3D tensor mask.
+    if mask.dim() == 2:
         # The shape is `batch_size x 1 x 1 x target_seq_len` which is broadcast
         # to `batch_size x num_attention_heads x source_seq_len x target_seq_len`
-        mask = mask.unsqueeze(1).unsqueeze(2)
-    # `mask==1` to convert float tensors.
-    mask = (
-        ~(mask == 1)
-    ) * -10e5  # -10e5 to ensure that the model also works in half-precision mode.
+        mask = mask[:, None, None, :]
+    elif mask.dim() == 3:
+        mask = mask[:, None, :, :]
+    mask = mask.to(values.dtype)
+    mask = (1.0 - mask) * min_value_of_dtype(values.dtype)
     return values + mask
+
+
+def get_extended_attention_mask(
+    attention_mask: torch.Tensor,
+    input_shape: Tuple[int, ...],
+    dtype: torch.dtype,
+    is_decoder: bool = False,
+) -> torch.Tensor:
+    """
+    Makes broadcastable attention and causal masks so that future and masked tokens are ignored.
+
+    # Parameters
+
+    attention_mask : `torch.Tensor`
+        Mask with ones indicating tokens to attend to, zeros for tokens to ignore.
+    input_shape : `Tuple[int, ...]`
+        The shape of the input to the model.
+    dtype : `torch.dtype`
+        The datatype of the resulting mask.
+    is_decoder : `bool`, optional (default = `False`)
+        If this is for a decoder stack.
+
+    # Returns
+
+    `torch.Tensor`
+        The extended attention mask, with a the same dtype as `attention_mask.dtype`.
+    """
+    # Adapted from https://github.com/huggingface/transformers/blob/
+    # 4c32f9f26e6a84f0d9843fec8757e6ce640bb44e/src/transformers/modeling_utils.py#L221.
+
+    # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+    # ourselves in which case we just need to make it broadcastable to all heads.
+    if attention_mask.dim() == 3:
+        extended_attention_mask = attention_mask[:, None, :, :]
+    elif attention_mask.dim() == 2:
+        # Provided a padding mask of dimensions [batch_size, seq_length]
+        # - if the model is a decoder, apply a causal mask in addition to the padding mask
+        # - if the model is an encoder, make the mask broadcastable to
+        #   `(batch_size, num_heads, seq_length, seq_length)`
+        if is_decoder:
+            batch_size, seq_length = input_shape
+            seq_ids = torch.arange(seq_length, device=attention_mask.device)
+            causal_mask = (
+                seq_ids[None, None, :].repeat(batch_size, seq_length, 1) <= seq_ids[None, :, None]
+            )
+            # in case past_key_values are used we need to add a prefix ones mask to the causal mask
+            # causal and attention masks must have same type with pytorch version < 1.3
+            causal_mask = causal_mask.to(attention_mask.dtype)
+
+            if causal_mask.shape[1] < attention_mask.shape[1]:
+                prefix_seq_len = attention_mask.shape[1] - causal_mask.shape[1]
+                causal_mask = torch.cat(
+                    [
+                        torch.ones(
+                            (batch_size, seq_length, prefix_seq_len),
+                            device=attention_mask.device,
+                            dtype=causal_mask.dtype,
+                        ),
+                        causal_mask,
+                    ],
+                    axis=-1,
+                )
+
+            extended_attention_mask = causal_mask[:, None, :, :] * attention_mask[:, None, None, :]
+        else:
+            extended_attention_mask = attention_mask[:, None, None, :]
+    else:
+        raise ValueError(
+            "Wrong shape for input_ids (shape {}) or attention_mask (shape {})".format(
+                input_shape, attention_mask.shape
+            )
+        )
+
+    return extended_attention_mask
diff --git a/allennlp/nn/activations.py b/allennlp/nn/activations.py
@@ -5,7 +5,7 @@
 [PyTorch activations](https://pytorch.org/docs/master/nn.html#non-linear-activations).
 Here we provide a thin wrapper to allow registering them and instantiating them `from_params`.
 
-The available activation functions are
+The available activation functions include
 
 * "linear"
 * ["mish"](https://arxiv.org/abs/1908.08681)
@@ -27,6 +27,8 @@
 * ["selu"](https://pytorch.org/docs/master/nn.html#torch.nn.SELU)
 """
 
+import math
+
 import torch
 
 from allennlp.common import Registrable
@@ -86,3 +88,24 @@ def forward(self, x: torch.Tensor) -> torch.Tensor:
 class SwishActivation(Activation):
     def forward(self, x: torch.Tensor) -> torch.Tensor:
         return x * torch.sigmoid(x)
+
+
+@Activation.register("gelu_new")
+class GeluNew(Activation):
+    """
+    Implementation of the GELU activation function currently in Google BERT repo (identical to OpenAI GPT). Also
+    see the Gaussian Error Linear Units paper: https://arxiv.org/abs/1606.08415
+    """
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return (
+            0.5
+            * x
+            * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))
+        )
+
+
+@Activation.register("gelu_fast")
+class GeluFast(Activation):
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return 0.5 * x * (1.0 + torch.tanh(x * 0.7978845608 * (1.0 + 0.044715 * x * x)))
diff --git a/tests/modules/transformer/self_attention_test.py b/tests/modules/transformer/self_attention_test.py
@@ -4,10 +4,9 @@
 
 from allennlp.common import Params
 from allennlp.common import cached_transformers
-from allennlp.common.testing import assert_equal_parameters
-
+from allennlp.common.testing import assert_equal_parameters, AllenNlpTestCase
 from allennlp.modules.transformer import SelfAttention
-from allennlp.common.testing import AllenNlpTestCase
+from allennlp.nn.util import min_value_of_dtype
 
 from transformers.models.bert.configuration_bert import BertConfig
 from transformers.models.bert.modeling_bert import BertSelfAttention
@@ -160,7 +159,7 @@ def test_loading_from_pretrained_weights_using_model_name(self, pretrained_name)
             )[0]
         else:
             # The attn_mask is processed outside the self attention module in HF bert models.
-            attention_mask = (~(attention_mask == 1)) * -10e5
+            attention_mask = (~(attention_mask == 1)) * min_value_of_dtype(hidden_states.dtype)
             torch.manual_seed(1234)
             hf_output = pretrained_module.forward(hidden_states, attention_mask=attention_mask)[0]
 
diff --git a/tests/modules/transformer/t5_test.py b/tests/modules/transformer/t5_test.py

Original file line number	Diff line number	Diff line change
`@@ -140,3 +140,4 @@ def forward(self, token_ids: torch.LongTensor, mask: torch.BoolTensor):`
`140`	`140`
`141`	`141`	`from allennlp.modules.transformer.bimodal_attention import BiModalAttention`
`142`	`142`	`from allennlp.modules.transformer.bimodal_encoder import BiModalEncoder`
	`143`	`+from allennlp.modules.transformer.t5 import T5`