Skip to content
This repository was archived by the owner on Dec 16, 2022. It is now read-only.

Commit b48347b

Browse files
committed
Merge remote-tracking branch 'origin/master' into vision
2 parents 81892db + b7cec51 commit b48347b

File tree

20 files changed

+260
-456
lines changed

20 files changed

+260
-456
lines changed

.github/workflows/master.yml

+6-2
Original file line numberDiff line numberDiff line change
@@ -151,8 +151,7 @@ jobs:
151151
- name: Clean up
152152
if: always()
153153
run: |
154-
pip uninstall --yes allennlp
155-
pip uninstall --yes allennlp_models
154+
pip uninstall --yes allennlp allennlp-models
156155
157156
# Builds package distribution files for PyPI.
158157
build_package:
@@ -252,6 +251,11 @@ jobs:
252251
- name: Install core package
253252
run: |
254253
pip install $(ls dist/*.whl)
254+
# TODO(epwalsh): In PyTorch 1.7, dataclasses is an unconditional dependency, when it should
255+
# only be a conditional dependency for Python < 3.7.
256+
# This has been fixed on PyTorch master branch, so we should be able to
257+
# remove this check with the next PyTorch release.
258+
pip uninstall -y dataclasses
255259
256260
- name: Pip freeze
257261
run: |

.github/workflows/pull_request.yml

+1-2
Original file line numberDiff line numberDiff line change
@@ -159,5 +159,4 @@ jobs:
159159
- name: Clean up
160160
if: always()
161161
run: |
162-
pip uninstall --yes allennlp
163-
pip uninstall --yes allennlp_models
162+
pip uninstall --yes allennlp allennlp-models

CHANGELOG.md

+17-4
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,16 @@ data loaders. Those are coming soon.
4343

4444
## Unreleased (1.x branch)
4545

46+
### Fixed
47+
48+
- Fixed the computation of saliency maps in the Interpret code when using mismatched indexing.
49+
Previously, we would compute gradients from the top of the transformer, after aggregation from
50+
wordpieces to tokens, which gives results that are not very informative. Now, we compute gradients
51+
with respect to the embedding layer, and aggregate wordpieces to tokens separately.
52+
53+
54+
## [v1.2.0](https://github.com/allenai/allennlp/releases/tag/v1.2.0) - 2020-10-29
55+
4656
### Changed
4757

4858
- Enforced stricter typing requirements around the use of `Optional[T]` types.
@@ -58,6 +68,11 @@ data loaders. Those are coming soon.
5868

5969
- Made it possible to instantiate `TrainerCallback` from config files.
6070
- Fixed the remaining broken internal links in the API docs.
71+
- Fixed a bug where Hotflip would crash with a model that had multiple TokenIndexers and the input
72+
used rare vocabulary items.
73+
- Fixed a bug where `BeamSearch` would fail if `max_steps` was equal to 1.
74+
- Fixed `BasicTextFieldEmbedder` to not raise ConfigurationError if it has embedders that are empty and not in input
75+
6176

6277
## [v1.2.0rc1](https://github.com/allenai/allennlp/releases/tag/v1.2.0rc1) - 2020-10-22
6378

@@ -87,10 +102,7 @@ data loaders. Those are coming soon.
87102
- Added logging for the main process when running in distributed mode.
88103
- Added a `TrainerCallback` object to support state sharing between batch and epoch-level training callbacks.
89104
- Added support for .tar.gz in PretrainedModelInitializer.
90-
- Added classes: `nn/samplers/samplers.py` with `MultinomialSampler`, `TopKSampler`, and `TopPSampler` for
91-
sampling indices from log probabilities
92-
- Made `BeamSearch` registrable.
93-
- Added `top_k_sampling` and `type_p_sampling` `BeamSearch` implementations.
105+
- Made `BeamSearch` instantiable `from_params`.
94106
- Pass `serialization_dir` to `Model` and `DatasetReader`.
95107
- Added an optional `include_in_archive` parameter to the top-level of configuration files. When specified, `include_in_archive` should be a list of paths relative to the serialization directory which will be bundled up with the final archived model from a training run.
96108

@@ -142,6 +154,7 @@ data loaders. Those are coming soon.
142154
- Fixed `allennlp.nn.util.add_sentence_boundary_token_ids()` to use `device` parameter of input tensor.
143155
- Be sure to close the TensorBoard writer even when training doesn't finish.
144156
- Fixed the docstring for `PyTorchSeq2VecWrapper`.
157+
- Fix intra word tokenization for `PretrainedTransformerTokenizer` when disabling fast tokenizer.
145158

146159

147160
## [v1.1.0](https://github.com/allenai/allennlp/releases/tag/v1.1.0) - 2020-09-08

Dockerfile

+5
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,11 @@ WORKDIR /stage/allennlp
2121
# Install the wheel of AllenNLP.
2222
COPY dist dist/
2323
RUN pip install $(ls dist/*.whl)
24+
# TODO(epwalsh): In PyTorch 1.7, dataclasses is an unconditional dependency, when it should
25+
# only be a conditional dependency for Python < 3.7.
26+
# This has been fixed on PyTorch master branch, so we should be able to
27+
# remove this check with the next PyTorch release.
28+
RUN pip uninstall -y dataclasses
2429

2530
# Copy wrapper script to allow beaker to run resumable training workloads.
2631
COPY scripts/ai2_internal/resumable_train.sh /stage/allennlp

README.md

+2
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,8 @@ And others on the [AI2 AllenNLP blog](https://medium.com/ai2-blog/allennlp/home)
9797

9898
AllenNLP requires Python 3.6.1 or later. The preferred way to install AllenNLP is via `pip`. Just run `pip install allennlp` in your Python environment and you're good to go!
9999

100+
> ⚠️ If you're using Python 3.7 or greater, you should ensure that you don't have the PyPI version of `dataclasses` installed after running the above command, as this could cause issues on certain platforms. You can quickly check this by running `pip freeze | grep dataclasses`. If you see something like `dataclasses=0.6` in the output, then just run `pip uninstall -y dataclasses`.
101+
100102
If you need pointers on setting up an appropriate Python environment or would like to install AllenNLP using a different method, see below.
101103

102104
We support AllenNLP on Mac and Linux environments. We presently do not support Windows but are open to contributions.

allennlp/data/tokenizers/pretrained_transformer_tokenizer.py

-1
Original file line numberDiff line numberDiff line change
@@ -351,7 +351,6 @@ def _intra_word_tokenize(
351351
return_tensors=None,
352352
return_offsets_mapping=False,
353353
return_attention_mask=False,
354-
return_token_type_ids=False,
355354
)
356355
wp_ids = wordpieces["input_ids"]
357356

allennlp/interpret/attackers/hotflip.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -361,7 +361,7 @@ def _first_order_taylor(self, grad: numpy.ndarray, token_idx: torch.Tensor, sign
361361
# This happens when we've truncated our fake embedding matrix. We need to do a dot
362362
# product with the word vector of the current token; if that token is out of
363363
# vocabulary for our truncated matrix, we need to run it through the embedding layer.
364-
inputs = self._make_embedder_input([self.vocab.get_token_from_index(token_idx)])
364+
inputs = self._make_embedder_input([self.vocab.get_token_from_index(token_idx.item())])
365365
word_embedding = self.embedding_layer(inputs)[0]
366366
else:
367367
word_embedding = torch.nn.functional.embedding(

allennlp/interpret/saliency_interpreters/integrated_gradient.py

+22-8
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
from typing import List, Dict, Any
33

44
import numpy
5+
import torch
56

67
from allennlp.common.util import JsonDict, sanitize
78
from allennlp.data import Instance
@@ -38,7 +39,7 @@ def saliency_interpret_from_json(self, inputs: JsonDict) -> JsonDict:
3839

3940
return sanitize(instances_with_grads)
4041

41-
def _register_forward_hook(self, alpha: int, embeddings_list: List):
42+
def _register_hooks(self, alpha: int, embeddings_list: List, token_offsets: List):
4243
"""
4344
Register a forward hook on the embedding layer which scales the embeddings by alpha. Used
4445
for one term in the Integrated Gradients sum.
@@ -50,15 +51,23 @@ def _register_forward_hook(self, alpha: int, embeddings_list: List):
5051
def forward_hook(module, inputs, output):
5152
# Save the input for later use. Only do so on first call.
5253
if alpha == 0:
53-
embeddings_list.append(output.squeeze(0).clone().detach().numpy())
54+
embeddings_list.append(output.squeeze(0).clone().detach())
5455

5556
# Scale the embedding by alpha
5657
output.mul_(alpha)
5758

58-
# Register the hook
59+
def get_token_offsets(module, inputs, outputs):
60+
offsets = util.get_token_offsets_from_text_field_inputs(inputs)
61+
if offsets is not None:
62+
token_offsets.append(offsets)
63+
64+
# Register the hooks
65+
handles = []
5966
embedding_layer = util.find_embedding_layer(self.predictor._model)
60-
handle = embedding_layer.register_forward_hook(forward_hook)
61-
return handle
67+
handles.append(embedding_layer.register_forward_hook(forward_hook))
68+
text_field_embedder = util.find_text_field_embedder(self.predictor._model)
69+
handles.append(text_field_embedder.register_forward_hook(get_token_offsets))
70+
return handles
6271

6372
def _integrate_gradients(self, instance: Instance) -> Dict[str, numpy.ndarray]:
6473
"""
@@ -67,18 +76,21 @@ def _integrate_gradients(self, instance: Instance) -> Dict[str, numpy.ndarray]:
6776
ig_grads: Dict[str, Any] = {}
6877

6978
# List of Embedding inputs
70-
embeddings_list: List[numpy.ndarray] = []
79+
embeddings_list: List[torch.Tensor] = []
80+
token_offsets: List[torch.Tensor] = []
7181

7282
# Use 10 terms in the summation approximation of the integral in integrated grad
7383
steps = 10
7484

7585
# Exclude the endpoint because we do a left point integral approximation
7686
for alpha in numpy.linspace(0, 1.0, num=steps, endpoint=False):
87+
handles = []
7788
# Hook for modifying embedding value
78-
handle = self._register_forward_hook(alpha, embeddings_list)
89+
handles = self._register_hooks(alpha, embeddings_list, token_offsets)
7990

8091
grads = self.predictor.get_gradients([instance])[0]
81-
handle.remove()
92+
for handle in handles:
93+
handle.remove()
8294

8395
# Running sum of gradients
8496
if ig_grads == {}:
@@ -93,6 +105,8 @@ def _integrate_gradients(self, instance: Instance) -> Dict[str, numpy.ndarray]:
93105

94106
# Gradients come back in the reverse order that they were sent into the network
95107
embeddings_list.reverse()
108+
token_offsets.reverse()
109+
embeddings_list = self._aggregate_token_embeddings(embeddings_list, token_offsets)
96110

97111
# Element-wise multiply average gradient by the input
98112
for idx, input_embedding in enumerate(embeddings_list):

allennlp/interpret/saliency_interpreters/saliency_interpreter.py

+32
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
1+
from typing import List
2+
3+
import numpy
4+
import torch
5+
16
from allennlp.common import Registrable
27
from allennlp.common.util import JsonDict
8+
from allennlp.nn import util
39
from allennlp.predictors import Predictor
410

511

@@ -30,3 +36,29 @@ def saliency_interpret_from_json(self, inputs: JsonDict) -> JsonDict:
3036
`{grad_input_1: ..., grad_input_2: ... }`.
3137
"""
3238
raise NotImplementedError("Implement this for saliency interpretations")
39+
40+
@staticmethod
41+
def _aggregate_token_embeddings(
42+
embeddings_list: List[torch.Tensor], token_offsets: List[torch.Tensor]
43+
) -> List[numpy.ndarray]:
44+
if len(token_offsets) == 0:
45+
return [embeddings.numpy() for embeddings in embeddings_list]
46+
aggregated_embeddings = []
47+
# NOTE: This is assuming that embeddings and offsets come in the same order, which may not
48+
# be true. But, the intersection of using multiple TextFields with mismatched indexers is
49+
# currently zero, so we'll delay handling this corner case until it actually causes a
50+
# problem. In practice, both of these lists will always be of size one at the moment.
51+
for embeddings, offsets in zip(embeddings_list, token_offsets):
52+
span_embeddings, span_mask = util.batched_span_select(embeddings.contiguous(), offsets)
53+
span_mask = span_mask.unsqueeze(-1)
54+
span_embeddings *= span_mask # zero out paddings
55+
56+
span_embeddings_sum = span_embeddings.sum(2)
57+
span_embeddings_len = span_mask.sum(2)
58+
# Shape: (batch_size, num_orig_tokens, embedding_size)
59+
embeddings = span_embeddings_sum / torch.clamp_min(span_embeddings_len, 1)
60+
61+
# All the places where the span length is zero, write in zeros.
62+
embeddings[(span_embeddings_len == 0).expand(embeddings.shape)] = 0
63+
aggregated_embeddings.append(embeddings.numpy())
64+
return aggregated_embeddings

allennlp/interpret/saliency_interpreters/simple_gradient.py

+25-11
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
from typing import List
44
import numpy
5+
import torch
56

67
from allennlp.common.util import JsonDict, sanitize
78
from allennlp.interpret.saliency_interpreters.saliency_interpreter import SaliencyInterpreter
@@ -21,44 +22,57 @@ def saliency_interpret_from_json(self, inputs: JsonDict) -> JsonDict:
2122
"""
2223
labeled_instances = self.predictor.json_to_labeled_instances(inputs)
2324

24-
# List of embedding inputs, used for multiplying gradient by the input for normalization
25-
embeddings_list: List[numpy.ndarray] = []
26-
2725
instances_with_grads = dict()
2826
for idx, instance in enumerate(labeled_instances):
27+
# List of embedding inputs, used for multiplying gradient by the input for normalization
28+
embeddings_list: List[torch.Tensor] = []
29+
token_offsets: List[torch.Tensor] = []
30+
2931
# Hook used for saving embeddings
30-
handle = self._register_forward_hook(embeddings_list)
32+
handles = self._register_hooks(embeddings_list, token_offsets)
3133
grads = self.predictor.get_gradients([instance])[0]
32-
handle.remove()
34+
for handle in handles:
35+
handle.remove()
3336

3437
# Gradients come back in the reverse order that they were sent into the network
3538
embeddings_list.reverse()
39+
token_offsets.reverse()
40+
embeddings_list = self._aggregate_token_embeddings(embeddings_list, token_offsets)
41+
3642
for key, grad in grads.items():
3743
# Get number at the end of every gradient key (they look like grad_input_[int],
3844
# we're getting this [int] part and subtracting 1 for zero-based indexing).
3945
# This is then used as an index into the reversed input array to match up the
4046
# gradient and its respective embedding.
4147
input_idx = int(key[-1]) - 1
4248
# The [0] here is undo-ing the batching that happens in get_gradients.
43-
emb_grad = numpy.sum(grad[0] * embeddings_list[input_idx], axis=1)
49+
emb_grad = numpy.sum(grad[0] * embeddings_list[input_idx][0], axis=1)
4450
norm = numpy.linalg.norm(emb_grad, ord=1)
4551
normalized_grad = [math.fabs(e) / norm for e in emb_grad]
4652
grads[key] = normalized_grad
4753

4854
instances_with_grads["instance_" + str(idx + 1)] = grads
4955
return sanitize(instances_with_grads)
5056

51-
def _register_forward_hook(self, embeddings_list: List):
57+
def _register_hooks(self, embeddings_list: List, token_offsets: List):
5258
"""
5359
Finds all of the TextFieldEmbedders, and registers a forward hook onto them. When forward()
5460
is called, embeddings_list is filled with the embedding values. This is necessary because
5561
our normalization scheme multiplies the gradient by the embedding value.
5662
"""
5763

5864
def forward_hook(module, inputs, output):
59-
embeddings_list.append(output.squeeze(0).clone().detach().numpy())
65+
embeddings_list.append(output.squeeze(0).clone().detach())
6066

61-
embedding_layer = util.find_embedding_layer(self.predictor._model)
62-
handle = embedding_layer.register_forward_hook(forward_hook)
67+
def get_token_offsets(module, inputs, outputs):
68+
offsets = util.get_token_offsets_from_text_field_inputs(inputs)
69+
if offsets is not None:
70+
token_offsets.append(offsets)
6371

64-
return handle
72+
# Register the hooks
73+
handles = []
74+
embedding_layer = util.find_embedding_layer(self.predictor._model)
75+
handles.append(embedding_layer.register_forward_hook(forward_hook))
76+
text_field_embedder = util.find_text_field_embedder(self.predictor._model)
77+
handles.append(text_field_embedder.register_forward_hook(get_token_offsets))
78+
return handles

allennlp/modules/text_field_embedders/basic_text_field_embedder.py

+17-2
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
from allennlp.modules.text_field_embedders.text_field_embedder import TextFieldEmbedder
1010
from allennlp.modules.time_distributed import TimeDistributed
1111
from allennlp.modules.token_embedders.token_embedder import TokenEmbedder
12+
from allennlp.modules.token_embedders import EmptyEmbedder
1213

1314

1415
@TextFieldEmbedder.register("basic")
@@ -53,18 +54,32 @@ def get_output_dim(self) -> int:
5354
def forward(
5455
self, text_field_input: TextFieldTensors, num_wrapping_dims: int = 0, **kwargs
5556
) -> torch.Tensor:
56-
if self._token_embedders.keys() != text_field_input.keys():
57+
if sorted(self._token_embedders.keys()) != sorted(text_field_input.keys()):
5758
message = "Mismatched token keys: %s and %s" % (
5859
str(self._token_embedders.keys()),
5960
str(text_field_input.keys()),
6061
)
61-
raise ConfigurationError(message)
62+
embedder_keys = set(self._token_embedders.keys())
63+
input_keys = set(text_field_input.keys())
64+
if embedder_keys > input_keys and all(
65+
isinstance(embedder, EmptyEmbedder)
66+
for name, embedder in self._token_embedders.items()
67+
if name in embedder_keys - input_keys
68+
):
69+
# Allow extra embedders that are only in the token embedders (but not input) and are empty to pass
70+
# config check
71+
pass
72+
else:
73+
raise ConfigurationError(message)
6274

6375
embedded_representations = []
6476
for key in self._ordered_embedder_keys:
6577
# Note: need to use getattr here so that the pytorch voodoo
6678
# with submodules works with multiple GPUs.
6779
embedder = getattr(self, "token_embedder_{}".format(key))
80+
if isinstance(embedder, EmptyEmbedder):
81+
# Skip empty embedders
82+
continue
6883
forward_params = inspect.signature(embedder.forward).parameters
6984
forward_params_values = {}
7085
missing_tensor_args = set()

0 commit comments

Comments
 (0)