Skip to content
This repository was archived by the owner on Dec 16, 2022. It is now read-only.

Commit 55cfb47

Browse files
authored
The truncation setting doesn't do anything anymore (#4672)
* The truncation setting doesn't do anything anymore * Changelog
1 parent 990c9c1 commit 55cfb47

File tree

2 files changed

+7
-11
lines changed

2 files changed

+7
-11
lines changed

CHANGELOG.md

+2
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
3030

3131
- `transformers` dependency updated to version 3.1.0.
3232
- When `cached_path` is called on a local archive with `extract_archive=True`, the archive is now extracted into a unique subdirectory of the cache root instead of a subdirectory of the archive's directory. The extraction directory is also unique to the modification time of the archive, so if the file changes, subsequent calls to `cached_path` will know to re-extract the archive.
33+
- Removed the `truncation_strategy` parameter to `PretrainedTransformerTokenizer`. The way we're calling the tokenizer, the truncation strategy takes no effect anyways.
3334

3435
### Fixed
3536

@@ -46,6 +47,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
4647
- Fixed a bug in our doc building script where markdown links did not render properly
4748
if the "href" part of the link (the part inside the `()`) was on a new line.
4849

50+
4951
## [v1.1.0](https://github.com/allenai/allennlp/releases/tag/v1.1.0) - 2020-09-08
5052

5153
### Fixed

allennlp/data/tokenizers/pretrained_transformer_tokenizer.py

+5-11
Original file line numberDiff line numberDiff line change
@@ -44,13 +44,6 @@ class PretrainedTransformerTokenizer(Tokenizer):
4444
stride : `int`, optional (default=`0`)
4545
If set to a number along with max_length, the overflowing tokens returned will contain some tokens
4646
from the main sequence returned. The value of this argument defines the number of additional tokens.
47-
truncation_strategy : `str`, optional (default=`'longest_first'`)
48-
String selected in the following options:
49-
- 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length
50-
starting from the longest one at each token (when there is a pair of input sequences)
51-
- 'only_first': Only truncate the first sequence
52-
- 'only_second': Only truncate the second sequence
53-
- 'do_not_truncate': Do not truncate (raise an error if the input sequence is longer than max_length)
5447
tokenizer_kwargs: `Dict[str, Any]`, optional (default = `None`)
5548
Dictionary with
5649
[additional arguments](https://github.com/huggingface/transformers/blob/155c782a2ccd103cf63ad48a2becd7c76a7d2115/transformers/tokenization_utils.py#L691)
@@ -63,7 +56,6 @@ def __init__(
6356
add_special_tokens: bool = True,
6457
max_length: Optional[int] = None,
6558
stride: int = 0,
66-
truncation_strategy: str = "longest_first",
6759
tokenizer_kwargs: Optional[Dict[str, Any]] = None,
6860
) -> None:
6961
if tokenizer_kwargs is None:
@@ -82,7 +74,6 @@ def __init__(
8274
self._add_special_tokens = add_special_tokens
8375
self._max_length = max_length
8476
self._stride = stride
85-
self._truncation_strategy = truncation_strategy
8677

8778
self._tokenizer_lowercases = self.tokenizer_lowercases(self.tokenizer)
8879

@@ -230,12 +221,15 @@ def tokenize(self, text: str) -> List[Token]:
230221
"""
231222
This method only handles a single sentence (or sequence) of text.
232223
"""
224+
max_length = self._max_length
225+
if max_length is not None and self._add_special_tokens:
226+
max_length -= self.num_special_tokens_for_sequence()
227+
233228
encoded_tokens = self.tokenizer.encode_plus(
234229
text=text,
235230
add_special_tokens=False,
236-
max_length=self._max_length,
231+
max_length=max_length,
237232
stride=self._stride,
238-
truncation=self._truncation_strategy if self._max_length is not None else False,
239233
return_tensors=None,
240234
return_offsets_mapping=self.tokenizer.is_fast,
241235
return_attention_mask=False,

0 commit comments

Comments
 (0)