Module tokenizers::processors::template
source · Expand description
§Template Processing
Provides a way to specify templates in order to add the special tokens to each input sequence as relevant.
§Example
Let’s take BERT
tokenizer as an example. It uses two special tokens, used to
delimitate each sequence. [CLS]
is always used at the beginning of the first
sequence, and [SEP]
is added at the end of both the first, and the pair
sequences. The final result looks like this:
- Single sequence:
[CLS] Hello there [SEP]
- Pair sequences:
[CLS] My name is Anthony [SEP] What is my name? [SEP]
With the type ids as following:
[CLS] ... [SEP] ... [SEP]
0 0 0 1 1
So, we can define a TemplateProcessing
that will achieve this result:
let template = TemplateProcessing::builder()
// The template when we only have a single sequence:
.try_single(vec!["[CLS]", "$0", "[SEP]"]).unwrap()
// Same as:
.try_single("[CLS] $0 [SEP]").unwrap()
// The template when we have both sequences:
.try_pair(vec!["[CLS]:0", "$A:0", "[SEP]:0", "$B:1", "[SEP]:1"]).unwrap()
// Same as:
.try_pair("[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1").unwrap()
// Or:
.try_pair("[CLS] $0 [SEP] $B:1 [SEP]:1").unwrap()
// The list of special tokens used by each sequences
.special_tokens(vec![("[CLS]", 1), ("[SEP]", 0)])
.build()
.unwrap();
In this example, each input sequence is identified using a $
construct. This identifier
lets us specify each input sequence, and the type_id to use. When nothing is specified,
it uses the default values. Here are the different ways to specify it:
- Specifying the sequence, with default
type_id == 0
:$A
or$B
- Specifying the
type_id
with defaultsequence == A
:$0
,$1
,$2
, … - Specifying both:
$A:0
,$B:1
, …
The same construct is used for special tokens: <identifier>(:<type_id>)?
.
Warning: You must ensure that you are giving the correct tokens/ids as these will
be added to the Encoding
without any further check. If the given ids correspond to
something totally different in a Tokenizer
using this PostProcessor
, it might lead
to unexpected results.
Structs§
- Represents a bunch of tokens to be used in a template. Usually, special tokens have only one associated id/token but in some cases, it might be interesting to have multiple ids/tokens.
- A Template represents a Vec<
Piece
>. - This PostProcessor takes care of processing each input
Encoding
by applying the corresponding template, before merging them in the final Encoding. - Builder for
TemplateProcessing
. - A bunch of
SpecialToken
represented by their ID. Internally,Tokens
is aHashMap<String, SpecialToken>
and can be built from a HashMap or a Vec<SpecialToken
>.
Enums§
- Represents the different kind of pieces that constitute a template. It can be either the input sequence or a
SpecialToken
: - Represents any sequences received as input of the PostProcessor
- Error type for TemplateProcessingBuilder