Trait tokenizers::tokenizer::PreTokenizer
source · pub trait PreTokenizer {
// Required method
fn pre_tokenize(&self, pretokenized: &mut PreTokenizedString) -> Result<()>;
}
Expand description
The PreTokenizer
is in charge of doing the pre-segmentation step. It splits the given string
in multiple substrings, keeping track of the offsets of said substrings from the
NormalizedString
. In some occasions, the PreTokenizer
might need to modify the given
NormalizedString
to ensure we can entirely keep track of the offsets and the mapping with
the original string.
Required Methods§
fn pre_tokenize(&self, pretokenized: &mut PreTokenizedString) -> Result<()>
Implementors§
impl PreTokenizer for PreTokenizerWrapper
impl PreTokenizer for BertPreTokenizer
impl PreTokenizer for ByteLevel
As a PreTokenizer
, ByteLevel
is in charge of transforming all the unicode characters into
their byte-level counterpart. It also splits the input according to the configured regex.