Trait tokenizers::tokenizer::PreTokenizer
source · pub trait PreTokenizer {
// Required method
fn pre_tokenize(&self, pretokenized: &mut PreTokenizedString) -> Result<()>;
}Expand description
The PreTokenizer is in charge of doing the pre-segmentation step. It splits the given string
in multiple substrings, keeping track of the offsets of said substrings from the
NormalizedString. In some occasions, the PreTokenizer might need to modify the given
NormalizedString to ensure we can entirely keep track of the offsets and the mapping with
the original string.
Required Methods§
fn pre_tokenize(&self, pretokenized: &mut PreTokenizedString) -> Result<()>
Implementors§
impl PreTokenizer for PreTokenizerWrapper
impl PreTokenizer for BertPreTokenizer
impl PreTokenizer for ByteLevel
As a PreTokenizer, ByteLevel is in charge of transforming all the unicode characters into
their byte-level counterpart. It also splits the input according to the configured regex.