Struct tokenizers::pre_tokenizers::byte_level::ByteLevel
source · #[non_exhaustive]pub struct ByteLevel {
pub add_prefix_space: bool,
pub trim_offsets: bool,
pub use_regex: bool,
}
Expand description
Provides all the necessary steps to handle the BPE tokenization at the byte-level. Takes care of all the required processing steps to transform a UTF-8 string as needed before and after the BPE model does its job.
Fields (Non-exhaustive)§
This struct is marked as non-exhaustive
Struct { .. }
syntax; cannot be matched against without a wildcard ..
; and struct update syntax will not work.add_prefix_space: bool
Whether to add a leading space to the first word. This allows to treat the leading word just as any other word.
trim_offsets: bool
Whether the post processing step should trim offsets to avoid including whitespaces.
use_regex: bool
Whether to use the standard GPT2 regex for whitespace splitting Set it to False if you want to use your own splitting.
Implementations§
Trait Implementations§
source§impl Decoder for ByteLevel
impl Decoder for ByteLevel
As a Decoder
, ByteLevel
is in charge of converting any byte-level characters to their
unicode counterpart, before merging everything back into a single String.
This decoder will consume the tokens and merge them in one step to alleviate
the fact that single token decoded might be a byte not representable as
as String.
source§impl<'de> Deserialize<'de> for ByteLevel
impl<'de> Deserialize<'de> for ByteLevel
source§fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where
__D: Deserializer<'de>,
fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where
__D: Deserializer<'de>,
source§impl From<ByteLevel> for DecoderWrapper
impl From<ByteLevel> for DecoderWrapper
source§impl From<ByteLevel> for PostProcessorWrapper
impl From<ByteLevel> for PostProcessorWrapper
source§impl From<ByteLevel> for PreTokenizerWrapper
impl From<ByteLevel> for PreTokenizerWrapper
source§impl PartialEq for ByteLevel
impl PartialEq for ByteLevel
source§impl PostProcessor for ByteLevel
impl PostProcessor for ByteLevel
As a PostProcessor
, ByteLevel
is in charge of trimming the offsets if necessary.
source§fn added_tokens(&self, _is_pair: bool) -> usize
fn added_tokens(&self, _is_pair: bool) -> usize
source§impl PreTokenizer for ByteLevel
impl PreTokenizer for ByteLevel
As a PreTokenizer
, ByteLevel
is in charge of transforming all the unicode characters into
their byte-level counterpart. It also splits the input according to the configured regex.