Struct tokenizers::pre_tokenizers::byte_level::ByteLevel

source ·
#[non_exhaustive]
pub struct ByteLevel { pub add_prefix_space: bool, pub trim_offsets: bool, pub use_regex: bool, }
Expand description

Provides all the necessary steps to handle the BPE tokenization at the byte-level. Takes care of all the required processing steps to transform a UTF-8 string as needed before and after the BPE model does its job.

Fields (Non-exhaustive)§

This struct is marked as non-exhaustive
Non-exhaustive structs could have additional fields added in future. Therefore, non-exhaustive structs cannot be constructed in external crates using the traditional Struct { .. } syntax; cannot be matched against without a wildcard ..; and struct update syntax will not work.
§add_prefix_space: bool

Whether to add a leading space to the first word. This allows to treat the leading word just as any other word.

§trim_offsets: bool

Whether the post processing step should trim offsets to avoid including whitespaces.

§use_regex: bool

Whether to use the standard GPT2 regex for whitespace splitting Set it to False if you want to use your own splitting.

Implementations§

source§

impl ByteLevel

source

pub fn new(add_prefix_space: bool, trim_offsets: bool, use_regex: bool) -> Self

source

pub fn alphabet() -> HashSet<char>

source

pub fn add_prefix_space(self, v: bool) -> Self

source

pub fn trim_offsets(self, v: bool) -> Self

source

pub fn use_regex(self, v: bool) -> Self

Trait Implementations§

source§

impl Clone for ByteLevel

source§

fn clone(&self) -> ByteLevel

Returns a copy of the value. Read more
1.0.0 · source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
source§

impl Debug for ByteLevel

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
source§

impl Decoder for ByteLevel

As a Decoder, ByteLevel is in charge of converting any byte-level characters to their unicode counterpart, before merging everything back into a single String. This decoder will consume the tokens and merge them in one step to alleviate the fact that single token decoded might be a byte not representable as as String.

source§

fn decode_chain(&self, tokens: Vec<String>) -> Result<Vec<String>>

source§

fn decode(&self, tokens: Vec<String>) -> Result<String>

source§

impl Default for ByteLevel

source§

fn default() -> Self

Returns the “default value” for a type. Read more
source§

impl<'de> Deserialize<'de> for ByteLevel

source§

fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>
where __D: Deserializer<'de>,

Deserialize this value from the given Serde deserializer. Read more
source§

impl From<ByteLevel> for DecoderWrapper

source§

fn from(from: ByteLevel) -> Self

Converts to this type from the input type.
source§

impl From<ByteLevel> for PostProcessorWrapper

source§

fn from(from: ByteLevel) -> Self

Converts to this type from the input type.
source§

impl From<ByteLevel> for PreTokenizerWrapper

source§

fn from(from: ByteLevel) -> Self

Converts to this type from the input type.
source§

impl PartialEq for ByteLevel

source§

fn eq(&self, other: &ByteLevel) -> bool

This method tests for self and other values to be equal, and is used by ==.
1.0.0 · source§

fn ne(&self, other: &Rhs) -> bool

This method tests for !=. The default implementation is almost always sufficient, and should not be overridden without very good reason.
source§

impl PostProcessor for ByteLevel

As a PostProcessor, ByteLevel is in charge of trimming the offsets if necessary.

source§

fn added_tokens(&self, _is_pair: bool) -> usize

Returns the number of tokens that will be added during the processing step
source§

fn process_encodings( &self, encodings: Vec<Encoding>, _add_special_tokens: bool ) -> Result<Vec<Encoding>>

Process any amount of encodings and returns a series of encoding (might merge them)
source§

fn process( &self, encoding: Encoding, pair_encoding: Option<Encoding>, add_special_tokens: bool ) -> Result<Encoding>

Process both encodings and returns a new merged one
source§

impl PreTokenizer for ByteLevel

As a PreTokenizer, ByteLevel is in charge of transforming all the unicode characters into their byte-level counterpart. It also splits the input according to the configured regex.

source§

fn pre_tokenize(&self, pretokenized: &mut PreTokenizedString) -> Result<()>

source§

impl Serialize for ByteLevel

source§

fn serialize<__S>(&self, __serializer: __S) -> Result<__S::Ok, __S::Error>
where __S: Serializer,

Serialize this value into the given Serde serializer. Read more
source§

impl Copy for ByteLevel

source§

impl Eq for ByteLevel

source§

impl StructuralPartialEq for ByteLevel

Auto Trait Implementations§

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> Pointable for T

source§

const ALIGN: usize = _

The alignment of pointer.
§

type Init = T

The type for initializers.
source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
source§

impl<T> ToOwned for T
where T: Clone,

§

type Owned = T

The resulting type after obtaining ownership.
source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

source§

fn vzip(self) -> V

source§

impl<T> DeserializeOwned for T
where T: for<'de> Deserialize<'de>,