Struct tokenizers::tokenizer::pre_tokenizer::PreTokenizedString
source · pub struct PreTokenizedString { /* private fields */ }
Expand description
The PreTokenizedString
is in charge of splitting an underlying string,
making sure everything is fine while doing so, and providing ways to normalize
and tokenize these splits.
Once everything has been normalized and tokenized, the PreTokenizedString
is able
to build an Encoding
with all the relevant offsets and word ids, relative to the
original string.
Implementations§
source§impl PreTokenizedString
impl PreTokenizedString
sourcepub fn split<F, U, R>(&mut self, split_fn: F) -> Result<()>
pub fn split<F, U, R>(&mut self, split_fn: F) -> Result<()>
Split the PreTokenizedString
by providing a split_fn
in charge of splitting
each substring (NormalizedString
) into multiple parts.
split_fn
takes a NormalizedString
and is in charge of returning an iterator
over the produced NormalizedString
. split_fn
is free of modifying these
NormalizedString
as relevant, as long as it respects the constraint stated below.
There are only one constraint that MUST be respected:
The produced
NormalizedString
, if combined back together, must have the sameoriginal
string as the original one given tosplit_fn
. This concretely means that for the offset tracking to work as expected,split_fn
must produce “splits” of the original string.
sourcepub fn normalize<F>(&mut self, normalize: F) -> Result<()>
pub fn normalize<F>(&mut self, normalize: F) -> Result<()>
Normalized all the splits that do not have attached Tokens
, using the provided
normalize
function.
sourcepub fn tokenize<F>(&mut self, tokenize: F) -> Result<()>
pub fn tokenize<F>(&mut self, tokenize: F) -> Result<()>
Tokenize all the splits that do not have attached Tokens
, using the provided
tokenize
function
sourcepub fn into_encoding(
self,
word_idx: Option<u32>,
type_id: u32,
offset_type: OffsetType
) -> Result<Encoding>
pub fn into_encoding( self, word_idx: Option<u32>, type_id: u32, offset_type: OffsetType ) -> Result<Encoding>
Transform the current PreTokenizedString
into an Encoding
.
If a word_idx
is provided, any word in the generated Encoding
will be set to this value. This is generally used with pre-tokenized
input, that do not need the PreTokenizedString
to generate word ids.
This method will fail if some splits do not have associated Token
.
sourcepub fn get_splits(
&self,
offset_ref: OffsetReferential,
offset_type: OffsetType
) -> Vec<(&str, Offsets, &Option<Vec<Token>>)>
pub fn get_splits( &self, offset_ref: OffsetReferential, offset_type: OffsetType ) -> Vec<(&str, Offsets, &Option<Vec<Token>>)>
Returns a list of splits, each of them being a slice of the normalized string, the associated offsets either in original or normalized referential, as well as the potention tokens
Trait Implementations§
source§impl Clone for PreTokenizedString
impl Clone for PreTokenizedString
source§fn clone(&self) -> PreTokenizedString
fn clone(&self) -> PreTokenizedString
1.0.0 · source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source
. Read moresource§impl Debug for PreTokenizedString
impl Debug for PreTokenizedString
source§impl From<&str> for PreTokenizedString
impl From<&str> for PreTokenizedString
source§impl From<NormalizedString> for PreTokenizedString
impl From<NormalizedString> for PreTokenizedString
source§fn from(s: NormalizedString) -> Self
fn from(s: NormalizedString) -> Self
source§impl From<String> for PreTokenizedString
impl From<String> for PreTokenizedString
source§impl PartialEq for PreTokenizedString
impl PartialEq for PreTokenizedString
source§fn eq(&self, other: &PreTokenizedString) -> bool
fn eq(&self, other: &PreTokenizedString) -> bool
self
and other
values to be equal, and is used
by ==
.