Trait aho_corasick::automaton::Automaton
source · pub unsafe trait Automaton: Sealed {
Show 26 methods
// Required methods
fn start_state(&self, anchored: Anchored) -> Result<StateID, MatchError>;
fn next_state(&self, anchored: Anchored, sid: StateID, byte: u8) -> StateID;
fn is_special(&self, sid: StateID) -> bool;
fn is_dead(&self, sid: StateID) -> bool;
fn is_match(&self, sid: StateID) -> bool;
fn is_start(&self, sid: StateID) -> bool;
fn match_kind(&self) -> MatchKind;
fn match_len(&self, sid: StateID) -> usize;
fn match_pattern(&self, sid: StateID, index: usize) -> PatternID;
fn patterns_len(&self) -> usize;
fn pattern_len(&self, pid: PatternID) -> usize;
fn min_pattern_len(&self) -> usize;
fn max_pattern_len(&self) -> usize;
fn memory_usage(&self) -> usize;
fn prefilter(&self) -> Option<&Prefilter>;
// Provided methods
fn try_find(&self, input: &Input<'_>) -> Result<Option<Match>, MatchError> { ... }
fn try_find_overlapping(
&self,
input: &Input<'_>,
state: &mut OverlappingState
) -> Result<(), MatchError> { ... }
fn try_find_iter<'a, 'h>(
&'a self,
input: Input<'h>
) -> Result<FindIter<'a, 'h, Self>, MatchError>
where Self: Sized { ... }
fn try_find_overlapping_iter<'a, 'h>(
&'a self,
input: Input<'h>
) -> Result<FindOverlappingIter<'a, 'h, Self>, MatchError>
where Self: Sized { ... }
fn try_replace_all<B>(
&self,
haystack: &str,
replace_with: &[B]
) -> Result<String, MatchError>
where Self: Sized,
B: AsRef<str> { ... }
fn try_replace_all_bytes<B>(
&self,
haystack: &[u8],
replace_with: &[B]
) -> Result<Vec<u8>, MatchError>
where Self: Sized,
B: AsRef<[u8]> { ... }
fn try_replace_all_with<F>(
&self,
haystack: &str,
dst: &mut String,
replace_with: F
) -> Result<(), MatchError>
where Self: Sized,
F: FnMut(&Match, &str, &mut String) -> bool { ... }
fn try_replace_all_with_bytes<F>(
&self,
haystack: &[u8],
dst: &mut Vec<u8>,
replace_with: F
) -> Result<(), MatchError>
where Self: Sized,
F: FnMut(&Match, &[u8], &mut Vec<u8>) -> bool { ... }
fn try_stream_find_iter<'a, R: Read>(
&'a self,
rdr: R
) -> Result<StreamFindIter<'a, Self, R>, MatchError>
where Self: Sized { ... }
fn try_stream_replace_all<R, W, B>(
&self,
rdr: R,
wtr: W,
replace_with: &[B]
) -> Result<()>
where Self: Sized,
R: Read,
W: Write,
B: AsRef<[u8]> { ... }
fn try_stream_replace_all_with<R, W, F>(
&self,
rdr: R,
wtr: W,
replace_with: F
) -> Result<()>
where Self: Sized,
R: Read,
W: Write,
F: FnMut(&Match, &[u8], &mut W) -> Result<()> { ... }
}
Expand description
A trait that abstracts over Aho-Corasick automata.
This trait primarily exists for niche use cases such as:
- Using an NFA or DFA directly, bypassing the top-level
AhoCorasick
searcher. Currently, these includenoncontiguous::NFA
,contiguous::NFA
anddfa::DFA
. - Implementing your own custom search routine by walking the automaton yourself. This might be useful for implementing search on non-contiguous strings or streams.
For most use cases, it is not expected that users will need
to use or even know about this trait. Indeed, the top level
AhoCorasick
searcher does not expose any details
about this trait, nor does it implement it itself.
Note that this trait defines a number of default methods, such as
Automaton::try_find
and Automaton::try_find_iter
, which implement
higher level search routines in terms of the lower level automata API.
Sealed
Currently, this trait is sealed. That means users of this crate can write generic routines over this trait but cannot implement it themselves. This restriction may be lifted in the future, but sealing the trait permits adding new required methods in a backwards compatible fashion.
Special states
This trait encodes a notion of “special” states in an automaton. Namely, a state is treated as special if it is a dead, match or start state:
- A dead state is a state that cannot be left once entered. All transitions on a dead state lead back to itself. The dead state is meant to be treated as a sentinel indicating that the search should stop and return a match if one has been found, and nothing otherwise.
- A match state is a state that indicates one or more patterns have
matched. Depending on the
MatchKind
of the automaton, a search may stop once a match is seen, or it may continue looking for matches until it enters a dead state or sees the end of the haystack. - A start state is a state that a search begins in. It is useful to know
when a search enters a start state because it may mean that a prefilter can
be used to skip ahead and quickly look for candidate matches. Unlike dead
and match states, it is never necessary to explicitly handle start states
for correctness. Indeed, in this crate, implementations of
Automaton
will only treat start states as “special” when a prefilter is enabled and active. Otherwise, treating it as special has no purpose and winds up slowing down the overall search because it results in ping-ponging between the main state transition and the “special” state logic.
Since checking whether a state is special by doing three different
checks would be too expensive inside a fast search loop, the
Automaton::is_special
method is provided for quickly checking whether
the state is special. The Automaton::is_dead
, Automaton::is_match
and
Automaton::is_start
predicates can then be used to determine which kind
of special state it is.
Panics
Most of the APIs on this trait should panic or give incorrect results
if invalid inputs are given to it. For example, Automaton::next_state
has unspecified behavior if the state ID given to it is not a valid
state ID for the underlying automaton. Valid state IDs can only be
retrieved in one of two ways: calling Automaton::start_state
or calling
Automaton::next_state
with a valid state ID.
Safety
This trait is not safe to implement so that code may rely on the correctness of implementations of this trait to avoid undefined behavior. The primary correctness guarantees are:
Automaton::start_state
always returns a valid state ID or an error or panics.Automaton::next_state
, when given a valid state ID, always returns a valid state ID for all values ofanchored
andbyte
, or otherwise panics.
In general, the rest of the methods on Automaton
need to uphold their
contracts as well. For example, Automaton::is_dead
should only returns
true if the given state ID is actually a dead state.
Note that currently this crate does not rely on the safety property defined here to avoid undefined behavior. Instead, this was done to make it possible to do in the future.
Example
This example shows how one might implement a basic but correct search
routine. We keep things simple by not using prefilters or worrying about
anchored searches, but do make sure our search is correct for all possible
MatchKind
semantics. (The comments in the code below note the parts
that are needed to support certain MatchKind
semantics.)
use aho_corasick::{
automaton::Automaton,
nfa::noncontiguous::NFA,
Anchored, Match, MatchError, MatchKind,
};
// Run an unanchored search for 'aut' in 'haystack'. Return the first match
// seen according to the automaton's match semantics. This returns an error
// if the given automaton does not support unanchored searches.
fn find<A: Automaton>(
aut: A,
haystack: &[u8],
) -> Result<Option<Match>, MatchError> {
let mut sid = aut.start_state(Anchored::No)?;
let mut at = 0;
let mut mat = None;
let get_match = |sid, at| {
let pid = aut.match_pattern(sid, 0);
let len = aut.pattern_len(pid);
Match::new(pid, (at - len)..at)
};
// Start states can be match states!
if aut.is_match(sid) {
mat = Some(get_match(sid, at));
// Standard semantics require matches to be reported as soon as
// they're seen. Otherwise, we continue until we see a dead state
// or the end of the haystack.
if matches!(aut.match_kind(), MatchKind::Standard) {
return Ok(mat);
}
}
while at < haystack.len() {
sid = aut.next_state(Anchored::No, sid, haystack[at]);
if aut.is_special(sid) {
if aut.is_dead(sid) {
return Ok(mat);
} else if aut.is_match(sid) {
mat = Some(get_match(sid, at + 1));
// As above, standard semantics require that we return
// immediately once a match is found.
if matches!(aut.match_kind(), MatchKind::Standard) {
return Ok(mat);
}
}
}
at += 1;
}
Ok(mat)
}
// Show that it works for standard searches.
let nfa = NFA::new(&["samwise", "sam"]).unwrap();
assert_eq!(Some(Match::must(1, 0..3)), find(&nfa, b"samwise")?);
// But also works when using leftmost-first. Notice how the match result
// has changed!
let nfa = NFA::builder()
.match_kind(MatchKind::LeftmostFirst)
.build(&["samwise", "sam"])
.unwrap();
assert_eq!(Some(Match::must(0, 0..7)), find(&nfa, b"samwise")?);
Required Methods§
sourcefn start_state(&self, anchored: Anchored) -> Result<StateID, MatchError>
fn start_state(&self, anchored: Anchored) -> Result<StateID, MatchError>
Returns the starting state for the given anchor mode.
Upon success, the state ID returned is guaranteed to be valid for this automaton.
Errors
This returns an error when the given search configuration is not supported by the underlying automaton. For example, if the underlying automaton only supports unanchored searches but the given configuration was set to an anchored search, then this must return an error.
sourcefn next_state(&self, anchored: Anchored, sid: StateID, byte: u8) -> StateID
fn next_state(&self, anchored: Anchored, sid: StateID, byte: u8) -> StateID
Performs a state transition from sid
for byte
and returns the next
state.
anchored
should be Anchored::Yes
when executing an anchored
search and Anchored::No
otherwise. For some implementations of
Automaton
, it is required to know whether the search is anchored
or not in order to avoid following failure transitions. Other
implementations may ignore anchored
altogether and depend on
Automaton::start_state
returning a state that walks a different path
through the automaton depending on whether the search is anchored or
not.
Panics
This routine may panic or return incorrect results when the given state ID is invalid. A state ID is valid if and only if:
- It came from a call to
Automaton::start_state
, or - It came from a previous call to
Automaton::next_state
with a valid state ID.
Implementations must treat all possible values of byte
as valid.
Implementations may panic on unsupported values of anchored
, but are
not required to do so.
sourcefn is_special(&self, sid: StateID) -> bool
fn is_special(&self, sid: StateID) -> bool
Returns true if the given ID represents a “special” state. A special state is a dead, match or start state.
Note that implementations may choose to return false when the given ID corresponds to a start state. Namely, it always correct to treat start states as non-special. Implementations must return true for states that are dead or contain matches.
This has unspecified behavior when given an invalid state ID.
sourcefn is_dead(&self, sid: StateID) -> bool
fn is_dead(&self, sid: StateID) -> bool
Returns true if the given ID represents a dead state.
A dead state is a type of “sink” in a finite state machine. It corresponds to a state whose transitions all loop back to itself. That is, once entered, it can never be left. In practice, it serves as a sentinel indicating that the search should terminate.
This has unspecified behavior when given an invalid state ID.
sourcefn is_match(&self, sid: StateID) -> bool
fn is_match(&self, sid: StateID) -> bool
Returns true if the given ID represents a match state.
A match state is always associated with one or more pattern IDs that
matched at the position in the haystack when the match state was
entered. When a match state is entered, the match semantics dictate
whether it should be returned immediately (for MatchKind::Standard
)
or if the search should continue (for MatchKind::LeftmostFirst
and
MatchKind::LeftmostLongest
) until a dead state is seen or the end of
the haystack has been reached.
This has unspecified behavior when given an invalid state ID.
sourcefn is_start(&self, sid: StateID) -> bool
fn is_start(&self, sid: StateID) -> bool
Returns true if the given ID represents a start state.
While it is never incorrect to ignore start states during a search (except for the start of the search of course), knowing whether one has entered a start state can be useful for certain classes of performance optimizations. For example, if one is in a start state, it may be legal to try to skip ahead and look for match candidates more quickly than would otherwise be accomplished by walking the automaton.
Implementations of Automaton
in this crate “unspecialize” start
states when a prefilter is not active or enabled. In this case, it
is possible for Automaton::is_special(sid)
to return false while
Automaton::is_start(sid)
returns true.
This has unspecified behavior when given an invalid state ID.
sourcefn match_kind(&self) -> MatchKind
fn match_kind(&self) -> MatchKind
Returns the match semantics that this automaton was built with.
sourcefn match_len(&self, sid: StateID) -> usize
fn match_len(&self, sid: StateID) -> usize
Returns the total number of matches for the given state ID.
This has unspecified behavior if the given ID does not refer to a match state.
sourcefn match_pattern(&self, sid: StateID, index: usize) -> PatternID
fn match_pattern(&self, sid: StateID, index: usize) -> PatternID
Returns the pattern ID for the match state given by sid
at the
index
given.
Typically, index
is only ever greater than 0
when implementing an
overlapping search. Otherwise, it’s likely that your search only cares
about reporting the first pattern ID in a match state.
This has unspecified behavior if the given ID does not refer to a match state, or if the index is greater than or equal to the total number of matches in this match state.
sourcefn patterns_len(&self) -> usize
fn patterns_len(&self) -> usize
Returns the total number of patterns compiled into this automaton.
sourcefn pattern_len(&self, pid: PatternID) -> usize
fn pattern_len(&self, pid: PatternID) -> usize
Returns the length of the pattern for the given ID.
This has unspecified behavior when given an invalid pattern
ID. A pattern ID is valid if and only if it is less than
Automaton::patterns_len
.
sourcefn min_pattern_len(&self) -> usize
fn min_pattern_len(&self) -> usize
Returns the length, in bytes, of the shortest pattern in this automaton.
sourcefn max_pattern_len(&self) -> usize
fn max_pattern_len(&self) -> usize
Returns the length, in bytes, of the longest pattern in this automaton.
sourcefn memory_usage(&self) -> usize
fn memory_usage(&self) -> usize
Returns the heap memory usage, in bytes, used by this automaton.
sourcefn prefilter(&self) -> Option<&Prefilter>
fn prefilter(&self) -> Option<&Prefilter>
Returns a prefilter, if available, that can be used to accelerate searches for this automaton.
The typical way this is used is when the start state is entered during a search. When that happens, one can use a prefilter to skip ahead and look for candidate matches without having to walk the automaton on the bytes between candidates.
Typically a prefilter is only available when there are a small (<100) number of patterns built into the automaton.
Provided Methods§
sourcefn try_find(&self, input: &Input<'_>) -> Result<Option<Match>, MatchError>
fn try_find(&self, input: &Input<'_>) -> Result<Option<Match>, MatchError>
Executes a non-overlapping search with this automaton using the given configuration.
See
AhoCorasick::try_find
for more documentation and examples.
sourcefn try_find_overlapping(
&self,
input: &Input<'_>,
state: &mut OverlappingState
) -> Result<(), MatchError>
fn try_find_overlapping( &self, input: &Input<'_>, state: &mut OverlappingState ) -> Result<(), MatchError>
Executes a overlapping search with this automaton using the given configuration.
See
AhoCorasick::try_find_overlapping
for more documentation and examples.
sourcefn try_find_iter<'a, 'h>(
&'a self,
input: Input<'h>
) -> Result<FindIter<'a, 'h, Self>, MatchError>where
Self: Sized,
fn try_find_iter<'a, 'h>( &'a self, input: Input<'h> ) -> Result<FindIter<'a, 'h, Self>, MatchError>where Self: Sized,
Returns an iterator of non-overlapping matches with this automaton using the given configuration.
See
AhoCorasick::try_find_iter
for more documentation and examples.
sourcefn try_find_overlapping_iter<'a, 'h>(
&'a self,
input: Input<'h>
) -> Result<FindOverlappingIter<'a, 'h, Self>, MatchError>where
Self: Sized,
fn try_find_overlapping_iter<'a, 'h>( &'a self, input: Input<'h> ) -> Result<FindOverlappingIter<'a, 'h, Self>, MatchError>where Self: Sized,
Returns an iterator of overlapping matches with this automaton using the given configuration.
See
AhoCorasick::try_find_overlapping_iter
for more documentation and examples.
sourcefn try_replace_all<B>(
&self,
haystack: &str,
replace_with: &[B]
) -> Result<String, MatchError>where
Self: Sized,
B: AsRef<str>,
fn try_replace_all<B>( &self, haystack: &str, replace_with: &[B] ) -> Result<String, MatchError>where Self: Sized, B: AsRef<str>,
Replaces all non-overlapping matches in haystack
with
strings from replace_with
depending on the pattern that
matched. The replace_with
slice must have length equal to
Automaton::patterns_len
.
See
AhoCorasick::try_replace_all
for more documentation and examples.
sourcefn try_replace_all_bytes<B>(
&self,
haystack: &[u8],
replace_with: &[B]
) -> Result<Vec<u8>, MatchError>where
Self: Sized,
B: AsRef<[u8]>,
fn try_replace_all_bytes<B>( &self, haystack: &[u8], replace_with: &[B] ) -> Result<Vec<u8>, MatchError>where Self: Sized, B: AsRef<[u8]>,
Replaces all non-overlapping matches in haystack
with
strings from replace_with
depending on the pattern that
matched. The replace_with
slice must have length equal to
Automaton::patterns_len
.
See
AhoCorasick::try_replace_all_bytes
for more documentation and examples.
sourcefn try_replace_all_with<F>(
&self,
haystack: &str,
dst: &mut String,
replace_with: F
) -> Result<(), MatchError>where
Self: Sized,
F: FnMut(&Match, &str, &mut String) -> bool,
fn try_replace_all_with<F>( &self, haystack: &str, dst: &mut String, replace_with: F ) -> Result<(), MatchError>where Self: Sized, F: FnMut(&Match, &str, &mut String) -> bool,
Replaces all non-overlapping matches in haystack
by calling the
replace_with
closure given.
See
AhoCorasick::try_replace_all_with
for more documentation and examples.
sourcefn try_replace_all_with_bytes<F>(
&self,
haystack: &[u8],
dst: &mut Vec<u8>,
replace_with: F
) -> Result<(), MatchError>where
Self: Sized,
F: FnMut(&Match, &[u8], &mut Vec<u8>) -> bool,
fn try_replace_all_with_bytes<F>( &self, haystack: &[u8], dst: &mut Vec<u8>, replace_with: F ) -> Result<(), MatchError>where Self: Sized, F: FnMut(&Match, &[u8], &mut Vec<u8>) -> bool,
Replaces all non-overlapping matches in haystack
by calling the
replace_with
closure given.
See
AhoCorasick::try_replace_all_with_bytes
for more documentation and examples.
sourcefn try_stream_find_iter<'a, R: Read>(
&'a self,
rdr: R
) -> Result<StreamFindIter<'a, Self, R>, MatchError>where
Self: Sized,
fn try_stream_find_iter<'a, R: Read>( &'a self, rdr: R ) -> Result<StreamFindIter<'a, Self, R>, MatchError>where Self: Sized,
Returns an iterator of non-overlapping matches with this automaton from the stream given.
See
AhoCorasick::try_stream_find_iter
for more documentation and examples.
sourcefn try_stream_replace_all<R, W, B>(
&self,
rdr: R,
wtr: W,
replace_with: &[B]
) -> Result<()>where
Self: Sized,
R: Read,
W: Write,
B: AsRef<[u8]>,
fn try_stream_replace_all<R, W, B>( &self, rdr: R, wtr: W, replace_with: &[B] ) -> Result<()>where Self: Sized, R: Read, W: Write, B: AsRef<[u8]>,
Replaces all non-overlapping matches in rdr
with strings from
replace_with
depending on the pattern that matched, and writes the
result to wtr
. The replace_with
slice must have length equal to
Automaton::patterns_len
.
See
AhoCorasick::try_stream_replace_all
for more documentation and examples.
sourcefn try_stream_replace_all_with<R, W, F>(
&self,
rdr: R,
wtr: W,
replace_with: F
) -> Result<()>where
Self: Sized,
R: Read,
W: Write,
F: FnMut(&Match, &[u8], &mut W) -> Result<()>,
fn try_stream_replace_all_with<R, W, F>( &self, rdr: R, wtr: W, replace_with: F ) -> Result<()>where Self: Sized, R: Read, W: Write, F: FnMut(&Match, &[u8], &mut W) -> Result<()>,
Replaces all non-overlapping matches in rdr
by calling the
replace_with
closure given and writing the result to wtr
.
See
AhoCorasick::try_stream_replace_all_with
for more documentation and examples.