Struct regex_automata::dfa::onepass::DFA

source ·

pub struct DFA { /* private fields */ }

Expand description

A one-pass DFA for executing a subset of anchored regex searches while resolving capturing groups.

A one-pass DFA can be built from an NFA that is one-pass. An NFA is one-pass when there is never any ambiguity about how to continue a search. For example, a*a is not one-pass becuase during a search, it’s not possible to know whether to continue matching the a* or to move on to the single a. However, a*b is one-pass, because for every byte in the input, it’s always clear when to move on from a* to b.

Only anchored searches are supported

In this crate, especially for DFAs, unanchored searches are implemented by treating the pattern as if it had a (?s-u:.)*? prefix. While the prefix is one-pass on its own, adding anything after it, e.g., (?s-u:.)*?a will make the overall pattern not one-pass. Why? Because the (?s-u:.) matches any byte, and there is therefore ambiguity as to when the prefix should stop matching and something else should start matching.

Therefore, one-pass DFAs do not support unanchored searches. In addition to many regexes simply not being one-pass, it implies that one-pass DFAs have limited utility. With that said, when a one-pass DFA can be used, it can potentially provide a dramatic speed up over alternatives like the BoundedBacktracker and the PikeVM. In particular, a one-pass DFA is the only DFA capable of reporting the spans of matching capturing groups.

To clarify, when we say that unanchored searches are not supported, what that actually means is:

The high level routines, DFA::is_match and DFA::captures, always do anchored searches.
Since iterators are most useful in the context of unanchored searches, there is no DFA::captures_iter method.
For lower level routines like DFA::try_search, an error will be returned if the given Input is configured to do an unanchored search or search for an invalid pattern ID. (Note that an Input is configured to do an unanchored search by default, so just giving a Input::new is guaranteed to return an error.)

Other limitations

In addition to the configurable heap limit and the requirement that a regex pattern be one-pass, there are some other limitations:

There is an internal limit on the total number of explicit capturing groups that appear across all patterns. It is somewhat small and there is no way to configure it. If your pattern(s) exceed this limit, then building a one-pass DFA will fail.
If the number of patterns exceeds an internal unconfigurable limit, then building a one-pass DFA will fail. This limit is quite large and you’re unlikely to hit it.
If the total number of states exceeds an internal unconfigurable limit, then building a one-pass DFA will fail. This limit is quite large and you’re unlikely to hit it.

Other examples of regexes that aren’t one-pass

One particularly unfortunate example is that enabling Unicode can cause regexes that were one-pass to no longer be one-pass. Consider the regex (?-u)\w*\s for example. It is one-pass because there is exactly no overlap between the ASCII definitions of \w and \s. But \w*\s (i.e., with Unicode enabled) is not one-pass because \w and \s get translated to UTF-8 automatons. And while the codepoints in \w and \s do not overlap, the underlying UTF-8 encodings do. Indeed, because of the overlap between UTF-8 automata, the use of Unicode character classes will tend to vastly increase the likelihood of a regex not being one-pass.

How does one know if a regex is one-pass or not?

At the time of writing, the only way to know is to try and build a one-pass DFA. The one-pass property is checked while constructing the DFA.

This does mean that you might potentially waste some CPU cycles and memory by optimistically trying to build a one-pass DFA. But this is currently the only way. In the future, building a one-pass DFA might be able to use some heuristics to detect common violations of the one-pass property and bail more quickly.

Resource usage

Unlike a general DFA, a one-pass DFA has stricter bounds on its resource usage. Namely, construction of a one-pass DFA has a time and space complexity of O(n), where n ~ nfa.states().len(). (A general DFA’s time and space complexity is O(2^n).) This smaller time bound is achieved because there is at most one DFA state created for each NFA state. If additional DFA states would be required, then the pattern is not one-pass and construction will fail.

Note though that currently, this DFA uses a fully dense representation. This means that while its space complexity is no worse than an NFA, it may in practice use more memory because of higher constant factors. The reason for this trade off is two-fold. Firstly, a dense representation makes the search faster. Secondly, the bigger an NFA, the more unlikely it is to be one-pass. Therefore, most one-pass DFAs are usually pretty small.

Example

This example shows that the one-pass DFA implements Unicode word boundaries correctly while simultaneously reporting spans for capturing groups that participate in a match. (This is the only DFA that implements full support for Unicode word boundaries.)

use regex_automata::{dfa::onepass::DFA, Match, Span};

let re = DFA::new(r"\b(?P<first>\w+)[[:space:]]+(?P<last>\w+)\b")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());

re.captures(&mut cache, "Шерлок Холмс", &mut caps);
assert_eq!(Some(Match::must(0, 0..23)), caps.get_match());
assert_eq!(Some(Span::from(0..12)), caps.get_group_by_name("first"));
assert_eq!(Some(Span::from(13..23)), caps.get_group_by_name("last"));

Example: iteration

Unlike other regex engines in this crate, this one does not provide iterator search functions. This is because a one-pass DFA only supports anchored searches, and so iterator functions are generally not applicable.

However, if you know that all of your matches are directly adjacent, then an iterator can be used. The util::iter::Searcher type can be used for this purpose:

use regex_automata::{
    dfa::onepass::DFA,
    util::iter::Searcher,
    Anchored, Input, Span,
};

let re = DFA::new(r"\w(\d)\w")?;
let (mut cache, caps) = (re.create_cache(), re.create_captures());
let input = Input::new("a1zb2yc3x").anchored(Anchored::Yes);

let mut it = Searcher::new(input).into_captures_iter(caps, |input, caps| {
    Ok(re.try_search(&mut cache, input, caps)?)
}).infallible();
let caps0 = it.next().unwrap();
assert_eq!(Some(Span::from(1..2)), caps0.get_group(1));

Struct regex_automata::dfa::onepass::DFA

Implementations§

impl DFA

pub fn new(pattern: &str) -> Result<DFA, BuildError>

pub fn new_many<P: AsRef<str>>(patterns: &[P]) -> Result<DFA, BuildError>

pub fn new_from_nfa(nfa: NFA) -> Result<DFA, BuildError>

pub fn always_match() -> Result<DFA, BuildError>

pub fn never_match() -> Result<DFA, BuildError>

pub fn config() -> Config

pub fn builder() -> Builder

pub fn create_captures(&self) -> Captures

pub fn create_cache(&self) -> Cache

pub fn reset_cache(&self, cache: &mut Cache)

pub fn get_config(&self) -> &Config

pub fn get_nfa(&self) -> &NFA

pub fn pattern_len(&self) -> usize

pub fn state_len(&self) -> usize

pub fn alphabet_len(&self) -> usize

pub fn stride2(&self) -> usize

pub fn stride(&self) -> usize

pub fn memory_usage(&self) -> usize

impl DFA

pub fn is_match<'h, I: Into<Input<'h>>>( &self, cache: &mut Cache, input: I ) -> bool

pub fn find<'h, I: Into<Input<'h>>>( &self, cache: &mut Cache, input: I ) -> Option<Match>

pub fn captures<'h, I: Into<Input<'h>>>( &self, cache: &mut Cache, input: I, caps: &mut Captures )

pub fn try_search( &self, cache: &mut Cache, input: &Input<'_>, caps: &mut Captures ) -> Result<(), MatchError>

pub fn try_search_slots( &self, cache: &mut Cache, input: &Input<'_>, slots: &mut [Option<NonMaxUsize>] ) -> Result<Option<PatternID>, MatchError>

Trait Implementations§

impl Clone for DFA

fn clone(&self) -> DFA

fn clone_from(&mut self, source: &Self)

impl Debug for DFA

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Auto Trait Implementations§

impl RefUnwindSafe for DFA

impl Send for DFA

impl Sync for DFA

impl Unpin for DFA

impl UnwindSafe for DFA

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> From<T> for T

fn from(t: T) -> T

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> ToOwned for Twhere T: Clone,

type Owned = T

fn to_owned(&self) -> T

fn clone_into(&self, target: &mut T)

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>