Expand description
Search for regex matches in &[u8] haystacks.
This module provides a nearly identical API via Regex to the one found in
the top-level of this crate. There are two important differences:
- Matching is done on &[u8]instead of&str. Additionally,Vec<u8>is used whereStringwould have been used in the top-level API.
- Unicode support can be disabled even when disabling it would result in matching invalid UTF-8 bytes.
Example: match null terminated string
This shows how to find all null-terminated strings in a slice of bytes. This works even if a C string contains invalid UTF-8.
use regex::bytes::Regex;
let re = Regex::new(r"(?-u)(?<cstr>[^\x00]+)\x00").unwrap();
let hay = b"foo\x00qu\xFFux\x00baz\x00";
// Extract all of the strings without the NUL terminator from each match.
// The unwrap is OK here since a match requires the `cstr` capture to match.
let cstrs: Vec<&[u8]> =
    re.captures_iter(hay)
      .map(|c| c.name("cstr").unwrap().as_bytes())
      .collect();
assert_eq!(cstrs, vec![&b"foo"[..], &b"qu\xFFux"[..], &b"baz"[..]]);Example: selectively enable Unicode support
This shows how to match an arbitrary byte pattern followed by a UTF-8 encoded string (e.g., to extract a title from a Matroska file):
use regex::bytes::Regex;
let re = Regex::new(
    r"(?-u)\x7b\xa9(?:[\x80-\xfe]|[\x40-\xff].)(?u:(.*))"
).unwrap();
let hay = b"\x12\xd0\x3b\x5f\x7b\xa9\x85\xe2\x98\x83\x80\x98\x54\x76\x68\x65";
// Notice that despite the `.*` at the end, it will only match valid UTF-8
// because Unicode mode was enabled with the `u` flag. Without the `u` flag,
// the `.*` would match the rest of the bytes regardless of whether they were
// valid UTF-8.
let (_, [title]) = re.captures(hay).unwrap().extract();
assert_eq!(title, b"\xE2\x98\x83");
// We can UTF-8 decode the title now. And the unwrap here
// is correct because the existence of a match guarantees
// that `title` is valid UTF-8.
let title = std::str::from_utf8(title).unwrap();
assert_eq!(title, "☃");In general, if the Unicode flag is enabled in a capture group and that capture is part of the overall match, then the capture is guaranteed to be valid UTF-8.
Syntax
The supported syntax is pretty much the same as the syntax for Unicode regular expressions with a few changes that make sense for matching arbitrary bytes:
- The uflag can be disabled even when disabling it might cause the regex to match invalid UTF-8. When theuflag is disabled, the regex is said to be in “ASCII compatible” mode.
- In ASCII compatible mode, Unicode character classes are not allowed. Literal Unicode scalar values outside of character classes are allowed.
- In ASCII compatible mode, Perl character classes (\w,\dand\s) revert to their typical ASCII definition.\wmaps to[[:word:]],\dmaps to[[:digit:]]and\smaps to[[:space:]].
- In ASCII compatible mode, word boundaries use the ASCII compatible \wto determine whether a byte is a word byte or not.
- Hexadecimal notation can be used to specify arbitrary bytes instead of
Unicode codepoints. For example, in ASCII compatible mode, \xFFmatches the literal byte\xFF, while in Unicode mode,\xFFis the Unicode codepointU+00FFthat matches its UTF-8 encoding of\xC3\xBF. Similarly for octal notation when enabled.
- In ASCII compatible mode, .matches any byte except for\n. When thesflag is additionally enabled,.matches any byte.
Performance
In general, one should expect performance on &[u8] to be roughly similar to
performance on &str.
Structs
- A low level representation of the byte offsets of each capture group.
- An iterator over all non-overlapping capture matches in a haystack.
- An iterator over the names of all capture groups in a regex.
- Represents the capture groups for a single match.
- Represents a single match of a regex in a haystack.
- An iterator over all non-overlapping matches in a haystack.
- A helper type for forcing literal string replacement.
- A compiled regular expression for searching Unicode haystacks.
- A configurable builder for aRegex.
- Match multiple, possibly overlapping, regexes in a single search.
- A configurable builder for aRegexSet.
- A by-reference adaptor for aReplacer.
- A set of matches returned by a regex set.
- An owned iterator over the set of matches from a regex set.
- A borrowed iterator over the set of matches from a regex set.
- An iterator over all substrings delimited by a regex match.
- An iterator over at mostNsubstrings delimited by a regex match.
- An iterator over all group matches in aCapturesvalue.
Traits
- A trait for types that can be used to replace matches in a haystack.