Regular Expressions PrimerThe Regular Expressions Primer is a tutorial for those completely new to regular expressions. To familiarize you with regular expressions, this primer starts with the simple building blocks of the syntax and through examples, builds to construct expressions useful for solving real every-day problems. The primer later discusses how to search for and replace text with regular expression syntax. While the examples presented in the primer are generic in structure and syntax, there are minor usage differences amongst the Komodo supported languages (Python, Perl, Tcl, etc). These differences are relevant when "Python Regular Expressions" are specified for use in Komodo's Find dialog box and Open/Find Toolbar. See More Regex Resources for information on Python regular expressions. Regular expressions are embedded in programs to parse text. For example, a Python program might contain a regular expression as follows: import re n = re.compile(r'\bw[a-z]*', re.IGNORECASE) print n.findall('will match all words beginning with the letter w.') An advanced Python regular expression embedded in a program: # Generate statement parsing regexes. stmts = ['#\s*(?P<op>if|elif|ifdef|ifndef)\s+(?P<expr>.*?)', '#\s*(?P<op>else|endif)', '#\s*(?P<op>error)\s+(?P<error>.*?)', '#\s*(?P<op>define)\s+(?P<var>[^\s]*?)(\s+(?P<val>.+?))?', '#\s*(?P<op>undef)\s+(?P<var>[^\s]*?)'] patterns = ['^\s*%s\s*%s\s*%s\s*$' % (re.escape(cg[0]), stmt, re.escape(cg[1])) for cg in cgs for stmt in stmts] stmtRes = [re.compile(p) for p in patterns] In this example, regular expressions are used within various statements. See the Python Tutorial for the full program where this regular expression is used. Komodo's Rx Toolkit is used to build and evaluate regular expressions. See Using Rx Toolkit for more information.
About Regular ExpressionsRegular expressions are used to describe patterns of characters that match
against text strings. They can be used as a tool to search for and replace text,
manipulate data, or test for a certain condition in a string of characters.
Many everyday tasks can be accomplished with regular expressions, such as
checking for the occurrence of a specific word or phrase in the body of an
e-mail message, or finding specific file types, such as
About Regex SyntaxRegular expressions use syntax elements comprised of alphanumeric characters
and symbols. For example, the regex There are many flavors and types of regular expression syntax. These variations are found in various tools, languages and operating systems. For example, Perl, Python, grep, sed, VI, and Unix all use variations on standard regex syntax. This primer focuses on standard regex patterns not tied to a specific language or tool. This standard syntax can be later applied to the specific language, tool or application of your choice.
Building Simple PatternsComplete regular expressions are constructed using characters as small building block units. Each building block is in itself simple, but since these units can be combined in an infinite number of ways, knowing how to combine them to achieve a goal takes some practice. This section shows you how to build regexes through examples ranging from the simple to the more complex.
Matching Simple StringsThe simplest and most common type of regex is an alphanumeric string that matches itself, called a "literal text match". A literal text regex matches anywhere along a string. For example, a literal string matches itself when placed alone, and at the beginning, middle, or end of a larger string. Literal text matches are case sensitive. Using regexes to search for simple strings. Example 1: Search for the string "at".
Example 2: Search for the string "email".
Example 3: Search for the alphanumeric string "abcdE567".
Note: Regular expressions are case sensitive unless case is deliberately modified.
Searching with WildcardsIn the previous examples, regular expressions are constructed with literal characters that match themselves. There are other characters in regex syntax that match in a more generalized way. These are called "metacharacters". Metacharacters do not match themselves, but rather perform a specific task when used in a regular expression. One such metacharacter is the dot ".", or wildcard. When used in a regular expression, the wildcard can match any single character. Using the wildcard to match any character. Example 1: Use a wildcard to search for any one character before the string "ubject:".
Example 2: Use three dots "
Example 3: Use several wildcards to match characters throughout a string.
Searching for Special CharactersIn regular expression syntax, most non-alphanumerical characters are
treated as special characters. These characters, called "metacharacters", include
asterisks, question marks, dots, slashes, etc. In order to search for
a metacharacter without using its special attribute, precede it with a
backslash "\" to change it into a literal character. For example, to build a
regex to search for a Precede the following metacharacters with a backslash "\" to search for them as literal characters:
Using the backslash "\" to escape special characters in a regular expression. Example 1: Escape the dollar sign "$" to find the alphanumeric string "$100".
Example 2: Use the dot "." as a literal character to find a file called "email.txt".
Example 3: Escape the backslash "\" character to search for a Windows file.
Ranges and RepetitionRegex syntax includes metacharacters which specify the number of times a particular character or string must match. This group of metacharacters is called "quantifiers"; they influence the quantity of matches found. Quantifiers act on the element immediately preceding them, which could be a digit, a letter, or another metacharacter (including spaces as metacharacters not previously defined and the dot "."). This section demonstrates how quantifiers search using ranges and repetition.
Ranges, {min, max}Ranges are considered a "counting qualifier" in regular expressions. This is
because they specify the minimum number of matches to find and the
maximum number of matches to allow. Use ranges in regex searches when a bound,
or a limit, should be placed on search results. For example, the range
Using ranges to identify search patterns. Example 1: Match the preceding "0" at least 3 times with a maximum of 5 times.
Example 2: Using the "." wildcard to match any character sequence two or three characters long.
Example 3: Match the preceding "e" exactly twice.
Example 4: Match the preceding "w" exactly three times.
Repetition, ?*+Unlike range quantifiers, the repetition quantifiers (question mark "?", asterisk "*", and plus "+") have few limits when performing regex searches. This is significant because these quantifiers settle for the minimum number of required matches, but always attempt to match as many times as possible, up to the maximum allowed. For example, the question mark "?" matches any preceding character 0 or 1 times, the asterisk "*" matches the preceding character 0 or more times, and the plus "+" matches the preceding character 1 or more times. Use repetition quantifiers in regex searches when large numbers of results are desired. Using repetition to search for repeated characters with few limits. Example 1: Use "?" to match the "u" character 0 or 1 times.
Example 2: Use "*" to match the preceding item 0 or more times; use "." to match any character.
Example 3: Use "+" to match the preceding "5" at least once.
Quantifier SummaryThe following table defines the various regex quantifiers. Note that each quantifier is unique and will perform a varying minimum and maximum number of matches in order to search successfully.
Using Conditional ExpressionsConditional expressions help qualify and restrict regex searches, increasing the probability of a desirable match. The vertical bar "|" symbol, meaning "OR", places a condition on the regex to search for either one character in a string or another. Because the regex has a list of alternate choices to evaluate, this regex technique is called "alternation". To search for either one character or another, insert a vertical bar "|" between the desired characters. Example 1: Use "|" to alternate a search for various spellings of a string.
Example 2: Use "|" to alternate a search for either email or Email or EMAIL or e-mail.
Grouping Similar Items in ParenthesesUse parentheses to enclose a group of related search elements. Parentheses
limit scope on alternation and create substrings to enhance searches with
metacharacters. For example, use parentheses to group the
expression Using parentheses to group regular expressions. Example 1: Use parentheses and a range quantifier to find instances of the string "abcabcabc".
Example 2: Use parentheses to limit the scope of alternative matches on the words gray and grey.
Example 3: Use parentheses and "|" to locate past correspondence in a mail-filtering program. This regex finds a 'To:' or a 'From:' line followed by a space and then either the word 'Smith' or the word 'Chan'.
Matching SequencesYou can build a regular expression to match a sequence of characters. These sequences, called "character classes", simply place a set of characters side-by-side within square brackets "[]". An item in a character class can be either an ordinary character, representing itself, or a metacharacter, performing a special function. This primer covers how to build simple character classes, prevent matches with character classes, and construct compound character classes with metacharacters.
Building Simple Character ClassesThe most basic type of character class is a set of characters placed
side-by-side within square brackets "[]". For example, the regular expression
Note: When placed inside a character class, the hyphen "-" metacharacter denotes a continuous sequence of letters or numbers in a range. For example, [a-d] is a range of letters denoting the continuous sequence of a,b,c and d. When a hyphen is otherwise used in a regex, it matches a literal hyphen. Using simple character classes to perform regex searches. Example 1: Use a character class to match all cases of the letter "s".
Example 2: Use a character class to limit the scope of alternative matches on the words gray and grey.
Example 3: Use a character class to match any one digit in the list.
Example 4: To simplify the previous example, use a hyphen "-" within a character class to denote a range for matching any one digit in the list.
Example 5: Use a hyphen "-" within a character class to denote an alphabetic range for matching various words ending in "mail".
Example 6: Match any three or more digits listed in the character class.
Preventing Matches with Character ClassesPrevious examples used character classes to specify exact sequences to
match. Character classes can also be used to prevent, or negate, matches with
undesirable strings. To prevent a match, use a leading caret "^"
(meaning NOT), within square brackets, Note: The caret symbol must be the first character within the square brackets to negate a character class. Using character classes to prevent a sequence from matching. Example 1: Prevent a match on any numeric string. Use the "*" to match an item 0 or more times.
Example 2: Search for a text file beginning with any character not a lower-case letter.
Example 3: Prevent a match on the numbers "10" and "12".
Compound Character ClassesCharacter classes are a versatile tool when combined with various pieces of the regex syntax. Compound character classes can help clarify and define sophisticated searches, test for certain conditions in a program, and filter wanted e-mail from spam. This section uses compound character classes to build meaningful expressions with the regex syntax. Using compound character classes with the regex syntax. Example 1: Find a partial e-mail address. Use a character class to denote a match for any number between 0 and 9. Use a range to restrict the number of times a digit matches.
Example 2: Search an HTML file to find each instance of a header tag. Allow matches on whitespace after the tag but before the ">".
Example 3: Match a regular 7-digit phone number. Prevent the digit "0" from leading the string.
Example 4: Match a valid web-based protocol. Escape the two front slashes.
Example 5: Match a valid e-mail address.
Character Class SummaryThe following table defines various character class sequences. Use these alphanumeric patterns to simplify your regex searches.
Matching Locations within a StringAt times, the pattern to be matched appears at either the very beginning or
end of a string. In these cases, use a caret "^" to match a desired pattern at
the beginning of a string, and a dollar sign "$" for the end of the string. For
example, the regular expression Note: The caret "^" has many meanings in regular expressions.
Its function is determined by its context. The caret can be used as an anchor to
match patterns at the beginning of a string, for example: Using anchors to match at the beginning or end of a string. Example 1: Use "$" to match the ".com" pattern at the end of a string.
Example 2: Use "^" to match "inter" at the beginning of a string, "$" to match "ion" at the end of a string, and ".*" to match any number of characters within the string.
Example 3: Use "^" inside parentheses to match "To" and "From" at the beginning of the string.
Example 4: Performing the same search as #3, place the caret "^" outside the parentheses this time for similar results.
Searching and ReplacingRegular expressions are often used to search and replace text strings. Up until this point, the preceding examples have centered on matching a string using regex syntax. This section examines the search and replace operation as a prominent feature of regular expressions and solves standard problems using the substitution syntax. Like with building regular expressions, there are many variations on substitution syntax depending on the language used. This primer focuses on general search and replace syntax. This standard syntax can be later applied to the specific language, tool or application of your choice.
Building Simple Substitution SearchesSubstitution searches search for and replace a pattern of text. Substitutions
are performed using the For example:
Use the Example 1: Search for the string "email" and replace it with "e-mail".
Example 2: Search for an old domain name and replace it with the new domain name. Using regex syntax, escape "." and "/" characters.
Example 3: Search for a single string starting with any lowercase letter and ending with "mail". Replace the string with "Email".
Modifying Substitution SearchesThe previous substitution examples focused on small searches, such as
replacing a single lower-case word in a single line of text. Extend the
scope and flexibility of substitution searches through the use of
modifiers. The modifier parameter is appended to the end of the
Use the modifier "i" to ignore case in alphabetic searches, "m" to allow multiple lines in a string, "s" to treat a pattern as a single line, "x" to allow for whitespace and comments, and "g" for global searching all occurances of the pattern in a file and not just the first instance found. Use various modifiers with the Example 1: Using the "g" modifier, search globally through all .htm instances in a file and replace them with ".html". Using "$", only substitute the ".htm" string when it appears at the end of a line. An example file where this substitution succeeds: /manual/mod_python/pythonapi.htm /manual/mod_python/more-comp.htm /manual/mod_python/overview.htm
Example 2: Using the "g" modifier, remove all html tags in a file and replace the tags with an empty string.
Example 3: Perform a case insensitive search for various instances of "login" and replace with the string "password".
Substitution Modifier SummaryThe following table defines various modifiers for the substitution operator. Modifiers change how a match is performed. Use these modifiers to expand the scope and versatility of your substitutions.
More Regex ResourcesInternet Web Sites:Beginner:
Intermediate:
Advanced:
|