Python Regex, or “Regular Expression”, is a sequence of special characters that define a search pattern. This search pattern is then used to perform operations on strings, such as “find” or “find and replace”.
In short, Regex can be used to check if a string contains a specified string pattern.
Start off by installing the Python regex library, re
. If you’re using windows, use the pip install re
command in the command prompt.
Next import the library into your code using the following command.
import re
If the above command works without any errors, you’ve installed the library successfully.
(Meta) Special Characters
Special characters (also called Meta characters) are used to symbolize regex patterns or formats. Below is list of them, and what they represent in regular expressions.
Character | Description |
---|---|
[] | Represents a set of characters |
. | Any character (Except for new line characters) |
+ | One or more occurrences |
* | Zero or more occurrences |
? | Zero or one occurence |
{} | A fixed number of occurrences |
^ | Starts with… |
$ | Ends with… |
| | Either/Or |
Re.match Function
To demonstrate the use of the above characters, we’ll be using the re.match
function. This function takes two inputs, the pattern and the string on which the pattern is to be applied.
re.match(pattern, string)
If the match was successful, a match object is returned, else None
is returned.
Re.findall Function
Another function we’ll be using is the re.findall
function. Like the previous one, this function also takes two inputs, the pattern and the string on which the pattern is to be applied.
re.findall(pattern, string)
This function returns a list of all the matches found within the string. There is also the find()
function, which is a variation which only returns the first match.
(Meta) Special Characters in Regex
Regular Expressions (Regex) in Python have a large number of special characters used to create search queries. You can use these characters in all kinds of ways, even chaining together several of them to create a complex query. We’ll be explaining each of them character by character below.
Character: []
This is the simplest, but also the most commonly used character for Python Regular expressions. It can be used in a variety of the different manners, producing a different result each time.
It can be used to locate single characters (1st and 3rd examples) as well as locate between a range of characters (2nd example). Lastly, you can also pass several characters at once to locate them all (4th example).
Sample = "Hello World"
Sample2 = "This website is called CodersLegacy"
print(re.findall("[a]", Sample))
print(re.findall("[a-z]", Sample))
print(re.findall("[a]", Sample2))
print(re.findall("[aoiue]", Sample2))
#OUTPUT
[]
['e', 'l', 'l', 'o', 'o', 'r', 'l', 'd']
['a', 'a']
['i', 'e', 'i', 'e', 'i', 'a', 'e', 'o', 'e', 'e', 'a']
Note that ‘H’ was excluded in the second example because it’s a capital letter.
Character: .
This character isn’t really used alone, it’s meant to be used along side other characters in a multi-character search query. It’s basically a placeholder type of character, as you’ll see in the examples below.
sample = "House Home Whole Hole Hope"
result1 = re.findall("H..e", sample)
result2 = re.findall("H...e", sample)
print(result1)
print(result2)
['Home', 'Hole', 'Hope']
['House']
This character is also called the “wildcard”.
Character: +
Adds some extra flexibility when you don’t want to “hard-code” your search results. Best used with other characters to form a complex search result.
import re
sample = "a aa aaa b"
result1 = re.findall("a+", sample)
result2 = re.findall("a", sample)
print(result1)
print(result2)
['a', 'aa', 'aaa']
['a', 'a', 'a', 'a', 'a', 'a']
We gave an example without the +
character for the sake of reference.
Character: *
This character, like the previous +
allows us to bring in some flexibility into our searches. Using it with the .
character implies, that there may or may not be a character in between h and t. It also further states there are may be more than one.
sample = "hat ht hoot"
result = re.findall("h.*t", sample)
print(result)
['hat ht hoot']
As you can see, it returns a match on all of them. Even though they have a different middle characters, as well as a different number of middle characters.
If we were to replace *
with +
here, the ht
word would not get matched.
Character: ?
Similar to the *
character, but does not accept multiple characters. If we were to run the code from the previous example:
sample = "hat ht hoot"
result = re.findall("h.?t", sample)
print(result)
['hat', 'ht']
Character: {}
sample1 = "elephant"
sample2 = "lion"
sample3 = "cat"
print(re.findall("[a-z]{4,}", sample1))
print(re.findall("[a-z]{4,}", sample2))
print(re.findall("[a-z]{4,}", sample3))
['elephant']
['lion']
[]
The above code will only trigger a match if the word is of length 4 or more.
Character: ^
The first and second example return words that start with upper and lower case respectively. The third example returns words that start with a number. The .*
is just there to return the rest of the word.
sample1 = "Lion"
sample2 = "elephant"
sample3 = "123pop"
print(re.findall("^[A-Z].*", sample1))
print(re.findall("^[a-z].*", sample2))
print(re.findall("^[0-9].*", sample3))
['Lion']
['elephant']
['123pop']
Character: $
The opposite of the ^
character, the $
character is used to locate/match words which end with a certain character or combination of characters.
sample1 = "Toilet"
sample2 = "Bedroom"
sample3 = "Balcony"
sample4 = "Livingroom"
print(re.findall("room$", sample1))
print(re.findall("room$", sample2))
print(re.findall("room$", sample3))
print(re.findall("room$", sample4))
[]
['room']
[]
['room']
This marks the end of the Python Regex MetaCharacters Tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article content can be asked in the comments section below.