In Python Regex, there are some slight differences in the behavior of certain metacharacters when dealing with Multiline Text. In this tutorial we will explore these differences, and explain how to properly apply regular expressions on multiline text.
Regular Expressions with Multiline Text
Normally when we use the .
character, it does not detect the newline character. This causes problems when we are dealing with multiline strings. We can see this problem in the output of the below code.
mystring=\
"""<p>This is a paragraph</p>
<p>This is also
a random paragraph</p>
"""
print(re.findall("<p>.*?</p>", mystring))
['<p>This is a paragraph</p>']
The purpose of the above code was to identify all the <p> tags and print them out along with their contents. Unfortunately, only the first one is being printed.
This is because our second set of <p> tags span across multiple lines. The .
character does not account for new line characters, thus giving us an incorrect result.
We can fix this problem by passing the re.DOTALL
flag, which makes the .
character match any character. (by default some are excluded). We have made this change in the below code.
mystring=\
"""<p>This is a paragraph</p>
<p>This is also
a random paragraph</p>
"""
print(re.findall("<p>.*?</p>", mystring, re.DOTALL))
['<p>This is a paragraph</p>', '<p>This is a random paragraph</p>']
We now have the correct output!
There is one more point of interest here. We need to use a non-greedy approach to pattern matching. Normally regex attempts to locate the largest substring that matches the pattern, so we need to stop this behavior with the ? special character.
mystring=\
"""<p>This is a paragraph</p>
<p>This is also
a random paragraph</p>
"""
print(re.findall("<p>.*</p>", mystring, re.DOTALL))
['<p>This is a paragraph</p>\n<p>This is also\na random paragraph</p>']
Here in the output we do not have two elements in the list, rather we have one large set of <p> tags. This is because regex is connecting the <p> tag of the first paragraph to the </p> tag of the second paragraph.
MULTILINE FLAG
Another special Regex flag we might need is re.MULTILINE
. Sometimes we may wish to treat each line in our text as a separate entity upon which we are applying regex. For example, when we apply a regex pattern with the re.MULTILINE flag, it will apply that pattern to each line, rather than the text as a whole.
This creates differences in certain situations such as when using the ^ and $ metacharacters. In the below code for example, we only get one output. And this is actually correct, but chances are that we want is something a bit different.
mystring="""This is some random text.
Hello World.
This is Goodbye.
"""
print(re.findall("^This.*", mystring))
['This is some random text.']
If we use the Multiline flag, we get the following output.
mystring="""This is some random text.
Hello World.
This is Goodbye.
"""
print(re.findall("^This.*", mystring, re.MULTILINE))
['This is some random text.', 'This is Goodbye.']
Here we have both lines which started with “This” printed out. This is because the pattern was applied to all three lines, from which the first and third line matched.
This marks the end of the Python Regex match() for Multiline Text article. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article content can be asked in the comments section below.