The problems with these regular expressions are in the .*
operator. The *
operator is greedy, meaning it will try to match as many characters as possible in the string.
Ideally, whenever possible, construct a regular expression that has a stop criterion. For example, can a URL or an email address have white space? If they can not, their stopping criterion is the blank character. Or an email or URL can only have letters, numbers and some characters (., -, _). Then you can marry everyone until you find a character that is not one of those.
Let's define that an email has only letters, numbers and some characters (., -, _) and has @
in the middle. A regular expression for validating email is beemm more complex than this , but this one accepts 98% of emails existing.
mail_regex = re.compile('([a-z0-9_.-]+@[a-z0-9_.-]+)', re.IGNORECASE)
In this regular expression, we have 2 parts, one that accepts 1 or more characters from a to z, numbers, and the 3 special characters that we define. We then expect a character @ and then the second part, where we accept the same things from the first part.
To marry a url is the same, the difference is that our anchor is at the beginning of the text ( http://
, www
or [
).
url_regex = re.compile('((http://|www|\[)[a-z0-9_.-]+]?)', re.IGNORECASE)
In this regular expression, we first look at the beginning of the text to see if it has http://
, www
or [
. If so, we look for letters, numbers, and the like. The only difference here is that we also look at the last character if it is not ]
, in case the URL is surrounded by brackets.
Finally, running these expressions in the text you posted, we get the following result:
print (mail_regex.sub('E-MAIL', text))
http://www.google.com.br
http://www.twitter.com
[image.jpeg]
www.facebook.com
[www.amazon.com]
E-MAIL
E-MAIL
And in the urls:
print (url_regex.sub('URL', text))
URL
URL
URL
URL
URL
[email protected]
[email protected]