REGEX to remove sites, emails [closed]

0

I want to remove the addresses of sites, emails, etc ...

url_regex = re.compile(r'(?i)(<|\[)?(https?|www)?(.*)?\.(.*){2,4}')
mail_regex = re.compile(r'(?i)(<|\[)?@(.*)\.(.*){2,3}')

In this way, I could remove for example:

  

link

     

     

link

     

[image.jpeg]

     

www.facebook.com

     

[www.amazon.com]

     

[email protected]

     

[email protected]

     

...

When tested in a text, these regex match the whole text and not just the site / email addresses.

    
asked by anonymous 11.07.2018 / 15:59

1 answer

1

The problems with these regular expressions are in the .* operator. The * operator is greedy, meaning it will try to match as many characters as possible in the string.

Ideally, whenever possible, construct a regular expression that has a stop criterion. For example, can a URL or an email address have white space? If they can not, their stopping criterion is the blank character. Or an email or URL can only have letters, numbers and some characters (., -, _). Then you can marry everyone until you find a character that is not one of those.

Let's define that an email has only letters, numbers and some characters (., -, _) and has @ in the middle. A regular expression for validating email is beemm more complex than this , but this one accepts 98% of emails existing.

mail_regex = re.compile('([a-z0-9_.-]+@[a-z0-9_.-]+)', re.IGNORECASE)

In this regular expression, we have 2 parts, one that accepts 1 or more characters from a to z, numbers, and the 3 special characters that we define. We then expect a character @ and then the second part, where we accept the same things from the first part.

To marry a url is the same, the difference is that our anchor is at the beginning of the text ( http:// , www or [ ).

url_regex  = re.compile('((http://|www|\[)[a-z0-9_.-]+]?)', re.IGNORECASE)

In this regular expression, we first look at the beginning of the text to see if it has http:// , www or [ . If so, we look for letters, numbers, and the like. The only difference here is that we also look at the last character if it is not ] , in case the URL is surrounded by brackets.

Finally, running these expressions in the text you posted, we get the following result:

print (mail_regex.sub('E-MAIL', text))
http://www.google.com.br

http://www.twitter.com

[image.jpeg]

www.facebook.com

[www.amazon.com]

E-MAIL

E-MAIL

And in the urls:

print (url_regex.sub('URL', text))
URL

URL

URL

URL

URL

[email protected]

[email protected]
    
11.07.2018 / 18:24