Regular expression to validate urls

1

I have a Web Crawler that takes links from websites, but it captures many links that I do not need.

I would like a regular expression to filter the links found. And only links like these would pass.

link

link

    
asked by anonymous 15.07.2017 / 00:45

1 answer

0

There are some ways to do this that you said in the question, you can only capture the links that do not have the characters you mentioned ( !# ) or ignore the link if the character you want to be captured.

As I do not know how your application works, I'll leave the 2 ways here and since you did not mention RegEx's flavor , nor the language it will be used to assume is something like flavor pcre (php).

If you want to identify strings that POSSIBLE # or ! use this pattern:

(?=.*#|.*!)(.*)

To get the result you want, you should identify all matchs of this expression and disregard it, here is a test I did to better visualize the result.

If you want to identify the strings that DO NOT have the default:

((?!=.*#|.*!)(.*))

In this case you should consider only the matchs and disregard the strings that were not captured, here another test but with that expression.

Explanation:

The two regexs work similarly, but one uses a positive lookahead ( ?= ) and the other negative lookahead ( ?!= ). The lookahead is a token that performs a string analysis and only returns if there is a specific pattern. After that there is a string that will determine the condition for lookahead ( .*#|.*! ), it indicates that there will be a string that can be numbers letters or symbols and then # or ! .

At the end there is ( .* ) that will capture all the characters (if it is the positive lookahead) or not (if it is the negative lookahead).

    
17.07.2017 / 17:52