I have a Web Crawler that takes links from websites, but it captures many links that I do not need.
I would like a regular expression to filter the links found. And only links like these would pass.
There are some ways to do this that you said in the question, you can only capture the links that do not have the characters you mentioned ( !#
) or ignore the link if the character you want to be captured.
As I do not know how your application works, I'll leave the 2 ways here and since you did not mention RegEx's flavor , nor the language it will be used to assume is something like flavor pcre
(php).
If you want to identify strings that POSSIBLE #
or !
use this pattern:
(?=.*#|.*!)(.*)
To get the result you want, you should identify all matchs
of this expression and disregard it, here is a test I did to better visualize the result.
If you want to identify the strings that DO NOT have the default:
((?!=.*#|.*!)(.*))
In this case you should consider only the matchs
and disregard the strings that were not captured, here another test but with that expression.
Explanation:
The two regexs work similarly, but one uses a positive lookahead ( ?=
) and the other negative lookahead ( ?!=
).
The lookahead is a token that performs a string analysis and only returns if there is a specific pattern.
After that there is a string that will determine the condition for lookahead ( .*#|.*!
), it indicates that there will be a string that can be numbers letters or symbols and then #
or !
.
At the end there is ( .*
) that will capture all the characters (if it is the positive lookahead) or not (if it is the negative lookahead).