Get URL with regular expression links

4

I'm trying to extract only the URL if an expression is validated with the [monitory] tag.

The expression I use is this:

(?=<a.*\[monitory\].*href=["|'][http:|https:]?[\/\/]).*?["|']>

And for example on a link like this:

<a [monitory] href="http://www.google.com">Google</a>

Extract only the address:

http://www.google.com
    
asked by anonymous 16.11.2018 / 00:32

2 answers

5

First let's look at some details in your regex (and also suggestions for improving it).

Using .* is always tempting, but "dangerous", since it is an expression that means "zero or more occurrences of any character ". Also, the quantifier is greedy , meaning it will try to grab as many characters as possible.

This means that if your string has two links on the same line, the first one is ignored. For example, if the string is:

<a [monitory] href="http://www.link1.com"><a [monitory] href="http://www.link2.com">

Only the http://www.link2.com address will be considered, since .* takes as many characters as possible (including the entire "link1.com" snippet). See here this regex working.

To cancel greed, put ? shortly after * :

<a.*?\[monitory\].*?href=["|']([http:|https:]?[\/\/]?.*?)["|']>

So, .* starts to take the minimum necessary, causing both "link1" and "link2" to be captured by regex. See here for the difference.

Another detail is that ["|'] is a character class , that is, it accepts all characters which are in the square brackets. So this expression means the character " or the character | or the ' character. This means that the string could have | instead of the quotation marks:

<a [monitory] href=|http://www.teste.com|>

And still regex would accept, see here .

If you want to have only " or ' , remove the | from the brackets: ["'] .

Similarly, [\/\/] means the character / or the / character (that is, it is redundant to have twice the same character within the brackets - and in some languages this even gives error). This causes the regex to accept only one slash in the URL ( http:/www.teste.com ), see here an example.

If you want two occurrences of / , delete the brackets.

The [http:|https:]? excerpt should also be removed from the brackets for the reasons explained above. Actually, regex only works because both this and [\/\/] have a ? soon after, which makes them optional, and after them has a .*? , which corresponds to any characters. To better understand, place parentheses around each of them and see the snippet that each one captures .

To accept http or https, just do https? : the s? excerpt makes the s letter optional. Then the regex would be:

<a.*?\[monitory\].*?href=["'](https?:\/\/.*?)["']>

See here working .

Ah yes, this regex only works if [monitory] is before href , and if soon after the quotation marks closing href has no space. You can improve it a little more by changing .*? by \s+ (one or more occurrences of spaces) and at the end, before closing the tag, put \s* (it can have zero or more spaces before > ):

<a\s+\[monitory\]\s+href=["'](https?:\/\/.*?)["']\s*>

See here this regex working.

Note that this has no end because HTML tags are more complex than they appear. If you make sure your strings always have this format and there are no more variations, regex resolves. But if you have more cases ( href before monitory , other attributes, URL has protocols like ftp, gopher, mailto, or simply localhost , etc), you will have to update the regex.

Using .* causes invalid URLs such as http:///#@@#@#@ or even http:// to be accepted ( see here ). If you really want to validate any URL, you'll end up with monstrous expressions like this , and then something is not worth using so complicated.

Regex, while very cool, is not the best tool for parsing and manipulating HTML . Maybe it's the case to try more suitable tools .

I understand that your regex worked, but the regex problem is not just to make it to work for valid cases, is to also make it not work for invalid cases.

    
16.11.2018 / 01:12
2

Looking at another question asked in stackoverlow I understood one thing.

O (parentheses serve to capture so I modified my expression to:

<a.*\[monitory\].*href=["|']([http:|https:]?[\/\/]?.*?)["|']>

And now it worked.

The other question you find in: Regular expression for paste part of text

    
16.11.2018 / 00:32