Regular expression for URLs with dynamic media

0

I have an html file with urls in this default URL: https://www.olympikus.com.br/tenis-olympikus-flower-415-feminino-cinza-D22-1131-010 The default is protocolo://dominio/strig-dinâmica-000-0000-000

I want to get all the links in this pattern. Then I created the following ER: (https\:\/\/?)www\.olympikus\.com\.br\/(.*)\-[A-Z0-9]{3}-[A-Z0-9]{4}-[A-Z0-9]{3}

Unfortunately the default gets the initial ceiling protocolo://dominio/ and ends in the last possible marriage -000-0000-000 Returning a raw string in the middle because of (.*) . I can not handle the dynamic part of the URL

How to write this ER so that it returns all the links?

I'm currently using egrep in the terminal, but examples with javascript are accepted because I intend to create a crawler in that language on Nodejs.

    
asked by anonymous 28.04.2018 / 23:42

2 answers

0

Considering that the variable part will consist of letters, number and hyphen, replace (.*) with [a-z0-9\-]+ that you must solve.

    
29.04.2018 / 05:33
0

Regex

This would be Regex: ((?:https|http|ftp)?:\/\/)?([^\/,\s]+\.[^\/,\s]+?)(?=\/|,|\s|$|\?|#)(.*) Where the demo on Regex101 can be viewed more closely.

Code

Regex101 Example

Return Group 2

const regex = /((?:https|http|ftp)?:\/\/)?([^\/,\s]+\.[^\/,\s]+?)(?=\/|,|\s|$|\?|#)(.*)/gm;
const str = 'http://dominio.do/strig-dinâmica-000-0000-000
https://www.olympikus.com.br/tenis-olympikus-flower-415-feminino-cinza-D22-1131-010
ftp://dominio.br/strig-dinâmica-000-0000-000
dominio.c/strig-dinâmica-000-0000-000';
const subst = '$2';

// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);

console.log('Substitution result: ', result);

Example SOen

Code of a deleted user that can be seen here

In which returns the Integer String

var regex = /((?:https|http|ftp)?:\/\/)?([^\/,\s]+\.[^\/,\s]+?)(?=\/|,|\s|$|\?|#)(.*)/g;

var input = 'http://dominio.do/strig-dinâmica-000-0000-000
https://www.olympikus.com.br/tenis-olympikus-flower-415-feminino-cinza-D22-1131-010
ftp://dominio.br/strig-dinâmica-000-0000-000
dominio.c/strig-dinâmica-000-0000-000';

while (match = regex.exec(input)) {
    document.write(match[0] + "<br/>");
};

Debug

The Debuggex can be seen in the link and helps in understanding along with the demo in Regex101.

Explanation:

((?:https|http|ftp)?:\/\/)?([^\/,\s]+\.[^\/,\s]+?)(?=\/|,|\s|$|\?|#)(.*)

  • 1st Capture Group - ((?:https|http|ftp)?:\/\/)?
    • Quantifier ? - Matches from zero to one, as many times as possible, returning as needed ( greedy )
    • Non-capture group - (?: Https | http | ftp)?
      • Quantifier ? - Matches from zero to one, as many times as possible, returning as needed ( greedy )
      • Alternatives - | are the options that are between the | separator, which acts as a boolean OR .
        • 1st Alternative - https matches the https characters literally
        • 2nd Alternative - http corresponds to http characters literally
        • 3rd Alternative - ftp matches ftp characters literally
      • : matches the character: literally
      • \ / matches the character / literally
  • 2nd Capture Group - ([^\/,\s]+\.[^\/,\s]+?)
    • [^\/,\s]+ - Corresponds to a character not present in the set
      • Quantifier + - Matches between one and unlimited times, as many times as possible, returning as needed ( greedy )
      • \ / matches the character / literally
      • , matches the character, literally
      • \ s matches any white space character (equal to [\ r \ n \ t \ f \ v])
    • \. Matches the character. literally
    • [^\/,\s]+? - Corresponds to a character not present in the set
      • Quantifier +? - Matches between one and unlimited times, as few as possible, expanding as needed ( lazy )
      • \ / matches the character / literally
      • , matches the character, literally
      • \ s matches any white space character (equal to [\ r \ n \ t \ f \ v])
  • Positive Lookahead (?=\/|,|\s|$|\?|#)
    • Alternatives - | are the options that are between the | separator, which acts as a boolean OR .
      • 1st alternative \ / matches the / literally
      • 2nd alternative , matches the character, literally
      • 3rd alternative \ s matches any white space character (equal to [\ r \ n \ t \ f \ v])
      • 4th alternative $ secures the position at the end of a line
      • 5th alternative \? matches the? literally
      • 6th alternative # matches the character # literally
  • 3rd Capture Group - (.*)
    • . * matches any character (except for line terminators)
    • Quantifier * - Matches between zero and unlimited times, as many times as possible, returning as needed ( greedy )
  

The second group is what "matters", in which it has the   link information, or if you want to get the entire string, it would be group 0.

    
30.04.2018 / 15:01