Regular "permissive" expression to detect allowed extensions and hosts

4

I have a list with some links

https://www.exemplo.com/
https://www.exemplo.com/home/
https://www.exemplo.com/logo.png
https://intranet.exemplo.com/
https://admin.exemplo.com/login
https://www.exemplo.com/sobre/
https://www.exemplo.com/shell.php.log
https://www.exemplo.com/background.jpg

And I want to identify links that start with

https://www.exemplo.com/

and do not end with jpg or png

In case the following urls would be blocked

    https://www.exemplo.com/logo.png
    https://www.exemplo.com/background.jpg
    https://intranet.exemplo.com/
    https://admin.exemplo.com/login
    
asked by anonymous 18.06.2016 / 22:31

2 answers

1

Then we have:

urls = ["https://www.exemplo.com/", "https://www.exemplo.com/home/", "https://www.exemplo.com/logo.png", "https://intranet.exemplo.com/", "https://admin.exemplo.com/login", "https://www.exemplo.com/sobre/", "https://www.exemplo.com/shell.php.log", "https://www.exemplo.com/background.jpg"]

We will filter those that end with png / jpg or that do not have the "www".

With and regex in python:

import re

bloqueados = []
for url in urls:
    img = re.compile('^.*\.(jpg|JPG|png)$')
    www = re.compile('(.*?)//www.(.*?)')
    if(img.match(url) or not www.match(url)):
        bloqueados.append(url)
print(bloqueados) # ['https://www.exemplo.com/logo.png', 'https://intranet.exemplo.com/', 'https://admin.exemplo.com/login', 'https://www.exemplo.com/background.jpg']

OR

import re
bloqueados = [url for url in urls if(re.compile('^.*\.(jpg|JPG|png)$').match(url) or re.compile('(.*?)//www.(.*?)').match(url) == None)]
print(bloqueados) # ['https://www.exemplo.com/logo.png', 'https://intranet.exemplo.com/', 'https://admin.exemplo.com/login', 'https://www.exemplo.com/background.jpg']

Although for this simple case I would not use regex, I would:

bloqueados = [url for url in urls if url[-4:] == '.png' or url[-4:] == '.jpg' or 'https://www.' not in url]
print(bloqueados) # ['https://www.exemplo.com/logo.png', 'https://intranet.exemplo.com/', 'https://admin.exemplo.com/login', 'https://www.exemplo.com/background.jpg']

With regex in javascript:

var bloqueados = []
var ext;
var www;
for(var url in urls) {
    if(/^.*\.(jpg|png)$/.test(urls[url]) || !/(.*?)\/\/www.(.*?)/.test(urls[url])) {
        bloqueados.push(urls[url])
    }
}
console.log(bloqueados); // ["https://www.exemplo.com/logo.png", "https://intranet.exemplo.com/", "https://admin.exemplo.com/login", "https://www.exemplo.com/background.jpg"]

No regex in javascript:

var bloqueados = []
var ext;
var www;
for(var url in urls) {
    exts = urls[url].split('.');
    ext = exts[exts.length - 1];
    if(ext == 'png' || ext == 'jpg' || urls[url].indexOf("//www.") < 0) {
        bloqueados.push(urls[url])
    }
}
console.log(bloqueados); // ["https://www.exemplo.com/logo.png", "https://intranet.exemplo.com/", "https://admin.exemplo.com/login", "https://www.exemplo.com/background.jpg"]
    
18.06.2016 / 23:05
2

Another simple way to achieve the same result would be to use the Array.filter() native of JavaScript, example:

function filtrarUrls(lista) {
    var base = 'https://www.exemplo.com';

    lista = lista
              .filter((url) => { return url.indexOf(base) > -1 })
              .filter((url) => { return url.match(/(.jpg|.png)/g) === null });

    return lista;
}

And to use the function:

var urls = ['https://url1.com', 'https://url2.com', ...];

var urlsFiltradas = filtrarUrls(urls); //retorna um array apenas com as URLs filtradas.
    
19.06.2016 / 00:09