Validate URL with ER

1

I have a validation class and I need a method to validate URL's, but the filter_var function has failures to validate it.

An example of 3 URL's:

The URL is complete returns TRUE
#1 'http://www.youtube.com' | string(22) "..."

The URL is invalid and yet the function returns TRUE
#2 'tp://www.youtube.com' | string(20) "..."

The URL returns FALSE
#3 'youtube.com' | bool(false)

I do not know if the failure is with the HTTP|HTTPS protocol, I have not yet tested. The rules for URL validation I've seen are huge and I do not quite understand all the rules.

I thought of using preg_match before filter_var to find the protocol with ER: "/(http|https):\/\/(.*?)$/i" .

My fear is that this also fails. Does anyone have a simple suggestion for this impasse - other than a complex ER?

    
asked by anonymous 26.08.2014 / 10:40

3 answers

4

There is no A regular expression to validate URLs, and the fault is in part part of the RFC (s). And it is so for any given data that depends on one.

The PHP filtering functions even follow the required specifications, but they do not cover all cases, and for others, in order not to avoid false positives, it narrows the restriction according to its necessity, through the flags of configuration, allowing you to have the necessary flexibility for each case.

For future references, by default, if you omit the second argument, it only treats the data as a common string.

In your case, given the absence of the usage form, I imagine you are doing this:

filter_var( 'http://www.youtube.com', FILTER_VALIDATE_URL );

The first URL is valid because it contains the main elements of a URL that are schema , the domain and TLD .

In the second case it also validates because it also has the three basic components, even if one of them is wrong.

In order for the second URL to also return FALSE it would be necessary to combine the first flag with FILTER_FLAG_SCHEME_REQUIRED .

The third URL is valid for the user, for the browser, but does not stop for the RFC because it lacks one of the basic components required by the specification.

What you could is, like in everything that comes from the user, before even validating, would sanitize the URL. Some things that come to me:

  • Verify that there are no broken schemas as in the second URL and correct them, either by removing or repairing when and if possible
  • Add the default link at the beginning of the URL if it is missing (or removed and then deleted), after all, an FTP or HTTPS URL (or ED2K, Magnet, torrent ...) that do not have such specific prefixes will not be treated as special anyway.

And always warn the user through a tip in the GUI that the format is link in>. If he type wrong, the system can not fix the scan fails, warned it was and will have to fill it all out again.

    
26.08.2014 / 13:11
3

Your second example is a valid URL! The URLs have the general format:

esquema://máquina/caminho/recurso

http and https are just two schema examples (sometimes called "protocol"). Others would be ftp , file ... Nothing prevents someone from creating a tp schema, so the validator accepted his second example.

If you want to restrict the schema to http and https , I suggest to test it just after filter_var :

strpos($url, "http:") === 0 || strpos($url, "https:") === 0

(Note: why would not I test it if the prefix is http ? Because that would accept URLs as httpabc://... )

    
26.08.2014 / 12:48
3

I use this regex and am satisfied with the results

(preg_match("%^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@|\d{1,3}(?:\.\d{1,3}){3}|(?:(?:[a-z\d\x{00a1}-\x{ffff}]+-?)*[a-z\d\x{00a1}-\x{ffff}]+)(?:\.(?:[a-z\d\x{00a1}-\x{ffff}]+-?)*[a-z\d\x{00a1}-\x{ffff}]+)*(?:\.[a-z\x{00a1}-\x{ffff}]{2,6}))(?::\d+)?(?:[^\s]*)?$%iu", $url)))

You can find regex at link

    
26.08.2014 / 12:56