There is no A regular expression to validate URLs, and the fault is in part part of the RFC (s). And it is so for any given data that depends on one.
The PHP filtering functions even follow the required specifications, but they do not cover all cases, and for others, in order not to avoid false positives, it narrows the restriction according to its necessity, through the flags of configuration, allowing you to have the necessary flexibility for each case.
For future references, by default, if you omit the second argument, it only treats the data as a common string.
In your case, given the absence of the usage form, I imagine you are doing this:
filter_var( 'http://www.youtube.com', FILTER_VALIDATE_URL );
The first URL is valid because it contains the main elements of a URL that are schema , the domain and TLD .
In the second case it also validates because it also has the three basic components, even if one of them is wrong.
In order for the second URL to also return FALSE it would be necessary to combine the first flag with FILTER_FLAG_SCHEME_REQUIRED .
The third URL is valid for the user, for the browser, but does not stop for the RFC because it lacks one of the basic components required by the specification.
What you could is, like in everything that comes from the user, before even validating, would sanitize the URL. Some things that come to me:
- Verify that there are no broken schemas as in the second URL and correct them, either by removing or repairing when and if possible
- Add the default link at the beginning of the URL if it is missing (or removed and then deleted), after all, an FTP or HTTPS URL (or ED2K, Magnet, torrent ...) that do not have such specific prefixes will not be treated as special anyway.
And always warn the user through a tip in the GUI that the format is link in>. If he type wrong, the system can not fix the scan fails, warned it was and will have to fill it all out again.