The closest solution for all cases is much more complex than a REGEX.
Unfortunately I could not make this more friendly, the final code was a bit confusing, but I think it will still give you an understanding and I'll explain the whole process.
Advantages (regarding this answer )
-
It has more support for all types of domains, such as floripa.br
or adult.ht
.
-
It has support for public subdomains such as <seusite>.blogspot.com
and even <seusite>.s3.amazonaws.com
and related.
Requirements:
You do not need any extension, plugin, framework ... You only need to download the public list of all domains / TLDs that is available here ( link ) and specify the location of the file on the line.
This document is updated periodically.
Code:
function pegaNome($url)
{
$url = parse_url($url, PHP_URL_HOST);
if (empty($url)) {
return false;
}
$generico = ['com', 'org', 'net', 'edu', 'gov', 'mil'];
$lista = array_filter(file('public_suffix_list.dat.txt')); // Download: https://publicsuffix.org/list/public_suffix_list.dat
$lista = array_merge($lista, ['*']);
$dominio = explode('.', $url);
$dominioTamanho = count($dominio) - 1;
$encontrado = [];
foreach ($lista as $tld) {
if (!in_array(substr($tld, 0, 1), ['!', '/', "\n"], true)) {
$correto = 0;
$partes = explode('.', $tld);
$partesTamanho = count($partes);
foreach ($partes as $i => $pedaco) {
if (!isset($dominio[$dominioTamanho - $partesTamanho + $i + 1])) {
break;
}
$pedaco = (array)trim($pedaco);
$pedaco = $pedaco === '*' ? $generico : $pedaco;
$correto += (int)(in_array($dominio[$dominioTamanho - $partesTamanho + $i + 1], $pedaco, true));
}
if ($correto === $partesTamanho) {
$encontrado[] = $correto;
}
}
}
if ($encontrado !== 0){
rsort($encontrado);
foreach($encontrado as $encontro){
if(!empty($dominio[$dominioTamanho - $encontro])){
return $dominio[$dominioTamanho - $encontro];
}
}
}
return $url;
}
Explanations:
File:
The file has four types of situations (ignoring blanks):
!tld
*.tld
// tld
tld
The above code ignores both // tld
, which are comments, as well as !tld
, which do not know the exact reason.
If it is *.tld
indicates that it would be net.tld
, com.tld
eg in most cases
Checks:
When you ask to check a URL, for example https://seusite.blogspot.com
, you do exactly the following:
- Uses
PHP_URL_HOST
to get seusite.blogspot.com
.
- Divide
seusite.blogspot.com
to seusite
, blogspot
and com
.
Then we need to check what is the domain used by your website:
- Checks that the last element is equal to
ac
: com
! = ac
- Checks whether the last set is equal to
com.ac
, so that:
- Compares the penultimate element equal to
com
: blogspot
! = com
- Compares the last element equal to
ac
: com
! = ac
This is repeated for each line of this file .
At some point you will do exactly:
- Checks that the last set is equal to
blogspot.com
:
- Compares the penultimate element equal to
blogpost
: blogspot
== blogpost
- Compares the last element equal to
com
: com
== com
Then it will save $encontrado[] = $correto
, this will store the 2
value, which is the number of shares that the "subdomain" has (.blogspot.com = 2, .net = 1, .a.b.c = 3).
In this same domain, in the last comparisons you will do:
- Checks that the last element is equal to
.com
: com
=== com
This will also store the value 1
to $encontrado
.
Result :
At the end we get the largest number of $encontrado
and then we get the name of the domain based on it.
So if seusite.blogspot.com
has the highest $encontrado
as 2
then just $dominio[count($dominio)-2-1]
.
So why do you create an array? Because it might report https://blogspot.com
, then it would also be valid in both cases, however the count($dominio)-2-1
would then be -1
. So it goes to the next domain found, in this case .com
and will return blogspot
, usually.