How to get the name of the site?

2

Imagine a scenario in which I have only URLs as follows, registered in my database:

https://www.google.com
https://www.facebook.com
https://www.youtube.com
https://www.twitter.com

Thinking about this case, and there will only be URLs in this way , how could I work out a way to get the site name?

For example, by regex, when I invoke a certain method, and pass as https://www.google.com , it returns me only to string Google?

    
asked by anonymous 22.08.2017 / 21:38

3 answers

1
function nome_dominio($url)
{
  $pieces = parse_url($url);
  $domain = isset($pieces['host']) ? $pieces['host'] : '';
  if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
    $nome = explode('.',$regs['domain']);
    return ucfirst($nome[0]); // converto primeira letra para maiúscula
  }
  return false;
}

// Exemplos (todos retornam Google):
echo nome_dominio("https://mail.google.com"); // Retorna Google
echo nome_dominio("https://google.com"); // Retorna Google
    
22.08.2017 / 22:01
1

Here is an example of regex in JavaScript:

var urls = [
  'https://www.google.com',
  'https://www.facebook.com',
  'https://www.youtube.com',
  'https://www.twitter.com'
];

var $saida = document.getElementById("saida");

urls.forEach(function(url) {
  var nome_site = /(https\:\/\/www.)([^.]+)(.*)/g.exec(url)[2];
  nome_site = nome_site.charAt(0).toUpperCase() + nome_site.slice(1)
  $saida.value = $saida.value + "\n" + url + ': ' + nome_site;
});
textarea {
  height: 200px;
  width: 100%;
}
<textarea id="saida"></textarea>
    
22.08.2017 / 22:01
1

The closest solution for all cases is much more complex than a REGEX.

Unfortunately I could not make this more friendly, the final code was a bit confusing, but I think it will still give you an understanding and I'll explain the whole process.

Advantages (regarding this answer )

  • It has more support for all types of domains, such as floripa.br or adult.ht .

  • It has support for public subdomains such as <seusite>.blogspot.com and even <seusite>.s3.amazonaws.com and related.

Requirements:

You do not need any extension, plugin, framework ... You only need to download the public list of all domains / TLDs that is available here ( link ) and specify the location of the file on the line.

This document is updated periodically.

Code:

function pegaNome($url)
{

    $url = parse_url($url, PHP_URL_HOST);
    if (empty($url)) {
        return false;
    }

    $generico = ['com', 'org', 'net', 'edu', 'gov', 'mil'];

    $lista = array_filter(file('public_suffix_list.dat.txt'));                                             // Download: https://publicsuffix.org/list/public_suffix_list.dat
    $lista = array_merge($lista, ['*']);

    $dominio = explode('.', $url);
    $dominioTamanho = count($dominio) - 1;

    $encontrado = [];

    foreach ($lista as $tld) {

        if (!in_array(substr($tld, 0, 1), ['!', '/', "\n"], true)) {

            $correto = 0;
            $partes = explode('.', $tld);
            $partesTamanho = count($partes);

            foreach ($partes as $i => $pedaco) {

                if (!isset($dominio[$dominioTamanho - $partesTamanho + $i + 1])) {
                    break;
                }

                $pedaco = (array)trim($pedaco);
                $pedaco = $pedaco === '*' ? $generico : $pedaco;

                $correto += (int)(in_array($dominio[$dominioTamanho - $partesTamanho + $i + 1], $pedaco, true));

            }

            if ($correto === $partesTamanho) {
                $encontrado[] = $correto;
            }

        }

    }

    if ($encontrado !== 0){
        rsort($encontrado);

        foreach($encontrado as $encontro){
            if(!empty($dominio[$dominioTamanho - $encontro])){
                return $dominio[$dominioTamanho - $encontro];
            }
        }

    }

    return $url;

}

Explanations:

File:

The file has four types of situations (ignoring blanks):

!tld
*.tld
// tld
tld

The above code ignores both // tld , which are comments, as well as !tld , which do not know the exact reason.

If it is *.tld indicates that it would be net.tld , com.tld eg in most cases

Checks:

When you ask to check a URL, for example https://seusite.blogspot.com , you do exactly the following:

  • Uses PHP_URL_HOST to get seusite.blogspot.com .
  • Divide seusite.blogspot.com to seusite , blogspot and com .

Then we need to check what is the domain used by your website:

  • Checks that the last element is equal to ac : com ! = ac
  • Checks whether the last set is equal to com.ac , so that:
    • Compares the penultimate element equal to com : blogspot ! = com
    • Compares the last element equal to ac : com ! = ac

This is repeated for each line of this file .

At some point you will do exactly:

  • Checks that the last set is equal to blogspot.com :
    • Compares the penultimate element equal to blogpost : blogspot == blogpost
    • Compares the last element equal to com : com == com

Then it will save $encontrado[] = $correto , this will store the 2 value, which is the number of shares that the "subdomain" has (.blogspot.com = 2, .net = 1, .a.b.c = 3).

In this same domain, in the last comparisons you will do:

  • Checks that the last element is equal to .com : com === com

This will also store the value 1 to $encontrado .

Result :

At the end we get the largest number of $encontrado and then we get the name of the domain based on it.

So if seusite.blogspot.com has the highest $encontrado as 2 then just $dominio[count($dominio)-2-1] .

So why do you create an array? Because it might report https://blogspot.com , then it would also be valid in both cases, however the count($dominio)-2-1 would then be -1 . So it goes to the next domain found, in this case .com and will return blogspot , usually.

    
24.08.2017 / 18:01