How to get the name of compound domains?

2

I've seen How to get the site name? and testing I have seen that works very well on simple names, but on compound names, for example:

https://www.stackoverflow.com
https://www.oficinacarlos.com
https://www.lucasverduras.com

It returns everything together, like this:

  

Stackoverflow

     

Offices

     

Lucasverduras

There would be a way to get compound names like these above and return them like this:

  

Stack Overflow

     

Office Carlos

     

Lucas Vegetables

I'm using the following code:

function nome_dominio($url)
{
    $pieces = parse_url($url);
    $domain = isset($pieces['host']) ? $pieces['host'] : '';
      if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
        $nome = explode('.',$regs['domain']);
        return ucfirst($nome[0]); // converto primeira letra para maiúscula
      }
    return false;
}

The function needs to return both compound names and simple names.

    
asked by anonymous 23.08.2017 / 22:10

2 answers

2

It's hard to create something that works in all cases, I've tried to make it as simple as possible, but in many cases it has somewhat grotesque errors.

Tests:

Alexa TOP 30 result:

+----------------+-----------+
|    Dominio     |   Nome    |
+----------------+-----------+
| youtube.com    | YouTube   |
| facebook.com   | Facebook  |
| baidu.com      | Baidu     |
| wikipedia.org  | Wikipedia |
| yahoo.com      | Yahoo     |
| reddit.com     | reddit    |
| google.co.in   | Google    |
| qq.com         | Qq**      |
| amazon.com     | Amazon    |
| taobao.com     | Taobao    |
| google.co.jp   | Google    |
| twitter.com    | Twitter   |
| tmall.com      | Tmall**   |
| vk.com         | VK        |
| live.com       | Live      |
| instagram.com  | Instagram |
| sohu.com       | Sohu      |
| sina.com.cn    | Sina      |
| weibo.com      | Weibo**   |
| jd.com         | JD        |
| 360.cn         | 360       |
| google.de      | Google    |
| google.co.uk   | Google    |
| google.ru      | Google    |
| google.fr      | Google    |
| linkedin.com   | LinkedIn  |
| google.com.br  | Google    |
| list.tmall.com | Tmall**   |
| google.com.hk  | Google    |
| yandex.ru      | Yandex    |
+----------------+-----------+

Already between the 199992 through 200026 Alexa:

+----------------------------+--------------------------------------------+
|          Dominio           |                  Nome                      |
+----------------------------+--------------------------------------------+
| gsm-specs.com              | GSM-specs.com - GSM-specs***               |
| cikm2017.org               | CIKM 2017                                  |
| sitkagear.com              | SITKA Gear | Turning Clothing Into Gear*** |
| laprocure.com              | La Procure                                 |
| pori.fi                    | Pori                                       |
| 1213wz.com                 | 1213wz                                     |
| unistar.by                 | Unistar                                    |
| upskirtjerk.com            | Upskirt Jerk                               |
| astarehsaghf.com           | Astarehsaghf*                              |
| dornc.com                  | Department of***                           |
| serviceacademyforums.com   | Service Academy Forums                     |
| yaledailynews.com          | Yale Daily News                            |
| rewardingexcellence.com    | rformance Ce***                            |
| lokosom.com.br             | Lokosom                                    |
| i-escape.com               | i-escape                                   |
| 90rss.com                  | 90rss                                      |
| bhdstar.vn                 | BHD STAR                                   |
| le-onze-parisien.fr        | Le Onze Parisien                           |
| criarweb.com               | CriarWeb                                   |
| fundayshop.com             | Fundayshop                                 |
| campsitephotos.com         | CampsitePhotos                             |
| spankwirefreehd.com        | Spankwirefreehd                            |
| kabudragon.com             | Kabudragon**                               |
| rebug.me                   | REBUG                                      |
| yuchaoyang.com             | Yuchaoyang*                                |
| naval.com.br               | NAVAL                                      |
| chesterfield.gov           | Chesterfield*                              |
| nururi.com                 | Nururi                                     |
| vcegdaprazdnik.ru          | Vcegdaprazdnik**                           |
| noridianmedicareportal.com | Noridianmedicareportal*                    |
| solobari.it                | Solobari                                   |
| kaddr.com                  | Kaddr                                      |
| mayoclinichealthsystem.org | Mayo Clinic Health System                  |
| sanayi.gov.tr              | Sanayi                                     |
+----------------------------+--------------------------------------------+

Already among the 390000 through 390029 Alexa:

+---------------------------+---------------------------------------------------------------------+
|    catholicplanet.com     |                           Catholic Planet                           |
+---------------------------+---------------------------------------------------------------------+
| 4jovem.com                | 4jovem                                                              |
| uploadmb.com              | UploadMB                                                            |
| 2bet.ag                   | 2Bet                                                                |
| polnakorzina.ru           | Polnakorzina**                                                      |
| kktown.com.tw             | KKTOWN                                                              |
| pension.de                | Pensionen, Ferienunterkünfte & Ferienwohnungen finden - Pension*** |
| realresultslist.com       | realresultslist*                                                    |
| hoya.co.jp                | HOYA                                                                |
| fbw.jp                    | Fbw**                                                               |
| mongol-media.com          | Mongol-Media                                                        |
| indianpediatrics.net      | Indian Pediatrics                                                   |
| dmmfree.net               | DmmFree                                                             |
| mp3gui.info               | Mp3Gui                                                              |
| xhtmlforum.de             | XHTMLforum                                                          |
| whole9life.com            | Whole9 - Let us change your life***                                 |
| swidnica.pl               | Swidnica                                                            |
| revbrew.com               | rewery | Revolution Brew***                                         |
| nasleshahvar.ir           | Nasleshahvar                                                        |
| com-private.club          | Com-private                                                         |
| crack4patch.com           | Crack 4 Patch                                                       |
| incomingsoft.de           | Incomingsoft*                                                       |
| thefrustratedengineer.com | The Frustrated Engineer                                             |
| forumdesimages.fr         | Forum des images                                                    |
| tripvillas.com            | Tripvillas                                                          |
| araxis.com                | Araxis                                                              |
| rembetiko.gr              | Rembetiko                                                           |
| krasview.ru               | Krasview                                                            |
| duckokong.com             | Duckokong*                                                          |
| hotesextubes.com          | Hot Sex Tubes                                                       |
+---------------------------+---------------------------------------------------------------------+

Result of the mentioned links:

+----------------------------+-----------------------------------------+
|          Dominio           |                  Nome                   |
+----------------------------+-----------------------------------------+
| stackoverflow.com          | Stack Overflow                          |
+----------------------------+-----------------------------------------+

Main problems:

  • The website should be available to work minimally and accessible by cURL, without redirects made by javascript for example, target the cases indicated with * .

  • "Asian" / "Russian" websites have bigger problems, target those with ** .

  • Due to the method of operation, getting the beginning and end may take a much longer section than the title itself or much smaller, target those marked with *** . This can be fixed by trying to find the nearest string, but I did nothing to fix it.

    How does it work?

    function colidirTituloComNome($title, $name){
    
        $inicio = encontrarInicio($title, $name);
        $fim = encontrarFim($title, $name);
    
        if ($inicio !== false && $fim !== false){
            return mb_substr($title, $inicio, $fim - $inicio, 'UTF-8');
        }
    
        return ucfirst($name);
    }
    
    function encontrarInicio($title, $name){
    
        $achado = mb_stripos($title, $name, 0, 'UTF-8');
        if ($achado !== false){
            return $achado;
        }
    
        if (mb_strlen($name, 'UTF-8') <= 1) {
            return false;
        }
    
        return encontrarInicio($title, mb_substr($name, 0, ceil(mb_strlen($name, 'UTF-8')/2), 'UTF-8'));
    }
    
    function encontrarFim($title, $name){
    
        $achado = mb_strripos($title, $name, 0, 'UTF-8');
        if ($achado !== false){
            return $achado + mb_strlen($name, 'UTF-8');
        }
    
        if (mb_strlen($name, 'UTF-8') <= 1) {
            return false;
        }
    
        return encontrarFim($title, mb_substr($name, ceil(mb_strlen($name, 'UTF-8')/2), null, 'UTF-8'));
    }
    

    It's "half" duplicated, but that's it. The idea is that given an input stackoverflow and another Stack Overflow em Portugues will try to cut the string to the point where it finds "Stack" and also find "flow", so it will get "Stack Overflow".

    There are several other ways to do this, perhaps more accurate and more efficient, for example similar_text or levenshtein .

    If it was not found it would return "Stackoverflow".

    To get the value of <title> you can use:

    function pegaTitulo($url)
    {
        $ch = curl_init($url);
    
        curl_setopt_array($ch, [
                CURLOPT_RETURNTRANSFER => 1,
                CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.0.0 Safari/537.36',
                CURLOPT_REDIR_PROTOCOLS => CURLPROTO_HTTP | CURLPROTO_HTTPS,
                CURLOPT_SSLVERSION => CURL_SSLVERSION_TLSv1_2,
                CURLOPT_FOLLOWLOCATION => 1,
                CURLOPT_MAXREDIRS => 2,
                CURLOPT_SSL_VERIFYPEER => 1,
                CURLOPT_SSL_VERIFYHOST => 2,
                CURLOPT_TIMEOUT => 10,                                                             // Timeout
                CURLOPT_CONNECTTIMEOUT => 2,                                                       // Timeout
                CURLOPT_FAILONERROR => 1,
                CURLOPT_CAINFO => __DIR__ . DIRECTORY_SEPARATOR . 'cacert-2017-06-07.pem',         // Download: https://curl.haxx.se/ca/cacert-2017-06-07.pem
            ]
        );
    
        if ($html = curl_exec($ch)) {
    
            libxml_use_internal_errors(true);
            $dom = new DOMDocument();
    
            if ($dom->loadHTML($html)) {
                $list = $dom->getElementsByTagName("title");
                if ($list->length > 0) {
                    return $list->item(0)->textContent;
                }
            }
        }
    
        return false;
    }
    

    The cURL will get the page information, it is limited to HTTP / HTTPS and can follow up to 2 redirects. In addition it will check SSL and have a timeout to fail if it takes too long. This is minimally safe for public use, where the user can set $url .

    If all goes well, it will get the contents of the <title> tag using DOMDocument .

    To get the name ( https://pt.stackoverflow.com to stackoverflow ) you can use this other function .

    Then you can use:

    $nome = pegaNome($url);
    $titulo = pegaTitulo($url);
    
    if ($nome && $titulo) {
        echo htmlentities(colidirTituloComNome($titulo, $nome));
    }
    
        
  • 24.08.2017 / 22:15
    1

    This is not easy, native or automated, because to create this kind of algorithm you need to set defaults for the code to follow. And as it comes to name, the amount of possible patterns are impractical to predict and analyze.

        
    23.08.2017 / 23:09