Problem catching the title of a URL using regular expression

1

I'm learning Go language on the competition side. I was challenged to use the default generator to get a channel that reads the title of a URL through a goroutine.

Inside this goroutine that I mounted, the reading is performed through the GET http to then get it in a string after checking for a regex. Initially, the code returned the out of bounds error ( panic: runtime error: index out of range ), and found that the error was because of the < <title> , my regular expression did not recognize this line break using (. *?) because the (.) point disregards line break characters .

I discovered this by giving view-source on any site, realizing that not all titles are defined between the <title> tags on the same line, and may happen to be broken lines as well, such as example:

<title>
meusite
</title>

instead of <title>meusite</title>

So far so good.

With this, I tried to improve my regex to match titles that are on the same line as well as broken lines, but unfortunately I was not successful because the code did not return titles the way I wanted them to.

Below is my source code:

//Padrões de concorrência - Generator
//Para mais informações sobre padrões de concorrência, visitar a documentação
//Google I/O 2012 - Go Concurrency Patterns

package main

import (
    "fmt"
    "io/ioutil"
    "net/http"
    "regexp"
)

func tituloURL(urls ...string) <-chan string {
    ch := make(chan string)

    for _, url := range urls {
        go func(url string) {
            resp, _ := http.Get(url)
            html, _ := ioutil.ReadAll(resp.Body)

            //r, _ := regexp.Compile("<title>(.*?)<\/title>")
            // r, _ := regexp.Compile("<title>(.|\n)*?<\/title>")
            r, _ := regexp.Compile("<title>(.*?)|([^\d])*?<\/title>")
            //r, _ := regexp.Compile("<title>([\s\S]*?)<\/title>")
            // r, _ := regexp.Compile("<title>(.|[\s\S])*?<\/title>")

            ch <- r.FindStringSubmatch(string(html))[1]

        }(url)
    }
    return ch
}

func main() {
    t1 := tituloURL("https://www.github.com", "https://www.linkedin.com")
    t2 := tituloURL("https://www.instagram.com", "https://www.youtube.com")
    fmt.Println("Prmeiros títulos:", <-t1, "|", <-t2)
    fmt.Println("Segundos títulos:", <-t1, "|", <-t2)
}

As you might have noticed, I tried to use some regex defaults, and the RegexPal did match, but the code did not return the expected result.

Some of you already have some suggestion of another regex that can solve this error?

I count on your help!

I'm waiting.

    
asked by anonymous 07.10.2018 / 19:10

1 answer

0

If you want to use regexp:

var r = regexp.MustCompile('(?is)<title>(.*?)</title>')

matches := r.FindStringSubmatch(site)
if len(matches) == 2 {
    fmt.Printf("título: %q\n", matches[1])
} else {
    fmt.Println("sem título!")
}

Playground: link .

But the best way is parsing HTML :

func getTitle(site string) (title string, err error) {
    resp, err := http.Get(site)
    // Check err.
    defer resp.Body.Close()

    node, err := html.Parse(resp.Body)
    // Check err.

    title, ok := findTitle(node)
    if !ok {
        return "", errors.New("no title")
    }

    return title, nil
}

func findTitle(node *html.Node) (title string, ok bool) {
    if node.DataAtom == atom.Title && node.FirstChild != nil {
        return node.FirstChild.Data, true
    }

    for c := node.FirstChild; c != nil; c = c.NextSibling {
        title, ok = findTitle(c)
        if ok {
            return title, ok
        }
    }

    return "", false
}
    
27.10.2018 / 12:10