How to extract web content (Web scraping) with C #?

3

Recently I learned how to do web scraping and got it on some sites, but in others I can not. I noticed that in some of those I can not get a "#", what does that mean?

I'll give you an example of a site where this happens to me. link

Is there any way to do web scraping on this site?

I usually do this:

var wc = new WebClient();
wc.Encoding = Encoding.UTF8;
var pagina = wc.DownloadString(url);

var htmlDocument = new HtmlAgilityPack.HtmlDocument();
htmlDocument.LoadHtml(pagina);

And then I find the node I want.

    
asked by anonymous 21.06.2018 / 20:55

1 answer

2

Here is a web scraper that takes all references to other URIs, from a URI:

public class WebScraper
{
    public static void Main(string[] args)
    {
        string url = args[0];

        foreach (string anotherUrl in GetScrapedUrls(url))
        {
            Console.WriteLine(anotherUrl);
        }
    }

    private static bool IsValidChunk(string chunk)
    {
        bool result = true;

        result = result && chunk.First() != '#';
        result = result && !chunk.Contains("clicklogger");
        result = result && !chunk.StartsWith("https");
        result = result && !chunk.Contains("captcha");
        result = result && !chunk.Contains("counter");

        return result;
    }

    private static IEnumerable<string> GetScrapedUrls(string url)
    {
        Uri myUri;
        if (Uri.TryCreate(url, UriKind.Absolute, out myUri))
        {
            yield return myUri.AbsoluteUri;

            WebClient client = new WebClient();
            string content = client.DownloadString(myUri);

            if (!string.IsNullOrEmpty(content) && content.IndexOf("<html>") > 0)
            {
                MatchCollection matches =
                    Regex.Matches(content, @"<a[^>]+?href\s*?=\s*?['""]([^'""]+)['""]");

                foreach (Match match in matches)
                {
                    string chunk = match.Groups[1].Value;

                    if (IsValidChunk(chunk))
                    {
                        string oneMoreUrl = 
                            (url.IndexOf("http") != 0 ? url : "") + 
                                (url.Last() == '/' ? "" : "/") + 
                                    chunk;

                        foreach (string evenOneMoreUrl in GetScrapedUrls(oneMoreUrl))
                        {
                            yield return evenOneMoreUrl;
                        }
                    }
                }
            }
        }
    }
}
    
22.06.2018 / 00:02