How to extract web content (Web scraping) with C #?


Recently I learned how to do web scraping and got it on some sites, but in others I can not. I noticed that in some of those I can not get a "#", what does that mean?

I'll give you an example of a site where this happens to me. link

Is there any way to do web scraping on this site?

I usually do this:

var wc = new WebClient();
wc.Encoding = Encoding.UTF8;
var pagina = wc.DownloadString(url);

var htmlDocument = new HtmlAgilityPack.HtmlDocument();

And then I find the node I want.

asked by anonymous 21.06.2018 / 20:55

1 answer


Here is a web scraper that takes all references to other URIs, from a URI:

public class WebScraper
    public static void Main(string[] args)
        string url = args[0];

        foreach (string anotherUrl in GetScrapedUrls(url))

    private static bool IsValidChunk(string chunk)
        bool result = true;

        result = result && chunk.First() != '#';
        result = result && !chunk.Contains("clicklogger");
        result = result && !chunk.StartsWith("https");
        result = result && !chunk.Contains("captcha");
        result = result && !chunk.Contains("counter");

        return result;

    private static IEnumerable<string> GetScrapedUrls(string url)
        Uri myUri;
        if (Uri.TryCreate(url, UriKind.Absolute, out myUri))
            yield return myUri.AbsoluteUri;

            WebClient client = new WebClient();
            string content = client.DownloadString(myUri);

            if (!string.IsNullOrEmpty(content) && content.IndexOf("<html>") > 0)
                MatchCollection matches =
                    Regex.Matches(content, @"<a[^>]+?href\s*?=\s*?['""]([^'""]+)['""]");

                foreach (Match match in matches)
                    string chunk = match.Groups[1].Value;

                    if (IsValidChunk(chunk))
                        string oneMoreUrl = 
                            (url.IndexOf("http") != 0 ? url : "") + 
                                (url.Last() == '/' ? "" : "/") + 

                        foreach (string evenOneMoreUrl in GetScrapedUrls(oneMoreUrl))
                            yield return evenOneMoreUrl;
22.06.2018 / 00:02