Error in "utf-8" in python 3

3

I'm having a problem with python 3 in the code:

#!/usr/bin/env python
#-*- coding: utf-8 -*-

import urllib.request

page = urllib.request.urlopen("http://beans-r-us.biz/prices.html")

text = page.read().decode('utf8')

print(text)'

Give error :

  

UnicodeDecodeError: 'utf-8' codec can not decode byte 0xd0 in position 1265: invalid continuation byte '

I do not know what to do to fix

Note: I'm still a beginner in programming, this code is part of the book "use head programming", and its purpose is to "show" the site.

    
asked by anonymous 06.11.2016 / 21:59

2 answers

1

The error happens on the following line:

text = page.read().decode('utf8')

It attempts to decode the above-mentioned site page using UTF-8 encoding, but fails to find any malformed byte. The content of the page is as follows:

<!doctype html><html><head><meta http-equiv="content-type" content="text/html; charset=shift_jis"><meta http-equiv="Content-Language" content="ja,en"><script type="text/javascript">\r\n\r\n  var _gaq = _gaq || [];\r\n  _gaq.push([\'_setAccount\', \'UA-20569835-2\']);\r\n  _gaq.push([\'_trackPageview\']);\r\n\r\n  (function() {\r\n    var ga = document.createElement(\'script\'); ga.type = \'text/javascript\'; ga.async = true;\r\n    ga.src = (\'https:\' == document.location.protocol ? \'https://ssl\' : \'http://www\') + \'.google-analytics.com/ga.js\';\r\n    var s = document.getElementsByTagName(\'script\')[0]; s.parentNode.insertBefore(ga, s);\r\n  })();\r\n\r\n</script><title>404 Not Found</title></head><body oncontextmenu="return false;" style="width: 100% !important; height: 2600px !important;">\r\n<center><a href="http://cgi.i-mobile.co.jp/ad_link.aspx?guid=on&asid=32341&pnm=0&asn=1"><img border="0" src="http://cgi.i-mobile.co.jp/ad_img.aspx?guid=on&asid=32341&pnm=0&asn=1&asz=0&atp=2&lnk=6666ff&bg=&txt=000000&pbb=1"></a></center>\r\n<center><ahref="http://cgi.i-mobile.co.jp/ad_link.aspx?guid=on&asid=32341&pnm=0&asn=2"><img border="0" src="http://cgi.i-mobile.co.jp/ad_img.aspx?guid=on&asid=32341&pnm=0&asn=2&asz=0&atp=2&lnk=6666ff&bg=&txt=000000"></a></center>\r\n\r\n\r\n<center><FONTSIZE="2">ミンナ�ホが選んだ�ゥ11/07のランキング�ソ</FONT></center>\r\n<center><FONT SIZE="2">�ソ ��位 �ソ</FONT></center>\r\n\r\n<br>\r\n<center><FONT SIZE="2">�ソ ��位 �ソ</FONT></center>\r\n\r\n<a name="madop"></a>\r\n<br>\r\n<center><font size="2">他のキーワードで探してみる</FONT></center><center>\r\n<form method="get" action="/genre23.php">\r\n<font size="2"><input type="text" name="query2" value="" size="8"><font size="4">\r\n<SELECT name="genre">\r\n<OPTION value="3">��</OPTION>\r\n\r\n</SELECT>\r\n</FONT><input type="submit" value=" 探す�マ "></FONT>\r\n<input type="hidden" name="cache" value=""><input type="hidden" name="fname" value="">\r\n</form>\r\n</center><br>\r\n<center><font size="2" color="red"><b><a href="/inq/disclaimer.php?ngdom=beans-r-us.biz&ngk=retire%20your%20vehicle">利用規約・削除依頼</a></b></FONT></center>\r\n<br></body></html>'

As you can see, there are several Eastern characters present. It is likely that he encountered problems decoding one of these.

    
07.11.2016 / 00:00
-3

I made the same example without having to remove the .decode ("utf-8"), which gave the same answer by taking it. The example of the book puts at the end of the .html site, remove it and you will get the same result.

Ps: I use python 3

    
18.12.2016 / 04:22