Python爬虫urllib模块:post方式-创新互联
本程序以爬取 'http://httpbin.org/post' 为例
创新互联建站"三网合一"的企业建站思路。企业可建设拥有电脑版、微信版、手机版的企业网站。实现跨屏营销,产品发布一步更新,电脑网络+移动网络一网打尽,满足企业的营销需求!创新互联建站具备承接各种类型的成都网站建设、成都网站设计项目的能力。经过十多年的努力的开拓,为不同行业的企事业单位提供了优质的服务,并获得了客户的一致好评。格式:
导入urllib.request
导入urllib.parse
数据编码处理,再设为utf-8编码: bytes(urllib.parse.urlencode({'word': 'hello'}), encoding = 'utf-8')
打开爬取的网页: response = urllib.request.urlopen('网址', data = data)
读取网页代码: html = response.read()
打印:
1.不decode
print(html) #爬取的网页代码会不分行,没有空格显示,很难看
2.decode
print(html.decode()) #爬取的网页代码会分行,像写规范的代码一样,看起来很舒服
查询请求结果:
a. response.status # 返回 200:请求成功 404:网页找不到,请求失败
b. response.getcode() # 返回 200:请求成功 404:网页找不到,请求失败
1.不decode的程序如下:
import urllib.request import urllib.parsse data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding = 'utf-8') response = urllib.request.urlopen(' data = data ) html = response.read() print(html) print("------------------------------------------------------------------") print("------------------------------------------------------------------") print(response.status) print(response.getcode())
运行结果:
2.带decode的程序如下:
import urllib.request import urllib.parsse data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding = 'utf-8') response = urllib.request.urlopen(' data = data ) html = response.read() print(html.decode()) print("------------------------------------------------------------------") print("------------------------------------------------------------------") print(response.status) print(response.getcode())
运行结果:
{ "args": {}, "data": "", "files": {}, "form": { "word": "hello" }, "headers": { "Accept-Encoding": "identity", "Connection": "close", "Content-Length": "10", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "Python-urllib/3.4" }, "json": null, "origin": "106.14.17.222", "url": "http://httpbin.org/post" } ------------------------------------------------------------------ ------------------------------------------------------------------ 200 200
为什么要用bytes转换?
因为
data = urllib.parse.urlencode({'word': 'hello'}) ##没有用bytes response = urllib.request.urlopen('http://httpbin.org/post', data = data ) html = response.read()
错误提示:
Traceback (most recent call last): File "/usercode/file.py", line 15, inresponse = urllib.request.urlopen('http://httpbin.org/post', data = data ) File "/usr/lib/python3.4/urllib/request.py", line 153, in urlopen return opener.open(url, data, timeout) File "/usr/lib/python3.4/urllib/request.py", line 453, in open req = meth(req) File "/usr/lib/python3.4/urllib/request.py", line 1104, in do_request_ raise TypeError(msg) TypeError: POST data should be bytes or an iterable of bytes. It cannot be of type str.
由此可见,post方式需要将请求内容用二进制编码。
class bytes
([source[, encoding[, errors]]])
Return a new “bytes” object, which is an immutable sequence of integers in the range 0 <= x < 256
. bytes
is an immutable version of bytearray
– it has the same non-mutating methods and the same indexing and slicing behavior.
Accordingly, constructor arguments are interpreted as for bytearray()
.
另外有需要云服务器可以了解下创新互联scvps.cn,海内外云服务器15元起步,三天无理由+7*72小时售后在线,公司持有idc许可证,提供“云服务器、裸金属服务器、高防服务器、香港服务器、美国服务器、虚拟主机、免备案服务器”等云主机租用服务以及企业上云的综合解决方案,具有“安全稳定、简单易用、服务可用性高、性价比高”等特点与优势,专为企业上云打造定制,能够满足用户丰富、多元化的应用场景需求。
网页标题:Python爬虫urllib模块:post方式-创新互联
链接地址:http://pcwzsj.com/article/ihihs.html