urllib提供了一系列用于操作URL的功能。
urllib的request模块可以非常方便地抓取URL内容,也就是发送一个GET请求到指定的页面,然后返回HTTP的响应:
例如,对豆瓣的一个 URLhttps://api.douban.com/v2/book/2129650 进行抓取,并返回响应:
可以看到HTTP响应的头和JSON数据:
from urllib import request
with request.urlopen('https://api.douban.com/v2/book/2129650') as f:
data = f.read()
print('Status:', f.status, f.reason)
for k, v in f.getheaders():
print('%s: %s' % (k, v))
print('Data:', data.decode('utf-8'))
Status: 200 OK
Date: Sat, 26 Jan 2019 11:59:44 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 2138
Connection: close
Vary: Accept-Encoding
X-Ratelimit-Remaining2: 99
X-Ratelimit-Limit2: 100
Expires: Sun, 1 Jan 2006 01:00:00 GMT
Pragma: no-cache
Cache-Control: must-revalidate, no-cache, private
Set-Cookie: bid=BdhRHcX_CIE; Expires=Sun, 26-Jan-20 11:59:44 GMT; Domain=.douban.com; Path=/
X-DOUBAN-NEWBID: BdhRHcX_CIE
X-DAE-Node: anson70
X-DAE-App: book
Server: dae
X-Frame-Options: SAMEORIGIN
Data: {"rating":{"max":10,"numRaters":16,"average":"7.4","min":0},"subtitle":"","author":["廖雪峰"],"pubdate":"2007","tags":[{"count":23,"name":"spring","title":"spring"},{"count":14,"name":"Java","title":"Java"},{"count":6,"name":"javaee","title":"javaee"},{"count":5,"name":"j2ee","title":"j2ee"},{"count":4,"name":"计算机","title":"计算机"},{"count":3,"name":"藏书","title":"藏书"},{"count":3,"name":"编程","title":"编程"},{"count":3,"name":"POJO","title":"POJO"}],"origin_title":"","image":"https://img3.doubanio.com\/view\/subject\/m\/public\/s2552283.jpg","binding":"平装","translator":[],"catalog":"","pages":"509","images":{"small":"https://img3.doubanio.com\/view\/subject\/s\/public\/s2552283.jpg","large":"https://img3.doubanio.com\/view\/subject\/l\/public\/s2552283.jpg","medium":"https://img3.doubanio.com\/view\/subject\/m\/public\/s2552283.jpg"},"alt":"https:\/\/book.douban.com\/subject\/2129650\/","id":"2129650","publisher":"电子工业出版社","isbn10":"7121042622","isbn13":"9787121042621","title":"Spring 2.0核心技术与最佳实践","url":"https:\/\/api.douban.com\/v2\/book\/2129650","alt_title":"","author_intro":"","summary":"本书注重实践而又深入理论,由浅入深且详细介绍了Spring 2.0框架的几乎全部的内容,并重点突出2.0版本的新特性。本书将为读者展示如何应用Spring 2.0框架创建灵活高效的JavaEE应用,并提供了一个真正可直接部署的完整的Web应用程序——Live在线书店(http:\/\/www.livebookstore.net)。\n在介绍Spring框架的同时,本书还介绍了与Spring相关的大量第三方框架,涉及领域全面,实用性强。本书另一大特色是实用性强,易于上手,以实际项目为出发点,介绍项目开发中应遵循的最佳开发模式。\n本书还介绍了大量实践性极强的例子,并给出了完整的配置步骤,几乎覆盖了Spring 2.0版本的新特性。\n本书适合有一定Java基础的读者,对JavaEE开发人员特别有帮助。本书既可以作为Spring 2.0的学习指南,也可以作为实际项目开发的参考手册。","price":"59.8"}
如果我们要想模拟浏览器发送GET请求,就需要使用Request对象,通过往Request对象添加HTTP头,我们就可以把请求伪装成浏览器。例如,模拟iPhone 6去请求豆瓣首页:
这样豆瓣会返回适合iPhone的移动版网页:
from urllib import request
req = request.Request('http://www.douban.com/')
req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25')
with request.urlopen(req) as f:
print('Status:', f.status, f.reason)
for k, v in f.getheaders():
print('%s: %s' % (k, v))
print('Data:', f.read().decode('utf-8'))
Status: 200 OK
Date: Sat, 26 Jan 2019 11:59:46 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
X-Xss-Protection: 1; mode=block
X-Douban-Mobileapp: 0
Expires: Sun, 1 Jan 2006 01:00:00 GMT
Pragma: no-cache
Cache-Control: must-revalidate, no-cache, private
Set-Cookie: talionnav_show_app="0"
Set-Cookie: bid=fHdEEseIE8o; Expires=Sun, 26-Jan-20 11:59:46 GMT; Domain=.douban.com; Path=/
X-DOUBAN-NEWBID: fHdEEseIE8o
X-DAE-Node: anson60
X-DAE-App: talion
Server: dae
X-Frame-Options: SAMEORIGIN
Strict-Transport-Security: max-age=15552000;
X-Content-Type-Options: nosniff
Data:
<!DOCTYPE html>
<html itemscope itemtype="http://schema.org/WebPage" class="ua-safari ua-mobile ">
<head>
<meta charset="UTF-8">
<title>豆瓣(手机版)</title>
<meta name="google-site-verification" content="ok0wCgT20tBBgo9_zat2iAcimtN4Ftf5ccsh092Xeyw" />
<meta name="viewport" content="width=device-width, height=device-height, user-scalable=no, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0">
<meta name="format-detection" content="telephone=no">
<link rel="canonical" href="
http://m.douban.com/">
<link href="https://img3.doubanio.com/f/talion/95a2c9ca9b251b8c2e810bc80afd4e9c7c37b392/css/card/base.css" rel="stylesheet">
<meta name="description" content="读书、看电影、涨知识、学穿搭...,加入兴趣小组,获得达人们的高质量生活经验,找到有相同爱好的小伙伴。">
<meta name="keywords" content="豆瓣,手机豆瓣,豆瓣手机版,豆瓣电影,豆瓣读书,豆瓣同城">
<!-- Schema.org markup for Google+ -->
<meta itemprop="name" content="豆瓣">
<meta itemprop="description" content="读书、看电影、涨知识、学穿搭...,加入兴趣小组,获得达人们的高质量生活经验,找到有相同爱好的小伙伴。">
<meta itemprop="image" content="https://img3.doubanio.com/f/talion/8e7b9cbd097c02972c4191aa03fdb084524505c4/pics/icon/m_logo_180.png">
<!-- Twitter meta -->
<meta name="twitter:card" content="summary" />
<!-- Open Graph meta -->
<meta property="og:title" content="豆瓣" />
<meta property="og:description" content="读书、看电影、涨知识、学穿搭...,加入兴趣小组,获得达人们的高质量生活经验,找到有相同爱好的小伙伴。" />
<meta property="og:site_name" content="豆瓣(手机版)" />
<meta property="og:url" content="https://m.douban.com/" />
<meta property="og:image" content="https://img3.doubanio.com/f/talion/8e7b9cbd097c02972c4191aa03fdb084524505c4/pics/icon/m_logo_180.png" />
<meta property="og:image:type" content="image/png" />
<meta property="og:image:width" content="300" />
<meta property="og:image:height" content="300" />
<meta property="og:type" content="article" />
<!-- Wechat meta -->
<meta property="weixin:timeline_title" content="豆瓣" />
<meta property="weixin:chat_title" content="豆瓣" />
<meta property="weixin:description" content="读书、看电影、涨知识、学穿搭...,加入兴趣小组,获得达人们的高质量生活经验,找到有相同爱好的小伙伴。" />
<meta property="weixin:image" content="https://img3.doubanio.com/f/talion/8e7b9cbd097c02972c4191aa03fdb084524505c4/pics/icon/m_logo_180.png" />
<script>
;(function () {
window.setMeta = function (name, val) {
var meta = document.querySelectorAll('meta[property="' + name + '"], meta[name="' + name + '"]')
if (!meta.length) {
meta = document.createElement('meta')
meta.name = name
document.head.appendChild(meta)
meta = [meta]
}
meta[0].content = val || ''
}
window.getMeta = function (name) {
var meta = document.querySelectorAll('meta[property="' + name + '"], meta[name="' + name + '"]')
if (!meta.length) {
return ''
} else {
return meta[0].content
}
}
!getMeta('weixin:chat_title') && setMeta('weixin:chat_title', document.title)
!getMeta('weixin:timeline_title') && setMeta('weixin:timeline_title', document.title)
!getMeta('weixin:description') && setMeta('weixin:description', getMeta('og:description'))
})();
</script>
<link rel="stylesheet" href="https://img3.doubanio.com/misc/mixed_static/468e9211d89502ad.css">
<link rel="icon" type="image/png" sizes="16x16" href="https://img3.doubanio.com/f/talion/c970bb0d720963a7392f7dd6c77068bb9925caaf/pics/icon/dou16.png">
<link rel="icon" type="image/png" sizes="32x32" href="https://img3.doubanio.com/f/talion/2f3c0bc0f35b031d4535fd993ae3936f4e40e6c8/pics/icon/dou32.png">
<link rel="icon" type="image/png" sizes="48x48" href="https://img3.doubanio.com/f/talion/10a4a913a5715f628e4b598f7f9f2c18621bdcb3/pics/icon/dou48.png">
<!-- iOS touch icon -->
<link rel="apple-touch-icon-precomposed" href="https://img3.doubanio.com/f/talion/997f2018d82979da970030a5eb84c77f0123ae5f/pics/icon/m_logo_76.png">
<link rel="apple-touch-icon-precomposed" sizes="76x76" href="https://img3.doubanio.com/f/talion/997f2018d82979da970030a5eb84c77f0123ae5f/pics/icon/m_logo_76.png">
<link rel="apple-touch-icon-precomposed" sizes="120x120" href="https://img3.doubanio.com/f/talion/18932a3e71a60ed7150ca2ca7ebf21ddadd7092e/pics/icon/m_logo_120.png">
<link rel="apple-touch-icon-precomposed" sizes="152x152" href="https://img3.doubanio.com/f/talion/b99497ff8538c54b9ba6f40867da932396ab2562/pics/icon/m_logo_152.png">
<link rel="apple-touch-icon-precomposed" sizes="167x167" href="https://img3.doubanio.com/f/talion/0c233ada957a95e632f81607e30230d16e8293e8/pics/icon/m_logo_167.png">
<link rel="apple-touch-icon-precomposed" sizes="180x180" href="https://img3.doubanio.com/f/talion/8e7b9cbd097c02972c4191aa03fdb084524505c4/pics/icon/m_logo_180.png">
<link rel="apple-touch-icon-precomposed" sizes="200x200" href="https://img3.doubanio.com/f/talion/7c6364aadf368dc0210173c940cfd0f64ceddc66/pics/icon/m_logo_200.png">
<!-- For Android -->
<link rel="icon" sizes="128x128" href="https://img3.doubanio.com/f/talion/b99497ff8538c54b9ba6f40867da932396ab2562/pics/icon/m_logo_152.png">
<link rel="icon" sizes="192x192" href="https://img3.doubanio.com/f/talion/7c6364aadf368dc0210173c940cfd0f64ceddc66/pics/icon/m_logo_200.png">
<!-- For Web App Manifest -->
<link rel="manifest" href="/pwa/manifest?path=/&short_name=%E8%B1%86%E7%93%A3%28%E6%89%8B%E6%9C%BA%E7%89%88%29&name=%E8%B1%86%E7%93%A3%28%E6%89%8B%E6%9C%BA%E7%89%88%29">
<meta name="theme-color" content="#42bd56">
<link type="application/opensearchdescription+xml" rel="search" href="/opensearch"/>
<!-- hm baidu -->
<script type="text/javascript">
var _hmt = _hmt || [];
(function() {
var hm = document.createElement("script");
hm.src = "https://hm.baidu.com/hm.js?6d4a8cfea88fa457c3127e14fb5fabc2";
var s = document.getElementsByTagName("script")[0];
s.parentNode.insertBefore(hm, s);
})();
_hmt.logTruncate = function(type) {
_hmt.push(['_trackEvent', 'article', 'truncate', type, 1]);
}
</script>
</head>
<body ontouchstart="">
<div id="TalionNav"><header class="TalionNav"><div class="TalionNav-primary"><a href="/"><h1>豆瓣</h1></a><nav><ul><li><a href="/movie" style="color: #2384E8;">电影</a></li><li><a href="/book" style="color: #9F7860;">图书</a></li><li><a href="/status" style="color: #E4A813;">广播</a></li><li><a href="/group" style="color: #2AB8CC;">小组</a></li></ul><span class=""></span></nav></div><div class="TalionNav-secondary"><a class="close-nav" href="javascript:;">关闭</a><form action="/search" method="GET"><div><input name="query" type="search"></div></form><ul><li><div><a href="/movie" target="_blank"><strong style="color: #2384E8;">电影</strong><span>影院热映</span></a><a href="https://douban.com/location" target=""><strong style="color: #E6467E;">同城</strong><span>周末活动</span></a><a href="https://read.douban.com" target=""><strong style="color: #9F7860;">阅读</strong><span>电子书</span></a><a href="/status" target="_blank"><strong style="color: #E1644D;">广播</strong><span>友邻动态</span></a></div></li><li><div><a href="/tv" target="_blank"><strong style="color: #7A6ADB;">电视</strong><span>正在热播</span></a><a href="/group" target="_blank"><strong style="color: #2AB8CC;">小组</strong><span>志趣相投</span></a><a href="/game" target="_blank"><strong style="color: #5774C5;">游戏</strong><span>虚拟世界</span></a><a href="https://douban.fm" target=""><strong style="color: #40CFA9;">FM</strong><span>红心歌单</span></a></div></li><li><div><a href="/book" target="_blank"><strong style="color: #9F7860;">图书</strong><span>畅销排行</span></a><a href="/music" target="_blank"><strong style="color: #F48F2E;">音乐</strong><span>新碟榜</span></a><a href="/mobileapp" target="_blank"><strong style="color: #596CDD;">应用</strong><span>玩手机</span></a><a href="https://market.douban.com/?utm_campaign=mobile_web_douban_nav&utm_source=douban&utm_medium=mobile_web" target=""><strong style="color: #42BD56;">豆品</strong><span>生活美学</span></a></div></li></ul><div class="navBottom"><div class="nav-item"><a class="toUser" href="/mine/">我的豆瓣</a><a class="toExit" href="https://accounts.douban.com/logout?ck=undefined&redir=http://accounts.douban.com/passport/login">退出豆瓣</a></div><div class="nav-item"><a class="toPC" href="/to_pc/?url=about%3Ablank">使用桌面版</a><a class="toApp">使用豆瓣App</a></div></div></div></header></div>
<div class="page">
<div class="card">
<ul class="quick-nav">
<li>
<a href="/movie/nowintheater?loc_id=108288">影院热映</a>
</li>
<li>
<a href="/music/newwestern/">欧美新碟榜</a>
</li>
<li>
<a id="hot-topics" href="https://m.douban.com/time/?dt_time_source=douban-msite_shortcut">豆瓣时间</a>
</li>
<li>
<a href="https://www.douban.com/doubanapp/app?channel=card_home&direct_dl=1">使用豆瓣App</a>
</li>
</ul>
<section id="recommend-feed"></section>
</div>
</div>
<script src="https://img3.doubanio.com/f/talion/ee8e0c54293aefb5709ececbdf082f8091ad5e49/js/card/zepto.min.js"></script>
<script src="https://img3.doubanio.com/f/talion/c453219f84a3c1d4f986fdcaf6b34c03bc913c6f/js/card/main.js"></script>
<script src="https://img3.doubanio.com/f/talion/f53cc45d4a16969b8592d776f476d9784a283e4a/js/lib/douban-ad-helper.js"></script>
<script src="https://img3.doubanio.com/f/talion/88fc2b21c8dda5c93aa4c011eb15b74f8850978f/js/lib/react/15.3.0/react-all.min.js"></script>
<script type="text/javascript">
var userCfg = {}
</script>
<script type="text/javascript" src="https://img3.doubanio.com/misc/mixed_static/27e49996655f37b9.js"></script>
<script type="text/javascript" data-mobile="true">
(function (global) {
var newNode = global.document.createElement('script'),
existingNode = global.document.getElementsByTagName('script')[0],
adSource = '//erebor.douban.com/',
userId = '',
browserId = 'fHdEEseIE8o',
criteria = '3:/',
preview = '',
debug = false;
global.DoubanAdRequest = {src: adSource, uid: userId, bid: browserId, crtr: criteria, prv: preview, debug: debug};
newNode.setAttribute('type', 'text/javascript');
newNode.setAttribute('src', 'https://img3.doubanio.com/f/adjs/dd37385211bc8deb01376096bfa14d2c0436a98c/ad.release.js');
newNode.setAttribute('async', true);
existingNode.parentNode.insertBefore(newNode, existingNode);
})(this);
</script>
<script type='text/javascript'>
;(function(global) {
if (window.DoubanAdRequest) {
window.DoubanAdRequest.filter = []
}
global.DoubanAdSlots = global.DoubanAdSlots || []
})(window);
</script>
<!-- Google Tag Manager -->
<noscript><iframe src="//www.googletagmanager.com/ns.html?id=GTM-NZHN7H" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-NZHN7H');</script>
<!-- End Google Tag Manager -->
<!-- Google Analytics -->
<script>
window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;
ga('create', 'UA-53594431-3', {'sampleRate': 4});
ga('send', 'pageview');
</script>
<script async src='//www.google-analytics.com/analytics.js'></script>
<!-- End Google Analytics -->
<script type='text/javascript'>
(function(){
var site_list = window.white_site_list || new RegExp ([
'^https?://(.+\\.douban\\.com',
'|web[0-9]?\\.qq\\.com',
'|hao\\.qq\\.com',
'|(hao\\.)*360\\.cn',
'|so\\.com',
'|www\\.soso\\.com',
'|(www\\.)?growingio\\.com',
'|m-douban-com\\.mipcdn\\.com',
'|.+\\.baidu\\.com',
')(\\:[\\d]+)?/'
].join(''));
if (self !== top && document.referrer.search(site_list) === -1) {
top.location = self.location;
}
})();
</script>
</body>
</html>
如果要以POST发送一个请求,只需要把参数data以bytes形式传入。
我们模拟一个微博登录,先读取登录的邮箱和口令,然后按照weibo.cn的登录页的格式以username=xxx&password=xxx的编码传入:
from urllib import request, parse
print('Login to weibo.cn...')
email = input('Email: ')
passwd = input('Password: ')
login_data = parse.urlencode([
('username', email),
('password', passwd),
('entry', 'mweibo'),
('client_id', ''),
('savestate', '1'),
('ec', ''),
('pagerefer', 'https://passport.weibo.cn/signin/welcome?entry=mweibo&r=http%3A%2F%2Fm.weibo.cn%2F')
])
req = request.Request('https://passport.weibo.cn/sso/login')
req.add_header('Origin', 'https://passport.weibo.cn')
req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25')
req.add_header('Referer', 'https://passport.weibo.cn/signin/login?entry=mweibo&res=wel&wm=3349&r=http%3A%2F%2Fm.weibo.cn%2F')
with request.urlopen(req, data=login_data.encode('utf-8')) as f:
print('Status:', f.status, f.reason)
for k, v in f.getheaders():
print('%s: %s' % (k, v))
print('Data:', f.read().decode('utf-8'))
Login to weibo.cn...
Email: c@c.com
Password: 123
Status: 200 OK
Server: nginx/1.6.1
Date: Sat, 26 Jan 2019 12:00:44 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
Cache-Control: no-cache, must-revalidate
Expires: Sat, 26 Jul 1997 05:00:00 GMT
Pragma: no-cache
Access-Control-Allow-Origin: https://passport.weibo.cn
Access-Control-Allow-Credentials: true
DPOOL_HEADER: dryad65
SINA-LB: aGEuMjM1LmcxLnF4Zy5sYi5zaW5hbm9kZS5jb20=
SINA-TS: YjBjYTk0Y2UgMCAwIDAgNyA1OTIK
Data: {"retcode":50011002,"msg":"\u7528\u6237\u540d\u6216\u5bc6\u7801\u9519\u8bef","data":{"username":"c@c.com","errline":655}}
如果还需要更复杂的控制,比如通过一个Proxy去访问网站,我们需要利用ProxyHandler来处理,示例代码如下:
proxy_handler = urllib.request.ProxyHandler({'http': 'http://www.example.com:3128/'})
proxy_auth_handler = urllib.request.ProxyBasicAuthHandler()
proxy_auth_handler.add_password('realm', 'host', 'username', 'password')
opener = urllib.request.build_opener(proxy_handler, proxy_auth_handler)
with opener.open('http://www.example.com/login.html') as f:
pass
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-6-a082423fa8ef> in <module>
----> 1 proxy_handler = urllib.request.ProxyHandler({'http': 'http://www.example.com:3128/'})
2 proxy_auth_handler = urllib.request.ProxyBasicAuthHandler()
3 proxy_auth_handler.add_password('realm', 'host', 'username', 'password')
4 opener = urllib.request.build_opener(proxy_handler, proxy_auth_handler)
5 with opener.open('http://www.example.com/login.html') as f:
NameError: name 'urllib' is not defined
urllib提供的功能就是利用程序去执行各种HTTP请求。如果要模拟浏览器完成特定功能,需要把请求伪装成浏览器。伪装的方法是先监控浏览器发出的请求,再根据浏览器的请求头来伪装,User-Agent头就是用来标识浏览器的。
利用urllib读取JSON,然后将JSON解析为Python对象:
# -*- coding: utf-8 -*-
from urllib import request
def fetch_data(url):
return ''
# 测试
URL = 'https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20weather.forecast%20where%20woeid%20%3D%202151330&format=json'
data = fetch_data(URL)
print(data)
assert data['query']['results']['channel']['location']['city'] == 'Beijing'
print('ok')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-9-22bde09eb578> in <module>
9 data = fetch_data(URL)
10 print(data)
---> 11 assert data['query']['results']['channel']['location']['city'] == 'Beijing'
12 print('ok')
TypeError: string indices must be integers
完