blogs

59 常用内建模块_urllib

urllib提供了一系列用于操作URL的功能。

Get

urllib的request模块可以非常方便地抓取URL内容,也就是发送一个GET请求到指定的页面,然后返回HTTP的响应:

例如,对豆瓣的一个 URLhttps://api.douban.com/v2/book/2129650 进行抓取,并返回响应:

可以看到HTTP响应的头和JSON数据:

from urllib import request

with request.urlopen('https://api.douban.com/v2/book/2129650') as f:
    data = f.read()
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', data.decode('utf-8'))
Status: 200 OK
Date: Sat, 26 Jan 2019 11:59:44 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 2138
Connection: close
Vary: Accept-Encoding
X-Ratelimit-Remaining2: 99
X-Ratelimit-Limit2: 100
Expires: Sun, 1 Jan 2006 01:00:00 GMT
Pragma: no-cache
Cache-Control: must-revalidate, no-cache, private
Set-Cookie: bid=BdhRHcX_CIE; Expires=Sun, 26-Jan-20 11:59:44 GMT; Domain=.douban.com; Path=/
X-DOUBAN-NEWBID: BdhRHcX_CIE
X-DAE-Node: anson70
X-DAE-App: book
Server: dae
X-Frame-Options: SAMEORIGIN
Data: {"rating":{"max":10,"numRaters":16,"average":"7.4","min":0},"subtitle":"","author":["廖雪峰"],"pubdate":"2007","tags":[{"count":23,"name":"spring","title":"spring"},{"count":14,"name":"Java","title":"Java"},{"count":6,"name":"javaee","title":"javaee"},{"count":5,"name":"j2ee","title":"j2ee"},{"count":4,"name":"计算机","title":"计算机"},{"count":3,"name":"藏书","title":"藏书"},{"count":3,"name":"编程","title":"编程"},{"count":3,"name":"POJO","title":"POJO"}],"origin_title":"","image":"https://img3.doubanio.com\/view\/subject\/m\/public\/s2552283.jpg","binding":"平装","translator":[],"catalog":"","pages":"509","images":{"small":"https://img3.doubanio.com\/view\/subject\/s\/public\/s2552283.jpg","large":"https://img3.doubanio.com\/view\/subject\/l\/public\/s2552283.jpg","medium":"https://img3.doubanio.com\/view\/subject\/m\/public\/s2552283.jpg"},"alt":"https:\/\/book.douban.com\/subject\/2129650\/","id":"2129650","publisher":"电子工业出版社","isbn10":"7121042622","isbn13":"9787121042621","title":"Spring 2.0核心技术与最佳实践","url":"https:\/\/api.douban.com\/v2\/book\/2129650","alt_title":"","author_intro":"","summary":"本书注重实践而又深入理论,由浅入深且详细介绍了Spring 2.0框架的几乎全部的内容,并重点突出2.0版本的新特性。本书将为读者展示如何应用Spring 2.0框架创建灵活高效的JavaEE应用,并提供了一个真正可直接部署的完整的Web应用程序——Live在线书店(http:\/\/www.livebookstore.net)。\n在介绍Spring框架的同时,本书还介绍了与Spring相关的大量第三方框架,涉及领域全面,实用性强。本书另一大特色是实用性强,易于上手,以实际项目为出发点,介绍项目开发中应遵循的最佳开发模式。\n本书还介绍了大量实践性极强的例子,并给出了完整的配置步骤,几乎覆盖了Spring 2.0版本的新特性。\n本书适合有一定Java基础的读者,对JavaEE开发人员特别有帮助。本书既可以作为Spring 2.0的学习指南,也可以作为实际项目开发的参考手册。","price":"59.8"}

如果我们要想模拟浏览器发送GET请求,就需要使用Request对象,通过往Request对象添加HTTP头,我们就可以把请求伪装成浏览器。例如,模拟iPhone 6去请求豆瓣首页:

这样豆瓣会返回适合iPhone的移动版网页:

from urllib import request

req = request.Request('http://www.douban.com/')
req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25')
with request.urlopen(req) as f:
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', f.read().decode('utf-8'))
Status: 200 OK
Date: Sat, 26 Jan 2019 11:59:46 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
X-Xss-Protection: 1; mode=block
X-Douban-Mobileapp: 0
Expires: Sun, 1 Jan 2006 01:00:00 GMT
Pragma: no-cache
Cache-Control: must-revalidate, no-cache, private
Set-Cookie: talionnav_show_app="0"
Set-Cookie: bid=fHdEEseIE8o; Expires=Sun, 26-Jan-20 11:59:46 GMT; Domain=.douban.com; Path=/
X-DOUBAN-NEWBID: fHdEEseIE8o
X-DAE-Node: anson60
X-DAE-App: talion
Server: dae
X-Frame-Options: SAMEORIGIN
Strict-Transport-Security: max-age=15552000;
X-Content-Type-Options: nosniff
Data: 


<!DOCTYPE html>
<html itemscope itemtype="http://schema.org/WebPage" class="ua-safari ua-mobile ">
    <head>
        <meta charset="UTF-8">
        <title>豆瓣(手机版)</title>
        <meta name="google-site-verification" content="ok0wCgT20tBBgo9_zat2iAcimtN4Ftf5ccsh092Xeyw" />
        <meta name="viewport" content="width=device-width, height=device-height, user-scalable=no, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0">
        <meta name="format-detection" content="telephone=no">
        <link rel="canonical" href="
http://m.douban.com/">
        <link href="https://img3.doubanio.com/f/talion/95a2c9ca9b251b8c2e810bc80afd4e9c7c37b392/css/card/base.css" rel="stylesheet">
        
    <meta name="description" content="读书、看电影、涨知识、学穿搭...,加入兴趣小组,获得达人们的高质量生活经验,找到有相同爱好的小伙伴。">
    <meta name="keywords" content="豆瓣,手机豆瓣,豆瓣手机版,豆瓣电影,豆瓣读书,豆瓣同城">
    
    

    <!-- Schema.org markup for Google+ -->
    <meta itemprop="name" content="豆瓣">
    <meta itemprop="description" content="读书、看电影、涨知识、学穿搭...,加入兴趣小组,获得达人们的高质量生活经验,找到有相同爱好的小伙伴。">
    <meta itemprop="image" content="https://img3.doubanio.com/f/talion/8e7b9cbd097c02972c4191aa03fdb084524505c4/pics/icon/m_logo_180.png">
    <!-- Twitter meta -->
    <meta name="twitter:card" content="summary" />
    <!-- Open Graph meta -->
    <meta property="og:title" content="豆瓣" />
    <meta property="og:description" content="读书、看电影、涨知识、学穿搭...,加入兴趣小组,获得达人们的高质量生活经验,找到有相同爱好的小伙伴。" />
    <meta property="og:site_name" content="豆瓣(手机版)" />
    <meta property="og:url" content="https://m.douban.com/" />
    <meta property="og:image" content="https://img3.doubanio.com/f/talion/8e7b9cbd097c02972c4191aa03fdb084524505c4/pics/icon/m_logo_180.png" />
    <meta property="og:image:type" content="image/png" />
    <meta property="og:image:width" content="300" />
    <meta property="og:image:height" content="300" />
    <meta property="og:type" content="article" />
    <!-- Wechat meta -->
    <meta property="weixin:timeline_title" content="豆瓣" />
    <meta property="weixin:chat_title" content="豆瓣" />
    <meta property="weixin:description" content="读书、看电影、涨知识、学穿搭...,加入兴趣小组,获得达人们的高质量生活经验,找到有相同爱好的小伙伴。" />
    <meta property="weixin:image" content="https://img3.doubanio.com/f/talion/8e7b9cbd097c02972c4191aa03fdb084524505c4/pics/icon/m_logo_180.png" />
    <script>
    ;(function () {
        window.setMeta = function (name, val) {
          var meta = document.querySelectorAll('meta[property="' + name + '"], meta[name="' + name + '"]')
          if (!meta.length) {
            meta = document.createElement('meta')
            meta.name = name
            document.head.appendChild(meta)
            meta = [meta]
          }
         meta[0].content = val || ''
        }
        window.getMeta = function (name) {
          var meta = document.querySelectorAll('meta[property="' + name + '"], meta[name="' + name + '"]')
          if (!meta.length) {
            return ''
          } else {
            return meta[0].content
          }
        }
        !getMeta('weixin:chat_title') && setMeta('weixin:chat_title', document.title)
        !getMeta('weixin:timeline_title') && setMeta('weixin:timeline_title', document.title)
        !getMeta('weixin:description') && setMeta('weixin:description', getMeta('og:description'))
    })();
    </script>


        <link rel="stylesheet" href="https://img3.doubanio.com/misc/mixed_static/468e9211d89502ad.css">
        <link rel="icon" type="image/png" sizes="16x16" href="https://img3.doubanio.com/f/talion/c970bb0d720963a7392f7dd6c77068bb9925caaf/pics/icon/dou16.png">
        <link rel="icon" type="image/png" sizes="32x32" href="https://img3.doubanio.com/f/talion/2f3c0bc0f35b031d4535fd993ae3936f4e40e6c8/pics/icon/dou32.png">
        <link rel="icon" type="image/png" sizes="48x48" href="https://img3.doubanio.com/f/talion/10a4a913a5715f628e4b598f7f9f2c18621bdcb3/pics/icon/dou48.png">
        <!-- iOS touch icon -->
        <link rel="apple-touch-icon-precomposed" href="https://img3.doubanio.com/f/talion/997f2018d82979da970030a5eb84c77f0123ae5f/pics/icon/m_logo_76.png">
        <link rel="apple-touch-icon-precomposed" sizes="76x76" href="https://img3.doubanio.com/f/talion/997f2018d82979da970030a5eb84c77f0123ae5f/pics/icon/m_logo_76.png">
        <link rel="apple-touch-icon-precomposed" sizes="120x120" href="https://img3.doubanio.com/f/talion/18932a3e71a60ed7150ca2ca7ebf21ddadd7092e/pics/icon/m_logo_120.png">
        <link rel="apple-touch-icon-precomposed" sizes="152x152" href="https://img3.doubanio.com/f/talion/b99497ff8538c54b9ba6f40867da932396ab2562/pics/icon/m_logo_152.png">
        <link rel="apple-touch-icon-precomposed" sizes="167x167" href="https://img3.doubanio.com/f/talion/0c233ada957a95e632f81607e30230d16e8293e8/pics/icon/m_logo_167.png">
        <link rel="apple-touch-icon-precomposed" sizes="180x180" href="https://img3.doubanio.com/f/talion/8e7b9cbd097c02972c4191aa03fdb084524505c4/pics/icon/m_logo_180.png">
        <link rel="apple-touch-icon-precomposed" sizes="200x200" href="https://img3.doubanio.com/f/talion/7c6364aadf368dc0210173c940cfd0f64ceddc66/pics/icon/m_logo_200.png">
        <!-- For Android -->
        <link rel="icon" sizes="128x128" href="https://img3.doubanio.com/f/talion/b99497ff8538c54b9ba6f40867da932396ab2562/pics/icon/m_logo_152.png">
        <link rel="icon" sizes="192x192" href="https://img3.doubanio.com/f/talion/7c6364aadf368dc0210173c940cfd0f64ceddc66/pics/icon/m_logo_200.png">
        <!-- For Web App Manifest -->
        
  
  
  <link rel="manifest" href="/pwa/manifest?path=/&short_name=%E8%B1%86%E7%93%A3%28%E6%89%8B%E6%9C%BA%E7%89%88%29&name=%E8%B1%86%E7%93%A3%28%E6%89%8B%E6%9C%BA%E7%89%88%29">
  <meta name="theme-color" content="#42bd56">


        <link type="application/opensearchdescription+xml" rel="search" href="/opensearch"/>
            <!-- hm baidu -->
            <script type="text/javascript">
            var _hmt = _hmt || [];
            (function() {
              var hm = document.createElement("script");
              hm.src = "https://hm.baidu.com/hm.js?6d4a8cfea88fa457c3127e14fb5fabc2";
              var s = document.getElementsByTagName("script")[0];
              s.parentNode.insertBefore(hm, s);
            })();
            _hmt.logTruncate = function(type) {
                _hmt.push(['_trackEvent', 'article', 'truncate', type, 1]);
            }
            </script>
    </head>
    <body ontouchstart="">
        
        

        
    
    
        <div id="TalionNav"><header class="TalionNav"><div class="TalionNav-primary"><a href="/"><h1>豆瓣</h1></a><nav><ul><li><a href="/movie" style="color: #2384E8;">电影</a></li><li><a href="/book" style="color: #9F7860;">图书</a></li><li><a href="/status" style="color: #E4A813;">广播</a></li><li><a href="/group" style="color: #2AB8CC;">小组</a></li></ul><span class=""></span></nav></div><div class="TalionNav-secondary"><a class="close-nav" href="javascript:;">关闭</a><form action="/search" method="GET"><div><input name="query" type="search"></div></form><ul><li><div><a href="/movie" target="_blank"><strong style="color: #2384E8;">电影</strong><span>影院热映</span></a><a href="https://douban.com/location" target=""><strong style="color: #E6467E;">同城</strong><span>周末活动</span></a><a href="https://read.douban.com" target=""><strong style="color: #9F7860;">阅读</strong><span>电子书</span></a><a href="/status" target="_blank"><strong style="color: #E1644D;">广播</strong><span>友邻动态</span></a></div></li><li><div><a href="/tv" target="_blank"><strong style="color: #7A6ADB;">电视</strong><span>正在热播</span></a><a href="/group" target="_blank"><strong style="color: #2AB8CC;">小组</strong><span>志趣相投</span></a><a href="/game" target="_blank"><strong style="color: #5774C5;">游戏</strong><span>虚拟世界</span></a><a href="https://douban.fm" target=""><strong style="color: #40CFA9;">FM</strong><span>红心歌单</span></a></div></li><li><div><a href="/book" target="_blank"><strong style="color: #9F7860;">图书</strong><span>畅销排行</span></a><a href="/music" target="_blank"><strong style="color: #F48F2E;">音乐</strong><span>新碟榜</span></a><a href="/mobileapp" target="_blank"><strong style="color: #596CDD;">应用</strong><span>玩手机</span></a><a href="https://market.douban.com/?utm_campaign=mobile_web_douban_nav&amp;utm_source=douban&amp;utm_medium=mobile_web" target=""><strong style="color: #42BD56;">豆品</strong><span>生活美学</span></a></div></li></ul><div class="navBottom"><div class="nav-item"><a class="toUser" href="/mine/">我的豆瓣</a><a class="toExit" href="https://accounts.douban.com/logout?ck=undefined&amp;redir=http://accounts.douban.com/passport/login">退出豆瓣</a></div><div class="nav-item"><a class="toPC" href="/to_pc/?url=about%3Ablank">使用桌面版</a><a class="toApp">使用豆瓣App</a></div></div></div></header></div>


        <div class="page">
            
    <div class="card">
        <ul class="quick-nav">
            <li>
                <a href="/movie/nowintheater?loc_id=108288">影院热映</a>
            </li>
              <li>
                  <a href="/music/newwestern/">欧美新碟榜</a>
              </li>
            <li>
                <a id="hot-topics" href="https://m.douban.com/time/?dt_time_source=douban-msite_shortcut">豆瓣时间</a>
            </li>
            <li>
                <a href="https://www.douban.com/doubanapp/app?channel=card_home&direct_dl=1">使用豆瓣App</a>
            </li>
        </ul>
        <section id="recommend-feed"></section>
    </div>

        </div>

        <script src="https://img3.doubanio.com/f/talion/ee8e0c54293aefb5709ececbdf082f8091ad5e49/js/card/zepto.min.js"></script>
        <script src="https://img3.doubanio.com/f/talion/c453219f84a3c1d4f986fdcaf6b34c03bc913c6f/js/card/main.js"></script>
        <script src="https://img3.doubanio.com/f/talion/f53cc45d4a16969b8592d776f476d9784a283e4a/js/lib/douban-ad-helper.js"></script>



        
    
  


        
  

  
        <script src="https://img3.doubanio.com/f/talion/88fc2b21c8dda5c93aa4c011eb15b74f8850978f/js/lib/react/15.3.0/react-all.min.js"></script>


  <script type="text/javascript">
    var userCfg = {}
  </script>

  

  
  



        <script type="text/javascript" src="https://img3.doubanio.com/misc/mixed_static/27e49996655f37b9.js"></script>
        


<script type="text/javascript" data-mobile="true">
    (function (global) {
        var newNode = global.document.createElement('script'),
            existingNode = global.document.getElementsByTagName('script')[0],
            adSource = '//erebor.douban.com/',
            userId = '',
            browserId = 'fHdEEseIE8o',
            criteria = '3:/',
            preview = '',
            debug = false;

        global.DoubanAdRequest = {src: adSource, uid: userId, bid: browserId, crtr: criteria, prv: preview, debug: debug};

        newNode.setAttribute('type', 'text/javascript');
        newNode.setAttribute('src', 'https://img3.doubanio.com/f/adjs/dd37385211bc8deb01376096bfa14d2c0436a98c/ad.release.js');
        newNode.setAttribute('async', true);
        existingNode.parentNode.insertBefore(newNode, existingNode);
    })(this);
</script>


        
  


        <script type='text/javascript'>
            
            ;(function(global) {
                if (window.DoubanAdRequest) {
                    window.DoubanAdRequest.filter = []
                }
                global.DoubanAdSlots = global.DoubanAdSlots || []
            })(window);
        </script>
            <!-- Google Tag Manager -->
<noscript><iframe src="//www.googletagmanager.com/ns.html?id=GTM-NZHN7H" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-NZHN7H');</script>
<!-- End Google Tag Manager -->
<!-- Google Analytics -->
<script>
window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;
ga('create', 'UA-53594431-3', {'sampleRate': 4});
ga('send', 'pageview');
</script>
<script async src='//www.google-analytics.com/analytics.js'></script>
<!-- End Google Analytics -->

        






        <script type='text/javascript'>
        (function(){
            var site_list = window.white_site_list || new RegExp ([
                '^https?://(.+\\.douban\\.com',
                '|web[0-9]?\\.qq\\.com',
                '|hao\\.qq\\.com',
                '|(hao\\.)*360\\.cn',
                '|so\\.com',
                '|www\\.soso\\.com',
                '|(www\\.)?growingio\\.com',
                '|m-douban-com\\.mipcdn\\.com',
                '|.+\\.baidu\\.com',
                ')(\\:[\\d]+)?/'
            ].join(''));
            if (self !== top && document.referrer.search(site_list) === -1) {
                top.location = self.location;
            }
        })();
        </script>
    </body>
</html>

Post

如果要以POST发送一个请求,只需要把参数data以bytes形式传入。

我们模拟一个微博登录,先读取登录的邮箱和口令,然后按照weibo.cn的登录页的格式以username=xxx&password=xxx的编码传入:

from urllib import request, parse

print('Login to weibo.cn...')
email = input('Email: ')
passwd = input('Password: ')
login_data = parse.urlencode([
    ('username', email),
    ('password', passwd),
    ('entry', 'mweibo'),
    ('client_id', ''),
    ('savestate', '1'),
    ('ec', ''),
    ('pagerefer', 'https://passport.weibo.cn/signin/welcome?entry=mweibo&r=http%3A%2F%2Fm.weibo.cn%2F')
])

req = request.Request('https://passport.weibo.cn/sso/login')
req.add_header('Origin', 'https://passport.weibo.cn')
req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25')
req.add_header('Referer', 'https://passport.weibo.cn/signin/login?entry=mweibo&res=wel&wm=3349&r=http%3A%2F%2Fm.weibo.cn%2F')

with request.urlopen(req, data=login_data.encode('utf-8')) as f:
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', f.read().decode('utf-8'))
Login to weibo.cn...
Email: c@c.com
Password: 123
Status: 200 OK
Server: nginx/1.6.1
Date: Sat, 26 Jan 2019 12:00:44 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
Cache-Control: no-cache, must-revalidate
Expires: Sat, 26 Jul 1997 05:00:00 GMT
Pragma: no-cache
Access-Control-Allow-Origin: https://passport.weibo.cn
Access-Control-Allow-Credentials: true
DPOOL_HEADER: dryad65
SINA-LB: aGEuMjM1LmcxLnF4Zy5sYi5zaW5hbm9kZS5jb20=
SINA-TS: YjBjYTk0Y2UgMCAwIDAgNyA1OTIK
Data: {"retcode":50011002,"msg":"\u7528\u6237\u540d\u6216\u5bc6\u7801\u9519\u8bef","data":{"username":"c@c.com","errline":655}}

Handler

如果还需要更复杂的控制,比如通过一个Proxy去访问网站,我们需要利用ProxyHandler来处理,示例代码如下:

proxy_handler = urllib.request.ProxyHandler({'http': 'http://www.example.com:3128/'})
proxy_auth_handler = urllib.request.ProxyBasicAuthHandler()
proxy_auth_handler.add_password('realm', 'host', 'username', 'password')
opener = urllib.request.build_opener(proxy_handler, proxy_auth_handler)
with opener.open('http://www.example.com/login.html') as f:
    pass
---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-6-a082423fa8ef> in <module>
----> 1 proxy_handler = urllib.request.ProxyHandler({'http': 'http://www.example.com:3128/'})
      2 proxy_auth_handler = urllib.request.ProxyBasicAuthHandler()
      3 proxy_auth_handler.add_password('realm', 'host', 'username', 'password')
      4 opener = urllib.request.build_opener(proxy_handler, proxy_auth_handler)
      5 with opener.open('http://www.example.com/login.html') as f:


NameError: name 'urllib' is not defined

小结

urllib提供的功能就是利用程序去执行各种HTTP请求。如果要模拟浏览器完成特定功能,需要把请求伪装成浏览器。伪装的方法是先监控浏览器发出的请求,再根据浏览器的请求头来伪装,User-Agent头就是用来标识浏览器的。

练习

利用urllib读取JSON,然后将JSON解析为Python对象:

# -*- coding: utf-8 -*-
from urllib import request

def fetch_data(url):
    return ''

# 测试
URL = 'https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20weather.forecast%20where%20woeid%20%3D%202151330&format=json'
data = fetch_data(URL)
print(data)
assert data['query']['results']['channel']['location']['city'] == 'Beijing'
print('ok')
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-9-22bde09eb578> in <module>
      9 data = fetch_data(URL)
     10 print(data)
---> 11 assert data['query']['results']['channel']['location']['city'] == 'Beijing'
     12 print('ok')


TypeError: string indices must be integers