书名《Python3 网络爬虫开发实战》，笔记这个都是我自己测试和网上找文章，书里面有的没有测试到位的我也补充了很多，先声明我是菜鸡

Python 的强大之处就是提供了功能齐全的类库来帮助我们完成这些请求。最基础的 HTTP 库有 urllib、httplib2、requests、treq 等

他有4个模块

urllib库的模块	作用
`request`模块	可以用来模拟发送请求
`error`模块	如果出现请求错误，我们可以捕获这些异常
`parse`模块	一个工具模块，提供了许多 URL 处理方法
`robotparser`模块	识别网站robots.txt 文件，判断哪些可以爬

看一下他的文件里面的库模块

模块在/usr/lib/python3/dist-packages/jedi/third_party/typeshed/stdlib/3/urllib/

请求`request`模块请求

request模块	作用
`urlopen()`函数	urlopen()方法只能构建一个简单请求
`Request()`类	Request()类可以构建一个完整的请求
`BaseHandler` 类	它提供了最基本的方法,比如用于设置代理
`OpenerDirector`类	更高级的功能更底层功能

request模块模块在urllib文件夹里面.

请求方法urlopen()函数

urlopen()方法只能构建一个简单请求

request文件里面可以看一下文件有一个urlopen()方法

get方式请求

这个函数发送一个请求下面的请求是get方式请求

import urllib.request

# get方式请求
response = urllib.request.urlopen('https://www.python.org')

# read() 方法可以得到返回的网页内容，decode()方法编码方式显示
print(response.read().decode('utf-8'))

分析一下上面的代码

用type()函数查看一下上面的response变量的类型代码如下

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(type(response))

下面有介绍

read() 方法可以得到返回的网页内容

decode()就是里面的编码

比如我们调用status属性，status属性是查看网站的状态码的

代码

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(response.status)

post方式请求

urlopen()函数的data 参数

data 参数是可选的，如果用data 参数就是post请求了

测试

import urllib.parse   # 一个工具模块，提供了许多 URL 处理方法
import urllib.request

# 用到`bytes()`函数是bytes是字节流bytes对象
data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
# post方式请求
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
# read() 方法可以得到返回的网页内容，decode()方法编码方式显示
print(response.read().decode('utf-8'))

结果

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "word": "hello"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "10", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/3.9", 
    "X-Amzn-Trace-Id": "Root=1-60192088-7cac43ff16b0ff6376a5772d"
  }, 
  "json": null, 
  "origin": "39.149.143.45", 
  "url": "http://httpbin.org/post"
}

上面的代码讲解

第一行导入了import urllib.parse，

导入urllib库的parse模块，parse模块一个工具模块，提供了许多 URL 处理方法
第五行data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')

用到bytes()函数是bytes是字节流bytes对象，字符串是字符串str 对象

encoding是指定的编码格式

我们随便测试一下
1
2
>>> b= bytes(a, encoding='utf-8')
b'\xe4\xbd\xa0\xe5\xa5\xbd'
bytes()函数里面的第一个测试urllib.parse.urlencode({'word': 'hello'})是键值对就是word=hello的意思

看看
1
2
>>> urllib.parse.urlencode({'word': 'hello'})
'word=hello'

请求超时设置

urlopen()函数的timeout 参数

如果多长时间没有相应就会抛出异常

下面的代码设置的是0.1

import urllib.request

# get方式请求  timeout超时值是0.1秒
response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)

# read() 方法可以得到返回的网页内容
print(response.read())

结果

raceback (most recent call last):
  File "/home/zss/杂东西/a.py", line 3, in <module>
    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
  File "/usr/lib/python3.9/urllib/request.py", line 214, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.9/urllib/request.py", line 517, in open
    response = self._open(req, data)
  File "/usr/lib/python3.9/urllib/request.py", line 534, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/lib/python3.9/urllib/request.py", line 494, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.9/urllib/request.py", line 1375, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/lib/python3.9/urllib/request.py", line 1349, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error timed out>

我们可以利用异常捕获

代码

import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
except urllib.error.URLError:
    print('时间超时了！')

结果

时间超时了！

其他参数

context参数，类型必须是ssl.SSLContext类型。

cafile和capath这两个参数分别指定CA证书和它的路径，在请求HTTPS链接时候有用

`HTTPResposne` 类型对象

看一下他的全部的属性和方法用dir函数查看

from  urllib.request  import urlopen

response = urlopen('https://www.python.org')
print(dir(response))

结果

['__abstractmethods__', '__class__', '__del__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__next__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_abc_impl', '_checkClosed', '_checkReadable', '_checkSeekable', '_checkWritable', '_check_close', '_close_conn', '_get_chunk_left', '_method', '_peek_chunked', '_read1_chunked', '_read_and_discard_trailer', '_read_next_chunk_size', '_read_status', '_readall_chunked', '_readinto_chunked', '_safe_read', '_safe_readinto', 'begin', 'chunk_left', 'chunked', 'close', 'closed', 'code', 'debuglevel', 'detach', 'fileno', 'flush', 'fp', 'getcode', 'getheader', 'getheaders', 'geturl', 'headers', 'info', 'isatty', 'isclosed', 'length', 'msg', 'peek', 'read', 'read1', 'readable', 'readinto', 'readinto1', 'readline', 'readlines', 'reason', 'seek', 'seekable', 'status', 'tell', 'truncate', 'url', 'version', 'will_close', 'writable', 'write', 'writelines']

构建请求内容Request()类

request文件里面可以看一下文件有一个Request()类

urlopen()方法只能构建一个简单请求，Request()类可以构建一个完整的请求

比如

import urllib.parse   # 一个工具模块，提供了许多 URL 处理方法
import urllib.request

# 用到`bytes()`函数是bytes是字节流bytes对象
data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
# post方式请求
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
# read() 方法可以得到返回的网页内容，decode()方法编码方式显示
print(response.read().decode('utf-8'))

结果

可以看见下面的User-Agent字段是Python-urllib/3.9不是我们的浏览器，我们就可以用Request()类添加了

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "word": "hello"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "10", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/3.9", 
    "X-Amzn-Trace-Id": "Root=1-601a1f43-352ed18d0b631c951537de2b"
  }, 
  "json": null, 
  "origin": "39.149.143.45", 
  "url": "http://httpbin.org/post"
}

Request()类的参数

1	urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

上面的参数

参数	作用
url	用于请求 URL，这是必传参数，其他都是可选参数
data	必须传 `bytes`（字节流）类型的，这个就是POST的数据内容
headers	`headers` 是一个字典，它就是请求头
origin_req_host	指的是请求方的 host 名称或者 IP 地址
unverifiable	没有抓取图像的权限就是True，他默认是False
method	用来指示请求使用的方法，比如 GET、POST 和 PUT 等

创建制定请求内容测试

import urllib.parse   # 一个工具模块，提供了许多 URL 处理方法
import urllib.request

# 目标URL
url = 'http://httpbin.org/post'

# 修改请求头信息
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36',
    'Host': 'httpbin.org'
}

# POST的数据内容字典类型的
dict = {
    'name': 'Germey'
}

# 用到`bytes()`函数是bytes是字节流bytes对象，dict是上面的字典变量
data = bytes(urllib.parse.urlencode(dict), encoding='utf8')

# 构建请求内容
req = urllib.request.Request(url=url, data=data, headers=headers, method='POST')

# post方式请求
response = urllib.request.urlopen(req)

# read() 方法可以得到返回的网页内容，decode()方法编码方式显示
print(response.read().decode('utf-8'))

结果

可以看见请求头的User-Agent字段和Host字段都被我修改成我制定的了

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "Germey"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36", 
    "X-Amzn-Trace-Id": "Root=1-601a2743-7648e6e64bf3d1423959aba7"
  }, 
  "json": null, 
  "origin": "39.149.143.45", 
  "url": "http://httpbin.org/post"
}

高级用法

但是对于一些更高级的操作（比如 Cookies 处理、代理设置等）就可以用下面的request模块的类了

`BaseHandler` 类

request文件里面可以看一下文件有一个BaseHandler类

urllib.request 模块里的 BaseHandler 类，它是所有其他 Handler 的父类，它提供了最基本的方法，例如 default_open()、protocol_request() 等

举例如下

类名	作用
`HTTPDefaultErrorHandler`	处理 HTTP 响应错误，错误会抛出 `HTTPError` 类型的异常
`HTTPRedirectHandler`	用于处理重定向
`HTTPCookieProcessor`	用于处理 Cookies
`ProxyHandler`	用于设置代理，默认代理为空
`HTTPPasswordMgr`	用于管理密码，它维护了用户名和密码的表
`HTTPBasicAuthHandler`	管理认证，链接打开时需要认证，可以用它来解决认证问题

还有其他的https://docs.python.org/3/library/urllib.request.html#urllib.request.BaseHandler

`OpenerDirector`类

request文件里面可以看一下文件有一个OpenerDirector类

OpenerDirector我们成为Opener

上面使用的 Request 和 urlopen() 类相当于给你封装好了常用的请求方法，完成基本的操作

更高级的功能更底层功能就用到了 Opener

登录验证

应为我没有环境我就不测试是了就叫书上的给已过来了

下面是验证代码

from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
from urllib.error import URLError

username = 'username'
password = 'password'
url = 'http://localhost:5000/'

p = HTTPPasswordMgrWithDefaultRealm()
p.add_password(None, url, username, password)
auth_handler = HTTPBasicAuthHandler(p)
opener = build_opener(auth_handler)

try:
    result = opener.open(url)
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

上面的类的作用可以查看官方文档很详细的https://docs.python.org/3/library/urllib.request.html#basehandler-objects

添加代理

from urllib.error import URLError

# 在request库里面导入ProxyHandler对象和build_opener 
from urllib.request import ProxyHandler, build_opener 

# 添加代理字典类型的键值对
proxy_handler = ProxyHandler({
    'http': 'http://127.0.0.1:8889',

})
# 多处理程序 
opener = build_opener(proxy_handler)
try:
    # 发送请求
    response = opener.open('https://www.baidu.com/')
    # 输出
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

上面的代码的意思

ProxyHandler对象该方法将通过调用来修改要通过代理的请求

build_opener()对象默认提供许多处理程序

结果我用的是机场

上面的类的作用可以查看官方文档很详细的https://docs.python.org/3/library/urllib.request.html#basehandler-objects

Cookies

import http.cookiejar, urllib.request

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
# 创建opener对象
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)

结果

BAIDUID=2E65A683F8A8BA3DF521469DF8EFF1E1:FG=1
BIDUPSID=2E65A683F8A8BA3DF521469DF8EFF1E1
H_PS_PSSID=20987_1421_18282_17949_21122_17001_21227_21189_21161_20927
PSTM=1474900615
BDSVRTM=0
BD_HOME=0

保持文件MozillaCookieJar格式

import http.cookiejar, urllib.request

# 文件名变量
filename = 'cookies.txt'

# 实例化MozillaCookieJar，保持文件filename件名变量
cookie = http.cookiejar.MozillaCookieJar(filename)


# 创建Handler对象
handler = urllib.request.HTTPCookieProcessor(cookie)

# 创建opener对象
opener = urllib.request.build_opener(handler)

# 请求
response = opener.open('http://www.baidu.com')

# 保存数据
cookie.save(ignore_discard=True, ignore_expires=True)

结果 cookies.txt 文件内容

# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This is a generated file!  Do not edit.

.baidu.com      TRUE    /       FALSE   1644416949      BAIDUID 89E9C382C1BBCC7142BBF4B4521535F3:FG=1
.baidu.com      TRUE    /       FALSE   3760364596      BIDUPSID        89E9C382C1BBCC71573B3743E3185D16
.baidu.com      TRUE    /       FALSE           H_PS_PSSID      33425_33514_33580_33259_33272_31254_33463_33584_26350_33567
.baidu.com      TRUE    /       FALSE   3760364596      PSTM    1612880949
www.baidu.com   FALSE   /       FALSE           BDSVRTM 0
www.baidu.com   FALSE   /       FALSE           BD_HOME 1

保持文件LWP格式

import http.cookiejar, urllib.request

# 文件名变量
filename = 'cookies.txt'

# 实例化LWP，保持文件filename件名变量
cookie = http.cookiejar.LWPCookieJar(filename)


# 创建Handler对象
handler = urllib.request.HTTPCookieProcessor(cookie)

# 创建opener对象
opener = urllib.request.build_opener(handler)

# 请求
response = opener.open('http://www.baidu.com')

# 保存数据
cookie.save(ignore_discard=True, ignore_expires=True)

结果 cookies.txt 文件内容

#LWP-Cookies-2.0
Set-Cookie3: BAIDUID="A11D2FD29D16CB293057EF54CA7A84B6:FG=1"; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2022-02-09 14:32:23Z"; comment=bd; version=0
Set-Cookie3: BIDUPSID=A11D2FD29D16CB29B7B7A37F12289763; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2089-02-27 17:46:30Z"; version=0
Set-Cookie3: H_PS_PSSID=33423_33402_33256_33344_31254_33601_26350; path="/"; domain=".baidu.com"; path_spec; domain_dot; discard; version=0
Set-Cookie3: PSTM=1612881143; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2089-02-27 17:46:30Z"; version=0
Set-Cookie3: BDSVRTM=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0
Set-Cookie3: BD_HOME=1; path="/"; domain="www.baidu.com"; path_spec; discard; version=0

读取并利用

error模块处理异常

但是在网络不好的情况下，如果出现了异常，就可以用error模块处理异常

error模块	作用
`URLError`类	request 模块异常都可以通过捕获这个类来处理
`HTTPError`类	它是 `URLError` 的子类,可以返回更多信息

request模块模块在urllib文件夹里面.

URLError类

在error文件里面

URLError类由 request 模块生的异常都可以通过捕获这个类来处理

测试

from urllib import request, error  

#  这个网址不存在
url='http://cuiqingc.com'
try:

    response = request.urlopen(url)
    
    # 用error模块的URLError类生的异常都可以通过捕获这个类来处理
except error.URLError as e:
    print(e.reason)

结果

Not Found

HTTPError类

在error文件里面

它是 URLError 的子类，请求错误用的，比如认证请求失败等。它有如下 3 个属性。

code：返回 HTTP 状态码，比如 404 表示网页不存在，500 表示服务器内部错误等。
reason：同父类一样，用于返回错误的原因。
headers：返回请求头。

下面我们用几个实例来看看：

# 导入urllib库里面的request模块和error模块
from urllib import request,error 

try:
    #  这个网址存在，但是不存在这个页面
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.HTTPError as e:

    # code返回HTTP状态码,比如 404 表示网页不存在
    print("状态码："+str(e.code))
    
    print("-"*10)# 隔离作用

    #reason同父类一样，用于返回错误的原因
    print("错误的原因："+str(e.reason))

    print("-"*10)# 隔离作用

    # headers返回请求头
    print("返回请求头"+str(e.headers))

结果

状态码：404
----------
错误的原因：Not Found
----------
返回请求头Server: GitHub.com
Date: Wed, 10 Feb 2021 04:33:04 GMT
Content-Type: text/html; charset=utf-8
X-NWS-UUID-VERIFY: 57751c67ef63d71111b6d2ccb0374d5d
Access-Control-Allow-Origin: *
ETag: "5ff19d26-c534"
x-proxy-cache: MISS
X-GitHub-Request-Id: B9A0:0A47:22D4A6:25005A:602356CA
Accept-Ranges: bytes
Age: 2870
Via: 1.1 varnish
X-Served-By: cache-tyo11956-TYO
X-Cache: HIT
X-Cache-Hits: 0
X-Timer: S1612931585.848275,VS0,VE0
Vary: Accept-Encoding
X-Fastly-Request-ID: 64ac5df797f171a4e102395049f9dec1d9c48b91
X-Daa-Tunnel: hop_count=2
X-Cache-Lookup: Hit From Upstream
X-Cache-Lookup: Hit From Inner Cluster
Content-Length: 50484
X-NWS-LOG-UUID: 1421142427750633019
Connection: close
X-Cache-Lookup: Cache Miss

parse模块解析链接

error模块	作用
`urlencode`函数	用于url编码操作
`unquote`函数	用于url解码操作
`urlparse()`函数	叫一个url拆分6部分
urlunparse()函数	组合6部url组合成一个完整的url
urlsplit()函数	叫一个拆分5部分
urlunsplit()函数	组合5部url组合成一个完整的url
urlencode()函数	字典变成url的参数
parse_qs()函数	url的参数变成字典
parse_qsl()函数	url的参数变成元素
urljoin()函数	目标地址加参数拼接完整url

request模块模块在urllib文件夹里面

编码操作

import urllib.parse  # 导入prase模块

# 创建一个字典键是a值是 无敌
cl={'a':'无敌'}

# 进行url编码
url=urllib.parse.urlencode(cl)
# 输出
print(url)

结果

1	a=%E6%97%A0%E6%95%8C

解码操作

import urllib.parse  # 导入prase模块

# 上面的编码
cl='a=%E6%97%A0%E6%95%8C'

# 进行url解码
url=urllib.parse.unquote(cl)
# 输出
print(url)

结果

a=无敌

urlparse()函数拆分6部分

基本演示

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result)

结果

1 2	<class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

结果是一个 ParseResult 类型的对象，它包含 6 部分，分别是 scheme、netloc、path、params、query 和 fragment

他就是一个URL组合后：http://www.baidu.com/index.html;user?id=5#comment

他其他是一个元组

代码

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html#comment')
print(result[0])
print(result[1])

结果

1 2	http www.baidu.com

也可以指定属性来输出结果但是一样的

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html#comment')
print(result.scheme)
print(result.netloc)

结果

1 2	http www.baidu.com

详细介绍

urlparse(url,scheme,allow_fragments)它有 3 个参数

url这是必填项即待解析的 URL
scheme默认的协议http或https，如果url里面有http://他就按url里面的

allow_fragments是否忽略 fragment部分设置为 False输出的结果fragment就什么都没有了

看一下结果

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html#comment', allow_fragments=False)
print(result)
result = urlparse('http://www.baidu.com/index.html#comment')
print(result)

结果

1
2

ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html#comment', params='', query='', fragment='')
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='', query='', fragment='comment')

urlunparse()函数组合6部分

urlunparse()函数他和上面的urlparse()函数是对立的

urlunparse()函数参数必须是 6 个不然就报错

演示

from urllib.parse import urlunparse

data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))

结果实现了 URL 的构造

1	http://www.baidu.com/index.html;user?a=6#comment

urlsplit()函数拆分5部分

这个方法和 urlparse() 方法非常相似，只不过它不再单独解析 params就是参数的部分，只返回 5 个结果

测试

from urllib.parse import urlsplit,urlparse,urlunsplit

result = urlsplit('http://www.baidu.com/index.html;user?id=5#comment')
print(result)

结果

1	SplitResult(scheme='http', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')

urlsplit函数和urlparse函数看一下他的对比结果

代码

from urllib.parse import urlsplit,urlparse

result = urlsplit('http://www.baidu.com/index.html;user?id=5#comment')
# 输出urlsplit函数的结果
print(result)

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
# 输出urlparse函数的结果
print(result)

结果

1
2

SplitResult(scheme='http', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

这个urlsplit()函数元组和属性

from urllib.parse import urlsplit

result = urlsplit('http://www.baidu.com/index.html;user?id=5#comment')
print(result.scheme)
print(result[0])

结果

1
2

http
http

urlunsplit()函数组合5部分

urlunsplit()函数与 urlunparse()函数类似，唯一的区别是长度必须为 5，不然就报错

示例如下

from urllib.parse import urlunsplit

data = ['http', 'www.baidu.com', 'index.html', 'a=6', 'comment']
print(urlunsplit(data))

结果

1	http://www.baidu.com/index.html?a=6#comment

urlencode()函数字典变成参数

urlencode()函数用来构建GET 请求参数

测试

from urllib.parse import urlencode

# 创建一个字典
params = {
    'name': 'germey',
    'age': 22
}
base_url = 'http://www.baidu.com?'

# 然后调用 urlencode() 方法将其序列化为 GET 请求参数
print(base_url + urlencode(params))

结果

1	http://www.baidu.com?name=germey&age=22

parse_qs()函数参数变成字典

parse_qs()函数和urlencode()函数是对立的

利用 parse_qs() 参就可以将它转回字典

测试

from urllib.parse import parse_qs

query = 'name=germey&age=22'
print(parse_qs(query))

结果

1	{'name': ['germey'], 'age': ['22']}

parse_qsl()函数参数变成元素

它用于将参数转化为元组组成的列表

测试

from urllib.parse import parse_qsl

query = 'name=germey&age=22'
print(parse_qsl(query))

结果

1	[('name', 'germey'), ('age', '22')]

urljoin()函数拼接完整url

urljoin()函数是用来做拼接用的可以叫不完整的url拼接成一个完整的

测试

from urllib.parse import urljoin

print(urljoin('http://www.baidu.com', 'FAQ.html'))
print(urljoin('www.baidu.com#comment', '?category=2'))

结果

1 2	http://www.baidu.com/FAQ.html www.baidu.com?category=2

如果第二个参数是完整的url他就会抛弃第一个参数

他的判断方式：scheme、netloc 和 path。如果这 3 项在新的链接就是第二个参数里面，就抛弃第一个参数

测试

from urllib.parse import urljoin

print(urljoin('http://www.baidu.com', 'FAQ.html'))
print(urljoin('http://www.baidu.com', 'https://cuiqingcai.com/FAQ.html'))

结果可以看见第四行第一个参数http://www.baidu.com被抛弃了

1 2	http://www.baidu.com/FAQ.html https://cuiqingcai.com/FAQ.html

robotparser 模块

利用 urllib 的 robotparser 模块，我们可以实现网站 Robots 协议的分析

request模块模块在urllib文件夹里面.

Robots 协议

可以看我的这个文章Robots 协议

RobotFileParser类

robotparser文件里面就这个一个类

RobotFileParser类的方法

可以看见一眼就能看我就这几个方法

robotparser的方法	作用
`set_url()`	设置 robots.txt 文件的链接，如果robotparser他传入url就不需要再使用这个方法设置了
`read()`	读取 robots.txt 文件并进行分析，不调用这个方法接下来的判断都会为 `False`
`parse()`	解析 robots.txt 文件传入参数是 robots.txt 某些行的内容，会按照 robots.txt 语法来分析内容
`can_fetch()`	有两参数一个是 `User-agent`，二个是要抓取的 URL，判断是否可以爬，返回结果 `True` 或 `False`
`mtime()`	返回上次抓取和分析 robots.txt 的时间，对长时间抓取一个网址很有用
`modified()`	将当前时间设置为上次抓取和分析 robots.txt 的时间，对长时间抓取一个网址很有用

测试就用我最喜欢的B站来测试嘻嘻

我们看一下B站那个网页下面这个几个路径不能爬

我找一个能爬URL：https://www.bilibili.com/v/popular/all这个可以爬

我用python来判断那个可以爬

代码

from urllib.robotparser import RobotFileParser


rp = RobotFileParser()

# 这个也可以直接写到RobotFileParser()里面
rp.set_url('https://www.bilibili.com/robots.txt')

# 读取 robots.txt 文件并进行分析
rp.read()


# 有两参数一个是 User-agent，二个是要抓取的URL。这个v目录是可以爬的
a=rp.can_fetch('*', 'https://www.bilibili.com/v/popular/all')

# 有两参数一个是 User-agent，二个是要抓取的URL。这个images目录是不可以爬的
b=rp.can_fetch('*', "https://www.bilibili.com/images/")

# 输出
print(a)
print(b)

结果

1 2	True False

也可以去掉set_url() 方法设置了 robots.txt 的链接

代码

from urllib.robotparser import RobotFileParser


rp = RobotFileParser('https://www.bilibili.com/robots.txt')


# 读取 robots.txt 文件并进行分析
rp.read()


# 有两参数一个是 User-agent，二个是要抓取的URL。这个v目录是可以爬的
a=rp.can_fetch('*', 'https://www.bilibili.com/v/popular/all')

# 有两参数一个是 User-agent，二个是要抓取的URL。这个images目录是不可以爬的
b=rp.can_fetch('*', "https://www.bilibili.com/images/")


# 输出
print(a)
print(b)

结果一样的

1 2	True False

可以使用 parse() 方法执行读取和分析

from urllib.robotparser import RobotFileParser
from urllib.request import urlopen

rp = RobotFileParser()

# 发起请求然后取得robots.txt里面的内容
rp.parse(urlopen('https://www.bilibili.com/robots.txt').read().decode('utf-8').split('\n'))

# 有两参数一个是 User-agent，二个是要抓取的URL。这个v目录是可以爬的
a=rp.can_fetch('*', 'https://www.bilibili.com/v/popular/all')

# 有两参数一个是 User-agent，二个是要抓取的URL。这个images目录是不可以爬的
b=rp.can_fetch('*', "https://www.bilibili.com/images/")


# 输出
print(a)
print(b)

结果一样

1 2	True False

上面的代码分析，其实就是这样

代码

from urllib.robotparser import RobotFileParser
from urllib.request import urlopen

rp = RobotFileParser()

# 我直接写上
robots=[
         'User-agent: *',
         'Disallow: /include/', 
         'Disallow: /mylist/', 
         'Disallow: /member/', 
         'Disallow: /images/', 
         'Disallow: /ass/', 
         'Disallow: /getapi', 
         'Disallow: /search', 
         'Disallow: /account', 
         'Disallow: /badlist.html', 
         'Disallow: /m/', 
         '']

# 发起请求然后取得robots.txt里面的内容
rp.parse(robots)

# 有两参数一个是 User-agent，二个是要抓取的URL。这个v目录是可以爬的
a=rp.can_fetch('*', 'https://www.bilibili.com/v/popular/all')

# 有两参数一个是 User-agent，二个是要抓取的URL。这个images目录是不可以爬的
b=rp.can_fetch('*', "https://www.bilibili.com/images/")


# 输出
print(a)
print(b)

结果一样的

1 2	True False

Python爬虫3.urllib库最详细

请求request模块请求

请求方法urlopen()函数

get方式请求

post方式请求

请求超时设置

其他参数

HTTPResposne 类型对象

构建请求内容Request()类

Request()类的参数

创建制定请求内容测试

高级用法

BaseHandler 类

OpenerDirector类

登录验证

添加代理

Cookies

保持文件MozillaCookieJar格式

保持文件LWP格式

读取并利用

error模块处理异常

URLError类

HTTPError类

parse模块解析链接

编码操作

解码操作

urlparse()函数拆分6部分

基本演示

详细介绍

urlunparse()函数组合6部分

urlsplit()函数拆分5部分

urlunsplit()函数组合5部分

urlencode()函数字典变成参数

parse_qs()函数参数变成字典

parse_qsl()函数参数变成元素

urljoin()函数拼接完整url

robotparser 模块

Robots 协议

RobotFileParser类

RobotFileParser类的方法

请求`request`模块请求

`HTTPResposne` 类型对象

`BaseHandler` 类

`OpenerDirector`类