lxml解析是用c语音编写的他解析XPath语言所以他的解析速度
为什么要用到lxml解析库,我们在请求响应回来的数据他只是一个html的字符串,lxml就是把html或xml的字符串解析成html或xml的页面
lxml解析就是html/xml
解析器
解析html/xml的页面 html字符串解析 我们在请求响应回来的数据他只是一个html的字符串,lxml就是把html或xml的字符串解析成html或xml的页面
lxml库有一个etree 模块下有一个html类,叫html的字符串进行初始化,构造一个 XPath 解析对象
代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 from lxml import etree text = ''' <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">爬虫</a> </ul> </div> ''' html = etree.HTML(text) print(html)
结果,他是一个html的对象
1 <Element html at 0x7f1287333f40>
可以看见他被解析成了一个html
对象我们可以用lxml
解析库的tostring()
方法即可输出修正后的 HTML 代码
代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 from lxml import etree text = ''' <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">爬虫</a> </ul> </div> ''' html = etree.HTML(text) result = etree.tostring(html) print(result) print(type(result))
结果,但是结果是 bytes
类型
1 2 b'<html><body><div>\n <ul>\n <li class="item-0"><a href="link1.html">first item</a></li>\n <li class="item-1"><a href="link2.html">second item</a></li>\n <li class="item-inactive"><a href="link3.html">third item</a></li>\n <li class="item-1"><a href="link4.html">fourth item</a></li>\n <li class="item-0"><a href="link5.html">爬虫</a> \n </li></ul>\n </div>\n</body></html>' <class 'bytes'>
上面的结果是 bytes
类型,利用 decode()
方法将其转成 str
类型
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 from lxml import etree text = ''' <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">爬虫</a> </ul> </div> ''' html = etree.HTML(text) result = etree.tostring(html) result= result.decode('utf-8' ) print(result) print(type(result))
结果:可以看见他是str类型了,修复了html
的代码还可以看见多了一个</li>
给前面的<li class="item-0">
闭合了,还添加了<html><body></body></html>
,但是下面结果还有一个爬
看不懂的代码可以设置tostring()
编码
1 2 3 4 5 6 7 8 9 10 11 <html><body><div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">爬虫</a> </li></ul> </div> </body></html> <class 'str '>
但是上面结果还有一个爬
看不懂的代码可以设置tostring()
编码他的参数是encoding="要设置输出的编码"
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 from lxml import etree text = ''' <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">爬虫</a> </ul> </div> ''' html = etree.HTML(text) result = etree.tostring(html,encoding="utf8" ) result= result.decode('utf-8' ) print(result) print(type(result))
结果:可以看见他的编码问题也解决了中文字符正常输出
1 2 3 4 5 6 7 8 9 10 11 <html><body><div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">爬虫</a> </li></ul> </div> </body></html> <class 'str '>
html文件内容解析 比如我有一个有一个html,想爬去里面的内容,lxml
库有一个etree
模块下有parse
类,就可以文件解析了
我创建一个a.html文件内容
1 2 3 4 5 6 7 8 9 <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">爬虫</a></li> </ul> </div>
文件解析代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 from lxml import etree html = etree.parse(r'./a.html' ) result = etree.tostring(html,encoding="utf8" ) result= result.decode('utf-8' ) print(result) print(type(result))
结果:可以看见和上面都一样
1 2 3 4 5 6 7 8 9 10 <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">爬虫</a></li> </ul> </div> <class 'str '>
a.htm
l文件少了一个</li>
标签
运行结果他就会报错
我们可以用lxml
库有一个etree
模块下有HTMLParser()
类指定解析器HTMLParser会根据文件修复HTML文件中缺失比如标签
测试
a.html文件内容我故意少了几个标签
代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 from lxml import etree html = etree.parse(r'./a.html' ,etree.HTMLParser(encoding="utf8" )) result = etree.tostring(html,encoding="utf8" ) result= result.decode('utf-8' ) print(result) print(type(result))
结果:可以看见没有报错,前面缺少的标签也给补了
1 2 3 4 5 6 7 8 9 10 11 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">爬虫</a> </li></ul></div></body></html> <class 'str'>
lxml库XPath提取数据 只要你用lxml
库有一个etree
模块下有一个XPath
方法,用这个方法的XPath语法就可以进行数据的提取了,他返回的是一个列表
提取全部 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 from lxml import etree text = ''' <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">爬虫</a></li> </ul> </div> ''' html = etree.HTML(text) divs=html.xpath(r'//li' ) print(divs) print(divs[0 ])
结果:可以看见一个列表对象里面都是html对象,我指定输出了第一个
1 2 [<Element li at 0x7f39af5e0f00>, <Element li at 0x7f39af5e0f40>, <Element li at 0x7f39af5e0f80>, <Element li at 0x7f39af5e0fc0>, <Element li at 0x7f39af5e8040>] <Element li at 0x7f39af5e0f00>
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 from lxml import etree text = ''' <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">爬虫</a></li> </ul> </div> ''' html = etree.HTML(text) divs=html.xpath(r'//li' ) result = etree.tostring(divs[0 ],encoding="utf8" ) result= result.decode('utf-8' ) print(result)
结果:可以看见我们指定输出第一个,可以看见结果是正确的
1 <li class="item-0"><a href="link1.html">first item</a></li>
提取子节点内容 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 from lxml import etree text = ''' <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">爬虫</a></li> </ul> </div> ''' html = etree.HTML(text) divs=html.xpath(r'//ul/li' ) print(divs)
结果:可以看见他是一个Element对象
1 [<Element li at 0x7fb8549b70c0 >, <Element li at 0x7fb8549b7100 >, <Element li at 0x7fb8549b7140 >, <Element li at 0x7fb8549b7180 >, <Element li at 0x7fb8549b71c0 >]
提取属性内容
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 from lxml import etree text = ''' <div> <ul> <li class="item-0"> <a href="link1.html">first item</a> </li> <li class="item-1"> <a href="link2.html">second item</a> </li> <li class="item-inactive"> <a href="link3.html">third item</a> </li> <li class="item-1"> <a href="link4.html">fourth item</a> </li> <li class="item-0"> <a href="link5.html">爬虫</a> </li> </ul> </div> ''' html = etree.HTML(text) divs=html.xpath(r'//li[@class="item-inactive"]/a/@href' ) print(divs)
//li[@class="item-inactive"]/a/@href
意思就是,li标签里面有class="item-inactive"
属性的子标签a
的href
属性的内容
结果
提取父节点内容
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 from lxml import etree text = ''' <div> <ul> <li class="item-0"> <a href="link1.html">first item</a> </li> </ul> </div> ''' html = etree.HTML(text) divs=html.xpath(r'//a/../@class' ) print(divs)
结果
文本获取 我们用 XPath 中的 text()
方法获取节点中的文本
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 from lxml import etree text = ''' <div> <ul> <li class="item-0"> <a href="link1.html">first item</a> </li> <li class="item-1"> <a href="link2.html">second item</a> </li> <li class="item-inactive"> <a href="link3.html">third item</a> </li> <li class="item-1"> <a href="link4.html">fourth item</a> </li> <li class="item-0"> <a href="link5.html">爬虫</a> </li> </ul> </div> ''' html = etree.HTML(text) divs=html.xpath(r'//a/text()' ) print(divs)
结果
1 ['first item' , 'second item' , 'third item' , 'fourth item' , '爬虫' ]
学习中