0


python数据分析之爬虫基础:requests详解

1、requests基本使用

1.1、requests介绍

requests是python中一个常用于发送HTTP请求的第三方库,它极大地简化了web服务交互的过程。它是唯一的一个非转基因的python HTTP库,人类可以安全享用。

1.2、requests库的安装

pip install -i https://pypi.tuan.tsinghua.edu.cn/simple requests

1.3、requests基础语法

  1. import requests
  2. url = 'http://www.baidu.com'
  3. response = requests.get(url)

1.4、response的属性以及类型

(1)一个类型:

  1. print(type(response)) # <class 'requests.models.Response'>

(2)六个属性:

  1. # 是指相应的编码格式
  2. response.encoding = 'utf-8'
  3. # 以字符串形式返回网页源码
  4. print(response.text)
  5. # 获取请求头
  6. print(response.url)
  7. # 返回二进制数据
  8. print(response.content)
  9. # 返回状态码信息
  10. print(response.status_code)
  11. # 获取响应头信息
  12. print(response.headers)

2、requests的get请求

爬取郑州页面信息,和urllib基本差不多,只要明白urllib,相信requests的get请求也不会有什么难度。

  1. import requests
  2. url = 'https://www.baidu.com/s?'
  3. headers = {
  4. "user-agent":
  5. "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
  6. }
  7. data = {
  8. "wd":"郑州"
  9. }
  10. # url 请求资源路径 params 参数 # kwargs 字典
  11. response = requests.get(url=url,params=data,headers=headers)
  12. content = response.text
  13. print(content)

与urllib的get请求区别:

1、参数需要使用params传递
2、参数无需urlencode

3、不需要请求对象的定制

4、请求资源路径中的?可以省略

3、requests的post请求

我们还是以之前urllib中关于post请求-百度翻译为例:

  1. import requests
  2. url = "https://fanyi.baidu.com/sug"
  3. headers = {
  4. "user-agent":
  5. "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
  6. "cookie":'BIDUPSID=91AC5A2A82E26F50448A070917943E70; PSTM=1732629509; BAIDUID=91AC5A2A82E26F50448A070917943E70:FG=1; BDUSS_BFESS=E1IcjZ0NVRodGlNNjJaNFdXNUZQVjVsZE04eW5iaVdOSXkzQ3BDRkcxVndMbkpuRUFBQUFBJCQAAAAAAQAAAAEAAABYaMgfAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHChSmdwoUpne; BAIDUID_BFESS=91AC5A2A82E26F50448A070917943E70:FG=1; ZFY=0L:BrFXMz3oPPSIl2WrbINbmdK4f2nDwQtL:Bfl6za7PM:C; BDRCVFR[l9-IMhu-BDf]=mk3SLVN4HKm; delPer=0; H_PS_PSSID=61027_61099_61217_61280_61298_61246_60853; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; H_WISE_SIDS=61027_61099_61217_61280_61298_61246_60853; PSINO=1; BA_HECTOR=a58l2h24a121a1808ka48g213kh3u01jlb88s1u; BCLID=10763796247062205483; BCLID_BFESS=10763796247062205483; BDSFRCVID=rvFOJexroG3B_xQJosAdbCbKXuweG7bTDYrEOwXPsp3LGJLVdLE8EG0Pts1-dEu-S2OOogKKBeOTHn0F_2uxOjjg8UtVJeC6EG0Ptf8g0M5; BDSFRCVID_BFESS=rvFOJexroG3B_xQJosAdbCbKXuweG7bTDYrEOwXPsp3LGJLVdLE8EG0Pts1-dEu-S2OOogKKBeOTHn0F_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF=tbkD_C-MfIvhDRTvhCcjh-FSMgTBKI62aKDsoJ71BhcqJ-ovQpJmjU4ByRnkBJoa0Krihn6cWKJJ8UbeWfvp3t_D-tuH3lLHQJnph66dah5nhMJmBp_VhfL3qtCOaJby523i5J5vQpn_hhQ3DRoWXPIqbN7P-p5Z5mAqKl0MLPbtbb0xXj_0DTbLjH8jqTntaD5yWj6JanTjjTrFbKTjhPrML4tJWMT-MTryKM3xJh7-Ox7Xy4nDLPDUWMciB5OMBanRhlRNQRjVHqI4Lq_K360ZWec72MQxtNRJMMKEal5MKqF9MRJobUPULxo9LUvXtgcdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLfC-aMCt6eno_Mt4HqfbQa4JWHDQbsJOOaCvDSqQOy4oTj6D05-TRbMRZXa5ZaRonKqviEP8RW4r_3MvB-fnyKMIJye3CBItbtbr5ol6KQft20-DAeMtjBbLLfNTtVn7jWhvIeq72y-I2QlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8EjHCDJ5kDtJuHVbobHJoHjJbGq4bohjPX54j9BtQO-DOxoho7MUjkDPOqb-5T-xPR5qJ-05baQgnkQq5vbMnmqPtRXMJkXhKOX-_O0x-jLTneo66e34KVVIoOXPnJyUPYbtnnBPCj3H8HL4nv2JcJbM5m3x6qLTKkQN3T-PKO5bRu_CcJ-J8XMD89jTbP; H_BDCLCKID_SF_BFESS=tbkD_C-MfIvhDRTvhCcjh-FSMgTBKI62aKDsoJ71BhcqJ-ovQpJmjU4ByRnkBJoa0Krihn6cWKJJ8UbeWfvp3t_D-tuH3lLHQJnph66dah5nhMJmBp_VhfL3qtCOaJby523i5J5vQpn_hhQ3DRoWXPIqbN7P-p5Z5mAqKl0MLPbtbb0xXj_0DTbLjH8jqTntaD5yWj6JanTjjTrFbKTjhPrML4tJWMT-MTryKM3xJh7-Ox7Xy4nDLPDUWMciB5OMBanRhlRNQRjVHqI4Lq_K360ZWec72MQxtNRJMMKEal5MKqF9MRJobUPULxo9LUvXtgcdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLfC-aMCt6eno_Mt4HqfbQa4JWHDQbsJOOaCvDSqQOy4oTj6D05-TRbMRZXa5ZaRonKqviEP8RW4r_3MvB-fnyKMIJye3CBItbtbr5ol6KQft20-DAeMtjBbLLfNTtVn7jWhvIeq72y-I2QlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8EjHCDJ5kDtJuHVbobHJoHjJbGq4bohjPX54j9BtQO-DOxoho7MUjkDPOqb-5T-xPR5qJ-05baQgnkQq5vbMnmqPtRXMJkXhKOX-_O0x-jLTneo66e34KVVIoOXPnJyUPYbtnnBPCj3H8HL4nv2JcJbM5m3x6qLTKkQN3T-PKO5bRu_CcJ-J8XMD89jTbP; ab_sr=1.0.1_ZmQ5MTQ5YzBmNGJkNTY1NzMwMDMyZDljNDI4ZDNmNDk2YjBiOTJiOTkyNTYwZDEwYWM1MTAyNDliM2IwZjQxNmFmYmQxZGJmZDI0MDI5YmViZDIwYzIwMDVkZmMxNjljNGEzNzQ5MTYyOWY5MzVmMTgxZTQxOGY4YzFhMTk3YWRiNGQ0NGI3Y2M1NjhjOGEyMTE1MDU1N2M1MDI2OWVjMg==; RT="z=1&dm=baidu.com&si=683d19d9-ec4a-4ee1-ba25-d45da6aaef7f&ss=m4fnfeoj&sl=3&tt=b6o&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=ruw"'
  7. }
  8. data = {
  9. "kw":"eye"
  10. }
  11. response = requests.post(url=url, headers=headers, data=data)
  12. content = response.text
  13. import json
  14. content = json.loads(content)
  15. print(content)

与urllib的post请求的区别:

1、post请求不需要编解码

2、post请求的参数是data

3、不需要请求对象的定制

4、代理

  1. import requests
  2. url = "http://www.baidu.com/s?"
  3. headers = {
  4. # "accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
  5. "user-agent":
  6. "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
  7. # "cookie":'BIDUPSID=91AC5A2A82E26F50448A070917943E70; PSTM=1732629509; BAIDUID=91AC5A2A82E26F50448A070917943E70:FG=1; BD_UPN=12314753; BDUSS_BFESS=E1IcjZ0NVRodGlNNjJaNFdXNUZQVjVsZE04eW5iaVdOSXkzQ3BDRkcxVndMbkpuRUFBQUFBJCQAAAAAAQAAAAEAAABYaMgfAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHChSmdwoUpne; BAIDUID_BFESS=91AC5A2A82E26F50448A070917943E70:FG=1; ZFY=0L:BrFXMz3oPPSIl2WrbINbmdK4f2nDwQtL:Bfl6za7PM:C; B64_BOT=1; BDRCVFR[l9-IMhu-BDf]=mk3SLVN4HKm; delPer=0; BD_CK_SAM=1; H_PS_PSSID=61027_61099_61217_61280_61298_61246_60853; shifen[8451320_53724]=1733557849; shifen[304792146112_6039]=1733557876; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; H_WISE_SIDS=61027_61099_61217_61280_61298_61246_60853; BA_HECTOR=a58l2h24a121a1808ka48g213kh3u01jlb88s1u; shifen[8332037_91638]=1733665082; BCLID=10763796247062205483; BCLID_BFESS=10763796247062205483; BDSFRCVID=rvFOJexroG3B_xQJosAdbCbKXuweG7bTDYrEOwXPsp3LGJLVdLE8EG0Pts1-dEu-S2OOogKKBeOTHn0F_2uxOjjg8UtVJeC6EG0Ptf8g0M5; BDSFRCVID_BFESS=rvFOJexroG3B_xQJosAdbCbKXuweG7bTDYrEOwXPsp3LGJLVdLE8EG0Pts1-dEu-S2OOogKKBeOTHn0F_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF=tbkD_C-MfIvhDRTvhCcjh-FSMgTBKI62aKDsoJ71BhcqJ-ovQpJmjU4ByRnkBJoa0Krihn6cWKJJ8UbeWfvp3t_D-tuH3lLHQJnph66dah5nhMJmBp_VhfL3qtCOaJby523i5J5vQpn_hhQ3DRoWXPIqbN7P-p5Z5mAqKl0MLPbtbb0xXj_0DTbLjH8jqTntaD5yWj6JanTjjTrFbKTjhPrML4tJWMT-MTryKM3xJh7-Ox7Xy4nDLPDUWMciB5OMBanRhlRNQRjVHqI4Lq_K360ZWec72MQxtNRJMMKEal5MKqF9MRJobUPULxo9LUvXtgcdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLfC-aMCt6eno_Mt4HqfbQa4JWHDQbsJOOaCvDSqQOy4oTj6D05-TRbMRZXa5ZaRonKqviEP8RW4r_3MvB-fnyKMIJye3CBItbtbr5ol6KQft20-DAeMtjBbLLfNTtVn7jWhvIeq72y-I2QlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8EjHCDJ5kDtJuHVbobHJoHjJbGq4bohjPX54j9BtQO-DOxoho7MUjkDPOqb-5T-xPR5qJ-05baQgnkQq5vbMnmqPtRXMJkXhKOX-_O0x-jLTneo66e34KVVIoOXPnJyUPYbtnnBPCj3H8HL4nv2JcJbM5m3x6qLTKkQN3T-PKO5bRu_CcJ-J8XMD89jTbP; H_BDCLCKID_SF_BFESS=tbkD_C-MfIvhDRTvhCcjh-FSMgTBKI62aKDsoJ71BhcqJ-ovQpJmjU4ByRnkBJoa0Krihn6cWKJJ8UbeWfvp3t_D-tuH3lLHQJnph66dah5nhMJmBp_VhfL3qtCOaJby523i5J5vQpn_hhQ3DRoWXPIqbN7P-p5Z5mAqKl0MLPbtbb0xXj_0DTbLjH8jqTntaD5yWj6JanTjjTrFbKTjhPrML4tJWMT-MTryKM3xJh7-Ox7Xy4nDLPDUWMciB5OMBanRhlRNQRjVHqI4Lq_K360ZWec72MQxtNRJMMKEal5MKqF9MRJobUPULxo9LUvXtgcdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLfC-aMCt6eno_Mt4HqfbQa4JWHDQbsJOOaCvDSqQOy4oTj6D05-TRbMRZXa5ZaRonKqviEP8RW4r_3MvB-fnyKMIJye3CBItbtbr5ol6KQft20-DAeMtjBbLLfNTtVn7jWhvIeq72y-I2QlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8EjHCDJ5kDtJuHVbobHJoHjJbGq4bohjPX54j9BtQO-DOxoho7MUjkDPOqb-5T-xPR5qJ-05baQgnkQq5vbMnmqPtRXMJkXhKOX-_O0x-jLTneo66e34KVVIoOXPnJyUPYbtnnBPCj3H8HL4nv2JcJbM5m3x6qLTKkQN3T-PKO5bRu_CcJ-J8XMD89jTbP; ab_sr=1.0.1_ZmQ5MTQ5YzBmNGJkNTY1NzMwMDMyZDljNDI4ZDNmNDk2YjBiOTJiOTkyNTYwZDEwYWM1MTAyNDliM2IwZjQxNmFmYmQxZGJmZDI0MDI5YmViZDIwYzIwMDVkZmMxNjljNGEzNzQ5MTYyOWY5MzVmMTgxZTQxOGY4YzFhMTk3YWRiNGQ0NGI3Y2M1NjhjOGEyMTE1MDU1N2M1MDI2OWVjMg==; RT="z=1&dm=baidu.com&si=683d19d9-ec4a-4ee1-ba25-d45da6aaef7f&ss=m4fnfeoj&sl=4&tt=cn1&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=wmj&ul=o4bd&hd=o4c0"; PSINO=7; sugstore=1; H_PS_645EC=e2c20yk9RoanWFIVyDJbr18JC5dzOzNojiUaPy0JXsXtSzcOKsks5N3IUyetiaDn7Vsq5ZY; baikeVisitId=1d823dea-39eb-4e63-978d-65fd09a0d697; COOKIE_SESSION=81376_0_6_6_7_3_1_0_6_3_205_1_111167_0_0_0_1733584849_0_1733666222%7C9%2379969_3_1733137574%7C2'
  8. }
  9. data = {
  10. "wd":"ip"
  11. }
  12. # 代理池
  13. proxy={
  14. "http":"23.247.137.142:80"
  15. }
  16. response =requests.get(url=url,params=data,headers=headers,proxies=proxy)
  17. content = response.text
  18. file = open("ip.html","w",encoding="utf-8")
  19. file.write(content)
  20. file.close()

5、cookie登录

我们以古诗文个人主页页面为例子,含有验证码。

66d0bea5ae1340d1984a0d845ac37648.png

首先我们进入登陆界面后,搜遍输入密码,然后打开开发者模式,看到login接口,看负载(payload)里面有许多信息。

__VIEWSTATE:MnTNH2SbI9isHX8zdfu1NvmByZXoSVf8Vxj5QIeJ5C8EmgWhaBFQRNjQYMe47E+qOO+ss1LSDNdjYeNRy/bdvD7wktgbMm73Cku21k7NhLMYo79CC54kuz//cZ9kSLKKFvkpppzOssnyET3GX789uH1DMUM= __VIEWSTATEGENERATOR: C93BE1AE

这两个信息不固定,是变量,而code也是变量。因此解决这三个变量就是这个例子的难点

难点:(1)__VIEWSTATE __VIEWSTATEGENERATOR

ad245ab4be2d4f5e81184c4ce5b7b2bd.png

我们回到登陆页面,检查源代码,发现里面是有这两个变量的。而hidden我们称之为隐藏域。

获取登录页面源码:

  1. import requests
  2. url = "https://www.gushiwen.cn/user/login.aspx?from=http://www.gushiwen.cn/user/collect.aspx"
  3. headers = {
  4. "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
  5. }
  6. response = requests.get(url, headers=headers)
  7. content = response.text

解析__VIEWSTATE __VIEWSTATEGENERATOR两个变量的value,可以通过beautifulsoup语法,也可用通过xpath:

  1. from lxml import etree
  2. tree = etree.HTML(content)
  3. __VIEWSTATE = tree.xpath('//input[@name="__VIEWSTATE"]/@value')
  4. __VIEWSTATEGENERATOR = tree.xpath('//input[@name="__VIEWSTATEGENERATOR"]/@value')
  5. print(__VIEWSTATE)
  6. print(__VIEWSTATEGENERATOR)

难点:(2)code验证码(获取验证码图片)

  1. code = tree.xpath('//img[@id="imgCode"]/@src')[0]
  2. code_url = "https://so.gushiwen.cn"+code

获取了验证码图片后下载到本地观察验证码,然后在控制台输入即可!(当然也可以用pytesseract来识别数字)

  1. import urllib.request
  2. urllib.request.urlretrieve(url=code_url,filename="code.jpg")
  3. code_name = input("请输入验证码:")

但这种方法显然是有问题的,只有我们输入验证码后才会生成新的验证码,也就是说这个时候我们输入的验证码是旧的验证码。因此我们可以用requests库中的session方法,通过session的返回值,是请求变成一个对象。

  1. session = requests.session()
  2. response_code = session.get(code_url)
  3. content_code = response_code.content # 此时要使用二进制数据,因为使用的图片的下载
  4. f = open("code.jpg","wb") # wb的模式就是将二进制数据写入到文件
  5. f.write(content_code)
  6. f.close()
  7. code_name = input("请输入验证码:")

抓取登录按钮的接口

  1. url_post = "https://www.gushiwen.cn/user/login.aspx?from=http%3a%2f%2fwww.gushiwen.cn%2fuser%2fcollect.aspx"
  2. data_post = {
  3. "__VIEWSTATE": viewstate,
  4. "__VIEWSTATEGENERATOR": viewstategenerator,
  5. "from": "http://www.gushiwen.cn/user/collect.aspx",
  6. "email": 17719114890,
  7. "pwd": "dwq0219423",
  8. "code": code_name,
  9. "denglu": "登录"
  10. }
  11. response_post = session.post(url=url_post, headers=headers, data=data_post)
  12. content_post = response_post.text
  13. f = open("古诗文.html","w",encoding="utf-8")
  14. f.write(content_post)

完整代码如下:

  1. import requests
  2. url = "https://www.gushiwen.cn/user/login.aspx?from=http://www.gushiwen.cn/user/collect.aspx"
  3. headers = {
  4. "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
  5. }
  6. response = requests.get(url, headers=headers)
  7. content = response.text
  8. from lxml import etree
  9. tree = etree.HTML(content)
  10. viewstate = tree.xpath('//input[@name="__VIEWSTATE"]/@value')[0]
  11. viewstategenerator = tree.xpath('//input[@name="__VIEWSTATEGENERATOR"]/@value')[0]
  12. code = tree.xpath('//img[@id="imgCode"]/@src')[0]
  13. code_url = "https://so.gushiwen.cn"+code
  14. session = requests.session()
  15. response_code = session.get(code_url)
  16. content_code = response_code.content # 此时要使用二进制数据,因为使用的图片的下载
  17. f = open("code.jpg","wb") # wb的模式就是将二进制数据写入到文件
  18. f.write(content_code)
  19. f.close()
  20. code_name = input("请输入验证码:")
  21. url_post = "https://www.gushiwen.cn/user/login.aspx?from=http%3a%2f%2fwww.gushiwen.cn%2fuser%2fcollect.aspx"
  22. data_post = {
  23. "__VIEWSTATE": viewstate,
  24. "__VIEWSTATEGENERATOR": viewstategenerator,
  25. "from": "http://www.gushiwen.cn/user/collect.aspx",
  26. "email": 17719114890,
  27. "pwd": "dwq0219423",
  28. "code": code_name,
  29. "denglu": "登录"
  30. }
  31. response_post = session.post(url=url_post, headers=headers, data=data_post)
  32. content_post = response_post.text
  33. f = open("古诗文.html","w",encoding="utf-8")
  34. f.write(content_post)

本文转载自: https://blog.csdn.net/2401_83283514/article/details/144318694
版权归原作者 人间无解 所有, 如有侵权,请联系我们删除。

“python数据分析之爬虫基础:requests详解”的评论:

还没有评论