1.若报错:requests.exceptions.SSLError: HTTPSConnectionPool(host='XXX', port=443),在请求时添加代码verify=False
response = requests.get(url=url,headers=headers,verify=False)
2.报错405 not allowed,nginx
看好是post还是get请求,url不对
3.数据中有单引号,会影响sql语句
4.AttributeError属性错误(索引的时候,属性找不到)
5.IndexError索引不到 解决:Evaluate 预编译 找
6.数据库查出的日期格式转换,是时间元组——》转成datetime.datetime
result = cursor.fetchall()
result = result[0][0]
页面爬出来的时间字符串——》datetime.datetime
ddate = datetime.datetime.strptime(reportTime, '%Y-%m-%d ')
return ddate > result, result
这样的格式可以进行比较
7.若网页内容是下载文件,下载后文件打不开有损坏,可能是url不对
8.图片路径对,但是页面不显示,在src=””前加
src="//images.weserv.nl/?url=https://mmbiz.qpic.cn/m
若路径在data-src下,将data-src替换为src
9.爬接口时,报错,先检查状态号是不是200
10.报错304,解决:在headers里去掉If-None-Match
11.爬接口时,报错
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
这是在请求请求.json()时返回的数据不是完整的json数据格式
解决:在params里去掉callback
或者,在转成json()之前输出response,返回504
uurl = 'https://flk.npc.gov.cn/api/detail'
new_data = requests.post(url=uurl, data=data, headers=headers, cookies=cookies)
if new_data.status_code == 504:
print("Error 504: Gateway Timeout")
if new_data.status_code == 200:
new_data = new_data.json()
download_url = 'https://wb.flk.npc.gov.cn' + new_data['result']['body'][0]['path']
text_p, text, attachment_url = "", "", ""
12.若获取不到数据,可能缺少请求头信息cookie
13.若获取的数据乱码
headers不写Accept-Encoding
html =requests.get(url=url,headers=headers)
html.encoding='utf-8' # 解决数据乱码
或者添加代码
# 消除警告
requests.packages.urllib3.disable_warnings()
# 无法识别的乱码处理
non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)
14.使用selenuim库爬取时,解决高版本谷歌驱动器driver
https://www.cnblogs.com/zsg88/p/18225736
15.解决报错
requests.exceptions.ConnectionError: (‘Connection aborted.‘, ConnectionResetError(10054, ‘远程主机强迫关闭
response =requests.get(url=target,headers=headers,timeout=30,verify=False)
16.解决报错:超出 url 的最大重试次数
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.hunan.gov.cn', port=443): Max retries exceeded with url:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36',
'Connection': 'close'
}
headers里添加'Connection': 'close'
17.遇到<![CDATA[字符,解决办法:关于python爬虫爬取网站时,数据返回中有 <![CDATA[ 字样的数据,xpath取不到值,解析问题及问题解决方案_
18.删除soup中不需要的标签
del_tags=dsoup.find_all('script') #删除多个标签
for del_tag in del_tags:
del_tag.extract()
dsoup.find('style').extract() #删除单个标签
19.若响应412,添加获得动态cookie代码
def getCookie():
# 使用Firefox浏览器驱动
driver = webdriver.Firefox()
# 打开网页
driver.get("https://www.hubei.gov.cn/xxgk/gz/index.shtml")
# 获取Cookies
cookies = driver.get_cookies()
# 打印Cookies
str_cookie = ""
index = 0
for cookie in cookies:
# print(cookie["name"]+"="+cookie["value"])
str_cookie += cookie["name"] + "=" + cookie["value"] + ";"
index += 1
if index == 1:
str_cookie += "7d0f4f97e8317b129e=3601cf1eb5196db6546ba2733010c134; "
driver.delete_all_cookies()
# 关闭浏览器
driver.quit()
return str_cookie
在headers里调用方法
- 若报错requests.exceptions.SSLError: [SSL: SSL_NEGATIVE_LENGTH] dh key too small (_ssl.c:600)
requests.exceptions.SSLError: HTTPSConnectionPool(host='abc.def.edu.cn', port=443): Max retries exceeded with url: ... (Caused by SSLError(SSLError(1, '[SSL: DH_KEY_TOO_SMALL] dh key too small (_ssl.c:997)')))
requests.exceptions.SSLError: HTTPSConnectionPool(host='myhost.com', port=443): Max retries exceeded with url: myurl (Caused by SSLError(SSLError(1, '[SSL: WRONG_SIGNATURE_TYPE] wrong signature type (_ssl.c:1108)')))
解决:添加代码
# 解决报错ssl
requests.packages.urllib3.util.ssl_.DEFAULT_CIPHERS = 'DEFAULT:@SECLEVEL=1'
requests.packages.urllib3.util.ssl_.DEFAULT_CIPHERS += ':HIGH:!DH:!aNULL'
21.使用selenium解决响应时间慢的问题
# 等待超时
desired_capabilities = DesiredCapabilities.CHROME
desired_capabilities["pageLoadStrategy"] = "none"
- 使用selenium,页面需要图片验证码识别
导入库
import ddddocr
from PIL import Image # 用于打开图片和对图片处理
from selenium import webdriver
image = self.get_pictures(href)
orc = ddddocr.DdddOcr()
text = orc.classification(image)
print("识别结果:", text)
版权归原作者 pumpkin0_0 所有, 如有侵权,请联系我们删除。