一、接口获取分析
本次采集借助微信公众号平台获取接口链接进行爬取,过程如下:
微信公众平台:https://mp.weixin.qq.com/
- 注册账号并登录账号
- 打开“文章”编辑页面,点击“超链接”按钮,点击“选择其他公众号”按钮
- 在下方可以看到微信公众号搜索接口,我们确认之后通过翻页可以找到历史文章接口:
二、接口分析
- 历史文章接口请求分析
url ="https://mp.weixin.qq.com/cgi-bin/appmsgpublish"
params ={"sub":"list","search_field":"null","begin":"0",# 第一条数据的下标,第二页的初始值为(page-1)*count"count":"5",# 每页返回数据最大值,一页能返回的最大值为20条数据"query":"","fakeid":"MjM5NTE1OTQyMQ==",# 微信公众号的biz号,如果爬取自己的历史数据,此处为空"type":"101_1","free_publish_type":"1","sub_action":"list_ex","token":"1338436259","lang":"zh_CN","f":"json","ajax":"1"}
- 历史文章接口响应字段解析,注意字典和字符串的切换
response = requests.get(url,params=params,headers=header)
publish_page = response.json()['publish_page']
publish_list = json.loads(publish_page)['publish_list']for publish in publish_list:print(publish)
publish_info = json.loads(publish['publish_info'])# 一天一次性发的会以列表的形式存储在【appmsgex】
appmsgex = publish_info['appmsgex']for appmsg in appmsgex:print(appmsg)
title = appmsg['title']# 标题
cover = appmsg['cover']# 封面图片
link = appmsg['link']# 详情数据链接
create_time = appmsg['create_time']# 创建时间的时间戳,秒为单位print(title,cover,link,create_time)
三、代码实现
注意事项
- cookie和token会失效,需要定期更换
- 短时间内爬取次数过于频繁会被封,短解封一般一两个小时,较长时间解封需要两天左右
defget_history_list(fakeid):
url ="https://mp.weixin.qq.com/cgi-bin/appmsgpublish"
params ={"sub":"list","search_field":"null","begin":"0",# 第一条数据的下标,第二页的初始值为(page-1)*count"count":"5",# 每页返回数据最大值,一页能返回的最大值为20条数据"query":"","fakeid":"MjM5NTE1OTQyMQ==",# 微信公众号的biz号,如果爬取自己的历史数据,此处为空"type":"101_1","free_publish_type":"1","sub_action":"list_ex","token": token,"lang":"zh_CN","f":"json","ajax":"1"}
header ={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36 Edg/129.0.0.0','cookie':cookie
}
response = requests.get(url,params=params,headers=header)
publish_page = response.json()['publish_page']
publish_list = json.loads(publish_page)['publish_list']for publish in publish_list:print(publish)
publish_info = json.loads(publish['publish_info'])# 一天一次性发的会以列表的形式存储在【appmsgex】
appmsgex = publish_info['appmsgex']for appmsg in appmsgex:print(appmsg)
title = appmsg['title']# 标题
cover = appmsg['cover']# 封面图片
link = appmsg['link']# 详情数据链接
create_time = appmsg['create_time']# 创建时间的时间戳,秒为单位print(title,cover,link,create_time)yield title,cover,link,create_time
defget_detail(link):
header ={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36 Edg/129.0.0.0'}
detail_repose = requests.get(url=link, headers=header)
detail_repose.encoding ='utf-8'
detail_html = detail_repose.text
detail_xp = etree.HTML(detail_html)# 正文内容
detail_content ="".join(detail_xp.xpath("//*[@id='js_content']//text()"))return detail_content
if __name__ =='__main__':
fakeid ='MjM5NTE1OTQyMQ=='
results = get_history_list(fakeid)for result in results:
title, cover, link, create_time = result
content = get_detail(link)print(title,cover,link,create_time,content)
结果展示:
本文转载自: https://blog.csdn.net/qq_44780372/article/details/143222205
版权归原作者 TU不秃头 所有, 如有侵权,请联系我们删除。
版权归原作者 TU不秃头 所有, 如有侵权,请联系我们删除。