0


ChatGPT实战100例 - (04) 自动化爬虫

文章目录

ChatGPT实战100例 - (04) 自动化爬虫

一、需求与思路

需求:解析网页元素太复杂,要让他自动解析

操作步骤

  • ChatGPT编写相关脚本
  • python跑起来

需要的前置技能:听说过python爬虫库requests和bs4
没听过?简单:

  • requests是一个Python HTTP请求库,用于获取网页数据。
  • bs4是BeautifulSoup的缩写,是一个HTML/XML解析库,用于从网页数据中提取信息。

二、油猴子脚本

问题:
写一段python的bs4库解析,试着从这段html中解析如下结构化信息:
图标 类别 标题 网址 简介
并使用json表示,使用英文字段

<divclass="url-card io-px-2 col-6 col-2a col-sm-2a col-md-2a col-lg-3a col-xl-5a col-xxl-6a "><divclass="url-body default "><ahref="/sites/4661.html?menu-id=2331&amp;mininav-id=4543"target="\_blank"data-id="4661"data-url="https://murf.ai"class="card no-c mb-4 site-4661"title="Murf AI"><divclass="card-body url-content d-flex align-items-center"><divclass="url-img rounded-circle mr-2 d-flex align-items-center justify-content-center"><imgclass="lazy unfancybox loaded"src="https://api.iowen.cn/favicon/murf.ai.png"data-src="https://api.iowen.cn/favicon/murf.ai.png"height="auto"alt="Murf AI"data-was-processed="true"></div><divclass="url-info flex-fill"><divclass="text-sm overflowClip\_1"><strong>Murf AI</strong></div><pclass="overflowClip\_1 m-0 text-muted text-xs">
                        AI语音生成器在20种语言。120+逼真的文本到语音的声音,创造完美的AI画外音。轻松地从文字转换为语音。</p></div></div></a><ahref="/go/?url=aHR0cHM6Ly9tdXJmLmFpLw%3D%3D"target="\_blank"rel="external nofollow noopener"title=""class="togo text-center text-muted is-views"data-id="4661"data-toggle="tooltip"data-placement="right"data-original-title="直达"><iclass="iconfont icon-goto"></i></a></div></div>

回答:

英文的
Here’s a Python code to extract the structured information from the HTML and represent it in JSON format:

import requests
from bs4 import BeautifulSoup
import json

url ="页面的url"
response = requests.get(url)
soup = BeautifulSoup(response.content,"html.parser")

results =[]for card in soup.find_all("div", class_="url-card"):
    icon = card.find("img")["data-src"]
    category = card.find("a", class_="card")["data-id"]
    title = card.find("strong").text
    website = card.find("a", class_="card")["data-url"]
    description = card.find("p", class_="overflowClip_1").text

    result ={"icon": icon,"category": category,"title": title,"website": website,"description": description
    }
    results.append(result)

json_result = json.dumps(results)print(json_result)

This code extracts the icon, category, title, website, and description of each tool listed on the page and saves them in a list of dictionaries. Finally, it converts the list to a JSON string and prints it.

三、部署

pip install requests
pip install bs4

脚本跑起来就好

四、总结

关键是给他html,给目标,剩下就是喝枸杞红茶和等待了。


本文转载自: https://blog.csdn.net/u010764910/article/details/130156177
版权归原作者 AI原吾 所有, 如有侵权,请联系我们删除。

“ChatGPT实战100例 - (04) 自动化爬虫”的评论:

还没有评论