【Python】简单的爬虫抓取

效果：抓取某个学校网站的教授名录，并获取研究方向。
由于网站使用的都是明文，所以抓起来没什么难度，且平时访问量小，很值得用来练习。
在这里插入图片描述
代码如下，解释请见注释

import time

import requests
from bs4 import BeautifulSoup

# 创建一个包含浏览器头部信息的字典，模拟浏览器，可以骗过一些简单的反爬虫网站
headers ={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36',# 可以根据需要添加更多的头部信息，比如：# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',# 'Accept-Language': 'en-US,en;q=0.5',# 'Accept-Encoding': 'gzip, deflate, br',# 'DNT': '1', # Do Not Track 请求头表明用户不希望被追踪# 'Connection': 'keep-alive',# 'Upgrade-Insecure-Requests': '1',# 如果需要处理cookies，也可以在这里设置'Cookie'字段# 'Cookie': 'your_cookie_here',}# 获取教师信息defrequest_teacher_info(url):# 发送GET请求，并带上headers
    response = requests.get(url, headers=headers)
    content = response.content.decode('utf-8')# 确保请求成功if response.status_code ==200:
        soup = BeautifulSoup(content,'html.parser')# 抓取div标签下，class为infobox的标签
        infobox_div = soup.find('div', class_='infobox')# 抓取span标签，class为'h_info'，并且包含"姓名"字样
        name_span = infobox_div.find('span', class_='h_info', string='姓名：')# 抓取"姓名"字样后一个字样，即实际姓名
        name = name_span.find_next_sibling('span').get_text(strip=True)# 同样的方法抓研究方向，想继续抓，还有邮箱，发表论文这些，都是一样的套路
        research_direction_span = infobox_div.find('span', class_='h_info', string='研究方向：')
        research_direction = research_direction_span.find_next_sibling('span').get_text(strip=True)print(f"{name}  研究方向：{research_direction}")else:print(f"请求失败，状态码为：{response.status_code}")# 获取教师列表defrequest_teacher_list(url):# 发送GET请求，并带上headers
    response = requests.get(url, headers=headers)
    content = response.content.decode('utf-8')
    link_list =[]# 确保请求成功if response.status_code ==200:# 使用BeautifulSoup解析HTML内容
        soup = BeautifulSoup(content,'html.parser')

        right_list_r = soup.find('div', class_='right-list r')# 教师列表
        teacher_lists = right_list_r.find_all('div', class_='teacher-list')for teacher_list in teacher_lists:
            job_type = teacher_list.find("h2", class_="title l")# 这些打印信息可以忽略，重要信息已在request_teacher_info()中展示print(job_type.get_text(strip=True))
            professor_ul = teacher_list.find_all('ul')[0]
            a_list = professor_ul.find_all('a', href=True)for a in a_list:
                link = a['href']
                link_list.append(link)print(link)print("="*50)return link_list

link_list1 = request_teacher_list("https://example.com")for link in link_list1:
    request_teacher_info(link)# time.sleep(0.5)

标签： python 爬虫开发语言

本文转载自: https://blog.csdn.net/bfz_50/article/details/141754654
版权归原作者 百分之50 所有，如有侵权，请联系我们删除。

【Python】简单的爬虫抓取

发表评论

“【Python】简单的爬虫抓取”的评论:

关于作者

overfit同步小助手

相关阅读

文章导航