2023最新当当网爬取书籍信息python源代码，使用selenium库动态爬取！欢迎各位学习交流！

本期文章算是对之前使用request库爬取的一个小升级吧，代码呢其实早在7月份就写好了，最近有位小伙伴问我有无源码，刚好代码还在，就顺便发一下，写的不是很好，基础能爬的水平（勿喷谢谢），能爬取到的东西多了（比如"评论数量", "好评数", "中评数", "差评数"），但是速度慢了，而且容易被封ip或者弹验证，建议一次性最多爬取5页或者使用更多反爬手段！！！

注意！！！使用之前记得selenium使用3.141.0版本！！！

安装方法： **pip install selenium==3.141.0 **，其他版本或可，但4.0以上版本不支持

报错：ValueError: Timeout value connect was <object object at 0x0000019A00694540>, but it must be an int, float or None.

**解决方法： **pip install urllib3==1.26.2

如遇到其他出错可能是网络问题，可适当添加： time.sleep(3)

话不多说，源码如下：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import csv
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re

options = Options()

# 解析商品信息
def get_product_info(driver, div):
    # 解析页面内容
    name = div.find("p", class_="name").get_text().strip()
    price = div.find("span", class_="search_now_price").get_text().strip()
    print("书名：" + name)
    print("价格: " + price)

    isbn_info = div.find("p", class_="search_book_author")
    if isbn_info is not None:
        spans = isbn_info.find_all("span")
        author = spans[0].find("a").get_text().strip().replace("/", "")
        publisher = spans[2].find("a").get_text().strip().replace("/", "")
        publish_date = spans[1].get_text().strip().replace("/", "")
    else:
        author = div.find("a", class_="search_book_author").get_text().strip()
        publisher = div.find_all("a", class_="search_book_author")[1].get_text().strip()
        publish_date = div.find_all("span", class_="search_book_author")[1].get_text().strip()

    print("作者：" + author)
    print("出版社：" + publisher)
    print("出版年份：" + publish_date)

    link = div.find("p", class_="name").find("a").get("href")
    if "http" not in link:
        link = "https:" + link
   
    # 在新标签页中打开链接
    driver.execute_script(f'''window.open("{link}","_blank");''')
    windows = driver.window_handles
    driver.switch_to.window(windows[-1])
   
    # 检测新标签页是否成功打开
    if len(windows) != len(driver.window_handles):
        print("无法打开新标签页")
        driver.switch_to.window(windows[0])
        return None
    WebDriverWait(driver,10).until(EC.presence_of_element_located((By.CSS_SELECTOR,"li[id='comment_tab']")))
    soup_new = BeautifulSoup(driver.page_source, "html.parser")
    # 点击“店铺评价”链接
    remark_link = driver.find_element_by_css_selector("li[id='comment_tab']")
    driver.execute_script("arguments[0].click();", remark_link)
    WebDriverWait(driver,10).until(EC.presence_of_element_located((By.CSS_SELECTOR,"span.on")))
   
    remark_page_html = driver.page_source
    remark_page_soup = BeautifulSoup(remark_page_html, "html.parser")
    comment_num = remark_page_soup.find("span", class_='on').get_text().split("（")[1].split("）")[0] \
        if remark_page_soup.find("span", class_="on") is not None else '未找到'
    print("评论数量：" + comment_num)
    good_comment = remark_page_soup.find("span", {"data-type": "2"})
    if good_comment is not None:
        good_count = good_comment.get_text().split("（")[1].split("）")[0]
        print("好评数：" + good_count)
    else:
        print("未找到好评")
    common_comment = remark_page_soup.find("span", {"data-type": "3"}).get_text().split("（")[1].split("）")[0] \
        if remark_page_soup.find("span", {"data-type": "3"}) is not None else '未找到'
    print("中评数：" + common_comment)
    bad_comment = remark_page_soup.find("span", {"data-type": "4"}).get_text().split("（")[1].split("）")[0] \
        if remark_page_soup.find("span", {"data-type": "4"}) is not None else '未找到'
    print("差评数：" + bad_comment)

    driver.close()
    driver.switch_to.window(windows[0])

    info = {
        "书名": name,
        "价格": price,
        "作者": author,
        "出版社": publisher,
        "出版年份": publish_date,
        "评论数量": comment_num,
        "好评数": good_count,
        "中评数": common_comment,
        "差评数": bad_comment,
    }
    return info

# 使用Chrome浏览器和无界面模式
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--window-size=1920,1080")
chrome_options.add_argument("user-agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'")

driver = webdriver.Chrome(options=chrome_options)
data = []
page_num = 1
total_pages = 3  # 设置要爬取的总页数
while page_num <= total_pages:
    url = f"http://search.dangdang.com/?key=%C8%CB%B9%A4%D6%C7%C4%DC&act=input&page_index={page_num}"#搜索书名跳转的网址+&page_index={page_num}
    driver.get(url)

    html_content = driver.page_source
    soup = BeautifulSoup(html_content, "html.parser")

    div_list = soup.find_all("li", class_=re.compile("line\d+"))
    for div in div_list:
        info = get_product_info(driver, div)
        if info is not None:
            data.append(info)

    page_num += 1

# 关闭浏览器
driver.quit()

# 保存数据到CSV文件
filename = "book_info3.csv"
fields = ["书名", "价格", "作者", "出版社", "出版年份", "评论数量", "好评数", "中评数", "差评数"]

with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=fields)
    writer.writeheader()
    writer.writerows(data)

print("数据已保存到", filename)

效果如图：

感谢各位朋友们阅读，下期再见！！！

标签： python selenium 学习

本文转载自: https://blog.csdn.net/qq_53862860/article/details/134889828
版权归原作者 一只因摆大学生 所有，如有侵权，请联系我们删除。

2023最新当当网爬取书籍信息python源代码，使用selenium库动态爬取！欢迎各位学习交流！

发表评论

“2023最新当当网爬取书籍信息python源代码，使用selenium库动态爬取！欢迎各位学习交流！”的评论:

关于作者

overfit同步小助手

相关阅读

文章导航