0


数据采集:selenium 获取 CDN 厂家各省市节点 IP

写在前面


  • 工作需要遇到,简单整理
  • 理解不足小伙伴帮忙指正

** 对每个人而言,真正的职责只有一个:找到自我。然后在心中坚守其一生,全心全意,永不停息。所有其它的路都是不完整的,是人的逃避方式,是对大众理想的懦弱回归,是随波逐流,是对内心的恐惧 ——赫尔曼·黑塞《德米安》**


逻辑相对简单,主要通过 站长之家

https://cdn.chinaz.com/

,获取全国省市的 CDN节点 IP 信息

采集流程:

  1. 获取CDN 厂家信息

在这里插入图片描述

  1. 跳转页面到指定的厂家,择需要获取的省份

在这里插入图片描述

  1. 获取当前页IP,循环处理分页数据

在这里插入图片描述

  1. 处理完当前省份,循环跳转其他省份处理
  2. 处理完当前厂家,循环处理其他厂家

代码:

#!/usr/bin/env python# -*- encoding: utf-8 -*-"""
@File    :   cdn_data_dns.py
@Time    :   2023/08/21 21:46:47
@Author  :   Li Ruilong
@Version :   1.0
@Contact :   [email protected]
@Desc    :   省市CDN 节点IP数据获取
"""# here put the import libfrom seleniumwire import webdriver
import json
import time
from selenium.webdriver.common.by import By
import pandas as pd
import re

ip_pattern =r"\b(?:\d{1,3}\.){3}\d{1,3}\b"# 自动登陆
driver = webdriver.Chrome()withopen('C:\\Users\山河已无恙\\Documents\GitHub\\reptile_demo\\demo\\cookie.txt','r', encoding='u8')as f:
    cookies = json.load(f)

driver.get('https://cdn.chinaz.com/')for cookie in cookies:
    driver.add_cookie(cookie)

driver.get('https://cdn.chinaz.com/')

time.sleep(6)#CND 商家排行获取 https://cdn.chinaz.com/
CDN_Manufacturer =[]
new_div_element = driver.find_element(By.CSS_SELECTOR,".toplist-main")
div_elements = new_div_element.find_element(By.CSS_SELECTOR,".ullist")
div_cdn = div_elements.find_elements(By.XPATH,"//a[contains(@href,'server')]")#CDN_Manufacturer.extend(div_elements)

current_window_1 = driver.current_window_handle
for i,mdn_ms inenumerate(div_cdn):try:#driver.execute_script("arguments[0].click();", mdn_ms)
        ip_addresse =[]print(mdn_ms.text)
    
        cloud_cdn_name = mdn_ms.text
        mdn_ms.click()
        time.sleep(2)
        driver.switch_to.window(driver.window_handles[-1])# 滚动到页面底部
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight/2)")
        time.sleep(5)
        areas_list =["安徽","河北","河南","湖北","湖南","江西","陕西","山西","四川","重庆"]for a in areas_list:              
            areas =  driver.find_element(By.CSS_SELECTOR,"#areas")
            nmg =  areas.find_element(By.XPATH,"//a/font[contains(text(),'"+a+"')]")
            nmg.click()
            time.sleep(2)
            new_div_element = driver.find_element(By.CSS_SELECTOR,".box")
            new_table_element =str(new_div_element.text).split("\n")
            ip_addresses = re.findall(ip_pattern,str(new_table_element))
            ip_addresse.extend(ip_addresses)iflen(driver.find_elements(By.XPATH,"//a[contains(@title, '尾页')]"))<2:#driver.close() #driver.switch_to.window(current_window_1)
                ips ={}
                ips[cloud_cdn_name]= ip_addresse
                df = pd.DataFrame(ips)
                df.to_csv('CDN_M_省份_'+a +'_'+cloud_cdn_name+'.csv', index=False)print("单页数据,数据已保存为CSV文件",'CDN_M_'+a +'_'+cloud_cdn_name+'.csv')continue
            sum_page = driver.find_element(By.XPATH,"//a[contains(@title, '尾页')]")
            attribute_value = sum_page.get_attribute('val')print(attribute_value)
            current_window_2 = driver.current_window_handle
            for page inrange(1,int(attribute_value)):try:
                    next_page = driver.find_element(By.XPATH,"//a[contains(@title, '下一页')]")
                    next_page.click()
                    time.sleep(5)
                    new_div_element = driver.find_element(By.CSS_SELECTOR,".box")
                    new_table_element =str(new_div_element.text).split("\n")
                    ip_addresses = re.findall(ip_pattern,str(new_table_element))
                    ip_addresse.extend(ip_addresses)except:print(a,cloud_cdn_name,"没有IP")
                    time.sleep(5)passcontinue    
            ips ={}
            ips[cloud_cdn_name]= ip_addresse
            df = pd.DataFrame(ips)
            df.to_csv('CDN_M_省份_'+a+'_'+cloud_cdn_name+'.csv', index=False)print("数据已保存为CSV文件",'  CDN_M_省份_'+a+'_'+cloud_cdn_name+'.csv')except:print(cloud_cdn_name,"没有IP")passcontinuefinally:pass
        driver.close() 
        driver.switch_to.window(current_window_1)continue

博文部分内容参考

© 文中涉及参考链接内容版权归原作者所有,如有侵权请告知



© 2018-2023 liruilonger@gmail.com, All rights reserved. 保持署名-非商用-相同方式共享(CC BY-NC-SA 4.0)


本文转载自: https://blog.csdn.net/sanhewuyang/article/details/132473916
版权归原作者 山河已无恙 所有, 如有侵权,请联系我们删除。

“数据采集:selenium 获取 CDN 厂家各省市节点 IP”的评论:

还没有评论