0


使用Python和Selenium获取BOOS直聘职位信息

文章目录

引言

在当今就业比较困难,很多人对于要投递的岗位相关行业信息不了解,如果有招聘网站职位信息的可视化分析,那就可以直观地了解这个岗位信息,而数据分析的前提就是数据获取。本文将介绍如何使用Python语言和Selenium库来爬取招聘网站上的职位信息,并将其存储为CSV文件。我们将以“前端工程师”这一职位为例,展示如何实现这一过程。
效果示例:
在这里插入图片描述

环境准备

在开始之前,确保你的开发环境中安装了以下组件:

  • Python 3.x
  • selenium~=4.24.0
  • Chrome WebDriver
  • pandas~=2.2.3
  • csv库

我这里使用的浏览器是Google Chrome浏览器
在这里插入图片描述
读者请根据自己的浏览器情况安装符合自己的浏览器驱动。

网页分析

要爬取的第一个页面信息
网站URL:

  1. https://www.zhipin.com/web/geek/job?

参数:

  1. query=前端工程师

:搜索的内容

  1. city=101280600

:城市代码:深圳
下面的列表信息即是要爬取的信息。
先鼠标右键“

  1. 检查

”进入开发者模式,再找到相应的HTML信息,之后就可以分析更多信息的元素位置了。
在这里插入图片描述

代码解析

1. 导入必要的库

  1. import csv
  2. import json
  3. import os
  4. import time
  5. import pandas as pd
  6. from selenium import webdriver
  7. from selenium.webdriver.chrome.service import Service
  8. from selenium.webdriver.common.by import By

这段代码导入了爬虫程序所需的所有外部库。

2. 定义爬虫类

  1. classspider(object):def__init__(self,type, page):
  2. self.type=type
  3. self.page = page
  4. self.spiderUrl ="https://www.zhipin.com/web/geek/job?query=%s&city=101280600&page=%s"
  1. spider

类是爬虫的核心,它接受职位类型和起始页面作为参数,并初始化爬取的URL模板。
ps:

  1. page=

是页数

3. 启动浏览器

  1. defstartBrowser(self):
  2. service = Service('./chromedriver.exe')
  3. options = webdriver.ChromeOptions()
  4. options.add_experimental_option('excludeSwitches',['enable-automation'])
  5. browser = webdriver.Chrome(service=service, options=options)return browser
  1. startBrowser

方法用于启动Chrome浏览器,这是Selenium进行网页操作的基础。

4. 主要爬取逻辑

  1. defmain(self, page):# 省略部分代码...
  2. job_List = browser.find_elements(by=By.XPATH, value='//div[@class="search-job-result"]/ul[@class="job-list-box"]/li')for index, job inenumerate(job_List):try:# 提取职位信息的代码...except Exception as e:print(e)pass
  3. self.page +=1
  4. browser.quit()
  5. self.main(self.page)
  1. main

方法是爬取逻辑的核心,它循环访问每一页的职位列表,提取每个职位的详细信息,并递归地爬取下一页。(所有的代码放在文章末尾

5. 提取职位信息

  1. main

方法中,我们使用XPATH来定位页面元素,并提取职位的标题、地址、薪资等信息。

6. 保存数据到CSV

  1. defsave_to_csv(self, rowData):withopen('./temp.csv','a', newline='', encoding='utf-8')as wf:
  2. writer = csv.writer(wf)
  3. writer.writerow(rowData)
  1. save_to_csv

方法将提取的职位信息以行的形式追加到CSV文件中。

7. 初始化CSV文件

  1. definit(self):ifnot os.path.exists('./temp.csv'):withopen('./temp.csv','w', encoding='utf-8')as wf:
  2. writer = csv.writer(wf)
  3. writer.writerow(["title","address","type",...])
  1. init

方法用于在程序开始时创建CSV文件,并定义列名。

8. 清理和整理CSV数据

  1. defclear_csv(self):
  2. df = pd.read_csv('./temp.csv')
  3. df.dropna(inplace=True)
  4. df.drop_duplicates(inplace=True)
  5. df['salaryMonth']= df['salaryMonth'].map(lambda x: x.replace('薪',''))return df.values
  1. clear_csv

方法用于清理CSV文件中的数据,去除空值和重复项,并统一薪资字段的格式。

9. 全部代码

  1. import csv
  2. import json
  3. import os
  4. import time
  5. import pandas as pd
  6. from selenium import webdriver
  7. from selenium.webdriver.chrome.service import Service
  8. from selenium.webdriver.common.by import By
  9. classspider(object):def__init__(self,type, page):
  10. self.type=type
  11. self.page = page
  12. self.spiderUrl ="https://www.zhipin.com/web/geek/job?query=%s&city=101280600&page=%s"defstartBrowser(self):
  13. service = Service('./chromedriver.exe')
  14. options = webdriver.ChromeOptions()
  15. options.add_experimental_option('excludeSwitches',['enable-automation'])# options.add_experimental_option("debuggerAddress", "127.0.0.1:9222")
  16. browser = webdriver.Chrome(service=service, options=options)return browser
  17. defmain(self, page):if self.page >2:returnTrue
  18. browser = self.startBrowser()print("正在爬取页面路径"+ self.spiderUrl %(self.type, self.page))
  19. browser.get(self.spiderUrl %(self.type, self.page))
  20. time.sleep(30)
  21. browser.execute_script("""var totalHeight = 0;
  22. var distance = 100;
  23. var interval = setInterval(function(){
  24. var scrollHeight = document.body.scrollHeight;
  25. window.scrollBy(0, distance);
  26. totalHeight += distance;
  27. if(totalHeight >= scrollHeight){
  28. clearInterval(interval);
  29. }
  30. }, 100);
  31. """)
  32. time.sleep(10)# print(browser.page_source)
  33. job_List = browser.find_elements(by=By.XPATH,
  34. value='//div[@class="search-job-result"]/ul[@class="job-list-box"]/li')print(len(job_List))for index, job inenumerate(job_List):try:
  35. jobData =[]print('正在爬取第%s条职位信息'%(index +1))# print(job.text)# title = job.find_element(by=By.XPATH, value='').text# title
  36. title = job.find_element(by=By.XPATH,
  37. value=".//div[@class='job-title clearfix']/span[@class='job-name']").text
  38. print(title)# addresses
  39. addresses = job.find_element(by=By.XPATH,
  40. value='.//div[@class="job-card-body clearfix"]/a/div/span[2]').text.split('·')
  41. address = addresses[0]# # distiflen(addresses)!=1:
  42. dist = addresses[1]else:
  43. dist =""print(address, dist)# typetype= self.typeprint(type)
  44. tag_list = job.find_elements(by=By.XPATH,
  45. value='.//div[1]/a/div[2]/ul/li')iflen(tag_list)==2:
  46. workExperience = tag_list[0].text
  47. educational = tag_list[1].text
  48. else:
  49. workExperience = tag_list[1].text
  50. educational = tag_list[2].text
  51. print(educational, workExperience)
  52. hr_full = job.find_element(by=By.XPATH, value='.//div[@class="info-public"]').text.split()[0]# hrWork
  53. hrWork = job.find_element(by=By.XPATH, value='.//div[@class="info-public"]/em').text
  54. # hrName
  55. hrName = hr_full.replace(hrWork,"")print(hrName, hrWork)# workTag
  56. workTag = job.find_elements(by=By.XPATH, value='.//div[2]/ul/li')
  57. workTag = json.dumps(list(map(lambda x: x.text, workTag)))print(workTag)# pratice
  58. pratice =0# salary
  59. salaries = job.find_element(by=By.XPATH, value='.//div[1]/a/div[2]/span').text
  60. print(salaries)if salaries.find("K"):
  61. salaries = salaries.split('-')iflen(salaries)==1:
  62. salary =list(map(lambda x:int(x)*1000, salaries[0].replace('K','').split('-')))
  63. salaryMonth ='0薪'else:
  64. salary =list(map(lambda x:int(x)*1000, salaries[0].replace('K','').split('-')))
  65. salaryMonth = salaries[1]else:
  66. salary =list(map(lambda x:int(x), salaries.replace('元/天','').split('-')))
  67. salaryMonth ='0薪'
  68. pratice =1print(salary, salaryMonth, pratice)# companyTitle
  69. companyTitle = job.find_element(by=By.XPATH, value='.//div[1]/div/div[2]/h3/a').text
  70. print(companyTitle)# companyAvatar
  71. companyAvatar = job.find_element(by=By.XPATH,
  72. value='.//div[1]/div/div[1]/a/img').get_attribute('src')print(companyAvatar)
  73. companyInfo = job.find_elements(by=By.XPATH,
  74. value='.//div[@class="job-card-right"]/div[@class="company-info"]/ul[@class="company-tag-list"]/li')# 打印公司信息的数量print(len(companyInfo))iflen(companyInfo)==3:# companyNature
  75. companyNature = companyInfo[0].text
  76. # companyStatue
  77. companyStatue = companyInfo[1].text
  78. # companyPeople
  79. companyPeople = companyInfo[2].text
  80. if companyPeople !='10000人以上':
  81. companyPeople =list(map(lambda x:int(x), companyInfo[2].text.replace('人','').split('-')))else:
  82. companyPeople =[0,10000]else:# companyNature
  83. companyNature = companyInfo[0].text
  84. # companyStatue
  85. companyStatue ="未融资"# companyPeople
  86. companyPeople = companyInfo[1].text
  87. if companyPeople !='10000人以上':
  88. companyPeople =list(map(lambda x:int(x), companyInfo[1].text.replace('人','').split('-')))else:
  89. companyPeople =[0,10000]print(companyNature, companyStatue, companyPeople)# companyTag
  90. companyTag = job.find_element(by=By.XPATH,
  91. value='.//div[@class="job-card-footer clearfix"]/div[@class="info-desc"]').text
  92. ifnot companyTag:
  93. companyTag ="无"else:
  94. companyTag =', '.join(companyTag.split(','))print(companyTag)# detailUrl
  95. detailUrl = job.find_element(by=By.XPATH,
  96. value='.//div[@class="job-card-body clearfix"]/a').get_attribute('href')print(detailUrl)# companyUrl
  97. companyUrl = job.find_element(by=By.XPATH,
  98. value='.//div[1]/div/div[2]/h3/a').get_attribute('href')print(companyUrl)print(title, address, dist,type, educational, workExperience, workTag, salary, salaryMonth, companyTag,
  99. hrWork, hrName, pratice, companyTitle, companyAvatar, companyNature, companyStatue, companyPeople,
  100. detailUrl, companyUrl)
  101. jobData.append(title)
  102. jobData.append(address)
  103. jobData.append(type)
  104. jobData.append(educational)
  105. jobData.append(workExperience)
  106. jobData.append(workTag)
  107. jobData.append(salary)
  108. jobData.append(salaryMonth)
  109. jobData.append(companyTag)
  110. jobData.append(hrWork)
  111. jobData.append(hrName)
  112. jobData.append(pratice)
  113. jobData.append(companyTitle)
  114. jobData.append(companyAvatar)
  115. jobData.append(companyNature)
  116. jobData.append(companyStatue)
  117. jobData.append(companyPeople)
  118. jobData.append(detailUrl)
  119. jobData.append(companyUrl)
  120. jobData.append(dist)
  121. self.save_to_csv(jobData)except Exception as e:print(e)pass
  122. self.page +=1
  123. browser.quit()
  124. self.main(self.page)defclear_csv(self):
  125. df = pd.read_csv('./temp.csv')
  126. df.dropna(inplace=True)
  127. df.drop_duplicates(inplace=True)
  128. df['salaryMonth']= df['salaryMonth'].map(lambda x: x.replace('薪',''))print(f'总数量为{df.shape[0]}')return df.values
  129. defsave_to_csv(self, rowData):withopen('./temp.csv','a', newline='', encoding='utf-8')as wf:
  130. writer = csv.writer(wf)
  131. writer.writerow(rowData)definit(self):ifnot os.path.exists('./temp.csv'):withopen('./temp.csv','w', encoding='utf-8')as wf:
  132. writer = csv.writer(wf)
  133. writer.writerow(["title","address","type","educational","workExperience","workTag","salary","salaryMonth","companyTag","hrWork","hrName","pratice","companyTitle","companyAvatar","companyNature","companyStatue","companyPeople","detailUrl","companyUrl","dist"])if __name__ =='__main__':
  134. spiderOpj = spider("前端工程师",1)
  135. spiderOpj.init()
  136. spiderOpj.main(1)
  137. spiderOpj.clear_csv()

结语

通过上述步骤,可以自动爬取招聘网站上的职位信息,并将其整理成结构化的数据。这不仅节省了大量的手动查找和整理时间,还可以为后续的数据分析和决策提供支持。
但是这段代码只是实现了基本的爬虫功能,其实还有改进的空间,特别是在异常处理、代码重复、性能优化和代码安全性方面。

标签: python selenium 爬虫

本文转载自: https://blog.csdn.net/qq_52313022/article/details/143373771
版权归原作者 多练项目 所有, 如有侵权,请联系我们删除。

“使用Python和Selenium获取BOOS直聘职位信息”的评论:

还没有评论