selenium
前言
文章仅供学习使用!!
严禁做违法违纪的事情,责任自负
简介
Selenium 是最广泛使用的开源 Web UI(用户界面)自动化测试套件之一。
与java集成,本质上是通过Java代码调用浏览器驱动 进行模拟人工的操作.
selenium支持不同的浏览器,本文以谷歌为例 !
1.安装驱动
selenium驱动有两种下载方式.任选其一即可
①首先需要确认浏览器版本: 在浏览器界面输入chrome://settings/
② 下面网址任选其一,选择对应的版本下载 ( 此处如未有完全一致版本,则选择最大版本 例如本文中是104.0.5112.102 可选的版本是104开头 最优选为104版本中最大版号)
http://chromedriver.storage.googleapis.com/index.html
http://npm.taobao.org/mirrors/chromedriver/
2.简单案例走进爬虫
packagecom.mengkeng.selenium_demo.test;importorg.openqa.selenium.By;importorg.openqa.selenium.WebElement;importorg.openqa.selenium.chrome.ChromeDriver;importorg.openqa.selenium.chrome.ChromeOptions;importjava.util.concurrent.TimeUnit;publicclassBaiduDemo{publicstaticvoidmain(String[] args)throwsException{//D://chromedriver.exe 以实际存储路径为准System.setProperty("webdriver.chrome.driver","D://chromedriver.exe");ChromeOptions chromeOptions =newChromeOptions();ChromeDriver driver =newChromeDriver(chromeOptions);try{// 窗口最大化
driver.manage().window().maximize();
driver.manage().timeouts().implicitlyWait(10,TimeUnit.SECONDS);Thread.sleep(1000);//进入百度首页
driver.get("https://www.baidu.com/");//找到输入框WebElement text = driver.findElement(By.id("kw"));//找到百度一下按钮WebElement button = driver.findElement(By.id("su"));
text.sendKeys("123");
button.click();}finally{sleep(10000);
driver.quit();}}publicstaticvoidsleep(int time){try{Thread.sleep(1000);}catch(InterruptedException e){
e.printStackTrace();}}}
通过几行代码实现了打开网页搜索 ‘123’ , 接下来看看常用的api , 理解即可 随用随查
3.seleniumAPI
3-1创建一个可操控的浏览器对象
// 注意修改实际驱动存储位置System.setProperty("webdriver.chrome.driver","D://chromedriver.exe");WebDriver driver =newChromeDriver();
3-2打开指定页面
driver.get("https://www.baidu.com/");
3-3定位元素
注意: 页面出现相同属性的元素, 则需要使用xpath定位方式进行指定获取
id定位
driver.findElement(By.id("pnum"));
name定位
driver.findElement(By.name("name"));
class 定位
driver.findElement(By.className("pgo"));
link定位
driver.findElement(By.linkText("link"));
xpath定位
driver.findElement(By.xpath("//div[@id='1']/div/div/h3/a[1]"))
3-4浏览器常用方法
方法描述sendKey()模拟输入指定内容clear()清楚输入内容text()获取文本信息getAttribute()获取指定属性
ok掌握这一部分就可以书写简单爬虫了 , 有兴趣的童鞋试着做一下如下案例:
案例 一 登录QQ邮箱
需求:
登录qq邮箱,并打开收件箱页面
以下是实现代码
packagecom.mengkeng.selenium_demo.test;importorg.openqa.selenium.By;importorg.openqa.selenium.WebDriver;importorg.openqa.selenium.WebElement;importorg.openqa.selenium.chrome.ChromeDriver;importjava.util.Objects;publicclassQQEmaIlLoginDemo{publicstaticvoidmain(String[] args)throwsInterruptedException{//定义使用什么版本的驱动,注意替换你的路径System.setProperty("webdriver.chrome.driver","D://chromedriver.exe");ChromeDriver driver =newChromeDriver();
driver.manage().window().maximize();try{Thread.sleep(1000);
driver.get("https://mail.qq.com/");
driver.switchTo().frame("login_frame");WebElement username = driver.findElement(By.id("u"));WebElement password = driver.findElement(By.id("p"));
username.sendKeys("[email protected]");
password.sendKeys("xxxxxx");WebElement submit = driver.findElement(By.id("login_button"));
submit.click();Thread.sleep(1000);
driver.switchTo().defaultContent();WebElement element =validElement("//a[@id='folder_1']", driver);if(Objects.nonNull(element)){WebElement folder_1 = driver.findElement(By.xpath("//a[@id='folder_1']"));
folder_1.click();}else{System.out.println("打开收件箱失败");}}finally{Thread.sleep(10000);
driver.close();
driver.quit();}}publicstaticWebElementvalidElement(String str,WebDriver driver){try{WebElement element = driver.findElement(By.xpath(str));return element;}catch(Exception e){System.out.println("这个元素不存在"+ str);}returnnull;}}
上述只是简单案例 有鼠标,多页面跳转的怎么办呢 . 别急 这就来
3-5selenium 进阶
鼠标
注意 鼠标操作方法需要以perform()方法结尾 如未使用该方法结尾则操作不生效
方法描述click()单击左键context_click()单击右键double_click()双击drag_and_drop()拖动move_to_element()鼠标悬停perform()执行所有ActionChains中存储的动作
切换窗口
当点击页面元素 浏览器创建新窗口后需要切换到最新页面.
driver.switchTo().window(frontHandle) // 此处的frontHandle是页面对象 可以使用driver.getWindowHandle(); 获取后暂存
调用js
模拟滑动页面
driver.executeScript(“window.scrollTo(0,300)”);当页面元素无法点击的时候(反爬虫拦截)
driver.executeScript(“arguments[0].click();”, element);// 其中element为按钮或元素
chromeOptions 创建浏览器 参数
ChromeOptions chromeOptions =newChromeOptions();
chromeOptions.setPageLoadStrategy(PageLoadStrategy.EAGER);// 急速加载模式
chromeOptions.addArguments("--incognito");// 隐私窗口模式
chromeOptions.addArguments("--blink-settings=imagesEnabled=false");// 不加载图片
chromeOptions.addArguments("--headless");// 无头模式
chromeOptions.addArguments("--no-sandbox");// 禁用沙箱模式
chromeOptions.addArguments("--disable-gpu");// 禁用gpu加速
chromeOptions.addArguments("--proxy-server="+ proxy);// 添加代理ChromeDriver driver =newChromeDriver(chromeOptions);
浏览器相关设置
// 设置全局等待时间
driver.manage().timeouts().implicitlyWait(10,TimeUnit.SECONDS);// 最大化页面
driver.manage().window().maximize();// 去除sesenium标志String js1="Object.defineProperties(navigator, {webdriver:{get:()=>undefined}});";((ChromeDriver) driver).executeScript(js1);// 添加UA请求头String[] arr ={"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36 Edg/103.0.1264.37","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50","Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)"};
chromeOptions.addArguments("User-Agent="+ arr[random.nextInt(7)]);
多线程示例
在解析列表页 创建浏览器对象执行解析
privatevoidparsePagePre(SetOperations ops){ThreadPoolExecutor threadPoolExecutor =newThreadPoolExecutor(2,8,30L,TimeUnit.SECONDS,newLinkedBlockingQueue<>());List<BuildAreaUrlLj> buildAreaUrlLjs = buildAreaUrlLjMapper.selectList(null);for(BuildAreaUrlLj buildAreaUrlLj : buildAreaUrlLjs1){
pagepoolExecutor.execute(()->parsePage(ops, opsForHash, buildAreaUrlLj));}}privatevoidparsePage(SetOperations ops,HashOperations<String,Object,Object> opsForHash,BuildAreaUrlLj buildAreaUrlLj){ChromeDriver driver =getChromeDriver();
driver.get(buildAreaUrlLj.getAreaUrl());// 业务代码}privateChromeDrivergetChromeDriver(){String[] arr ={"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36 Edg/103.0.1264.37","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50","Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)"};ChromeOptions chromeOptions =newChromeOptions();
chromeOptions.setPageLoadStrategy(PageLoadStrategy.EAGER);
chromeOptions.addArguments("--incognito");
chromeOptions.addArguments("--blink-settings=imagesEnabled=false");
chromeOptions.addArguments("--headless");
chromeOptions.addArguments("--no-sandbox");
chromeOptions.addArguments("--disable-gpu");if("用代理"){
chromeOptions.addArguments("--proxy-server="+ nextProxy);}HashMap<String,Object> map =newHashMap<>();
map.put("webrtc.ip_handling_policy","disable_non_proxied_udp");
map.put("webrtc.multiple_routes_enabled",false);
map.put("webrtc.nonproxied_udp_enabled",false);
chromeOptions.setExperimentalOption("prefs", map);Random random =newRandom();
chromeOptions.addArguments("User-Agent="+ arr[random.nextInt(7)]);ChromeDriver driver =newChromeDriver(chromeOptions);
driver.manage().window().maximize();return driver;}
实战案例 - 爬取房天下价格走势图
packagecom.mengkeng.selenium_demo.test;importcom.alibaba.fastjson.JSON;importcom.mengkeng.selenium_demo.config.RestTemplateConfig;importcom.mengkeng.selenium_demo.entity.TkBuildingsPriceAjk;importlombok.extern.slf4j.Slf4j;importorg.openqa.selenium.By;importorg.openqa.selenium.WebDriver;importorg.openqa.selenium.WebElement;importorg.openqa.selenium.chrome.ChromeDriver;importorg.openqa.selenium.chrome.ChromeOptions;importorg.springframework.beans.factory.annotation.Autowired;importorg.springframework.data.redis.core.RedisTemplate;importorg.springframework.data.redis.core.SetOperations;importorg.springframework.http.*;importorg.springframework.util.CollectionUtils;importorg.springframework.web.bind.annotation.RequestMapping;importorg.springframework.web.bind.annotation.RestController;importorg.springframework.web.client.RestTemplate;importjava.math.BigDecimal;importjava.util.*;importjava.util.concurrent.TimeUnit;importjava.util.regex.Matcher;importjava.util.regex.Pattern;/**
*
* Date: 2022-07-10 13:50
* Description:
*/@RestController@RequestMapping("fang")@Slf4jpublicclassFangtianxiaDemo{@AutowiredprivateRedisTemplate redisTemplate;privatestaticLinkedList<String> pages =newLinkedList<>();/**
* 基础页面
*/publicstaticfinalString PRICE_URL ="https://pinggun.fang.com/RunChartNew/MakeChartData/";/**
* redis 记录页面
*/publicstaticfinalString SKIP_URLS ="SKIP_URLS";/**
* 成功标识
*/publicstaticString TEMP_FLAG ="fail";@RequestMapping("sync")publicStringsync(){while(!TEMP_FLAG.equals("success")){System.setProperty("webdriver.chrome.driver","D://chromedriver.exe");ChromeOptions chromeOptions =newChromeOptions();
chromeOptions.addArguments("--headless");
chromeOptions.addArguments("--no-sandbox");
chromeOptions.addArguments("--disable-gpu");
chromeOptions.addArguments("--disable-dev-shm-usage");WebDriver driver =newChromeDriver(chromeOptions);
driver.manage().window().maximize();
driver.manage().timeouts().implicitlyWait(10,TimeUnit.SECONDS);
driver.get("https://esf.fang.com/housing/");sleep(2000);try{parseFTX(driver);}catch(Exception e){try{Thread.sleep(10000);}catch(InterruptedException interruptedException){
interruptedException.printStackTrace();}}finally{sleep(10000);
driver.quit();}}return"ok";}/**
* 解析fangtianxia
*/privatevoidparseFTX(WebDriver driver){SetOperations ops = redisTemplate.opsForSet();List<WebElement> elements = driver.findElements(By.xpath("//div[@class='qxName']/a"));// 区域for(int i =2; i <= elements.size()-3; i++){WebElement element = driver.findElement(By.xpath("//div[@class='qxName']/a["+ i +"]"));
element.click();sleep(800);//商圈List<WebElement> elementsShangquan = driver.findElements(By.xpath("//p[@id='shangQuancontain']/a"));for(int sq =2; sq <= elementsShangquan.size(); sq++){WebElement elementsq = driver.findElement(By.xpath("//p[@id='shangQuancontain']/a["+ sq +"]"));String tempHref = elementsq.getAttribute("href");// if (ops.isMember(SKIP_URLS, tempHref)) {// System.out.println("跳过了当前链接" + tempHref);// continue;// }
elementsq.click();parsePage(driver);
ops.add(SKIP_URLS, tempHref);sleep(800);}}
TEMP_FLAG ="success";//正常跑一圈 结束}/**
* 解析分页
*
* @param driver
*/privatevoidparsePage(WebDriver driver){// 分页try{
driver.findElement(By.className("txt")).getText();}catch(Exception e){
log.info("该分类下无数据 url是"+ driver.getCurrentUrl());return;}String pageTotal = driver.findElement(By.className("txt")).getText().replaceAll("共","").replaceAll("页","");for(int page =0; page <Integer.parseInt(pageTotal); page++){List<WebElement> houseList = driver.findElements(By.xpath("//div[@class='houseList']/div"));for(int i =1; i < houseList.size(); i++){String communityName = driver.findElement(By.xpath("//div[@class='houseList']/div["+ i +"]/dl/dd/p[1]/a[1]")).getText();String communityCode = driver.findElement(By.xpath("//div[@class='houseList']/div["+ i +"]/dl/dd/p[1]/a[2]")).getAttribute("projcode");String areaName = driver.findElement(By.xpath("//div[@class='houseList']/div["+ i +"]/dl/dd/p[2]/a[1]")).getText();// 跳转到详情页
pages.addAll(driver.getWindowHandles());
driver.findElement(By.xpath("//div[@class='houseList']/div["+ i +"]/dl/dd/p[1]/a[1]")).click();sleepAndCutoverNewPage(800, driver);parseDetail(communityCode, communityName, areaName);
driver.close();
driver.switchTo().window(pages.getLast());sleep(1000);}if(page +1==Integer.parseInt(pageTotal)){break;}String pageNow = driver.findElement(By.xpath("//div[@id='houselist_B14_01']/a[last()-1]")).getAttribute("href");System.out.println("下一页是------------"+ pageNow +"----"+ pageTotal);
driver.findElement(By.xpath("//div[@id='houselist_B14_01']/a[last()-1]")).click();sleep(600);}}/**
* 解析详情
*
* @param communityCode
* @param communityName
* @param areaName
*/publicvoidparseDetail(String communityCode,String communityName,String areaName){HashMap<String,Object> map =newHashMap<>();
map.put("newcode", communityCode);
map.put("city",cnToUnicode("北京"));
map.put("district",cnToUnicode(areaName));HttpHeaders headers =newHttpHeaders();
headers.setContentType(MediaType.APPLICATION_JSON_UTF8);HttpEntity<String> entity =newHttpEntity<>(JSON.toJSONString(map), headers);RestTemplate restTemplate =null;try{
restTemplate =newRestTemplate(RestTemplateConfig.generateHttpRequestFactory());}catch(Exception e){
e.printStackTrace();}ResponseEntity<String> stringResponseEntity = restTemplate.exchange(PRICE_URL,HttpMethod.POST, entity,String.class);Pattern compile =Pattern.compile(",(\\w+)]");Matcher matcher = compile.matcher(stringResponseEntity.getBody());Pattern compileMonth =Pattern.compile("年(\\w+)月");Matcher matcherMonth = compileMonth.matcher(stringResponseEntity.getBody());ArrayList<String> list =newArrayList<>();while(matcherMonth.find()){
list.add(matcherMonth.group(1));}Pattern compileYear =Pattern.compile("&(\\w+)年");Matcher matcherYear = compileYear.matcher(stringResponseEntity.getBody());int year =2020;while(matcherYear.find()){
year =Integer.parseInt(matcherYear.group(1));}ArrayList months =null;if(!CollectionUtils.isEmpty(list)){
months =getMonths(year,Integer.parseInt(list.get(0)),Integer.parseInt(list.get(1)));}while(matcher.find()){TkBuildingsPriceAjk ajk =newTkBuildingsPriceAjk();
ajk.setDataOrigin("fangtianxia");
ajk.setCommunityCode(communityCode);
ajk.setCommunity(communityName);
ajk.setAvgPrice(newBigDecimal(matcher.group(1)));System.out.println("持久化======================================="+ ajk);}}privatestaticvoidsleep(int millis){try{Thread.sleep(millis);}catch(InterruptedException e){
e.printStackTrace();}}/**
* 切换页面
*
* @param millis
* @param driver
* @return
*/privatestaticStringsleepAndCutoverNewPage(int millis,WebDriver driver){try{Thread.sleep(millis);for(String handle : driver.getWindowHandles()){if(!pages.contains(handle)){
driver.switchTo().window(handle);}}}catch(InterruptedException e){
e.printStackTrace();}returnnull;}/**
* 获取对象unionCode值
*
* @param cn
* @return
*/privatestaticStringcnToUnicode(String cn){char[] chars = cn.toCharArray();StringBuilder returnStr =newStringBuilder();for(int i =0; i < chars.length; i++){
returnStr.append("\\u").append(Integer.toString(chars[i],16));}return returnStr.toString();}/**
* 获取年份列表-只支持今年至下一年
*
* @param year 开始年份
* @param start 开始月份
* @param end 结束月份
* @return
*/privatestaticArrayListgetMonths(int year,int start,int end){ArrayList res =newArrayList();for(int i = start; i <=(end ==12?12: end +12); i++){if(i >12){
res.add((year +1)+String.format("%02d", i -12));}else{
res.add(year +String.format("%02d", i));}}return res;}}
实战案例 - 爬取链家小区价格
packagecom.mengkeng.selenium_demo.test;importcom.alibaba.fastjson.JSON;importcom.mengkeng.selenium_demo.entity.BuildAreaUrlLj;importcom.mengkeng.selenium_demo.entity.IdAndNamePO;importcom.mengkeng.selenium_demo.entity.TkBuildingsAreaInfolj;importcom.mengkeng.selenium_demo.entity.TkBuildingsMonthPriceLj;importcom.mengkeng.selenium_demo.mapper.BuildAreaUrlLjMapper;importcom.mengkeng.selenium_demo.service.ProxyService;importlombok.extern.slf4j.Slf4j;importorg.apache.commons.lang3.StringUtils;importorg.apache.commons.lang3.time.DateFormatUtils;importorg.openqa.selenium.By;importorg.openqa.selenium.PageLoadStrategy;importorg.openqa.selenium.WebDriver;importorg.openqa.selenium.WebElement;importorg.openqa.selenium.chrome.ChromeDriver;importorg.openqa.selenium.chrome.ChromeOptions;importorg.springframework.beans.factory.annotation.Autowired;importorg.springframework.data.redis.core.HashOperations;importorg.springframework.data.redis.core.SetOperations;importorg.springframework.data.redis.core.StringRedisTemplate;importorg.springframework.web.bind.annotation.RequestMapping;importorg.springframework.web.bind.annotation.RestController;importjava.time.LocalDate;importjava.time.LocalDateTime;importjava.util.*;importjava.util.concurrent.LinkedBlockingQueue;importjava.util.concurrent.ThreadPoolExecutor;importjava.util.concurrent.TimeUnit;importjava.util.regex.Matcher;importjava.util.regex.Pattern;/**
*
* Date: 2022-09-05 13:58
* Description: 小区
*/@RestController@RequestMapping("areaInfo")@Slf4jpublicclassLianjiaAreaInfoDemo{@AutowiredprivateStringRedisTemplate redisTemplate;@AutowiredprivateBuildAreaUrlLjMapper buildAreaUrlLjMapper;@AutowiredprivateProxyService proxyService;publicstaticfinalString SKIP_URLS ="SKIP_URLS_AREAINFO_LIANJIA";publicstaticfinalString URLS ="URLS_AREAINFO_LIANJIA";publicstaticfinalString AREA_INFO_COMMUNITY_CODE_LJ ="AREA_INFO_COMMUNITY_CODE_LJ";privatestaticLinkedList<String> pages =newLinkedList<>();ThreadPoolExecutor pagepoolExecutor =newThreadPoolExecutor(2,10,30L,TimeUnit.SECONDS,newLinkedBlockingQueue<>());@RequestMapping("sync")publicvoidsync()throwsInterruptedException{System.setProperty("webdriver.chrome.driver","D://chromedriver.exe");boolean flag =false;while(!flag){try{ChromeDriver driver =getChromeDriver();SetOperations ops = redisTemplate.opsForSet();try{getUrls(driver, ops);parsePagePre(ops);}finally{sleep(1000);
driver.quit();}}catch(Exception e){Thread.sleep(10000);continue;}
flag =true;}System.out.println("完成");}/**
* 获取浏览器对象
* @return
*/privateChromeDrivergetChromeDriver(){String nextProxy = proxyService.getNextProxy();System.out.println("当前ip是"+ nextProxy);String[] arr ={"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36 Edg/103.0.1264.37","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50","Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)"};ChromeOptions chromeOptions =newChromeOptions();
chromeOptions.setPageLoadStrategy(PageLoadStrategy.EAGER);
chromeOptions.addArguments("--incognito");
chromeOptions.addArguments("--blink-settings=imagesEnabled=false");
chromeOptions.addArguments("--headless");
chromeOptions.addArguments("--no-sandbox");
chromeOptions.addArguments("--disable-gpu");if(StringUtils.isNotBlank(nextProxy)&&!nextProxy.equals("local")){
chromeOptions.addArguments("--proxy-server="+ nextProxy);}HashMap<String,Object> map =newHashMap<>();
map.put("webrtc.ip_handling_policy","disable_non_proxied_udp");
map.put("webrtc.multiple_routes_enabled",false);
map.put("webrtc.nonproxied_udp_enabled",false);
chromeOptions.setExperimentalOption("prefs", map);Random random =newRandom();
chromeOptions.addArguments("User-Agent="+ arr[random.nextInt(7)]);ChromeDriver driver =newChromeDriver(chromeOptions);
driver.manage().window().maximize();return driver;}privatevoidparsePagePre(SetOperations ops){HashOperations<String,Object,Object> opsForHash = redisTemplate.opsForHash();List<BuildAreaUrlLj> buildAreaUrlLjs = buildAreaUrlLjMapper.selectList(null);List<BuildAreaUrlLj> buildAreaUrlLjs1 = buildAreaUrlLjs.subList(1,3500);for(BuildAreaUrlLj buildAreaUrlLj : buildAreaUrlLjs1){if(ops.isMember(SKIP_URLS, buildAreaUrlLj.getAreaUrl())){System.out.println("跳过当前区域"+ buildAreaUrlLj.getCityName()+"-"+ buildAreaUrlLj.getCountyName());continue;}
pagepoolExecutor.execute(()->parsePage(ops, opsForHash, buildAreaUrlLj));}}/**
* 解析列表
* @param ops
* @param opsForHash
* @param buildAreaUrlLj
*/privatevoidparsePage(SetOperations ops,HashOperations<String,Object,Object> opsForHash,BuildAreaUrlLj buildAreaUrlLj){ChromeDriver driver =getChromeDriver();try{
driver.get(buildAreaUrlLj.getAreaUrl());String windowHandlePage = driver.getWindowHandle();WebElement totalNumStr =validElement("//h2[@class='total fl']/span", driver);if(null!= totalNumStr){Integer total =Integer.valueOf(totalNumStr.getText());// 有数据if(total >1){String pageData = driver.findElement(By.xpath("//div[@class='page-box house-lst-page-box']")).getAttribute("page-data");Integer pageNumStr =Integer.valueOf(JSON.parseObject(pageData).getString("totalPage"));System.out.println("当前区域页数"+ pageNumStr +"---"+ buildAreaUrlLj.getAreaUrl());for(int x =1; x <= pageNumStr; x++){List<WebElement> elements = driver.findElements(By.xpath("//ul[@class='listContent']/li/div[1]/div[1]/a"));for(int i =0; i < elements.size(); i++){WebElement item = elements.get(i);String code ="";Pattern compile1 =Pattern.compile("xiaoqu/(\\w+)/");Matcher matcher1 = compile1.matcher(item.getAttribute("href"));while(matcher1.find()){
code = matcher1.group(1);}
driver.executeScript("arguments[0].click();", item);sleepAndCutoverNewPage(300, driver);// 如果有 则不解析详情if(!opsForHash.hasKey(AREA_INFO_COMMUNITY_CODE_LJ, code)){parseDetail(driver, code, buildAreaUrlLj, opsForHash);}else{System.out.println("当前code redis 存在"+ code);//更新// new TkBuildingsMonthPriceLj();}
driver.close();
driver.switchTo().window(windowHandlePage);sleep(200);
elements = driver.findElements(By.xpath("//ul[@class='listContent']/li/div[1]/div[1]/a"));}if(x != pageNumStr){String nextPage = buildAreaUrlLj.getAreaUrl()+"pg"+(x +1)+"/";
driver.get(nextPage);System.out.println("下一页是"+ nextPage);sleep(200);}}}}
ops.add(SKIP_URLS, buildAreaUrlLj.getAreaUrl());}catch(NumberFormatException e){thrownewRuntimeException("多线程发生异常"+e.getMessage());}finally{
driver.quit();}}/**
* 解析详情
* @param driver
* @param communityCode
* @param buildAreaUrlLj
* @param opsForHash
*/privatevoidparseDetail(ChromeDriver driver,String communityCode,BuildAreaUrlLj buildAreaUrlLj,HashOperations<String,Object,Object> opsForHash){LocalDateTime now1 =LocalDateTime.now();if(null!=validElement("//span[@class='xiaoquUnitPrice']", driver)){TkBuildingsMonthPriceLj lj =newTkBuildingsMonthPriceLj();
lj.setCommunityCode(communityCode);String year =String.valueOf(LocalDate.now().getYear());if(driver.findElement(By.className("xiaoquUnitPriceDesc")).getText().equals("挂牌均价")){
lj.setYearmonth(DateFormatUtils.format(newDate(),"yyyyMM"));}else{String monthStr = driver.findElement(By.className("xiaoquUnitPriceDesc")).getText().replace("月参考均价","");String month =String.format("%02d",Integer.parseInt(monthStr));
lj.setYearmonth(year + month);}
lj.setAvgPrice(Integer.valueOf(driver.findElement(By.className("xiaoquUnitPrice")).getText()));
lj.setGenerateType("0");
lj.setCreateBy("1");
lj.setCreateDate(newDate());
lj.setUpdateBy("1");
lj.setUpdateDate(newDate());
lj.setDelFlag("0");System.out.println("持久化价格"+lj);}LocalDateTime now2 =LocalDateTime.now();TkBuildingsAreaInfolj infolj =newTkBuildingsAreaInfolj();
infolj.setDataOrigin("lianjia");
infolj.setGenerateType("0");
infolj.setProvince(buildAreaUrlLj.getProvinceId());
infolj.setCity(buildAreaUrlLj.getCityId());
infolj.setArea(buildAreaUrlLj.getCountyId());
infolj.setCommunity(validElement("//h1[@class='detailTitle']", driver)==null?"": driver.findElement(By.xpath("//h1[@class='detailTitle']")).getText());
infolj.setCommunityCode(communityCode);
infolj.setBuildingYear(validElement("//span[text()='建筑年代']", driver)==null?"": driver.findElement(By.xpath("//span[text()='建筑年代']/parent::div/span[2]")).getText());
infolj.setBuildingType(validElement("//span[text()='建筑类型']", driver)==null?"": driver.findElement(By.xpath("//span[text()='建筑类型']/parent::div/span[2]")).getText());
infolj.setManageCost(validElement("//span[text()='物业费用']", driver)==null?"": driver.findElement(By.xpath("//span[text()='物业费用']/parent::div/span[2]")).getText());
infolj.setManageCompany(validElement("//span[text()='物业公司']", driver)==null?"": driver.findElement(By.xpath("//span[text()='物业公司']/parent::div/span[2]")).getText());
infolj.setManageDevlop(validElement("//span[text()='开发商']", driver)==null?"": driver.findElement(By.xpath("//span[text()='开发商']/parent::div/span[2]")).getText());
infolj.setBuildingCount(validElement("//span[text()='楼栋总数']", driver)==null?"": driver.findElement(By.xpath("//span[text()='楼栋总数']/parent::div/span[2]")).getText());
infolj.setRoomCount(validElement("//span[text()='房屋总数']", driver)==null?"": driver.findElement(By.xpath("//span[text()='房屋总数']/parent::div/span[2]")).getText());
infolj.setCreateBy("1");
infolj.setCreateDate(newDate());
infolj.setUpdateBy("1");
infolj.setUpdateDate(newDate());
infolj.setDelFlag("0");System.out.println("持久化小区"+infolj);}/**
* 爬取链接
* @param driver
* @param ops
*/privatevoidgetUrls(ChromeDriver driver,SetOperations ops){
driver.get("https://www.lianjia.com/city/");int count =0;List<WebElement> elements = driver.findElements(By.xpath("//ul[@class='city_list_ul']/li/div[2]/div/ul/li/a"));for(int i =0; i < elements.size(); i++){WebElement element = elements.get(i);String provinceName = element.findElement(By.xpath("./parent::li/parent::ul/parent::div/div")).getText();String areaName = element.getText();Boolean memberFlag = ops.isMember(URLS, areaName);if(memberFlag){System.out.println("已跑过当前区域 跳过"+ areaName);continue;}
driver.executeScript("arguments[0].click();", element);String frontPage = driver.getWindowHandle();WebElement ershoufang =null;try{
ershoufang = driver.findElement(By.linkText("小区"));}catch(Exception e){
ops.add(URLS, areaName);sleep(200);System.out.println(areaName +" 没有小区====");
driver.get("https://www.lianjia.com/city/");
elements = driver.findElements(By.xpath("//ul[@class='city_list_ul']/li/div[2]/div/ul/li/a"));continue;}
driver.executeScript("arguments[0].click();", ershoufang);sleepAndCutoverNewPage(500, driver);List<WebElement> citys = driver.findElements(By.xpath("//div[@data-role='ershoufang']/div[1]/a"));
citys.forEach(e ->System.out.println("市级============"+ e.getText()+"=="+ e.getAttribute("href")));for(int j =0; j < citys.size(); j++){String countyName = citys.get(j).getText();
driver.executeScript("arguments[0].click();", citys.get(j));sleep(200);if(validElement("//h2[@class='total fl']/span", driver)!=null){String text = driver.findElement(By.xpath("//h2[@class='total fl']/span")).getText();
count +=Integer.parseInt(text);System.out.println(countyName + text +"个");System.out.println("当前总数是"+ count);}List<WebElement> areas =null;try{
areas = driver.findElements(By.xpath("//div[@data-role='ershoufang']/div[2]/a"));}catch(Exception e){
citys = driver.findElements(By.xpath("//div[@data-role='ershoufang']/div[1]/a"));saveDataCity(countyName, areaName, provinceName, citys);break;}if(areas.size()==0){
citys = driver.findElements(By.xpath("//div[@data-role='ershoufang']/div[1]/a"));saveDataCity(countyName, areaName, provinceName, citys);break;}saveDataCounty(countyName, areaName, provinceName, areas);sleep(100);
citys = driver.findElements(By.xpath("//div[@data-role='ershoufang']/div[1]/a"));}
ops.add(URLS, areaName);
driver.close();
driver.switchTo().window(frontPage);
driver.get("https://www.lianjia.com/city/");sleep(200);
elements = driver.findElements(By.xpath("//ul[@class='city_list_ul']/li/div[2]/div/ul/li/a"));}System.out.println("总数是"+ count);}privatevoidsaveDataCounty(String countyName,String areaName,String provinceName,List<WebElement> list){for(WebElement element : list){String url = element.getAttribute("href");BuildAreaUrlLj buildAreaUrlLj =newBuildAreaUrlLj();IdAndNamePO provincepo =queryProvinceCityArea(1, provinceName,null);
buildAreaUrlLj.setProvinceName(provincepo.getBusinessName());
buildAreaUrlLj.setProvinceId(provincepo.getBusinessId());IdAndNamePO areapo =queryProvinceCityArea(2, areaName, provincepo.getBusinessId());
buildAreaUrlLj.setCityName(areapo.getBusinessName());
buildAreaUrlLj.setCityId(areapo.getBusinessId());IdAndNamePO countypo =queryProvinceCityArea(3, countyName, areapo.getBusinessId());
buildAreaUrlLj.setCountyName(countypo.getBusinessName());
buildAreaUrlLj.setCountyId(countypo.getBusinessId());
buildAreaUrlLj.setAreaUrl(url);
buildAreaUrlLj.setCreateTime(newDate());
buildAreaUrlLj.setUpdateTime(newDate());System.out.println("持久化链接"+buildAreaUrlLj);}}privatevoidsaveDataCity(String countyName,String areaName,String provinceName,List<WebElement> list){for(WebElement element : list){String url = element.getAttribute("href");BuildAreaUrlLj buildAreaUrlLj =newBuildAreaUrlLj();IdAndNamePO provincepo =queryProvinceCityArea(1, provinceName,null);
buildAreaUrlLj.setProvinceName(provinceName);
buildAreaUrlLj.setProvinceId(provincepo.getBusinessId());
buildAreaUrlLj.setCityName(areaName);IdAndNamePO areapo =queryProvinceCityArea(2, areaName, provincepo.getBusinessId());
buildAreaUrlLj.setCityId(areapo.getBusinessId());IdAndNamePO countypo =queryProvinceCityArea(3, countyName, areapo.getBusinessId());
buildAreaUrlLj.setCountyName(countypo.getBusinessName());
buildAreaUrlLj.setCountyId(countypo.getBusinessId());
buildAreaUrlLj.setAreaUrl(url);
buildAreaUrlLj.setCreateTime(newDate());
buildAreaUrlLj.setUpdateTime(newDate());System.out.println("持久化链接"+buildAreaUrlLj);}}/**
* 根据名称查询省市县信息
* @param type 1/省 2/市 3/区
* @param businessName 名称
* @param parentId 父id
* @return
*/privateIdAndNamePOqueryProvinceCityArea(Integer type,String businessName,String parentId){if(StringUtils.isNotBlank(parentId)){ArrayList<String> citys =newArrayList<>(8);
citys.add("50");
citys.add("11");
citys.add("31");
citys.add("12");if(citys.contains(parentId)){
businessName ="市辖区";}}IdAndNamePO po =null;try{if(type ==1){// po = buildingsAvgMapper.queryProvinceIdByName(businessName);}elseif(type ==2){// po = buildingsAvgMapper.queryCityIdByName(businessName, parentId);}elseif(type ==3){// po = buildingsAvgMapper.querycountyIdByName(businessName, parentId);}}catch(Exception e){
e.printStackTrace();}if(null== po){
po =newIdAndNamePO();
po.setBusinessId("-1");
po.setBusinessName(businessName);}return po;}privatestaticStringsleepAndCutoverNewPage(int millis,WebDriver driver){try{Thread.sleep(millis);for(String handle : driver.getWindowHandles()){if(!pages.contains(handle)){
driver.switchTo().window(handle);}}}catch(InterruptedException e){}returnnull;}privatestaticvoidsleep(int millis){try{Thread.sleep(millis);}catch(InterruptedException e){}}publicstaticWebElementvalidElement(String str,WebDriver driver){try{WebElement element = driver.findElement(By.xpath(str));return element;}catch(Exception e){System.out.println("这个元素不存在"+ str);}returnnull;}}
注意事项
1. driver.close 是关闭当前页 driver.quit是退出进程 循环跑列表的不退出进程的话浏览器会把内存吃满
2. 跳转页面尽量显示等待一下 以防元素未加载导致查找错误
3. 请求不可太频繁 特殊需求请加代理
后语
上述案例源码
版权归原作者 萌坑 所有, 如有侵权,请联系我们删除。