0


java+selenium

selenium

前言

文章仅供学习使用!!
严禁做违法违纪的事情,责任自负

简介

Selenium 是最广泛使用的开源 Web UI(用户界面)自动化测试套件之一。
与java集成,本质上是通过Java代码调用浏览器驱动 进行模拟人工的操作.
selenium支持不同的浏览器,本文以谷歌为例 !

1.安装驱动

selenium驱动有两种下载方式.任选其一即可
①首先需要确认浏览器版本: 在浏览器界面输入chrome://settings/
在这里插入图片描述② 下面网址任选其一,选择对应的版本下载 ( 此处如未有完全一致版本,则选择最大版本 例如本文中是104.0.5112.102 可选的版本是104开头 最优选为104版本中最大版号)

http://chromedriver.storage.googleapis.com/index.html
http://npm.taobao.org/mirrors/chromedriver/

在这里插入图片描述

2.简单案例走进爬虫

packagecom.mengkeng.selenium_demo.test;importorg.openqa.selenium.By;importorg.openqa.selenium.WebElement;importorg.openqa.selenium.chrome.ChromeDriver;importorg.openqa.selenium.chrome.ChromeOptions;importjava.util.concurrent.TimeUnit;publicclassBaiduDemo{publicstaticvoidmain(String[] args)throwsException{//D://chromedriver.exe 以实际存储路径为准System.setProperty("webdriver.chrome.driver","D://chromedriver.exe");ChromeOptions chromeOptions =newChromeOptions();ChromeDriver driver =newChromeDriver(chromeOptions);try{// 窗口最大化
            driver.manage().window().maximize();
            driver.manage().timeouts().implicitlyWait(10,TimeUnit.SECONDS);Thread.sleep(1000);//进入百度首页
            driver.get("https://www.baidu.com/");//找到输入框WebElement text = driver.findElement(By.id("kw"));//找到百度一下按钮WebElement button = driver.findElement(By.id("su"));
            text.sendKeys("123");
            button.click();}finally{sleep(10000);
            driver.quit();}}publicstaticvoidsleep(int time){try{Thread.sleep(1000);}catch(InterruptedException e){
            e.printStackTrace();}}}

通过几行代码实现了打开网页搜索 ‘123’ , 接下来看看常用的api , 理解即可 随用随查

3.seleniumAPI

3-1创建一个可操控的浏览器对象

//  注意修改实际驱动存储位置System.setProperty("webdriver.chrome.driver","D://chromedriver.exe");WebDriver driver =newChromeDriver();

3-2打开指定页面

driver.get("https://www.baidu.com/");

3-3定位元素

注意: 页面出现相同属性的元素, 则需要使用xpath定位方式进行指定获取

id定位
driver.findElement(By.id("pnum"));
name定位
driver.findElement(By.name("name"));
class 定位
driver.findElement(By.className("pgo"));
link定位
driver.findElement(By.linkText("link"));
xpath定位
driver.findElement(By.xpath("//div[@id='1']/div/div/h3/a[1]"))

3-4浏览器常用方法

方法描述sendKey()模拟输入指定内容clear()清楚输入内容text()获取文本信息getAttribute()获取指定属性
ok掌握这一部分就可以书写简单爬虫了 , 有兴趣的童鞋试着做一下如下案例:

案例 一 登录QQ邮箱

需求:

登录qq邮箱,并打开收件箱页面

以下是实现代码

packagecom.mengkeng.selenium_demo.test;importorg.openqa.selenium.By;importorg.openqa.selenium.WebDriver;importorg.openqa.selenium.WebElement;importorg.openqa.selenium.chrome.ChromeDriver;importjava.util.Objects;publicclassQQEmaIlLoginDemo{publicstaticvoidmain(String[] args)throwsInterruptedException{//定义使用什么版本的驱动,注意替换你的路径System.setProperty("webdriver.chrome.driver","D://chromedriver.exe");ChromeDriver driver =newChromeDriver();
        driver.manage().window().maximize();try{Thread.sleep(1000);
            driver.get("https://mail.qq.com/");
            driver.switchTo().frame("login_frame");WebElement username = driver.findElement(By.id("u"));WebElement password = driver.findElement(By.id("p"));
            username.sendKeys("[email protected]");
            password.sendKeys("xxxxxx");WebElement submit = driver.findElement(By.id("login_button"));
            submit.click();Thread.sleep(1000);
            driver.switchTo().defaultContent();WebElement element =validElement("//a[@id='folder_1']", driver);if(Objects.nonNull(element)){WebElement folder_1 = driver.findElement(By.xpath("//a[@id='folder_1']"));
                folder_1.click();}else{System.out.println("打开收件箱失败");}}finally{Thread.sleep(10000);
            driver.close();
            driver.quit();}}publicstaticWebElementvalidElement(String str,WebDriver driver){try{WebElement element = driver.findElement(By.xpath(str));return element;}catch(Exception e){System.out.println("这个元素不存在"+ str);}returnnull;}}

上述只是简单案例 有鼠标,多页面跳转的怎么办呢 . 别急 这就来

3-5selenium 进阶

鼠标

注意 鼠标操作方法需要以perform()方法结尾 如未使用该方法结尾则操作不生效
方法描述click()单击左键context_click()单击右键double_click()双击drag_and_drop()拖动move_to_element()鼠标悬停perform()执行所有ActionChains中存储的动作

切换窗口

当点击页面元素 浏览器创建新窗口后需要切换到最新页面.

driver.switchTo().window(frontHandle) // 此处的frontHandle是页面对象 可以使用driver.getWindowHandle(); 获取后暂存

调用js

模拟滑动页面
driver.executeScript(“window.scrollTo(0,300)”);

当页面元素无法点击的时候(反爬虫拦截)
driver.executeScript(“arguments[0].click();”, element);// 其中element为按钮或元素

chromeOptions 创建浏览器 参数
ChromeOptions chromeOptions =newChromeOptions();
        chromeOptions.setPageLoadStrategy(PageLoadStrategy.EAGER);//  急速加载模式
           chromeOptions.addArguments("--incognito");// 隐私窗口模式
        chromeOptions.addArguments("--blink-settings=imagesEnabled=false");//  不加载图片
        chromeOptions.addArguments("--headless");//  无头模式
        chromeOptions.addArguments("--no-sandbox");//  禁用沙箱模式
        chromeOptions.addArguments("--disable-gpu");//  禁用gpu加速
        chromeOptions.addArguments("--proxy-server="+ proxy);//  添加代理ChromeDriver driver =newChromeDriver(chromeOptions);
浏览器相关设置
//  设置全局等待时间
driver.manage().timeouts().implicitlyWait(10,TimeUnit.SECONDS);//  最大化页面
driver.manage().window().maximize();//  去除sesenium标志String js1="Object.defineProperties(navigator, {webdriver:{get:()=>undefined}});";((ChromeDriver) driver).executeScript(js1);//  添加UA请求头String[] arr ={"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36 Edg/103.0.1264.37","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50","Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)"};
chromeOptions.addArguments("User-Agent="+ arr[random.nextInt(7)]);
多线程示例

在解析列表页 创建浏览器对象执行解析

privatevoidparsePagePre(SetOperations ops){ThreadPoolExecutor threadPoolExecutor =newThreadPoolExecutor(2,8,30L,TimeUnit.SECONDS,newLinkedBlockingQueue<>());List<BuildAreaUrlLj> buildAreaUrlLjs = buildAreaUrlLjMapper.selectList(null);for(BuildAreaUrlLj buildAreaUrlLj : buildAreaUrlLjs1){
            pagepoolExecutor.execute(()->parsePage(ops, opsForHash, buildAreaUrlLj));}}privatevoidparsePage(SetOperations ops,HashOperations<String,Object,Object> opsForHash,BuildAreaUrlLj buildAreaUrlLj){ChromeDriver driver =getChromeDriver();
            driver.get(buildAreaUrlLj.getAreaUrl());//  业务代码}privateChromeDrivergetChromeDriver(){String[] arr ={"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36 Edg/103.0.1264.37","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50","Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)"};ChromeOptions chromeOptions =newChromeOptions();
        chromeOptions.setPageLoadStrategy(PageLoadStrategy.EAGER);
        chromeOptions.addArguments("--incognito");
        chromeOptions.addArguments("--blink-settings=imagesEnabled=false");
        chromeOptions.addArguments("--headless");
        chromeOptions.addArguments("--no-sandbox");
        chromeOptions.addArguments("--disable-gpu");if("用代理"){
            chromeOptions.addArguments("--proxy-server="+ nextProxy);}HashMap<String,Object> map =newHashMap<>();
        map.put("webrtc.ip_handling_policy","disable_non_proxied_udp");
        map.put("webrtc.multiple_routes_enabled",false);
        map.put("webrtc.nonproxied_udp_enabled",false);
        chromeOptions.setExperimentalOption("prefs", map);Random random =newRandom();
        chromeOptions.addArguments("User-Agent="+ arr[random.nextInt(7)]);ChromeDriver driver =newChromeDriver(chromeOptions);
        driver.manage().window().maximize();return driver;}

实战案例 - 爬取房天下价格走势图

packagecom.mengkeng.selenium_demo.test;importcom.alibaba.fastjson.JSON;importcom.mengkeng.selenium_demo.config.RestTemplateConfig;importcom.mengkeng.selenium_demo.entity.TkBuildingsPriceAjk;importlombok.extern.slf4j.Slf4j;importorg.openqa.selenium.By;importorg.openqa.selenium.WebDriver;importorg.openqa.selenium.WebElement;importorg.openqa.selenium.chrome.ChromeDriver;importorg.openqa.selenium.chrome.ChromeOptions;importorg.springframework.beans.factory.annotation.Autowired;importorg.springframework.data.redis.core.RedisTemplate;importorg.springframework.data.redis.core.SetOperations;importorg.springframework.http.*;importorg.springframework.util.CollectionUtils;importorg.springframework.web.bind.annotation.RequestMapping;importorg.springframework.web.bind.annotation.RestController;importorg.springframework.web.client.RestTemplate;importjava.math.BigDecimal;importjava.util.*;importjava.util.concurrent.TimeUnit;importjava.util.regex.Matcher;importjava.util.regex.Pattern;/**
 *
 * Date: 2022-07-10 13:50
 * Description:
 */@RestController@RequestMapping("fang")@Slf4jpublicclassFangtianxiaDemo{@AutowiredprivateRedisTemplate redisTemplate;privatestaticLinkedList<String> pages =newLinkedList<>();/**
     * 基础页面
     */publicstaticfinalString PRICE_URL ="https://pinggun.fang.com/RunChartNew/MakeChartData/";/**
     * redis 记录页面
     */publicstaticfinalString SKIP_URLS ="SKIP_URLS";/**
     * 成功标识
     */publicstaticString TEMP_FLAG ="fail";@RequestMapping("sync")publicStringsync(){while(!TEMP_FLAG.equals("success")){System.setProperty("webdriver.chrome.driver","D://chromedriver.exe");ChromeOptions chromeOptions =newChromeOptions();
            chromeOptions.addArguments("--headless");
            chromeOptions.addArguments("--no-sandbox");
            chromeOptions.addArguments("--disable-gpu");
            chromeOptions.addArguments("--disable-dev-shm-usage");WebDriver driver =newChromeDriver(chromeOptions);
            driver.manage().window().maximize();
            driver.manage().timeouts().implicitlyWait(10,TimeUnit.SECONDS);
            driver.get("https://esf.fang.com/housing/");sleep(2000);try{parseFTX(driver);}catch(Exception e){try{Thread.sleep(10000);}catch(InterruptedException interruptedException){
                    interruptedException.printStackTrace();}}finally{sleep(10000);
                driver.quit();}}return"ok";}/**
     * 解析fangtianxia
     */privatevoidparseFTX(WebDriver driver){SetOperations ops = redisTemplate.opsForSet();List<WebElement> elements = driver.findElements(By.xpath("//div[@class='qxName']/a"));// 区域for(int i =2; i <= elements.size()-3; i++){WebElement element = driver.findElement(By.xpath("//div[@class='qxName']/a["+ i +"]"));
            element.click();sleep(800);//商圈List<WebElement> elementsShangquan = driver.findElements(By.xpath("//p[@id='shangQuancontain']/a"));for(int sq =2; sq <= elementsShangquan.size(); sq++){WebElement elementsq = driver.findElement(By.xpath("//p[@id='shangQuancontain']/a["+ sq +"]"));String tempHref = elementsq.getAttribute("href");//                if (ops.isMember(SKIP_URLS, tempHref)) {//                    System.out.println("跳过了当前链接" + tempHref);//                    continue;//                }

                elementsq.click();parsePage(driver);
                ops.add(SKIP_URLS, tempHref);sleep(800);}}
        TEMP_FLAG ="success";//正常跑一圈 结束}/**
     * 解析分页
     *
     * @param driver
     */privatevoidparsePage(WebDriver driver){// 分页try{
            driver.findElement(By.className("txt")).getText();}catch(Exception e){
            log.info("该分类下无数据 url是"+ driver.getCurrentUrl());return;}String pageTotal = driver.findElement(By.className("txt")).getText().replaceAll("共","").replaceAll("页","");for(int page =0; page <Integer.parseInt(pageTotal); page++){List<WebElement> houseList = driver.findElements(By.xpath("//div[@class='houseList']/div"));for(int i =1; i < houseList.size(); i++){String communityName = driver.findElement(By.xpath("//div[@class='houseList']/div["+ i +"]/dl/dd/p[1]/a[1]")).getText();String communityCode = driver.findElement(By.xpath("//div[@class='houseList']/div["+ i +"]/dl/dd/p[1]/a[2]")).getAttribute("projcode");String areaName = driver.findElement(By.xpath("//div[@class='houseList']/div["+ i +"]/dl/dd/p[2]/a[1]")).getText();// 跳转到详情页
                pages.addAll(driver.getWindowHandles());
                driver.findElement(By.xpath("//div[@class='houseList']/div["+ i +"]/dl/dd/p[1]/a[1]")).click();sleepAndCutoverNewPage(800, driver);parseDetail(communityCode, communityName, areaName);

                driver.close();
                driver.switchTo().window(pages.getLast());sleep(1000);}if(page +1==Integer.parseInt(pageTotal)){break;}String pageNow = driver.findElement(By.xpath("//div[@id='houselist_B14_01']/a[last()-1]")).getAttribute("href");System.out.println("下一页是------------"+ pageNow +"----"+ pageTotal);
            driver.findElement(By.xpath("//div[@id='houselist_B14_01']/a[last()-1]")).click();sleep(600);}}/**
     * 解析详情
     *
     * @param communityCode
     * @param communityName
     * @param areaName
     */publicvoidparseDetail(String communityCode,String communityName,String areaName){HashMap<String,Object> map =newHashMap<>();
        map.put("newcode", communityCode);
        map.put("city",cnToUnicode("北京"));
        map.put("district",cnToUnicode(areaName));HttpHeaders headers =newHttpHeaders();
        headers.setContentType(MediaType.APPLICATION_JSON_UTF8);HttpEntity<String> entity =newHttpEntity<>(JSON.toJSONString(map), headers);RestTemplate restTemplate =null;try{
            restTemplate =newRestTemplate(RestTemplateConfig.generateHttpRequestFactory());}catch(Exception e){
            e.printStackTrace();}ResponseEntity<String> stringResponseEntity = restTemplate.exchange(PRICE_URL,HttpMethod.POST, entity,String.class);Pattern compile =Pattern.compile(",(\\w+)]");Matcher matcher = compile.matcher(stringResponseEntity.getBody());Pattern compileMonth =Pattern.compile("年(\\w+)月");Matcher matcherMonth = compileMonth.matcher(stringResponseEntity.getBody());ArrayList<String> list =newArrayList<>();while(matcherMonth.find()){
            list.add(matcherMonth.group(1));}Pattern compileYear =Pattern.compile("&(\\w+)年");Matcher matcherYear = compileYear.matcher(stringResponseEntity.getBody());int year =2020;while(matcherYear.find()){
            year =Integer.parseInt(matcherYear.group(1));}ArrayList months =null;if(!CollectionUtils.isEmpty(list)){
            months =getMonths(year,Integer.parseInt(list.get(0)),Integer.parseInt(list.get(1)));}while(matcher.find()){TkBuildingsPriceAjk ajk =newTkBuildingsPriceAjk();
            ajk.setDataOrigin("fangtianxia");
            ajk.setCommunityCode(communityCode);
            ajk.setCommunity(communityName);
            ajk.setAvgPrice(newBigDecimal(matcher.group(1)));System.out.println("持久化======================================="+ ajk);}}privatestaticvoidsleep(int millis){try{Thread.sleep(millis);}catch(InterruptedException e){
            e.printStackTrace();}}/**
     * 切换页面
     *
     * @param millis
     * @param driver
     * @return
     */privatestaticStringsleepAndCutoverNewPage(int millis,WebDriver driver){try{Thread.sleep(millis);for(String handle : driver.getWindowHandles()){if(!pages.contains(handle)){
                    driver.switchTo().window(handle);}}}catch(InterruptedException e){
            e.printStackTrace();}returnnull;}/**
     * 获取对象unionCode值
     *
     * @param cn
     * @return
     */privatestaticStringcnToUnicode(String cn){char[] chars = cn.toCharArray();StringBuilder returnStr =newStringBuilder();for(int i =0; i < chars.length; i++){
            returnStr.append("\\u").append(Integer.toString(chars[i],16));}return returnStr.toString();}/**
     * 获取年份列表-只支持今年至下一年
     *
     * @param year  开始年份
     * @param start 开始月份
     * @param end   结束月份
     * @return
     */privatestaticArrayListgetMonths(int year,int start,int end){ArrayList res =newArrayList();for(int i = start; i <=(end ==12?12: end +12); i++){if(i >12){
                res.add((year +1)+String.format("%02d", i -12));}else{
                res.add(year +String.format("%02d", i));}}return res;}}

实战案例 - 爬取链家小区价格

packagecom.mengkeng.selenium_demo.test;importcom.alibaba.fastjson.JSON;importcom.mengkeng.selenium_demo.entity.BuildAreaUrlLj;importcom.mengkeng.selenium_demo.entity.IdAndNamePO;importcom.mengkeng.selenium_demo.entity.TkBuildingsAreaInfolj;importcom.mengkeng.selenium_demo.entity.TkBuildingsMonthPriceLj;importcom.mengkeng.selenium_demo.mapper.BuildAreaUrlLjMapper;importcom.mengkeng.selenium_demo.service.ProxyService;importlombok.extern.slf4j.Slf4j;importorg.apache.commons.lang3.StringUtils;importorg.apache.commons.lang3.time.DateFormatUtils;importorg.openqa.selenium.By;importorg.openqa.selenium.PageLoadStrategy;importorg.openqa.selenium.WebDriver;importorg.openqa.selenium.WebElement;importorg.openqa.selenium.chrome.ChromeDriver;importorg.openqa.selenium.chrome.ChromeOptions;importorg.springframework.beans.factory.annotation.Autowired;importorg.springframework.data.redis.core.HashOperations;importorg.springframework.data.redis.core.SetOperations;importorg.springframework.data.redis.core.StringRedisTemplate;importorg.springframework.web.bind.annotation.RequestMapping;importorg.springframework.web.bind.annotation.RestController;importjava.time.LocalDate;importjava.time.LocalDateTime;importjava.util.*;importjava.util.concurrent.LinkedBlockingQueue;importjava.util.concurrent.ThreadPoolExecutor;importjava.util.concurrent.TimeUnit;importjava.util.regex.Matcher;importjava.util.regex.Pattern;/**
 *
 * Date: 2022-09-05 13:58
 * Description: 小区
 */@RestController@RequestMapping("areaInfo")@Slf4jpublicclassLianjiaAreaInfoDemo{@AutowiredprivateStringRedisTemplate redisTemplate;@AutowiredprivateBuildAreaUrlLjMapper buildAreaUrlLjMapper;@AutowiredprivateProxyService proxyService;publicstaticfinalString SKIP_URLS ="SKIP_URLS_AREAINFO_LIANJIA";publicstaticfinalString URLS ="URLS_AREAINFO_LIANJIA";publicstaticfinalString AREA_INFO_COMMUNITY_CODE_LJ ="AREA_INFO_COMMUNITY_CODE_LJ";privatestaticLinkedList<String> pages =newLinkedList<>();ThreadPoolExecutor pagepoolExecutor =newThreadPoolExecutor(2,10,30L,TimeUnit.SECONDS,newLinkedBlockingQueue<>());@RequestMapping("sync")publicvoidsync()throwsInterruptedException{System.setProperty("webdriver.chrome.driver","D://chromedriver.exe");boolean flag =false;while(!flag){try{ChromeDriver driver =getChromeDriver();SetOperations ops = redisTemplate.opsForSet();try{getUrls(driver, ops);parsePagePre(ops);}finally{sleep(1000);
                    driver.quit();}}catch(Exception e){Thread.sleep(10000);continue;}
            flag =true;}System.out.println("完成");}/**
     * 获取浏览器对象
     * @return
     */privateChromeDrivergetChromeDriver(){String nextProxy = proxyService.getNextProxy();System.out.println("当前ip是"+ nextProxy);String[] arr ={"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36 Edg/103.0.1264.37","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50","Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)"};ChromeOptions chromeOptions =newChromeOptions();
        chromeOptions.setPageLoadStrategy(PageLoadStrategy.EAGER);
        chromeOptions.addArguments("--incognito");
        chromeOptions.addArguments("--blink-settings=imagesEnabled=false");
        chromeOptions.addArguments("--headless");
        chromeOptions.addArguments("--no-sandbox");
        chromeOptions.addArguments("--disable-gpu");if(StringUtils.isNotBlank(nextProxy)&&!nextProxy.equals("local")){
            chromeOptions.addArguments("--proxy-server="+ nextProxy);}HashMap<String,Object> map =newHashMap<>();
        map.put("webrtc.ip_handling_policy","disable_non_proxied_udp");
        map.put("webrtc.multiple_routes_enabled",false);
        map.put("webrtc.nonproxied_udp_enabled",false);
        chromeOptions.setExperimentalOption("prefs", map);Random random =newRandom();
        chromeOptions.addArguments("User-Agent="+ arr[random.nextInt(7)]);ChromeDriver driver =newChromeDriver(chromeOptions);
        driver.manage().window().maximize();return driver;}privatevoidparsePagePre(SetOperations ops){HashOperations<String,Object,Object> opsForHash = redisTemplate.opsForHash();List<BuildAreaUrlLj> buildAreaUrlLjs = buildAreaUrlLjMapper.selectList(null);List<BuildAreaUrlLj> buildAreaUrlLjs1 = buildAreaUrlLjs.subList(1,3500);for(BuildAreaUrlLj buildAreaUrlLj : buildAreaUrlLjs1){if(ops.isMember(SKIP_URLS, buildAreaUrlLj.getAreaUrl())){System.out.println("跳过当前区域"+ buildAreaUrlLj.getCityName()+"-"+ buildAreaUrlLj.getCountyName());continue;}
            pagepoolExecutor.execute(()->parsePage(ops, opsForHash, buildAreaUrlLj));}}/**
     * 解析列表
     * @param ops
     * @param opsForHash
     * @param buildAreaUrlLj
     */privatevoidparsePage(SetOperations ops,HashOperations<String,Object,Object> opsForHash,BuildAreaUrlLj buildAreaUrlLj){ChromeDriver driver =getChromeDriver();try{
            driver.get(buildAreaUrlLj.getAreaUrl());String windowHandlePage = driver.getWindowHandle();WebElement totalNumStr =validElement("//h2[@class='total fl']/span", driver);if(null!= totalNumStr){Integer total =Integer.valueOf(totalNumStr.getText());// 有数据if(total >1){String pageData = driver.findElement(By.xpath("//div[@class='page-box house-lst-page-box']")).getAttribute("page-data");Integer pageNumStr =Integer.valueOf(JSON.parseObject(pageData).getString("totalPage"));System.out.println("当前区域页数"+ pageNumStr +"---"+ buildAreaUrlLj.getAreaUrl());for(int x =1; x <= pageNumStr; x++){List<WebElement> elements = driver.findElements(By.xpath("//ul[@class='listContent']/li/div[1]/div[1]/a"));for(int i =0; i < elements.size(); i++){WebElement item = elements.get(i);String code ="";Pattern compile1 =Pattern.compile("xiaoqu/(\\w+)/");Matcher matcher1 = compile1.matcher(item.getAttribute("href"));while(matcher1.find()){
                                code = matcher1.group(1);}
                            driver.executeScript("arguments[0].click();", item);sleepAndCutoverNewPage(300, driver);// 如果有 则不解析详情if(!opsForHash.hasKey(AREA_INFO_COMMUNITY_CODE_LJ, code)){parseDetail(driver, code, buildAreaUrlLj, opsForHash);}else{System.out.println("当前code redis 存在"+ code);//更新//                                new  TkBuildingsMonthPriceLj();}

                            driver.close();
                            driver.switchTo().window(windowHandlePage);sleep(200);
                            elements = driver.findElements(By.xpath("//ul[@class='listContent']/li/div[1]/div[1]/a"));}if(x != pageNumStr){String nextPage = buildAreaUrlLj.getAreaUrl()+"pg"+(x +1)+"/";
                            driver.get(nextPage);System.out.println("下一页是"+ nextPage);sleep(200);}}}}
            ops.add(SKIP_URLS, buildAreaUrlLj.getAreaUrl());}catch(NumberFormatException e){thrownewRuntimeException("多线程发生异常"+e.getMessage());}finally{
            driver.quit();}}/**
     * 解析详情
     * @param driver
     * @param communityCode
     * @param buildAreaUrlLj
     * @param opsForHash
     */privatevoidparseDetail(ChromeDriver driver,String communityCode,BuildAreaUrlLj buildAreaUrlLj,HashOperations<String,Object,Object> opsForHash){LocalDateTime now1 =LocalDateTime.now();if(null!=validElement("//span[@class='xiaoquUnitPrice']", driver)){TkBuildingsMonthPriceLj lj =newTkBuildingsMonthPriceLj();
            lj.setCommunityCode(communityCode);String year =String.valueOf(LocalDate.now().getYear());if(driver.findElement(By.className("xiaoquUnitPriceDesc")).getText().equals("挂牌均价")){
                lj.setYearmonth(DateFormatUtils.format(newDate(),"yyyyMM"));}else{String monthStr = driver.findElement(By.className("xiaoquUnitPriceDesc")).getText().replace("月参考均价","");String month =String.format("%02d",Integer.parseInt(monthStr));
                lj.setYearmonth(year + month);}
            lj.setAvgPrice(Integer.valueOf(driver.findElement(By.className("xiaoquUnitPrice")).getText()));
            lj.setGenerateType("0");
            lj.setCreateBy("1");
            lj.setCreateDate(newDate());
            lj.setUpdateBy("1");
            lj.setUpdateDate(newDate());
            lj.setDelFlag("0");System.out.println("持久化价格"+lj);}LocalDateTime now2 =LocalDateTime.now();TkBuildingsAreaInfolj infolj =newTkBuildingsAreaInfolj();
        infolj.setDataOrigin("lianjia");
        infolj.setGenerateType("0");
        infolj.setProvince(buildAreaUrlLj.getProvinceId());
        infolj.setCity(buildAreaUrlLj.getCityId());
        infolj.setArea(buildAreaUrlLj.getCountyId());
        infolj.setCommunity(validElement("//h1[@class='detailTitle']", driver)==null?"": driver.findElement(By.xpath("//h1[@class='detailTitle']")).getText());
        infolj.setCommunityCode(communityCode);
        infolj.setBuildingYear(validElement("//span[text()='建筑年代']", driver)==null?"": driver.findElement(By.xpath("//span[text()='建筑年代']/parent::div/span[2]")).getText());
        infolj.setBuildingType(validElement("//span[text()='建筑类型']", driver)==null?"": driver.findElement(By.xpath("//span[text()='建筑类型']/parent::div/span[2]")).getText());
        infolj.setManageCost(validElement("//span[text()='物业费用']", driver)==null?"": driver.findElement(By.xpath("//span[text()='物业费用']/parent::div/span[2]")).getText());
        infolj.setManageCompany(validElement("//span[text()='物业公司']", driver)==null?"": driver.findElement(By.xpath("//span[text()='物业公司']/parent::div/span[2]")).getText());
        infolj.setManageDevlop(validElement("//span[text()='开发商']", driver)==null?"": driver.findElement(By.xpath("//span[text()='开发商']/parent::div/span[2]")).getText());
        infolj.setBuildingCount(validElement("//span[text()='楼栋总数']", driver)==null?"": driver.findElement(By.xpath("//span[text()='楼栋总数']/parent::div/span[2]")).getText());
        infolj.setRoomCount(validElement("//span[text()='房屋总数']", driver)==null?"": driver.findElement(By.xpath("//span[text()='房屋总数']/parent::div/span[2]")).getText());
        infolj.setCreateBy("1");
        infolj.setCreateDate(newDate());
        infolj.setUpdateBy("1");
        infolj.setUpdateDate(newDate());
        infolj.setDelFlag("0");System.out.println("持久化小区"+infolj);}/**
     * 爬取链接
     * @param driver
     * @param ops
     */privatevoidgetUrls(ChromeDriver driver,SetOperations ops){
        driver.get("https://www.lianjia.com/city/");int count =0;List<WebElement> elements = driver.findElements(By.xpath("//ul[@class='city_list_ul']/li/div[2]/div/ul/li/a"));for(int i =0; i < elements.size(); i++){WebElement element = elements.get(i);String provinceName = element.findElement(By.xpath("./parent::li/parent::ul/parent::div/div")).getText();String areaName = element.getText();Boolean memberFlag = ops.isMember(URLS, areaName);if(memberFlag){System.out.println("已跑过当前区域  跳过"+ areaName);continue;}

            driver.executeScript("arguments[0].click();", element);String frontPage = driver.getWindowHandle();WebElement ershoufang =null;try{
                ershoufang = driver.findElement(By.linkText("小区"));}catch(Exception e){
                ops.add(URLS, areaName);sleep(200);System.out.println(areaName +"  没有小区====");
                driver.get("https://www.lianjia.com/city/");
                elements = driver.findElements(By.xpath("//ul[@class='city_list_ul']/li/div[2]/div/ul/li/a"));continue;}
            driver.executeScript("arguments[0].click();", ershoufang);sleepAndCutoverNewPage(500, driver);List<WebElement> citys = driver.findElements(By.xpath("//div[@data-role='ershoufang']/div[1]/a"));
            citys.forEach(e ->System.out.println("市级============"+ e.getText()+"=="+ e.getAttribute("href")));for(int j =0; j < citys.size(); j++){String countyName = citys.get(j).getText();
                driver.executeScript("arguments[0].click();", citys.get(j));sleep(200);if(validElement("//h2[@class='total fl']/span", driver)!=null){String text = driver.findElement(By.xpath("//h2[@class='total fl']/span")).getText();
                    count +=Integer.parseInt(text);System.out.println(countyName + text +"个");System.out.println("当前总数是"+ count);}List<WebElement> areas =null;try{
                    areas = driver.findElements(By.xpath("//div[@data-role='ershoufang']/div[2]/a"));}catch(Exception e){
                    citys = driver.findElements(By.xpath("//div[@data-role='ershoufang']/div[1]/a"));saveDataCity(countyName, areaName, provinceName, citys);break;}if(areas.size()==0){
                    citys = driver.findElements(By.xpath("//div[@data-role='ershoufang']/div[1]/a"));saveDataCity(countyName, areaName, provinceName, citys);break;}saveDataCounty(countyName, areaName, provinceName, areas);sleep(100);
                citys = driver.findElements(By.xpath("//div[@data-role='ershoufang']/div[1]/a"));}

            ops.add(URLS, areaName);
            driver.close();
            driver.switchTo().window(frontPage);
            driver.get("https://www.lianjia.com/city/");sleep(200);
            elements = driver.findElements(By.xpath("//ul[@class='city_list_ul']/li/div[2]/div/ul/li/a"));}System.out.println("总数是"+ count);}privatevoidsaveDataCounty(String countyName,String areaName,String provinceName,List<WebElement> list){for(WebElement element : list){String url = element.getAttribute("href");BuildAreaUrlLj buildAreaUrlLj =newBuildAreaUrlLj();IdAndNamePO provincepo =queryProvinceCityArea(1, provinceName,null);
            buildAreaUrlLj.setProvinceName(provincepo.getBusinessName());
            buildAreaUrlLj.setProvinceId(provincepo.getBusinessId());IdAndNamePO areapo =queryProvinceCityArea(2, areaName, provincepo.getBusinessId());
            buildAreaUrlLj.setCityName(areapo.getBusinessName());
            buildAreaUrlLj.setCityId(areapo.getBusinessId());IdAndNamePO countypo =queryProvinceCityArea(3, countyName, areapo.getBusinessId());
            buildAreaUrlLj.setCountyName(countypo.getBusinessName());
            buildAreaUrlLj.setCountyId(countypo.getBusinessId());
            buildAreaUrlLj.setAreaUrl(url);
            buildAreaUrlLj.setCreateTime(newDate());
            buildAreaUrlLj.setUpdateTime(newDate());System.out.println("持久化链接"+buildAreaUrlLj);}}privatevoidsaveDataCity(String countyName,String areaName,String provinceName,List<WebElement> list){for(WebElement element : list){String url = element.getAttribute("href");BuildAreaUrlLj buildAreaUrlLj =newBuildAreaUrlLj();IdAndNamePO provincepo =queryProvinceCityArea(1, provinceName,null);
            buildAreaUrlLj.setProvinceName(provinceName);
            buildAreaUrlLj.setProvinceId(provincepo.getBusinessId());
            buildAreaUrlLj.setCityName(areaName);IdAndNamePO areapo =queryProvinceCityArea(2, areaName, provincepo.getBusinessId());
            buildAreaUrlLj.setCityId(areapo.getBusinessId());IdAndNamePO countypo =queryProvinceCityArea(3, countyName, areapo.getBusinessId());
            buildAreaUrlLj.setCountyName(countypo.getBusinessName());
            buildAreaUrlLj.setCountyId(countypo.getBusinessId());

            buildAreaUrlLj.setAreaUrl(url);
            buildAreaUrlLj.setCreateTime(newDate());
            buildAreaUrlLj.setUpdateTime(newDate());System.out.println("持久化链接"+buildAreaUrlLj);}}/**
     * 根据名称查询省市县信息
     * @param type 1/省  2/市 3/区
     * @param businessName 名称
     * @param parentId 父id
     * @return
     */privateIdAndNamePOqueryProvinceCityArea(Integer type,String businessName,String parentId){if(StringUtils.isNotBlank(parentId)){ArrayList<String> citys =newArrayList<>(8);
            citys.add("50");
            citys.add("11");
            citys.add("31");
            citys.add("12");if(citys.contains(parentId)){
                businessName ="市辖区";}}IdAndNamePO po =null;try{if(type ==1){//                po = buildingsAvgMapper.queryProvinceIdByName(businessName);}elseif(type ==2){//                po = buildingsAvgMapper.queryCityIdByName(businessName, parentId);}elseif(type ==3){//                po = buildingsAvgMapper.querycountyIdByName(businessName, parentId);}}catch(Exception e){
            e.printStackTrace();}if(null== po){
            po =newIdAndNamePO();
            po.setBusinessId("-1");
            po.setBusinessName(businessName);}return po;}privatestaticStringsleepAndCutoverNewPage(int millis,WebDriver driver){try{Thread.sleep(millis);for(String handle : driver.getWindowHandles()){if(!pages.contains(handle)){
                    driver.switchTo().window(handle);}}}catch(InterruptedException e){}returnnull;}privatestaticvoidsleep(int millis){try{Thread.sleep(millis);}catch(InterruptedException e){}}publicstaticWebElementvalidElement(String str,WebDriver driver){try{WebElement element = driver.findElement(By.xpath(str));return element;}catch(Exception e){System.out.println("这个元素不存在"+ str);}returnnull;}}

注意事项

1. driver.close 是关闭当前页  driver.quit是退出进程   循环跑列表的不退出进程的话浏览器会把内存吃满 
2. 跳转页面尽量显示等待一下 以防元素未加载导致查找错误
3. 请求不可太频繁  特殊需求请加代理 

后语

上述案例源码

https://download.csdn.net/download/DoAsOnePleases/86772623

标签: java selenium chrome

本文转载自: https://blog.csdn.net/DoAsOnePleases/article/details/127357495
版权归原作者 萌坑 所有, 如有侵权,请联系我们删除。

“java+selenium”的评论:

还没有评论