系列文章目錄
通過Java+Selenium查詢文章質(zhì)量分
通過Java+Selenium查詢某個博主的Top40文章質(zhì)量分
通過Java+Selenium查詢某個博主的Top100文章質(zhì)量分
前言
大家好,我是青花,本篇給大家分享一下《通過Java+Selenium查詢某個博主的Top100文章質(zhì)量分》,針對上一章Top40文章,做了簡單的優(yōu)化,在查詢博客質(zhì)量分的時候,控制了頻繁的開關(guān)Chrome瀏覽器,避免了重復(fù)的加載Chrome驅(qū)動以及打開Chrome瀏覽器。
備注: 在上章節(jié)里,加載100文章,在50-60文章數(shù)時,就會被限制訪問。
一、環(huán)境準(zhǔn)備
瀏覽器:本篇使用的是Chrome
Chrome瀏覽器版本:113
Chrome驅(qū)動版本:113(Java爬蟲第一篇)
Java版本:Jdk1.8
selenium版本: 4.9.1
二、查詢某個博主的Top100文章
2.1、修改pom.xml配置
<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.11.3</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>4.9.1</version>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.10.1</version>
</dependency>
2.2、配置Chrome驅(qū)動(SeleniumUtil類,包含驅(qū)動位置,圖片保存路徑)
/***
* @title SeleniumUtil
* @desctption Selenium輔助類
* @author Kelvin
* @create 2023/6/21 22:47
**/
@Slf4j
public class SeleniumUtil {
public final static String CHROMEDRIVERPATH = "/Users/apple/Downloads/chromedriver_mac64/chromedriver";
public final static String LOCATION_IMG_BASE_PATH = "~/java/code/spiderX/img/";
public static void sleep(int m) {
try {
Thread.sleep(m);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
System.setProperty("webdriver.chrome.driver", SeleniumUtil.CHROMEDRIVERPATH );// chromedriver localPath
2.3、引入瀏覽器配置
WebDriver driver;
ChromeOptions chromeOptions = new ChromeOptions();
2.4、設(shè)置無頭模式
chromeOptions.addArguments('--headless')
chromeOptions.addArguments("--remote-allow-origins=*");
2.5、啟動瀏覽器實例,添加配置信息
driver = new ChromeDriver(chromeOptions);
2.6、窗口設(shè)置
chromeOptions.addArguments("–no-sandbox"); //--start-maximized
2.7、禁止加載圖片設(shè)置
// 增加禁止加載圖片的設(shè)置
HashMap<String, Object> prefs = new HashMap<>();
prefs.put("profile.default_content_settings", 2);
chromeOptions.setExperimentalOption("prefs", prefs);
chromeOptions.addArguments("blink-settings=imagesEnabled=false");//禁用圖片
2.8、加載博主地址
String baseUrl = "https://blog.csdn.net/s445320?type=blog";
2.9、加載文章列表
//定位到文章列表
WebElement mainSelectE = driver.findElement(By.cssSelector("div.mainContent"));
2.10、加載下一頁
模擬瀏覽器滾動條下拉,加載下一頁數(shù)據(jù)
//加載下一頁
JavascriptExecutor jsDriver = (JavascriptExecutor) driver;//將java中的driver強(qiáng)制轉(zhuǎn)型為JS類型
jsDriver.executeScript("window.scrollTo(0, 50)");
jsDriver.executeScript("window.scrollTo(0, document.body.scrollHeight-20)");
SeleniumUtil.sleep(500);
jsDriver.executeScript("window.scrollTo(0, document.body.scrollHeight +1)");
SeleniumUtil.sleep(2000);
2.11、設(shè)置加載100條數(shù)據(jù)
// 獲取Top100的數(shù)量
int topNum = 100;
//如果加載的數(shù)據(jù)超過或等于 要求的最大長度,返回現(xiàn)在已加載的數(shù)據(jù)
if( webElements.size() >= topNum ) {
for(WebElement element : webElements ){
System.out.println( element.getAttribute("href") );
blogUrlList.add(element.getAttribute("href"));
}
log.info("文章已讀取 {} 條,最大限制 {} 條!" , webElements.size() , topNum);
break;
}
2.12、對頻繁的開關(guān)Chrome瀏覽器做了優(yōu)化
優(yōu)化1
針對上一章Top40文章,做了簡單的優(yōu)化,在查詢博客質(zhì)量分的時候,控制了頻繁的開關(guān)Chrome瀏覽器,避免了重復(fù)的加載Chrome驅(qū)動以及打開Chrome瀏覽器。
優(yōu)化2
初次的時候加載查詢博客質(zhì)量分頁面,后續(xù)只需要更換博客文章鏈接地址,去獲取數(shù)據(jù)即可。
/**
* 獲取質(zhì)量數(shù)據(jù)
* @throws IOException
*/
CsdnBlogInfo csdnQcBySelenium(String blogUrl , WebDriver driver , boolean isFirst) {
log.info("csdnQcBySelenium start!");
CsdnBlogInfo csdnBlogInfo = new CsdnBlogInfo();
//第一次時加載查詢質(zhì)量分頁面
if( isFirst ) {
driver.get("https://www.csdn.net/qc");
SeleniumUtil.sleep(500);
}
2.13、成果
00:44:50.693 [main] INFO com.kelvin.spiderx.service.CsdnQcService - csdnQcBySelenium start!
00:45:03.151 [main] INFO com.kelvin.spiderx.service.CsdnQcService - 文章已讀取 100 條,最大限制 100 條!
00:45:03.151 [main] INFO com.kelvin.spiderx.service.CsdnQcService - blogUrlList size:100
00:45:03.294 [main] INFO com.kelvin.spiderx.service.CsdnQcService - blogUrlList:
[{"title":"[刷題] 刪除有序數(shù)組中的重復(fù)項","posttime":"- 青花鎖 · 2023-06-25 12:49:42 -","score":"82","remark":"文章質(zhì)量良好"},{"title":"[Selenium] 通過Java+Selenium查詢某個博主的Top40文章質(zhì)量分","posttime":"- 青花鎖 · 2023-06-25 01:22:55 -","score":"87","remark":"文章質(zhì)量良好"},{"title":"[Selenium] 通過Java+Selenium查詢文章質(zhì)量分","posttime":"- 青花鎖 · 2023-06-23 08:42:36 -","score":"86","remark":"文章質(zhì)量良好"},{"title":"【并發(fā)知識點】CAS的實現(xiàn)原理及應(yīng)用","posttime":"- 青花鎖 · 2023-06-22 11:55:48 -","score":"90","remark":"文章質(zhì)量良好"},{"title":"【并發(fā)知識點】AQS的實現(xiàn)原理及應(yīng)用","posttime":"- 青花鎖 · 2023-06-22 11:35:53 -","score":"90","remark":"文章質(zhì)量良好"},{"title":"簡單介紹html/javascript、ajax應(yīng)用","posttime":"- 青花鎖 · 2023-06-22 10:18:01 -","score":"92","remark":"文章質(zhì)量良好"},{"title":"[設(shè)計模式] OOP六大原則","posttime":"- 青花鎖 · 2023-06-22 01:24:19 -","score":"89","remark":"文章質(zhì)量良好"},{"title":"[Web前端] Servlet及應(yīng)用","posttime":"- 青花鎖 · 2023-06-22 01:07:56 -","score":"91","remark":"文章質(zhì)量良好"},{"title":"【在線商城系統(tǒng)】數(shù)據(jù)來源-爬蟲篇","posttime":"- 青花鎖 · 2023-06-22 00:48:46 -","score":"87","remark":"文章質(zhì)量良好"},{"title":"《項目實戰(zhàn)》構(gòu)建SpringCloud alibaba項目(三、構(gòu)建服務(wù)方子工程store-user-service)","posttime":"- 青花鎖 · 2023-06-21 18:20:46 -","score":"86","remark":"文章質(zhì)量良好"},{"title":"《項目實戰(zhàn)》構(gòu)建SpringCloud alibaba項目(二、構(gòu)建微服務(wù)鑒權(quán)子工程store-authority-service)","posttime":"- 青花鎖 · 2023-06-19 17:24:53 -","score":"86","remark":"文章質(zhì)量良好"},{"title":"《項目實戰(zhàn)》使用JDBC手寫分庫","posttime":"- 青花鎖 · 2023-06-16 17:56:03 -","score":"88","remark":"文章質(zhì)量良好"},{"title":"《項目實戰(zhàn)》構(gòu)建SpringCloud alibaba項目(一、構(gòu)建父工程、公共庫、網(wǎng)關(guān)))","posttime":"- 青花鎖 · 2023-06-15 20:41:46 -","score":"92","remark":"文章質(zhì)量良好"},{"title":"《項目實戰(zhàn)》 Jenkins 與 CICD、發(fā)布腳本","posttime":"- 青花鎖 · 2023-06-13 15:53:46 -","score":"90","remark":"文章質(zhì)量良好"},{"title":"《微服務(wù)實戰(zhàn)》 第三十二章 微服務(wù)鏈路跟蹤-sleuth zipkin","posttime":"- 青花鎖 · 2023-06-11 18:41:09 -","score":"90","remark":"文章質(zhì)量良好"},{"title":"《微服務(wù)實戰(zhàn)》 第三十一章 ShardingSphere - ShardingSphere-JDBC","posttime":"- 青花鎖 · 2023-06-11 18:25:34 -","score":"80","remark":"文章質(zhì)量良好"},{"title":"《微服務(wù)實戰(zhàn)》 第三十章 分布式事務(wù)框架seata TCC模式","posttime":"- 青花鎖 · 2023-06-11 10:38:21 -","score":"89","remark":"文章質(zhì)量良好"},{"title":"《微服務(wù)實戰(zhàn)》 第二十九章 分布式事務(wù)框架seata AT模式","posttime":"- 青花鎖 · 2023-06-11 10:23:44 -","score":"92","remark":"文章質(zhì)量良好"},{"title":"《微服務(wù)實戰(zhàn)》 第二十八章 分布式鎖框架-Redisson","posttime":"- 青花鎖 · 2023-06-09 17:55:59 -","score":"91","remark":"文章質(zhì)量良好"},{"title":"【項目實戰(zhàn)】一、Spring boot整合JWT、Vue案例展示用戶鑒權(quán)","posttime":"- 青花鎖 · 2023-06-09 09:06:43 -","score":"92","remark":"文章質(zhì)量良好"},{"title":"《微服務(wù)實戰(zhàn)》 第二十七章 CAS","posttime":"- 青花鎖 · 2023-05-30 16:26:30 -","score":"87","remark":"文章質(zhì)量良好"},{"title":"《微服務(wù)實戰(zhàn)》 第二十六章 Java鎖的分類","posttime":"- 青花鎖 · 2023-05-29 17:28:47 -","score":"91","remark":"文章質(zhì)量良好"},{"title":"《微服務(wù)實戰(zhàn)》 第二十五章 Java多線程安全與鎖","posttime":"- 青花鎖 · 2023-05-29 16:07:53 -","score":"91","remark":"文章質(zhì)量良好"}.....]
00:45:03.296 [main] INFO com.kelvin.spiderx.service.CsdnQcService - 此博主有文章,開始解析文章質(zhì)量分!
00:45:03.296 [main] INFO com.kelvin.spiderx.service.CsdnQcService - csdnQcBySelenium start!
---------- 省略解析過程
此博主質(zhì)量分如下:
00:50:40.699 [main] INFO com.kelvin.spiderx.service.CsdnQcService - {"title":"[刷題] 刪除有序數(shù)組中的重復(fù)項","posttime":"- 青花鎖 · 2023-06-25 12:49:42 -","score":"82","remark":"文章質(zhì)量良好"},{"title":"[Selenium] 通過Java+Selenium查詢某個博主的Top40文章質(zhì)量分","posttime":"- 青花鎖 · 2023-06-25 01:22:55 -","score":"87","remark":"文章質(zhì)量良好"},{"title":"[Selenium] 通過Java+Selenium查詢文章質(zhì)量分","posttime":"- 青花鎖 · 2023-06-23 08:42:36 -","score":"86","remark":"文章質(zhì)量良好"},{"title":"【并發(fā)知識點】CAS的實現(xiàn)原理及應(yīng)用","posttime":"- 青花鎖 · 2023-06-22 11:55:48 -","score":"90","remark":"文章質(zhì)量良好"},{"title":"【并發(fā)知識點】AQS的實現(xiàn)原理及應(yīng)用","posttime":"- 青花鎖 · 2023-06-22 11:35:53 -","score":"90","remark":"文章質(zhì)量良好"},{"title":"簡單介紹html/javascript、ajax應(yīng)用","posttime":"- 青花鎖 · 2023-06-22 10:18:01 -","score":"92","remark":"文章質(zhì)量良好"},{"title":"[設(shè)計模式] OOP六大原則","posttime":"- 青花鎖 · 2023-06-22 01:24:19 -","score":"89","remark":"文章質(zhì)量良好"},{"title":"[Web前端] Servlet及應(yīng)用","posttime":"- 青花鎖 · 2023-06-22 01:07:56 -","score":"91","remark":"文章質(zhì)量良好"}....]
00:50:40.693 [main] INFO com.kelvin.spiderx.service.CsdnQcService - csdnQcBySelenium end!
三、循環(huán)查詢文章質(zhì)量分
通過Java+Selenium查詢文章質(zhì)量分
查詢文章質(zhì)量分可見上述文章,在上一章中對返回值,禁止圖片加載做了優(yōu)化。在本章中,對driver做了優(yōu)化,多個查詢頁面共用一個driver,且在進(jìn)入查詢質(zhì)量分階段,只在初次加載頁面。
備注: 后期針對查詢質(zhì)量分,在并發(fā)情況下,可考慮driver池化技術(shù),100篇文章分為N段,分別去作業(yè),優(yōu)化性能。文章來源:http://www.zghlxwxcb.cn/news/detail-508001.html
四、代碼
package com.kelvin.spiderx.service;
import com.google.gson.Gson;
import com.kelvin.spiderx.util.SeleniumUtil;
import lombok.Data;
import lombok.extern.slf4j.Slf4j;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.openqa.selenium.By;
import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.springframework.util.CollectionUtils;
import org.springframework.util.StringUtils;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
/***
* @title CsdnQcService
* @desctption CSDN查詢質(zhì)量分
* @author LTF
* @create 2023/6/21 23:02
**/
@Slf4j
public class CsdnQcService {
@Data
class CsdnBlogInfo {
private String title;
private String posttime;
private String score;
private String remark;
}
/**
* 獲取質(zhì)量數(shù)據(jù)
* @throws IOException
*/
CsdnBlogInfo csdnQcBySelenium(String blogUrl , WebDriver driver , boolean isFirst) {
log.info("csdnQcBySelenium start!");
CsdnBlogInfo csdnBlogInfo = new CsdnBlogInfo();
if( isFirst ) {
driver.get("https://www.csdn.net/qc");
SeleniumUtil.sleep(500);
}
//定位到輸入框
WebElement inputSelectE = driver.findElement(By.cssSelector("input.el-input__inner"));
//輸入文字地址
inputSelectE.sendKeys(blogUrl);
SeleniumUtil.sleep(100);
//定位查詢按鈕
WebElement qcSelectE = driver.findElement(By.cssSelector("div.trends-input-box-btn"));
//點擊查詢按鈕
qcSelectE.click();
SeleniumUtil.sleep(1000);
//獲取右邊區(qū)域 -- 文章質(zhì)量分結(jié)果區(qū)域
WebElement mainSelectE = driver.findElement(By.cssSelector("div.csdn-body-right"));
//轉(zhuǎn)化為Jsoup文檔處理
Document doc = Jsoup.parse( mainSelectE.getAttribute("outerHTML") );
//獲取文章標(biāo)題
String title = doc.select("span.title").text();
if(!StringUtils.isEmpty(title)) {
csdnBlogInfo.setTitle(title);
}
//獲取作者和發(fā)布時間
String posttime = doc.select("span.name").text();
if(!StringUtils.isEmpty(posttime)) {
csdnBlogInfo.setPosttime(posttime);
}
//獲取質(zhì)量分
String score = doc.select("p.img").text();
if(!StringUtils.isEmpty(score)) {
csdnBlogInfo.setScore(score);
}
//獲取博文質(zhì)量分建議
String remark = doc.select("p.desc").text();
if(!StringUtils.isEmpty(remark)) {
csdnBlogInfo.setRemark(remark);
}
//打印結(jié)果
log.info("文章標(biāo)題:{} , 作者和發(fā)布時間:{} , 質(zhì)量分:{} , 博文建議:{}" , title , posttime , score , remark );
// driver.quit();
log.info("csdnQcBySelenium end!");
return csdnBlogInfo;
}
/**
* 查詢指定博主的文章質(zhì)量分
*/
void allBlogQcDataBySelenium() {
String baseUrl = "https://blog.csdn.net/s445320?type=blog";
System.setProperty("webdriver.chrome.driver", SeleniumUtil.CHROMEDRIVERPATH );// chromedriver localPath
ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.addArguments("--remote-allow-origins=*");
chromeOptions.addArguments("–no-sandbox"); //--start-maximized
// 增加禁止加載圖片的設(shè)置
HashMap<String, Object> prefs = new HashMap<>();
prefs.put("profile.default_content_settings", 2);
chromeOptions.setExperimentalOption("prefs", prefs);
chromeOptions.addArguments("blink-settings=imagesEnabled=false");//禁用圖片
WebDriver driver = new ChromeDriver(chromeOptions);
driver.get(baseUrl);
SeleniumUtil.sleep(200);
//定位到文章列表
WebElement mainSelectE = driver.findElement(By.cssSelector("div.mainContent"));
boolean isEnd = false;
// 獲取Top40的數(shù)量
int topNum = 100;
// 上一次讀取的文章數(shù)
int prePoint = 0;
List<String> blogUrlList = new ArrayList<>();
List<WebElement> webElements = null;
while ( isEnd == false ) {
JavascriptExecutor jsDriver = (JavascriptExecutor) driver;//將java中的driver強(qiáng)制轉(zhuǎn)型為JS類型
jsDriver.executeScript("window.scrollTo(0, 50)");
jsDriver.executeScript("window.scrollTo(0, document.body.scrollHeight-20)");
SeleniumUtil.sleep(500);
jsDriver.executeScript("window.scrollTo(0, document.body.scrollHeight +1)");
SeleniumUtil.sleep(2000);
webElements = mainSelectE.findElements(By.cssSelector("article.blog-list-box>a"));
// 如果上一次的文章數(shù)
// 等于 當(dāng)前頁面的文章數(shù):文章已全部讀取完
// 否則,繼續(xù)加載下一頁
if( webElements.size() == prePoint){
for(WebElement element : webElements ){
System.out.println( element.getAttribute("href") );
blogUrlList.add(element.getAttribute("href"));
}
log.info("文章已全部讀取完");
break;
} else {
prePoint = webElements.size();
}
//如果加載的數(shù)據(jù)超過或等于 要求的最大長度,返回現(xiàn)在已加載的數(shù)據(jù)
if( webElements.size() >= topNum ) {
for(WebElement element : webElements ){
System.out.println( element.getAttribute("href") );
blogUrlList.add(element.getAttribute("href"));
}
log.info("文章已讀取 {} 條,最大限制 {} 條!" , webElements.size() , topNum);
break;
}
}
log.info("blogUrlList size:{}" , blogUrlList.size());
log.info("blogUrlList:{}" , new Gson().toJson(blogUrlList) );
List<CsdnBlogInfo> csdnBlogInfoList = null;
if(CollectionUtils.isEmpty(blogUrlList)) {
log.info("此博主沒有發(fā)表文章!");
} else {
log.info("此博主有文章,開始解析文章質(zhì)量分!");
csdnBlogInfoList = new ArrayList<>();
int num = 0;
for (String blogUrl : blogUrlList) {
try{
CsdnBlogInfo csdnBlogInfo = this.csdnQcBySelenium(blogUrl , driver , num <= 0 );
if( null != csdnBlogInfo ) {
csdnBlogInfoList.add(csdnBlogInfo);
}
num ++;
} catch (Exception e) {
log.info("解析文章質(zhì)量分失敗,文章:{}" , blogUrl);
}
}
if(CollectionUtils.isEmpty(csdnBlogInfoList)) {
log.info("解析文章質(zhì)量分失??!");
} else {
log.info("此博主質(zhì)量分如下:");
log.info(new Gson().toJson(csdnBlogInfoList));
}
}
driver.quit();
log.info("讀取數(shù)據(jù)完畢!the end!");
}
public static void main(String[] args) {
CsdnQcService csdnQcService = new CsdnQcService();
csdnQcService.allBlogQcDataBySelenium();
}
}
總結(jié)
通過Java+Selenium查詢某個博主的Top100文章質(zhì)量分至此結(jié)束,優(yōu)化空間還有很大,以實現(xiàn)效果為主。文章來源地址http://www.zghlxwxcb.cn/news/detail-508001.html
到了這里,關(guān)于[Selenium] 通過Java+Selenium查詢某個博主的Top100文章質(zhì)量分的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!