標(biāo)題SpringBoot+ES+Jsoup實(shí)現(xiàn)JD搜索
項(xiàng)目效果
1、功能概述
? 利用Jsoup爬蟲(chóng)爬取JD商城的商品信息,并將商品信息存儲(chǔ)在ElasticSearch中,同時(shí)利用請(qǐng)求進(jìn)行全文檢索,同時(shí)完成高亮顯示等功能。
2、工具簡(jiǎn)介
Jsoup:jsoup 是一款Java 的HTML解析器,可直接解析某個(gè)URL地址、HTML文本內(nèi)容。它提供了一套非常省力的API,可通過(guò)DOM,CSS以及類(lèi)似于jQuery的操作方法來(lái)取出和操作數(shù)據(jù)。
httpclient:HttpClient 是Apache Jakarta Common 下的子項(xiàng)目,可以用來(lái)提供高效的、最新的、功能豐富的支持 HTTP 協(xié)議的客戶(hù)端編程工具包,并且它支持 HTTP 協(xié)議最新的版本和建議。
3、操作步驟
3.1 創(chuàng)建SpringBoot項(xiàng)目
3.2 勾選對(duì)應(yīng)的集成包
3.3 導(dǎo)入項(xiàng)目中需要的jar包依賴(lài)(這里需要注意springboot版本與ES版本的沖突問(wèn)題)
? 版本對(duì)應(yīng):
Spring Data Release Train | Spring Data Elasticsearch | Elasticsearch | Spring Framework | Spring Boot |
---|---|---|---|---|
2021.2 (Raj) | 4.4.x | 7.17.4 | 5.3.x | 2.7.x |
2021.1 (Q) | 4.3.x | 7.15.2 | 5.3.x | 2.6.x |
2021.0 (Pascal) | 4.2.x[1] | 7.12.0 | 5.3.x | 2.5.x |
2020.0 (Ockham)[1] | 4.1.x[1] | 7.9.3 | 5.3.2 | 2.4.x |
Neumann[1] | 4.0.x[1] | 7.6.2 | 5.2.12 | 2.3.x |
Moore[1] | 3.2.x[1] | 6.8.12 | 5.2.12 | 2.2.x |
Lovelace[1] | 3.1.x[1] | 6.2.2 | 5.1.19 | 2.1.x |
Kay[1] | 3.0.x[1] | 5.5.0 | 5.0.13 | 2.0.x |
Ingalls[1] | 2.1.x[1] | 2.4.0 | 4.3.25 | 1.5.x |
? 需要導(dǎo)入maven依賴(lài):
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<optional>true</optional>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.75</version>
</dependency>
<!--解析網(wǎng)頁(yè) jsoup 解析視頻 tika-->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.13.1</version>
</dependency>
<dependency>
<groupId>cn.hutool</groupId>
<artifactId>hutool-all</artifactId>
<version>5.4.6</version>
</dependency>
<!-- HttpClient -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
</dependency>
3.4 編寫(xiě)ES客戶(hù)端配置類(lèi) ElasticSearchClientConfig (用于spring整體管理)
@Configuration
public class ElasticSearchClientConfig {
@Bean
public RestHighLevelClient restHighLevelClient(){
RestHighLevelClient restHighLevelClient = new RestHighLevelClient(
RestClient.builder(
new HttpHost("127.0.0.1", 9200)));
return restHighLevelClient;
}
}
3.5 編寫(xiě)爬蟲(chóng)工具類(lèi) HtmlParseUtil
//html解析工具類(lèi)
public class HtmlParseUtil {
public static void main(String[] args) throws IOException {
List<Content> list = HtmlParseUtil.parseJDSearchKeyByPage("洗衣機(jī)", 2);
System.out.println(list.size());
}
public static List<Content> parseJDSearchKeyByPage(String key,int page) throws IOException {
List<Content> list = new ArrayList<>();
for (int i = 1; i <=page ; i++) {
List<Content> itemList = HtmlParseUtil.parseJDSearchKey(key, i);
list.addAll(itemList);
}
return list;
}
public static List<Content> parseJDSearchKey(String key,int page) throws IOException {
//拼接URL路徑和請(qǐng)求參數(shù)
String url = UrlBuilder.create()
.setScheme("https")
.setHost("search.jd.com")
.addPath("Search")
.addQuery("keyword", key)
.addQuery("enc","utf-8")
.addQuery("page",String.valueOf(2*page-1)) //默認(rèn)爬取前兩頁(yè)數(shù)據(jù)
.build();
URL url1 = new URL(url);
HttpURLConnection httpConn = (HttpURLConnection) url1.openConnection();
httpConn.setRequestMethod("GET");
/**
利用http模仿瀏覽器行為,防止被京東反爬蟲(chóng)程序
**/
httpConn.setRequestProperty("authority", "search.jd.com");
httpConn.setRequestProperty("accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9");
httpConn.setRequestProperty("accept-language", "zh-CN,zh;q=0.9");
httpConn.setRequestProperty("cache-control", "max-age=0");
httpConn.setRequestProperty("cookie", "__jdv=122270672|direct|-|none|-|1657610731752; __jdu=1657610731752947367087; pinId=zrLGvhk9izSm009P6x9LOw; pin=apple_ggUEIRS; unick=apple_ggUEIRS; ceshi3.com=000; _tp=70MDtYz0RbaKAAA4iyM%2FQQ%3D%3D; _pst=apple_ggUEIRS; shshshfpb=daS4RVr0Yk9w65Hio31lN-g; shshshfpa=03fd05de-1795-e1be-7faa-dbe1342ebbcd-1657504705; rkv=1.0; areaId=12; ipLoc-djd=12-988-0-0; TrackID=1xjK9942JTH1cA13hCy9lpjoF4VUsywFztnHXMZa8fMqdod6dnvsJBqV2ZD7UVJXPOj_9eOcIbRSs8MdtE1dIc4M7Ie1oRPm-h1ZW-hdOnb9Gtb_DRX3_JGb_ZkJexJcQ; qrsc=3; PCSYCityID=CN_320000_320500_0; user-key=93bcac49-c4f4-4018-8b25-0766e0c16eda; cn=0; shshshfp=fc6aabe0109953d6062026a77f8bb1e5; __jda=122270672.1657610731752947367087.1657610732.1657610732.1657610732.1; __jdb=122270672.12.1657610731752947367087|1.1657610732; __jdc=122270672; shshshsID=fcfca37eb1dce4e7ebabf041ed253e70_6_1657612610164; thor=D83906BED82DBCAAD56166802034A7EB66575CF409BC09A49AFAF3487B79FEB995355C1A9063238C46E44EDF6CFED6A8324081B64A2FC4E00045BBAB6836FB7D4A6F24F6FBF97FE1F6A3014B93F3032242CB6FE9BF9D997B81005B34FA33DC1505BFB42E7DA2FE2D5991823CAEC187EE28A13F59C3698528BFD659FBAB4CFF16650B12DA4813475B5BF6F26CFCF2C198; 3AB9D23F7A4B3C9B=4YK7NHSJLWRZZ3CXJ4A22DRHHX7TAZBRBGGHDONJODT3TACJJJ65IS72HOSU4LFNHG6ZV3WAFDYORHCEBRJYYI6ZL4");
httpConn.setRequestProperty("sec-ch-ua", "\".Not/A)Brand\";v=\"99\", \"Google Chrome\";v=\"103\", \"Chromium\";v=\"103\"");
httpConn.setRequestProperty("sec-ch-ua-mobile", "?0");
httpConn.setRequestProperty("sec-ch-ua-platform", "\"macOS\"");
httpConn.setRequestProperty("sec-fetch-dest", "document");
httpConn.setRequestProperty("sec-fetch-mode", "navigate");
httpConn.setRequestProperty("sec-fetch-site", "none");
httpConn.setRequestProperty("sec-fetch-user", "?1");
httpConn.setRequestProperty("upgrade-insecure-requests", "1");
httpConn.setRequestProperty("user-agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36");
InputStream responseStream = httpConn.getResponseCode() / 100 == 2
? httpConn.getInputStream()
: httpConn.getErrorStream();
Scanner s = new Scanner(responseStream).useDelimiter("\\A");
String response = s.hasNext() ? s.next() : "";
Document document = Jsoup.parse(response);
// Document document = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36").cookie("wlfstk_smdl","4jxg7p5cy2jz7afp41rull7hc3y9mkjr").timeout(30000).get();
Element j_goodsList = document.getElementById("J_goodsList");
if(j_goodsList==null)
return new ArrayList<>(); ;
Element gl_warp= j_goodsList.getElementsByClass("gl-warp").get(0);
ArrayList<Content> contents = new ArrayList<>();
for (Element child : gl_warp.children()) {
//img圖片路徑是存放在懶加載路徑里面。
String img =child.getElementsByTag("img").eq(0).attr("data-lazy-img");
String price = child.getElementsByClass("p-price").eq(0).text();
String name = child.getElementsByClass("p-name").eq(0).text();
Content content = new Content();
content.setImg(img);
content.setTitle(name);
content.setPrice(price);
contents.add(content);
}
return contents;
}
}
3.6 編寫(xiě)前端頁(yè)面 index.html
<!DOCTYPE html>
<html xmlns:th="http://www.thymeleaf.org">
<head>
<meta charset="utf-8"/>
<title>ES仿京東實(shí)戰(zhàn)</title>
<link rel="stylesheet" th:href="@{/css/style.css}"/>
</head>
<body class="pg">
<div class="page" id="app">
<div id="mallPage" class=" mallist tmall- page-not-market ">
<!-- 頭部搜索 -->
<div id="header" class=" header-list-app">
<div class="headerLayout">
<div class="headerCon ">
<!-- Logo-->
<h1 id="mallLogo">
<img th:src="@{/images/jdlogo.png}" alt="">
</h1>
<div class="header-extra">
<!--搜索-->
<div id="mallSearch" class="mall-search">
<form name="searchTop" class="mallSearch-form clearfix">
<fieldset>
<legend>天貓搜索</legend>
<div class="mallSearch-input clearfix">
<div class="s-combobox" id="s-combobox-685">
<div class="s-combobox-input-wrap">
<input v-model="keyword" type="text" autocomplete="off" value="dd" id="mq"
class="s-combobox-input" aria-haspopup="true"
>
</div>
</div>
<button type="submit" id="searchbtn" @click.prevent="searchKey">搜索</button>
</div>
</fieldset>
</form>
<ul class="relKeyTop">
<li><a>Java</a></li>
<li><a>前端</a></li>
<li><a>Linux</a></li>
<li><a>大數(shù)據(jù)</a></li>
<li><a>理財(cái)</a></li>
</ul>
</div>
</div>
</div>
</div>
</div>
<!-- 商品詳情頁(yè)面 -->
<div id="content">
<div class="main">
<!-- 品牌分類(lèi) -->
<form class="navAttrsForm">
<div class="attrs j_NavAttrs" style="display:block">
<div class="brandAttr j_nav_brand">
<div class="j_Brand attr">
<div class="attrKey">
品牌
</div>
<div class="attrValues">
<ul class="av-collapse row-2">
<li><a href="#"> </a></li>
<li><a href="#"> Java </a></li>
</ul>
</div>
</div>
</div>
</div>
</form>
<!-- 排序規(guī)則 -->
<div class="filter clearfix">
<a class="fSort fSort-cur">綜合<i class="f-ico-arrow-d"></i></a>
<a class="fSort">人氣<i class="f-ico-arrow-d"></i></a>
<a class="fSort">新品<i class="f-ico-arrow-d"></i></a>
<a class="fSort">銷(xiāo)量<i class="f-ico-arrow-d"></i></a>
<a class="fSort">價(jià)格<i class="f-ico-triangle-mt"></i><i class="f-ico-triangle-mb"></i></a>
</div>
<!-- 商品詳情 -->
<div class="view grid-nosku">
<!-- <div class="product">-->
<!-- <div class="product-iWrap">-->
<!-- <!–商品封面–>-->
<!-- <div class="productImg-wrap">-->
<!-- <a class="productImg">-->
<!-- <img src="https://img.alicdn.com/bao/uploaded/i1/3899981502/O1CN01q1uVx21MxxSZs8TVn_!!0-item_pic.jpg">-->
<!-- </a>-->
<!-- </div>-->
<!-- <!–價(jià)格–>-->
<!-- <p class="productPrice">-->
<!-- <em><b>¥</b>2590.00</em>-->
<!-- </p>-->
<!-- <!–標(biāo)題–>-->
<!-- <p class="productTitle">-->
<!-- <a> dkny秋季純色a字蕾絲dd商場(chǎng)同款連衣裙 </a>-->
<!-- </p>-->
<!-- <!– 店鋪名 –>-->
<!-- <div class="productShop">-->
<!-- <span>店鋪: Java </span>-->
<!-- </div>-->
<!-- <!– 成交信息 –>-->
<!-- <p class="productStatus">-->
<!-- <span>月成交<em>999筆</em></span>-->
<!-- <span>評(píng)價(jià) <a>3</a></span>-->
<!-- </p>-->
<!-- </div>-->
<!-- </div>-->
<div class="product" v-for="(item,index) in result" :key="index+item">
<div class="product-iWrap">
<!--商品封面-->
<div class="productImg-wrap">
<a class="productImg">
<img :src="'http:'+item.img">
</a>
</div>
<!--價(jià)格-->
<p class="productPrice">
<!-- <em><b>¥</b>2590.00</em>-->
<em>{{item.price}}</em>
</p>
<!--標(biāo)題-->
<p class="productTitle">
<a v-html="item.title"> </a>
<!-- <a> {{item.title}}} </a>-->
</p>
<!-- 店鋪名 -->
<div class="productShop">
<span>店鋪: Java </span>
</div>
<!-- 成交信息 -->
<p class="productStatus">
<span>月成交<em>999筆</em></span>
<span>評(píng)價(jià) <a>3</a></span>
</p>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<script th:src="@{/js/jquery.min.js}"></script>
<script th:src="@{/js/axios.min.js}"></script>
<script th:src="@{/js/vue.min.js}"></script>
<script>
new Vue({
el:"#app",
data:{
keyword:"",
result:[]
},
methods:{
async searchKey(){
let keyword = this.keyword;
console.log(keyword);
let res = await axios.post("ES/Search",{
keyword,
pageSize:20,
pageNo:1
})
console.log(res);
if(res!=null&& res!=undefined){
// alert("查詢(xún)成功")
this.result = res.data;
}
}
}
})
</script>
</body>
</html>
3.7 創(chuàng)建商品pojo類(lèi) Content
@Data
public class Content {
private String img;
private String title;
private String price;
}
3.8 編寫(xiě)爬蟲(chóng)同步邏輯代碼
/** Controller層代碼 **/
@Slf4j
@RestController
@RequestMapping("/ES")
public class ESController {
@Resource
EsDataSearchService esDataSearchService;
/**
* 導(dǎo)入數(shù)據(jù)進(jìn)入es
* @param keyword
* @return
* @throws Exception
*/
@GetMapping("/data/{keyword}")
public boolean SynchronizeData(@PathVariable("keyword") String keyword) throws Exception {
return esDataSearchService.SynchronizeData(keyword);
}
}
/** Service層代碼 **/
@Service
public class EsDataSearchServiceImpl implements EsDataSearchService {
@Resource
RestHighLevelClient restHighLevelClient;
@Override
public boolean SynchronizeData(String keyword)throws Exception {
List<Content> contents = HtmlParseUtil.parseJDSearchKeyByPage(keyword,2) ;
//創(chuàng)建批量操作請(qǐng)求
BulkRequest jd_goods = new BulkRequest();
jd_goods.timeout("2m");
//將爬取出來(lái)的數(shù)組同步進(jìn)入es
for (Content content : contents) {
//新增添加請(qǐng)求
jd_goods.add(
new IndexRequest("jd_goods")
.source(JSON.toJSONString(content), XContentType.JSON)
);
}
//批量請(qǐng)求
BulkResponse response = restHighLevelClient.bulk(jd_goods, RequestOptions.DEFAULT);
return !response.hasFailures();
}
}
注意:通過(guò)將爬取的數(shù)據(jù)轉(zhuǎn)成數(shù)組,再通過(guò)es批量處理,將數(shù)據(jù)同步進(jìn)入es
3.9 編寫(xiě)查詢(xún)接口
/** Controller層代碼 **/
@PostMapping("/Search")
public List<Content> SearchData(@RequestBody SearchObject searchObject) {
return esDataSearchService.SearchData(searchObject,true);
}
/** Service層代碼 **/
@SneakyThrows
@Override
public List<Content> SearchData(SearchObject searchObject,boolean flag) {
SearchRequest request = new SearchRequest();
request.indices("jd_goods");
SearchSourceBuilder builder = new SearchSourceBuilder();
//分頁(yè)
builder.from((searchObject.getPageNo()-1)*searchObject.getPageSize());
builder.size(searchObject.getPageSize());
HighlightBuilder highlightBuilder = new HighlightBuilder();
//多個(gè)高亮顯示
highlightBuilder.requireFieldMatch(false);
highlightBuilder.preTags("<span style='color:red;'>");
highlightBuilder.postTags("</span>");
highlightBuilder.field("title");
builder.highlighter(highlightBuilder);
//精準(zhǔn)匹配 必須完全相同 否則無(wú)法展示
TermQueryBuilder termQueryBuilder = QueryBuilders.termQuery("title", searchObject.getKeyword());
MatchPhraseQueryBuilder queryBuilders = QueryBuilders.matchPhraseQuery("title", searchObject.getKeyword());
builder.query(queryBuilders);
//帶中文的匹配
BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery();
//boolQueryBuilder.must(QueryBuilders.matchPhraseQuery("title",searchObject.getKeyword()));
builder.query(boolQueryBuilder);
builder.timeout(new TimeValue(60, TimeUnit.SECONDS));
request.source(builder);
//執(zhí)行搜索
SearchResponse response = restHighLevelClient.search(request, RequestOptions.DEFAULT);
//獲取結(jié)果
List<Content> res = new ArrayList<>();
SearchHits hits = response.getHits();
for (SearchHit hit : hits.getHits()) {
Content content = JSON.parseObject(hit.getSourceAsString(), Content.class);
Map<String, HighlightField> highlightFields = hit.getHighlightFields();
HighlightField title = highlightFields.get("title");
if(title!=null){
Text[] fragments = title.fragments();
StringBuffer str = new StringBuffer("");//利用StringBuffer拼接效率更高
for (Text fragment : fragments) {
str.append(fragment);
}
content.setTitle(str.toString());
}
res.add( content);
}
//沒(méi)有就現(xiàn)插入
if(res.size()==0&&flag){
//第一次沒(méi)有查找到數(shù)據(jù),則進(jìn)行一次數(shù)據(jù)爬取再執(zhí)行查詢(xún)。
this.SynchronizeData(searchObject.getKeyword());
Thread.sleep(1000);//線程睡眠1s 因?yàn)橥絜s數(shù)據(jù)是異步操作,等待同步完成。
res = this.SearchData(searchObject,false);
}
return res;
}
3.10 啟動(dòng)項(xiàng)目,通過(guò) 啟動(dòng)端口進(jìn)行訪問(wèn)(記得打開(kāi)ES服務(wù))
ES項(xiàng)目視頻
4、總結(jié)
Elasticsearch 是一個(gè)分布式、高擴(kuò)展、高實(shí)時(shí)的搜索與數(shù)據(jù)分析引擎。它能很方便的使大量數(shù)據(jù)具有搜索、分析和探索的能力。文章來(lái)源:http://www.zghlxwxcb.cn/news/detail-425327.html
它可以做實(shí)時(shí)數(shù)據(jù)存儲(chǔ),es檢索數(shù)據(jù)本身擴(kuò)展性很好,可以擴(kuò)展到上百臺(tái)服務(wù)器,處理PB級(jí)別(大數(shù)據(jù)時(shí)代)的數(shù)據(jù)。文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-425327.html
到了這里,關(guān)于ElasticSearch(五)SpringBoot+ES+Jsoup實(shí)現(xiàn)JD(京東)搜索的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!