目錄
? ? ? ? ?第一章:項目概述
1.1項目需求和目標(biāo)
1.2預(yù)備知識
1.3項目架構(gòu)設(shè)計及技術(shù)選取
1.4開發(fā)環(huán)境和開發(fā)工具
1.5項目開發(fā)流程
第二章:搭建大數(shù)據(jù)集群環(huán)境
2.1安裝準(zhǔn)備
2.2Hadoop集群搭建
2.3Hive安裝
2.4Sqoop安裝
第三章:數(shù)據(jù)采集
3.1知識概要
3.2分析與準(zhǔn)備
3.3采集網(wǎng)頁數(shù)據(jù)
第四章:數(shù)據(jù)預(yù)處理?
4.1分析預(yù)處理數(shù)據(jù)
4.2設(shè)計數(shù)據(jù)預(yù)處理方案
4.3實現(xiàn)數(shù)據(jù)的預(yù)處理
第五章:數(shù)據(jù)分析
5.1數(shù)據(jù)分析概述
5.2Hive數(shù)據(jù)倉庫
5.3分析數(shù)據(jù)
第六章:數(shù)據(jù)可視化
6.1平臺概述
6.2數(shù)據(jù)遷移
6.3平臺環(huán)境搭建
6.4實現(xiàn)圖形化展示功能
第一章:項目概述
1.1項目需求
項目需求:
本項目是以國內(nèi)某互聯(lián)網(wǎng)招聘網(wǎng)站全國范圍內(nèi)的大數(shù)據(jù)相關(guān)招聘信息作為基礎(chǔ)信息,其招聘信息能較大程度地反映出市場對大數(shù)據(jù)相關(guān)職位的需求情況及能力要求,利用這些招聘信息數(shù)據(jù)通過大數(shù)據(jù)分析平臺重點(diǎn)分析一下幾點(diǎn):
- 分析大數(shù)據(jù)職位的區(qū)域分布情況
- 分析大數(shù)據(jù)職位薪資區(qū)間分布情況
- 分析大數(shù)據(jù)職位相關(guān)公司的福利情況
- 分析大數(shù)據(jù)職位相關(guān)公司技能要求情況
1.2預(yù)備知識
知識儲備:
- JAVA面向?qū)ο缶幊趟枷?/li>
- Hadoop、Hive、Sqoop在Linux環(huán)境下的基本操作
- HDFS與MapReduce的Java API程序開發(fā)
- 大數(shù)據(jù)相關(guān)技術(shù),如Hadoop、HIve、Sqoop的基本理論及原理
- Linux操作系統(tǒng)Shell命令的使用
- 關(guān)系型數(shù)據(jù)庫MySQL的原理,SQL語句的編寫
- 網(wǎng)站前端開發(fā)相關(guān)技術(shù),如HTML、JSP、JQuery、CSS等
- 網(wǎng)站后端開發(fā)框架Spring+SpringMVC+MyBatis整合使用
- Eclipse開發(fā)工具的應(yīng)用
- Maven項目管理工具的使用
1.3項目架構(gòu)設(shè)計及技術(shù)選取
1.4開發(fā)環(huán)境和開發(fā)工具
系統(tǒng)環(huán)境主要分為開發(fā)環(huán)境(Windows)和集群環(huán)境(Linux)
開發(fā)工具:Eclipse、JDK、Maven、VMware Workstation
集群環(huán)境:Hadoop、Hive、Sqoop、MySQL
web環(huán)境:Tomcat、Spring、Spring MVC、MyBatis、Echarts
1.5項目開發(fā)流程
1.搭建大數(shù)據(jù)實驗環(huán)境
(1)Linux?系統(tǒng)虛擬機(jī)的安裝與克隆
(2)配置虛擬機(jī)網(wǎng)絡(luò)與?SSH?服務(wù)
(3)搭建?Hadoop?集群
(4)安裝?MySQL?數(shù)據(jù)庫
(5)安裝?Hive
(6)安裝?Sqoop
2.編寫網(wǎng)絡(luò)爬蟲程序進(jìn)行數(shù)據(jù)采集
(1)準(zhǔn)備爬蟲環(huán)境
(2)編寫爬蟲程序
(3)將爬取數(shù)據(jù)存儲到?HDFS
3.數(shù)據(jù)預(yù)處理
(1)分析預(yù)處理數(shù)據(jù)
(2)準(zhǔn)備預(yù)處理環(huán)境
(3)實現(xiàn)?MapReduce?預(yù)處理程序進(jìn)行數(shù)據(jù)集成和數(shù)據(jù)轉(zhuǎn)換操作
(4)實現(xiàn)?MapReduce?預(yù)處理程序的兩種運(yùn)行模式
4.數(shù)據(jù)分析
(1)構(gòu)建數(shù)據(jù)倉庫
(2)通過?HSQL?進(jìn)行職位區(qū)域分析
(3)通過?HSQL?進(jìn)行職位薪資分析
(4)通過?HSQL?進(jìn)行公司福利標(biāo)簽分析
(5)通過?HSQL?進(jìn)行技能標(biāo)簽分析
5.數(shù)據(jù)可視化
(1)構(gòu)建關(guān)系型數(shù)據(jù)庫
(2)通過?Sqoop?實現(xiàn)數(shù)據(jù)遷移
(3)創(chuàng)建?Maven?項目配置項目依賴的信息
(4)編輯配置文件整合?SSM?框架
(5)完善項目組織框架
(6)編寫程序?qū)崿F(xiàn)職位區(qū)域分布展示
(7)編寫程序?qū)崿F(xiàn)薪資分布展示
(8)編寫程序?qū)崿F(xiàn)福利標(biāo)簽詞云圖
(9)預(yù)覽平臺展示內(nèi)容
(10)編寫程序?qū)崿F(xiàn)技能標(biāo)簽詞云圖
第二章:搭建大數(shù)據(jù)集群環(huán)境
2.1安裝準(zhǔn)備
虛擬機(jī)安裝與克隆(克隆方法選擇創(chuàng)建完整克隆)
虛擬機(jī)網(wǎng)絡(luò)配置
#編輯網(wǎng)絡(luò)
vi /etc/sysconfig/network-scripts/ifcfg-ens33
#重啟
service network restart
#配置ip和主機(jī)名映射
vi /etc/hosts
SSH服務(wù)配置
#查看SSH服務(wù)
rpm -qa | grep ssh
#SSH安裝命令
yum -y install openssh openssh-server
#查看SSH進(jìn)程
ps -ef | grep ssh
#生成密鑰對
ssh-keygen -t rsa
#復(fù)制公鑰文件
ssh-copy-id 主機(jī)名
2.2Hadoop集群搭建
步驟:
- 下載安裝
- 配置環(huán)境變量(編輯環(huán)境變量文件——配置系統(tǒng)環(huán)境變量——初始化環(huán)境變量)
- 環(huán)境驗證
- JDK安裝
1.安裝rz,通過rz命令上傳安裝包 yum install lrzsz 2.解壓 tar -zxvf jdk-8u181-linux-x64.tar.gz -C /usr/local 3.修改名字 mv jdk1.8.0_181/ jdk 4.配置環(huán)境變量 vi /etc/profile #JAVA_HOME export JAVA_HOME=/usr/local/jdk export PATH=$PATH:$JAVA_HOME/bin 5.初始化環(huán)境變量 source /etc/profile 6.驗證配置 java -version
- Hadoop安裝
1.通過rz命令上傳安裝包
2.解壓
tar -zxvf hadoop2.7.1.tar.gz -C /usr/local
3.修改名字
mv hadoop2.7.1/ hadoop
4.配置環(huán)境變量
vi /etc/profile
#HADOOP_HOME
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
5.初始化環(huán)境變量
source /etc/profile
6.驗證配置
hadoop version
- Hadoop集群配置
步驟:
- 配置文件
- 修改文件(hadoop-env.sh、yarn-env.sh、core-site.xml、hdfs-site.xml、mapred-site.xml、yarn-site.xml)
- 修改slaves文件并將集群主節(jié)點(diǎn)的配置文件分發(fā)到其他主節(jié)點(diǎn)
1.cd hadoop/etc/hadoop 2.vi hadoop-env.sh #配置JAVA_HOME export JAVA_HOME=/usr/local/jdk 3.vi yarn-env.sh #配置JAVA_HOME(記得去掉前面的#注釋,注意別找錯地方)
4.vi core-site.xml #配置主進(jìn)程N(yùn)ameNode運(yùn)行地址和Hadoop運(yùn)行時生成數(shù)據(jù)的臨時存放目錄 <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop1:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/usr/local/hadoop/tmp</value> </property> </configuration>
5.vi hdfs-site.xml #配置Secondary NameNode節(jié)點(diǎn)運(yùn)行地址和HDFS數(shù)據(jù)塊的副本數(shù)量 <configuration> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>hadoop2:50090</value> </property> </configuration>
6.cp mapred-site.xml.template mapred-site.xml vi mapred-site.xml #配置MapReduce程序在Yarns上運(yùn)行 <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
7.vi yarn-site.xml #配置Yarn的主進(jìn)程ResourceManager管理者及附屬服務(wù)mapreduce_shuffle <configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.resourcemanager.hostname</name> <value>hadoop1</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
8.vi slaves hadoop1 hadoop2 hadoop3 9.scp /etc/profile root@hadoop2:/etc/profile scp /etc/profile root@hadoop3:/etc/profile scp -r /usr/local/* root@hadoop2:/usr/local/ scp -r /usr/local/* root@hadoop3:/usr/local/ 10.記得在hadoop2、hadoop3初始化 source /etc/profile
- Hadoop集群測試
- 格式化文件系統(tǒng)
- 啟動hadoop集群
- 驗證各服務(wù)器進(jìn)程啟動情況
#1.格式化文件系統(tǒng) 初次啟動HDFS集群時,對主節(jié)點(diǎn)進(jìn)行格式化處理 hdfs namenode -format 或者h(yuǎn)adoop namenode -format #2.進(jìn)入hadoop/sbin/ cd /usr/local/hadoop/sbin/ #3.主節(jié)點(diǎn)上啟動HDFSNameNode進(jìn)程 hadoop-daemon.sh start namenode #4.每個節(jié)點(diǎn)上啟動HDFSDataNode進(jìn)程 hadoop-daemon.sh start datanode #5.主節(jié)點(diǎn)上啟動YARNResourceManager進(jìn)程 yarn-daemon.sh start resourcemanager #6.每個節(jié)點(diǎn)上啟動YARNodeManager進(jìn)程 yarn-daemon.sh start nodemanager #7.規(guī)劃節(jié)點(diǎn)上啟動SecondaryNameNode進(jìn)程 hadoop-daemon.sh start secondarynamenode #8.jps(5個進(jìn)程) DataNode ResourceManager NameNode NodeManager jps
- 通過UI界面查看Hadoop運(yùn)行狀態(tài)
在Windows操作系統(tǒng)配置IP映射,文件路徑C:\Windows\System32\drivers\etc,在etc文件添加如下配置內(nèi)容
2.3Hive安裝
- 安裝MySQL服務(wù)
#安裝mariadb
yum install mariadb-server mariadb
#啟動服務(wù)
systemctl start mariadb
systemctl enable mariadb
#切換到mysql數(shù)據(jù)庫
use mysql;
#修改root用戶密碼
update user set password=PASSWORD('123456') where user = 'root';
#設(shè)置允許遠(yuǎn)程登錄
grant all privileges on *.* to 'root'@'%'
identified by '123456' with grant option;
#更新權(quán)限表
flush privileges;
- 安裝hive
#1.解壓 tar -zxvf apache-hive-1.2.2-bin.tar.gz -C /usr/local #2.修改名字 mv apache-hive-1.2.2-bin/ hive #3.配置文件 cd /hive/conf cp hive-env.sh.template hive-env.sh vi hive-env.sh(修改 export HADOOP_HOME=/usr/local/hadoop)
#4.
vi hive-site.xml
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
<description>password to use against metastore database</description>
</property>
</configuration>
#5.上傳mysql驅(qū)動包
cd ../lib
rz(mysql-connector-java-5.1.40.jar)
#6.配置環(huán)境變量
vi /etc/profile
#添加HIVE_HOME
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
source /etc/profile
#7.啟動hive
cd ../bin/
./hive
2.4Sqoop安裝
#1.解壓
tar -zxvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz -C /usr/local
#2.修改名字
mv sqoop-1.4.7.bin__hadoop-2.6.0/ sqoop
#3.配置
cd sqoop/conf/
cp sqoop-env-template.sh sqoop-env.sh
vi sqoop-env.sh
修改
export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop
export HIVE_HOME=/usr/local/hive
#4.配置環(huán)境變量
vi /etc/profile
#添加SQOOP_HOME
export SQOOP_HOME=/usr/local/sqoop
export PATH=$PATH:$SQOOP_HOME/bin
source /etc/profile
#5.效果測試
cd ../lib
rz(mysql-connector-java-5.1.40.jar)#上傳jar包到lib目錄下
cd ../bin/
sqoop list-database \
-connect jdbc:mysql://localhost:3306/ \
--username root --password 123456
#(sqoop list-database用于輸出連接的本地MySQL數(shù)據(jù)庫中的所有數(shù)據(jù)庫,如果正確返回指定地址的MySQL數(shù)據(jù)庫信息,說明Sqoop配置完畢)
第三章:數(shù)據(jù)采集
3.1知識概要
1.數(shù)據(jù)源分類(系統(tǒng)日志采集、網(wǎng)絡(luò)數(shù)據(jù)采集、數(shù)據(jù)庫采集)
2.HTTP請求過程
3.HttpClient
3.2分析與準(zhǔn)備
1.分析網(wǎng)頁數(shù)據(jù)結(jié)構(gòu)
使用Google瀏覽器進(jìn)入到開發(fā)者模式,切換到Network這項,設(shè)置過濾規(guī)則,查看Ajax請求中的JSON文件;在JSON文件的“content-positionResult-result”下查看大數(shù)據(jù)職位相關(guān)的信息
2.數(shù)據(jù)采集環(huán)境準(zhǔn)備
?在pom文件中添加編寫爬蟲程序所需要的HttpClient和JDK1.8依賴
<dependencies>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.4</version>
</dependency>
<dependency>
<groupId>jdk.tools</groupId>
<artifactId>jdk.tools</artifactId>
<version>1.8</version>
<scope>system</scope>
<systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
</dependency>
</dependencies>
3.3采集網(wǎng)頁數(shù)據(jù)
1.創(chuàng)建相應(yīng)結(jié)果JavaBean類
通過創(chuàng)建的HttpClient響應(yīng)結(jié)果對象作為數(shù)據(jù)存儲的載體,對響應(yīng)結(jié)果中的狀態(tài)碼和數(shù)據(jù)內(nèi)容進(jìn)行封裝
//HttpClientResp.java
package com.position.reptile;
import java.io.Serializable;
public class HttpClientResp implements Serializable {
private static final long serialVersionUID = 2963835334380947712L;
//響應(yīng)狀態(tài)碼
private int code;
//響應(yīng)內(nèi)容
private String content;
//空參構(gòu)造
public HttpClientResp() {
}
public HttpClientResp(int code) {
super();
this.code = code;
}
public HttpClientResp(String content) {
super();
this.content = content;
}
public HttpClientResp(int code, String content) {
super();
this.code = code;
this.content = content;
}
//getter和setter方法
public int getCode() {
return code;
}
public void setCode(int code) {
this.code = code;
}
public String getContent() {
return content;
}
public void setContent(String content) {
this.content = content;
}
//重寫toString方法
@Override
public String toString() {
return "HttpClientResp [code=" + code + ", content=" + content + "]";
}
}
2.封裝HTTP請求的工具類
在com.position.reptile包下,創(chuàng)建一個命名為HttpClientUtils.java文件的工具類,用于實現(xiàn)HTTP請求方法
(1)定義三個全局變量
//編碼格式
private static final String ENCODING = "UTF-8";
//設(shè)置連接超時時間,單位毫秒
private static final int CONNECT_TIMEOUT = 6000;
//設(shè)置響應(yīng)時間
private static final int SOCKET_TIMEOUT = 6000;
(2)編寫packageHeader()方法,用于封裝HTTP請求頭
// 封裝請求頭
public static void packageHeader(Map<String, String> params, HttpRequestBase httpMethod){
if (params != null) {
// set集合中得到的就是params里面封裝的所有請求頭的信息,保存在entrySet里面
Set<Entry<String, String>> entrySet = params.entrySet();
// 遍歷集合
for (Entry<String, String> entry : entrySet) {
// 封裝到httprequestbase對象里面
httpMethod.setHeader(entry.getKey(),entry.getValue());
}
}
}
(3)編寫packageParam()方法,用于封裝HTTP請求參數(shù)
// 封裝請求參數(shù)
public static void packageParam(Map<String,String> params,HttpEntityEnclosingRequestBase httpMethod) throws UnsupportedEncodingException {
if (params != null) {
List<NameValuePair> nvps = new ArrayList<NameValuePair>();
Set<Entry<String, String>> entrySet = params.entrySet();
for (Entry<String, String> entry : entrySet) {
// 分別提取entry中的key和value放入nvps數(shù)組中
nvps.add(new BasicNameValuePair(entry.getKey(), entry.getValue()));
}
httpMethod.setEntity(new UrlEncodedFormEntity(nvps, ENCODING));
}
}
(4)編寫HttpClientResp()方法,用于獲取HTTP響應(yīng)內(nèi)容
public static HttpClientResp getHttpClientResult(CloseableHttpResponse httpResponse,CloseableHttpClient httpClient,HttpRequestBase httpMethod) throws Exception{
httpResponse=httpClient.execute(httpMethod);
//獲取HTTP的響應(yīng)結(jié)果
if(httpResponse != null && httpResponse.getStatusLine() != null) {
String content = "";
if(httpResponse.getEntity() != null) {
content = EntityUtils.toString(httpResponse.getEntity(),ENCODING);
}
return new HttpClientResp(httpResponse.getStatusLine().getStatusCode(),content);
}
return new HttpClientResp(HttpStatus.SC_INTERNAL_SERVER_ERROR);
}
(5)編寫doPost()方法,提交請求頭和請求參數(shù)
public static HttpClientResp doPost(String url,Map<String,String>headers,Map<String,String>params) throws Exception{
CloseableHttpClient httpclient = HttpClients.createDefault();
HttpPost httppost = new HttpPost(url);
//封裝請求配置
RequestConfig requestConfig = RequestConfig.custom()
.setConnectTimeout(CONNECT_TIMEOUT)
.setSocketTimeout(SOCKET_TIMEOUT)
.build();
//設(shè)置post請求配置項
httppost.setConfig(requestConfig);
//設(shè)置請求頭
packageHeader(headers,httppost);
//設(shè)置請求參數(shù)
packageParam(params,httppost);
//創(chuàng)建httpResponse對象獲取響應(yīng)內(nèi)容
CloseableHttpResponse httpResponse = null;
try {
return getHttpClientResult(httpResponse,httpclient,httppost);
}finally {
//釋放資源
release(httpResponse,httpclient);
}
}
(6)編寫release()方法,用于釋放HTTP請求和HTTP響應(yīng)對象資源
private static void release(CloseableHttpResponse httpResponse,CloseableHttpClient httpClient) throws IOException{
if(httpResponse != null) {
httpResponse.close();
}
if(httpClient != null) {
httpClient.close();
}
}
3.封裝存儲在HDFS工具類
(1)在pom.xml文件中添加hadoop的依賴,用于調(diào)用HDFS API
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.1</version>
</dependency>
(2)在com.position.reptile包下,創(chuàng)建名為HttpClientHdfsUtils.java文件的工具類,實現(xiàn)將數(shù)據(jù)寫入HDFS的方法createFileBySysTime()
public class HttpClientHdfsUtils {
public static void createFileBySysTime(String url,String fileName,String data) {
System.setProperty("HADOOP_USER_NAME", "root");
Path path = null;
//讀取系統(tǒng)時間
Calendar calendar = Calendar.getInstance();
Date time = calendar.getTime();
//格式化系統(tǒng)時間
SimpleDateFormat format = new SimpleDateFormat("yyyMMdd");
//獲取系統(tǒng)當(dāng)前時間,將其轉(zhuǎn)換為String類型
String filepath = format.format(time);
//構(gòu)造Configuration對象,配置hadoop參數(shù)
Configuration conf = new Configuration();
URI uri= URI.create(url);
FileSystem fileSystem;
try {
//獲取文件系統(tǒng)對象
fileSystem = FileSystem.get(uri,conf);
//定義文件路徑
path = new Path("/JobData/"+filepath);
if(!fileSystem.exists(path)) {
fileSystem.mkdirs(path);
}
//在指定目錄下創(chuàng)建文件
FSDataOutputStream fsDataOutputStream = fileSystem.create(new Path(path.toString()+"/"+fileName));
//向文件中寫入數(shù)據(jù)
IOUtils.copyBytes(new ByteArrayInputStream(data.getBytes()),fsDataOutputStream,conf,true);
fileSystem.close();
}catch(IOException e) {
e.printStackTrace();
}
}
}
4.實現(xiàn)網(wǎng)頁數(shù)據(jù)采集
(1)通過Chrome瀏覽器查看請求頭
?(2)在com.position.reptile包下,創(chuàng)建名為HttpClientData.java文件的主類,用于數(shù)據(jù)采集功能
public class HttpClientData {
public static void main(String[] args) throws Exception {
//設(shè)置請求頭
Map<String,String>headers = new HashMap<String,String>();
headers.put("Cookie","privacyPolicyPopup=false; user_trace_token=20221103113731-d2950fcd-eb36-486c-9032-feab09943d4d; LGUID=20221103113731-ef107f32-06e0-4453-a89c-683f5a558e86; _ga=GA1.2.11435994.1667446652; RECOMMEND_TIP=true; index_location_city=%E5%85%A8%E5%9B%BD; __lg_stoken__=a5abb0b1f9cda5e7a6da82dd7a4397075c675acce324397a86b9cbbd4fc31a58d921346f317ba5c8c92b5c4a9ebb0650576575b67ebae44f422aeb4b1a950643cd2854eece70; JSESSIONID=ABAAAECABIEACCAC2031D7A104C1E74CDC3FABFA00BCC7F; WEBTJ-ID=20221105161123-18446d82e00bcd-0f0b3aafbd8e8e-26021a51-921600-18446d82e018bf; _gid=GA1.2.1865104541.1667635884; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1667446652,1667456559,1667635885; PRE_UTM=; PRE_HOST=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist%5F%25E5%25A4%25A7%25E6%2595%25B0%25E6%258D%25AE%3FlabelWords%3D%26fromSearch%3Dtrue%26suginput%3D%3FlabelWords%3Dhot; LGSID=20221105161124-df5ffe02-aefa-434b-b378-2d64367fddde; PRE_SITE=https%3A%2F%2Fwww.lagou.com%2Fcommon-sec%2Fsecurity-check.html%3Fseed%3D5E87A87B3DA4AFE2BC190FBB560FB9266A5615D5937A536A0FA5205B13CAC74F0D0C1CC5AF1D2DD0C0060C9AF3B36CA5%26ts%3D16676358793441%26name%3Da5abb0b1f9cd%26callbackUrl%3Dhttps%253A%252F%252Fwww.lagou.com%252Fjobs%252Flist%5F%2525E5%2525A4%2525A7%2525E6%252595%2525B0%2525E6%25258D%2525AE%253FlabelWords%253D%2526fromSearch%253Dtrue%2526suginput%253D%253FlabelWords%253Dhot%26srcReferer%3D; _gat=1; X_MIDDLE_TOKEN=668d4b4d5ba925cb7156e2d72086c745; privacyPolicyPopup=false; sensorsdata2015session=%7B%7D; TG-TRACK-CODE=index_search; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221843b917f5d1b4-025994c92cf438-26021a51-921600-1843b917f5e3e5%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24os%22%3A%22Windows%22%2C%22%24browser%22%3A%22Chrome%22%2C%22%24browser_version%22%3A%22103.0.0.0%22%2C%22%24latest_referrer_host%22%3A%22%22%7D%2C%22%24device_id%22%3A%221843b917f5d1b4-025994c92cf438-26021a51-921600-1843b917f5e3e5%22%7D; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1667636243; LGRID=20221105161724-fad126be-48da-4684-aa52-1ff6cfb2dffd; SEARCH_ID=535076fc2a094fa2913263e0079a9038; X_HTTP_TOKEN=a18b9f65c1cbf1490626367661a3afc88e7340da5d");
headers.put("Connection","keep-alive");
headers.put("Accept","application/json, text/javascript, */*; q=0.01");
headers.put("Accept-Language","zh-CN,zh;q=0.9");
headers.put("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64)"+"AppleWebKit/537.36 (KHTML, like Gecko)"+"Chrome/103.0.0.0 Safari/537.36");
headers.put("content-type","application/x-www-form-urlencoded; charset=UTF-8");
headers.put("Referer", "https://www.lagou.com/jobs/list_%E5%A4%A7%E6%95%B0%E6%8D%AE?labelWords=&fromSearch=true&suginput=?labelWords=hot");
headers.put("Origin", "https://www.lagou.com");
headers.put("x-requested-with","XMLHttpRequest");
headers.put("x-anit-forge-token","None");
headers.put("x-anit-forge-code","0");
headers.put("Host","www.lagou.com");
headers.put("Cache-Control","no-cache");
Map<String,String>params = new HashMap<String,String>();
params.put("kd","大數(shù)據(jù)");
params.put("city","全國");
for (int i=1;i<31;i++){
params.put("pn",String.valueOf(i));
}
for (int i=1;i<31;i++){
params.put("pn",String.valueOf(i));
HttpClientResp result = HttpClientUtils.doPost("https://www.lagou.com/jobs/positionAjax.json?"+"needAddtionalResult=false",headers,params);
HttpClientHdfsUtils.createFileBySysTime("hdfs://hadoop1:9000","page"+i,result.toString());
Thread.sleep(1 * 500);
}
}
}
最終采集數(shù)據(jù)的結(jié)果?
第四章:數(shù)據(jù)預(yù)處理?
4.1分析預(yù)處理數(shù)據(jù)
查看數(shù)據(jù)結(jié)構(gòu)內(nèi)容,格式化數(shù)據(jù)
本項目主要分析的內(nèi)容是薪資、福利、技能要求、職位分布這四個方面。
- salary(薪資字段的數(shù)據(jù)內(nèi)容為字符串形式)
- city(城市字段的數(shù)據(jù)內(nèi)容為字符串形式)
- skillLabels(技能要求字段的數(shù)據(jù)內(nèi)容為數(shù)組形式)
- companyLabelList(福利標(biāo)簽數(shù)據(jù)字段 數(shù)據(jù)形式為數(shù)組);positionAdvantage(數(shù)據(jù)形式為字符串)
4.2設(shè)計數(shù)據(jù)預(yù)處理方案
4.3實現(xiàn)數(shù)據(jù)的預(yù)處理
(1)數(shù)據(jù)預(yù)處理環(huán)境準(zhǔn)備
?在pom.xml文件中,添加hadoop相關(guān)依賴
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.1</version>
</dependency>
</dependencies>
(2)創(chuàng)建數(shù)據(jù)轉(zhuǎn)換類
創(chuàng)建一個com.position.clean的Package,再創(chuàng)建CleanJob類,用于實現(xiàn)對職位信息數(shù)據(jù)進(jìn)行轉(zhuǎn)換操作
- deleteString()方法,用于對薪資字符串處理(去除薪資中的"k"字符)
//刪除指定字符
public static String deleteString(String str,char delChar) {
StringBuffer stringBuffer = new StringBuffer("");
for(int i=0;i<str.length();i++) {
//str是要處理的字符串,delChar是要刪除的字符
if(str.charAt(i) != delChar) {
stringBuffer.append(str.charAt(i));
}
}
return stringBuffer.toString();
}
- mergeString()方法,用于將companyLabelList字段中的數(shù)據(jù)內(nèi)容和positionAdvange字段中的數(shù)據(jù)內(nèi)容進(jìn)行合并處理,生成新字符串?dāng)?shù)據(jù)(以"-"為分隔符)
//處理合并福利標(biāo)簽
public static String mergeString(String position,JSONArray company) throws JSONException {
String result = "";
if(company.length()!=0) {
for(int i=0;i<company.length();i++) {
result = result + company.get(i)+"-";
}
}
if(position != "") {
String[] positionList = position.split("|; |, |、, |,|/");
for(int i=0;i<positionList.length;i++) {
result = result + positionList[i].replaceAll("[\\pP\\p{Punct}]", "")+"-";
}
}
return result.substring(0,result.length()-1);
}
- killResult()方法,用于將技能數(shù)據(jù)以"-"為分隔符進(jìn)行分隔,生成新的字符串?dāng)?shù)據(jù)
//處理技能標(biāo)簽
public static String killResult(JSONArray killData) throws JSONException {
String result = "";
if(killData.length() != 0) {
for(int i=0;i<killData.length();i++) {
result = result + killData.get(i)+"-";
}
return result.substring(0,result.length()-1);
}else {
return "null";
}
}
- resultToString()方法,將數(shù)據(jù)文件中的每一條職位信息數(shù)據(jù)進(jìn)行處理并重新組合成新的字符串形式
//數(shù)據(jù)清洗結(jié)果
public static String resultToString(JSONArray jobdata) throws JSONException {
String jobResultData="";
for(int i=0;i<jobdata.length();i++) {
String everyData = jobdata.get(i).toString();
JSONObject everyDataJson=new JSONObject(everyData);
String city = everyDataJson.getString("city");
String salary = everyDataJson.getString("salary");
String positionAdvantage = everyDataJson.getString("positionAdvantage");
JSONArray companyLabelList = everyDataJson.getJSONArray("companyLabelList");
JSONArray skillLables = everyDataJson.getJSONArray("skillLables");
//處理薪資字段數(shù)據(jù)
String salaryNew = deleteString(salary,'k');
String welfare = mergeString(positionAdvantage,companyLabelList);
String kill = killResult(skillLables);
if(i == jobdata.length() -1) {
jobResultData = jobResultData+city+","+salaryNew+","+welfare+","+kill;
}else {
jobResultData = jobResultData+city+","+salaryNew+","+welfare+","+kill+"\n";
}
}
return jobResultData;
}
}
(3)創(chuàng)建實現(xiàn)Map任務(wù)的Mapper類
在com.position.clean包下,創(chuàng)建一個名稱為CleanMapper的類,用于實現(xiàn)MapReduce程序的Map方法
//CleanMapper類繼承Mapper基類,并定義Map程序輸入和輸出的key和value
public class CleanMapper extends Mapper<LongWritable,Text,Text,NullWritable>{
//map()方法對輸入的鍵值對進(jìn)行處理
protected void map(LongWritable key,Text value,Context context) throws IOException,InterruptedException {
String jobResultData="";
String reptileData = value.toString();
//通過截取字符串方式獲取content中的數(shù)據(jù)
String jobData = reptileData.substring(reptileData.indexOf("=",reptileData.indexOf("=")+1)+1,
reptileData.length()-1
);
try {
//獲取content中的數(shù)據(jù)內(nèi)容
JSONObject contentJson = new JSONObject(jobData);
String contentData = contentJson.getString("content");
//獲取content下positionResult中的數(shù)據(jù)內(nèi)容
JSONObject positionResultJson = new JSONObject(contentData);
String positionResultData = positionResultJson.getString("positionResult");
//獲取最終result中的數(shù)據(jù)內(nèi)容
JSONObject resultJson = new JSONObject(positionResultData);
JSONArray resultData = resultJson.getJSONArray("result");
jobResultData = CleanJob.resultToString(resultData);
context.write(new Text(jobResultData), NullWritable.get());
} catch (JSONException e) {
e.printStackTrace();
}
}
}
(4)創(chuàng)建并執(zhí)行MapReduce程序
在com.position.clean包下,創(chuàng)建一個名稱為CleanMain的類,用于實現(xiàn)MapReduce程序配置
public class CleanMain {
public static void main(String[] args) throws IOException,ClassNotFoundException,InterruptedException {
//控制臺輸出日志
BasicConfigurator.configure();
//初始化Hadoop配置
Configuration conf = new Configuration();
//定義一個新的Job,第一個參數(shù)是hadoop配置信息,第二個參數(shù)是Job的名字
Job job = new Job(conf,"job");
//設(shè)置主類
job.setJarByClass(CleanMain.class);
//設(shè)置Mapper類
job.setMapperClass(CleanMapper.class);
//設(shè)置job輸出數(shù)據(jù)的key類
job.setOutputKeyClass(Text.class);
//設(shè)置job輸出數(shù)據(jù)的value類
job.setOutputValueClass(NullWritable.class);
//數(shù)據(jù)輸入路徑
FileInputFormat.addInputPath(job, new Path("hdfs://hadoop1:9000/JobData/20221105"));
//數(shù)據(jù)輸出路徑
FileOutputFormat.setOutputPath(job,new Path("D:\\BigData\\out"));
System.exit(job.waitForCompletion(true)?0:1);
}
}
(5)將程序打包提交到集群運(yùn)行
修改MapReduce程序主類
package com.position.clean;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.lib.CombineTextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.log4j.BasicConfigurator;
public class CleanMain {
public static void main(String[] args) throws IOException,ClassNotFoundException,InterruptedException {
//控制臺輸出日志
BasicConfigurator.configure();
//初始化Hadoop配置
Configuration conf = new Configuration();
//從hadoop命令行讀取參數(shù)
String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
//判斷讀取的參數(shù)正常是兩個,分別是輸入文件和輸出文件的目錄
if(otherArgs.length != 2) {
System.err.println("Usage:wordcount<in><out>");
System.exit(2);
}
//定義一個新的Job,第一個參數(shù)是hadoop配置信息,第二個參數(shù)是Job的名字
Job job = new Job(conf,"job");
//設(shè)置主類
job.setJarByClass(CleanMain.class);
//設(shè)置Mapper類
job.setMapperClass(CleanMapper.class);
//處理小文件
job.setInputFormatClass(CombineTextInputFormat.class);
//n個小文件之和不能大于2MB
CombineTextInputFormat.setMinInputSplitSize(job, 2097152);
//在n個小文件之和大于2MB的情況下,需滿足n+1個小文件之和不能大于4MB
CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);
//設(shè)置job輸出數(shù)據(jù)的key類
job.setOutputKeyClass(Text.class);
//設(shè)置job輸出數(shù)據(jù)的value類
job.setOutputValueClass(NullWritable.class);
//設(shè)置輸入文件
FileInputFormat.addInputPath(job,new Path(otherArgs[0]));
//設(shè)置輸出文件
FileOutputFormat.setOutputPath(job,new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true)?0:1);
}
}
創(chuàng)建jar包
將jar包提交到集群運(yùn)行
第五章:數(shù)據(jù)分析
5.1數(shù)據(jù)分析概述
本項目通過使用基于分布式文件系統(tǒng)的Hive對招聘網(wǎng)站的數(shù)據(jù)進(jìn)行分析
5.2Hive數(shù)據(jù)倉庫
Hive是建立在Hadoop分布式文件系統(tǒng)上的數(shù)據(jù)倉庫,它提供了一系列工具,能夠?qū)Υ鎯υ贖DFS中的數(shù)據(jù)進(jìn)行數(shù)據(jù)提取、轉(zhuǎn)換和加載(ETL),是一種可以存儲、查詢和分析存儲在Hadoop中的大規(guī)模的工具。Hive可以將HQL語句轉(zhuǎn)為MapReduce程序進(jìn)行處理。
本項目是將Hive數(shù)據(jù)倉庫設(shè)計為星狀模型,由一張事實表和多張維度表組成。
- 事實表(ods_jobdata_origin)主要用于存儲MapReduce計算框架清洗后的數(shù)據(jù)
字段 | 數(shù)據(jù)類型 | 描述 |
city | String | 城市 |
salary | array<String> | 薪資 |
company | array<String> | 福利標(biāo)簽 |
kill | array<String> | 技能標(biāo)簽 |
- 維度表(t_salary_detail)主要用于存儲薪資分布分析的數(shù)據(jù)
字段 | 數(shù)據(jù)類型 | 描述 |
salary | String | 薪資分布區(qū)間 |
count | int | 區(qū)間內(nèi)出現(xiàn)薪資的頻次 |
- 維度表(t_company_detail)主要用于存儲福利標(biāo)簽分析的數(shù)據(jù)
字段 | 數(shù)據(jù)類型 | 描述 |
company | String | 每個福利標(biāo)簽 |
count | int | 每個福利標(biāo)簽的頻次 |
- 維度表(t_city_detail)主要用于存儲城市分布分析的數(shù)據(jù)
字段 | 數(shù)據(jù)類型 | 描述 |
city | String | 城市 |
count | int | 城市頻次 |
- 維度表(t_kill_detail)主要用于存儲技能標(biāo)簽分析的數(shù)據(jù)
字段 | 數(shù)據(jù)類型 | 描述 |
kill | String | 每個標(biāo)簽技能 |
count | int | 每個標(biāo)簽技能的頻次 |
實現(xiàn)數(shù)據(jù)倉庫
- 啟動Hadoop集群后,在主節(jié)點(diǎn)hadoop1啟動Hive
- 將HDFS上的預(yù)處理數(shù)據(jù)導(dǎo)入到事實表ods_jobdata_origin中
--創(chuàng)建數(shù)據(jù)倉庫 jobdata create database jobdata; use jobdata; --創(chuàng)建事實表 ods_jobdata_origin create table ods_jobdata_origin( city string comment '城市', salary array<string> comment '薪資', company array<string> comment '福利', kill array<string> comment '技能') comment '原始職位數(shù)據(jù)表' row format delimited fields terminated by ',' collection items terminated by '-' stored as textfile; --加載數(shù)據(jù) load data inpath '/JobData/output/part-r-00000' overwrite into table ods_jobdata_origin; --查詢數(shù)據(jù) select * from ods_jobdata_origin;
- 創(chuàng)建明細(xì)表ods_jobdata_detail用于存儲事實表細(xì)化薪資字段的數(shù)據(jù)
create table ods_jobdata_detail( city string comment '城市', salary array<string> comment '薪資', company array<string> comment '福利', kill array<string> comment '技能', low_salary int comment '低薪資', high_salary int comment '高薪資', avg_salary double comment '平均薪資') comment '職位數(shù)據(jù)明細(xì)表' row format delimited fields terminated by ',' collection items terminated by '-' stored as textfile;
insert overwrite table ods_jobdata_detail
select city,salary,company,kill,salary[0],salary[1],(salary[0]+salary[1])/2
from ods_jobdata_origin;
- 對薪資字段內(nèi)容進(jìn)行扁平化處理,將處理結(jié)果存儲到臨時中間表t_ods_tmp_salary
create table t_ods_tmp_salary as select explode(ojo.salary) from ods_jobdata_origin ojo;
- 對t_ods_tmp_salary表的每一條數(shù)據(jù)進(jìn)行泛化處理,將處理結(jié)果存儲到中間表t_ods_tmp_salary_dist中
create table t_ods_tmp_salary_dist as select case when col>=0 and col<=5 then "0-5" when col>=6 and col<=10 then "6-10" when col>=11 and col<=15 then "11-15" when col>=16 and col<=20 then "16-20" when col>=21 and col<=25 then "21-25" when col>=26 and col<=30 then "26-30" when col>=31 and col<=35 then "31-35" when col>=36 and col<=40 then "36-40" when col>=41 and col<=45 then "41-45" when col>=46 and col<=50 then "46-50" when col>=51 and col<=55 then "51-55" when col>=56 and col<=60 then "56-60" when col>=61 and col<=65 then "61-65" when col>=66 and col<=70 then "66-70" when col>=71 and col<=75 then "71-75" when col>=76 and col<=80 then "76-80" when col>=81 and col<=85 then "81-85" when col>=86 and col<=90 then "86-90" when col>=91 and col<=95 then "91-95" when col>=96 and col<=100 then "96-100" when col>=101 then ">101" end from t_ods_tmp_salary;
- 對福利標(biāo)簽字段內(nèi)容進(jìn)行扁平化處理,將處理結(jié)果存儲到臨時中間表t_ods_tmp_company
create table t_ods_tmp_company as select explode(ojo.company) from ods_jobdata_origin ojo;
- 對技能標(biāo)簽字段內(nèi)容進(jìn)行扁平化處理,將處理結(jié)果存儲到臨時中間表t_ods_tmp_kill
create table t_ods_tmp_kill as select explode(ojo.kill) from ods_jobdata_origin ojo;
- 創(chuàng)建維度表t_ods_kill,用于存儲技能標(biāo)簽的統(tǒng)計結(jié)果
create table t_ods_kill( every_kill string comment '技能標(biāo)簽', count int comment '詞頻') comment '技能標(biāo)簽詞頻統(tǒng)計' row format delimited fields terminated by ',' stored as textfile;
- 創(chuàng)建維度表t_ods_company,用于存儲福利標(biāo)簽的統(tǒng)計結(jié)果
create table t_ods_company( every_company string comment '福利標(biāo)簽', count int comment '詞頻') comment '福利標(biāo)簽詞頻統(tǒng)計' row format delimited fields terminated by ',' stored as textfile;
- 創(chuàng)建維度表t_ods_salary,用于存儲薪資分布的統(tǒng)計結(jié)果
create table t_ods_salary( every_partition string comment '薪資分布', count int comment '聚合統(tǒng)計') comment '薪資分布聚合統(tǒng)計' row format delimited fields terminated by ',' stored as textfile;
- 創(chuàng)建維度表t_ods_city,用于存儲城市的統(tǒng)計結(jié)果
create table t_ods_city( every_city string comment '城市', count int comment '詞頻') comment '城市統(tǒng)計' row format delimited fields terminated by ',' stored as textfile;
5.3分析數(shù)據(jù)
- 職位區(qū)域分析
--職位區(qū)域分析
insert overwrite table t_ods_city
select city,count(1) from ods_jobdata_origin group by city;
--倒敘查詢職位區(qū)域的信息
select * from t_ods_city sort by count desc;
- ?職位薪資分析
--職位薪資分析
insert overwrite table t_ods_salary
select '_c0',count(1) from t_ods_tmp_salary_dist group by '_c0';
--查看維度表t_ods_salary中的分析結(jié)果,使用sort by 參數(shù)對表中的count列進(jìn)行倒序排序
select * from t_ods_salary sort by count desc;
--平均值
select avg(avg_salary) from ods_jobdata_detail;
--眾數(shù)
select avg_salary,count(1) as cnt from ods_jobdata_detail group by avg_salary order by cnt desc limit 1;
--中位數(shù)
select percentile(cast(avg_salary as bigint),0.5) from ods_jobdata_detail;
- ?公司福利標(biāo)簽分析
--公司福利分析
insert overwrite table t_ods_company
select col,count(1) from t_ods_tmp_company group by col;
--查詢維度表中的分析結(jié)果,倒序查詢前10個
select every_company,count from t_ods_company sort by count desc limit 10;
- ?職位技能要求分析
--職位技能要求分析
insert overwrite table t_ods_kill
select col,count(1) from t_ods_tmp_kill group by col;
--查看技能維度表中的分析結(jié)果,倒敘查看前3個
select every_kill,count from t_ods_kill sort by count desc limit 3;
第六章:數(shù)據(jù)可視化
6.1平臺概述
招聘網(wǎng)站職位分析-數(shù)據(jù)可視化系統(tǒng)主要通過Web平臺對分析結(jié)果進(jìn)行圖像化展示,旨在借助于圖形化手段,清晰有效地傳達(dá)信息,能夠真實反映現(xiàn)階段有關(guān)大數(shù)據(jù)職位的內(nèi)容。本系統(tǒng)采用ECharts來輔助實現(xiàn)。
招聘網(wǎng)站職位分析可視化系統(tǒng)以JavaWeb為基礎(chǔ)搭建,通過SSM(Spring+Springmvc+MyBatis)框架實現(xiàn)后端功能,前端在JSP中使用Echarts實現(xiàn)可視化展示,前后端的數(shù)據(jù)交互是通過SpringMVC與AJAX交互實現(xiàn)。
6.2數(shù)據(jù)遷移
- 創(chuàng)建關(guān)系型數(shù)據(jù)庫(通過Navicat工具連接)
--創(chuàng)建數(shù)據(jù)庫JobData
CREATE DATABASE JobData CHARACTER set utf8 COLLATE utf8_general_ci;
--創(chuàng)建城市分布表
create table t_city_count(
city VARCHAR(30) DEFAULT null,
count int(5) DEFAULT NULL
) ENGINE=INNODB DEFAULT CHARSET=utf8;
--創(chuàng)建薪資分布表
create table t_salary_count(
salary VARCHAR(30) DEFAULT null,
count int(5) DEFAULT NULL
) ENGINE=INNODB DEFAULT CHARSET=utf8;
--創(chuàng)建福利標(biāo)簽統(tǒng)計表
create table t_company_count(
company VARCHAR(30) DEFAULT null,
count int(5) DEFAULT NULL
) ENGINE=INNODB DEFAULT CHARSET=utf8;
--創(chuàng)建技能標(biāo)簽統(tǒng)計表
create table t_kill_count(
kills VARCHAR(30) DEFAULT null,
count int(5) DEFAULT NULL
) ENGINE=INNODB DEFAULT CHARSET=utf8;
- 通過Sqoop實現(xiàn)數(shù)據(jù)遷移
Sqoop主要用于在Hadoop(Hive)與傳統(tǒng)數(shù)據(jù)庫(MySQL)間進(jìn)行數(shù)據(jù)傳遞,可以將一個關(guān)系型數(shù)據(jù)庫中的數(shù)據(jù)導(dǎo)入到Hadoop的HDFS中,也可以將HDFS的數(shù)據(jù)導(dǎo)入到關(guān)系型數(shù)據(jù)庫中。
(啟動的時候,有相關(guān)的警告信息,配置bin/configure-sqoop
?文件,注釋對應(yīng)的相關(guān)語句)
--將職位所在的城市的分布統(tǒng)計結(jié)果數(shù)據(jù)遷移到t_city_count表中
bin/sqoop export \
--connect jdbc:mysql://hadoop1:3306/JobData?characterEncoding=UTF-8 \
--username root \
--password 123456 \
--table t_city_count \
--columns "city,count" \
--fields-terminated-by ',' \
--export-dir /user/hive/warehouse/jobdata.db/t_ods_city
--將職位薪資分布結(jié)果數(shù)據(jù)遷移到t_salary_count表中
bin/sqoop export \
--connect jdbc:mysql://hadoop1:3306/JobData?characterEncoding=UTF-8 \
--username root \
--password 123456 \
--table t_salary_dist \
--columns "salary,count" \
--fields-terminated-by ',' \
--export-dir /user/hive/warehouse/jobdata.db/t_ods_salary
--將職位福利統(tǒng)計結(jié)果數(shù)據(jù)遷移到t_company_count表中
bin/sqoop export \
--connect jdbc:mysql://hadoop1:3306/JobData?characterEncoding=UTF-8 \
--username root \
--password 123456 \
--table t_company_count \
--columns "company,count" \
--fields-terminated-by ',' \
--export-dir /user/hive/warehouse/jobdata.db/t_ods_company
--將職位技能標(biāo)簽統(tǒng)計結(jié)果遷移到t_kill_count表中
bin/sqoop export \
--connect jdbc:mysql://hadoop1:3306/JobData?characterEncoding=UTF-8 \
--username root \
--password 123456 \
--table t_kill_dist \
--columns "kills,count" \
--fields-terminated-by ',' \
--export-dir /user/hive/warehouse/jobdata.db/t_ods_kill
6.3平臺環(huán)境搭建
創(chuàng)建后會出現(xiàn)web.xml is missing and <failOnMissingWebXml> is set to true 的錯誤,是缺少web.xml文件導(dǎo)致的。在src/main/webapp/ WEB-INF下添加web.xml
- 配置pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.itcast.jobanalysis</groupId>
<artifactId>job-web</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>war</packaging>
<dependencies>
<dependency>
<groupId>org.codehaus.jettison</groupId>
<artifactId>jettison</artifactId>
<version>1.1</version>
</dependency>
<!-- Spring -->
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-context</artifactId>
<version>4.2.4.RELEASE</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-beans</artifactId>
<version>4.2.4.RELEASE</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-webmvc</artifactId>
<version>4.2.4.RELEASE</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-jdbc</artifactId>
<version>4.2.4.RELEASE</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-aspects</artifactId>
<version>4.2.4.RELEASE</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-jms</artifactId>
<version>4.2.4.RELEASE</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-context-support</artifactId>
<version>4.2.4.RELEASE</version>
</dependency>
<!-- Mybatis -->
<dependency>
<groupId>org.mybatis</groupId>
<artifactId>mybatis</artifactId>
<version>3.2.8</version>
</dependency>
<dependency>
<groupId>org.mybatis</groupId>
<artifactId>mybatis-spring</artifactId>
<version>1.2.2</version>
</dependency>
<dependency>
<groupId>com.github.miemiedev</groupId>
<artifactId>mybatis-paginator</artifactId>
<version>1.2.15</version>
</dependency>
<!-- MySql -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.32</version>
</dependency>
<!-- 連接池 -->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>druid</artifactId>
<version>1.0.9</version>
<exclusions>
<exclusion>
<groupId>com.alibaba</groupId>
<artifactId>jconsole</artifactId>
</exclusion>
<exclusion>
<groupId>com.alibaba</groupId>
<artifactId>tools</artifactId>
</exclusion>
</exclusions>
</dependency>
<!-- JSP相關(guān) -->
<dependency>
<groupId>jstl</groupId>
<artifactId>jstl</artifactId>
<version>1.2</version>
</dependency>
<dependency>
<groupId>javax.servlet</groupId>
<artifactId>servlet-api</artifactId>
<version>2.5</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>javax.servlet</groupId>
<artifactId>jsp-api</artifactId>
<version>2.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.4.2</version>
</dependency>
<dependency>
<groupId>org.aspectj</groupId>
<artifactId>aspectjweaver</artifactId>
<version>1.8.4</version>
</dependency>
</dependencies>
<build>
<finalName>${project.artifactId}</finalName>
<resources>
<resource>
<directory>src/main/java</directory>
<includes>
<include>**/*.properties</include>
<include>**/*.xml</include>
</includes>
<filtering>false</filtering>
</resource>
<resource>
<directory>src/main/resources</directory>
<includes>
<include>**/*.properties</include>
<include>**/*.xml</include>
</includes>
<filtering>false</filtering>
</resource>
</resources>
<plugins>
<!-- 指定maven編譯的jdk版本,如果不指定,maven3默認(rèn)用jdk 1.5-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.2</version>
<configuration>
<!-- 源代碼使用的JDK版本 -->
<source>1.8</source>
<!-- 需要生成的目標(biāo)class文件的編譯版本 -->
<target>1.8</target>
<!-- 字符集編碼 -->
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<!-- 配置Tomcat插件 -->
<plugin>
<groupId>org.apache.tomcat.maven</groupId>
<artifactId>tomcat7-maven-plugin</artifactId>
<version>2.2</version>
<configuration>
<path>/</path>
<port>8080</port>
</configuration>
</plugin>
</plugins>
</build>
</project>
- 在src/main/resources-spring文件夾下的applicationContext.xml中,編寫spring的配置內(nèi)容
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:context="http://www.springframework.org/schema/context"
xmlns:p="http://www.springframework.org/schema/p"
xmlns:aop="http://www.springframework.org/schema/aop"
xmlns:tx="http://www.springframework.org/schema/tx"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans-4.2.xsd
http://www.springframework.org/schema/context
http://www.springframework.org/schema/context/spring-context-4.2.xsd
http://www.springframework.org/schema/aop
http://www.springframework.org/schema/aop/spring-aop-4.2.xsd
http://www.springframework.org/schema/tx
http://www.springframework.org/schema/tx/spring-tx-4.2.xsd
http://www.springframework.org/schema/util
http://www.springframework.org/schema/util/spring-util-4.2.xsd">
<!-- 數(shù)據(jù)庫連接池 -->
<!-- 加載配置文件 -->
<context:property-placeholder
location="classpath:properties/db.properties" />
<!-- 數(shù)據(jù)庫連接池 -->
<bean id="dataSource"
class="com.alibaba.druid.pool.DruidDataSource"
destroy-method="close">
<property name="url" value="${jdbc.url}" />
<property name="username" value="${jdbc.username}" />
<property name="password" value="${jdbc.password}" />
<property name="driverClassName" value="${jdbc.driver}" />
<property name="maxActive" value="10" />
<property name="minIdle" value="5" />
</bean>
<!-- 讓spring管理sqlsessionfactory使用mybatis和spring整合包中的 -->
<bean id="sqlSessionFactory"
class="org.mybatis.spring.SqlSessionFactoryBean">
<!-- 數(shù)據(jù)庫連接池 -->
<property name="dataSource" ref="dataSource" />
<!-- 加載mybatis的全局配置文件 -->
<property name="configLocation"
value="classpath:mybatis/mybatis-config.xml" />
</bean>
<!-- 使用掃描包的形式來創(chuàng)建mapper代理對象 -->
<bean class="org.mybatis.spring.mapper.MapperScannerConfigurer">
<property name="basePackage" value="cn.itcast.mapper" />
</bean>
<!-- 事務(wù)管理器 -->
<bean id="transactionManager"
class="org.springframework.jdbc.datasource.DataSourceTransactionManager">
<!-- 數(shù)據(jù)源 -->
<property name="dataSource" ref="dataSource" />
</bean>
<!-- 通知 -->
<tx:advice id="txAdvice" transaction-manager="transactionManager">
<tx:attributes>
<!-- 傳播行為 -->
<tx:method name="save*" propagation="REQUIRED" />
<tx:method name="insert*" propagation="REQUIRED" />
<tx:method name="add*" propagation="REQUIRED" />
<tx:method name="create*" propagation="REQUIRED" />
<tx:method name="delete*" propagation="REQUIRED" />
<tx:method name="update*" propagation="REQUIRED" />
<tx:method name="find*"
propagation="SUPPORTS"
read-only="true" />
<tx:method name="select*"
propagation="SUPPORTS"
read-only="true" />
<tx:method name="get*"
propagation="SUPPORTS"
read-only="true" />
</tx:attributes>
</tx:advice>
<!-- 切面 -->
<aop:config>
<aop:advisor advice-ref="txAdvice"
pointcut="execution(* cn.itcast.service..*.*(..))" />
</aop:config>
<!-- 配置包掃描器,掃描所有帶@Service注解的類 -->
<context:component-scan base-package="cn.itcast.service" />
</beans>
- 在src/main/resources-spring文件夾下的springmvc.xml中,編寫SpringMVC的配置內(nèi)容
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:p="http://www.springframework.org/schema/p"
xmlns:context="http://www.springframework.org/schema/context"
xmlns:mvc="http://www.springframework.org/schema/mvc"
xsi:schemaLocation="http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans-4.2.xsd
http://www.springframework.org/schema/mvc
http://www.springframework.org/schema/mvc/spring-mvc-4.2.xsd
http://www.springframework.org/schema/context
http://www.springframework.org/schema/context/spring-context-4.2.xsd">
<!-- 掃描指定包路徑 使路徑當(dāng)中的@controller注解生效 -->
<context:component-scan base-package="cn.itcast.controller" />
<!-- mvc的注解驅(qū)動 -->
<mvc:annotation-driven />
<!-- 視圖解析器 -->
<bean
class=
"org.springframework.web.servlet.view.InternalResourceViewResolver">
<property name="prefix" value="/WEB-INF/jsp/" />
<property name="suffix" value=".jsp" />
</bean>
<!-- 配置資源映射 -->
<mvc:resources location="/css/" mapping="/css/**"/>
<mvc:resources location="/js/" mapping="/js/**"/>
<mvc:resources location="/assets/" mapping="/assets/**"/>
<mvc:resources location="/img/" mapping="/img/**"/>
</beans>
- 編寫web.xml文件,配置spring監(jiān)聽器、編碼過濾器和SpringMVC前端控制器等信息
<web-app xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://java.sun.com/xml/ns/javaee" xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd" version="2.5">
<display-name>job-web</display-name>
<welcome-file-list>
<welcome-file>index.html</welcome-file>
</welcome-file-list>
<!-- 加載spring容器 -->
<context-param>
<param-name>contextConfigLocation</param-name>
<param-value>classpath:spring/applicationContext.xml</param-value>
</context-param>
<listener>
<listener-class> org.springframework.web.context.ContextLoaderListener </listener-class>
</listener>
<!-- 解決post亂碼 -->
<filter>
<filter-name>CharacterEncodingFilter</filter-name>
<filter-class> org.springframework.web.filter.CharacterEncodingFilter </filter-class>
<init-param>
<param-name>encoding</param-name>
<param-value>utf-8</param-value>
</init-param>
</filter>
<filter-mapping>
<filter-name>CharacterEncodingFilter</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>
<!-- 配置springmvc的前端控制器 -->
<servlet>
<servlet-name>data-report</servlet-name>
<servlet-class> org.springframework.web.servlet.DispatcherServlet </servlet-class>
<init-param>
<param-name>contextConfigLocation</param-name>
<param-value>classpath:spring/springmvc.xml</param-value>
</init-param>
<load-on-startup>1</load-on-startup>
</servlet>
<!-- 攔截所有請求 jsp除外 -->
<servlet-mapping>
<servlet-name>data-report</servlet-name>
<url-pattern>/</url-pattern>
</servlet-mapping>
<!-- 全局錯誤頁面 -->
<error-page>
<error-code>404</error-code>
<location>/WEB-INF/jsp/404.jsp</location>
</error-page>
</web-app>
- 編寫數(shù)據(jù)庫配置參數(shù)文件db.properties,用于項目解耦
jdbc.driver=com.mysql.jdbc.Driver
jdbc.url=jdbc:mysql://hadoop1:3306/JobData?characterEncoding=utf-8
jdbc.username=root
jdbc.password=123456
- 編寫Mybatis-Config.xml文件,用于配置Mybatis相關(guān)配置
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE configuration PUBLIC "-//mybatis.org//DTD Config 3.0//EN"
"http://mybatis.org/dtd/mybatis-3-config.dtd">
<configuration>
</configuration>
6.4實現(xiàn)圖形化展示功能
實現(xiàn)職位區(qū)域分布展示
實現(xiàn)薪資分布展示
實現(xiàn)福利標(biāo)簽詞云圖
實現(xiàn)技能標(biāo)簽詞云圖文章來源:http://www.zghlxwxcb.cn/news/detail-449488.html
平臺可視化展示文章來源地址http://www.zghlxwxcb.cn/news/detail-449488.html
到了這里,關(guān)于大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!