大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析

這篇具有很好參考價值的文章主要介紹了大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析。希望對大家有所幫助。如果存在錯誤或未考慮完全的地方，請大家不吝賜教，您也可以點(diǎn)擊"舉報違法"按鈕提交疑問。

? ? ? ? ?第一章：項目概述

1.1項目需求和目標(biāo)

1.2預(yù)備知識

1.3項目架構(gòu)設(shè)計及技術(shù)選取

1.4開發(fā)環(huán)境和開發(fā)工具

1.5項目開發(fā)流程

第二章：搭建大數(shù)據(jù)集群環(huán)境

2.1安裝準(zhǔn)備

2.2Hadoop集群搭建

2.3Hive安裝

2.4Sqoop安裝

第三章：數(shù)據(jù)采集

3.1知識概要

3.2分析與準(zhǔn)備

3.3采集網(wǎng)頁數(shù)據(jù)

第四章：數(shù)據(jù)預(yù)處理?

4.1分析預(yù)處理數(shù)據(jù)

4.2設(shè)計數(shù)據(jù)預(yù)處理方案

4.3實現(xiàn)數(shù)據(jù)的預(yù)處理

第五章：數(shù)據(jù)分析

5.1數(shù)據(jù)分析概述

5.2Hive數(shù)據(jù)倉庫

5.3分析數(shù)據(jù)

第六章：數(shù)據(jù)可視化

6.1平臺概述

6.2數(shù)據(jù)遷移

6.3平臺環(huán)境搭建

6.4實現(xiàn)圖形化展示功能

第一章：項目概述

1.1項目需求

項目需求：

本項目是以國內(nèi)某互聯(lián)網(wǎng)招聘網(wǎng)站全國范圍內(nèi)的大數(shù)據(jù)相關(guān)招聘信息作為基礎(chǔ)信息，其招聘信息能較大程度地反映出市場對大數(shù)據(jù)相關(guān)職位的需求情況及能力要求，利用這些招聘信息數(shù)據(jù)通過大數(shù)據(jù)分析平臺重點(diǎn)分析一下幾點(diǎn)：

分析大數(shù)據(jù)職位的區(qū)域分布情況
分析大數(shù)據(jù)職位薪資區(qū)間分布情況
分析大數(shù)據(jù)職位相關(guān)公司的福利情況
分析大數(shù)據(jù)職位相關(guān)公司技能要求情況

1.2預(yù)備知識

知識儲備：

JAVA面向?qū)ο缶幊趟枷?/li>
Hadoop、Hive、Sqoop在Linux環(huán)境下的基本操作
HDFS與MapReduce的Java API程序開發(fā)
大數(shù)據(jù)相關(guān)技術(shù)，如Hadoop、HIve、Sqoop的基本理論及原理
Linux操作系統(tǒng)Shell命令的使用
關(guān)系型數(shù)據(jù)庫MySQL的原理，SQL語句的編寫
網(wǎng)站前端開發(fā)相關(guān)技術(shù)，如HTML、JSP、JQuery、CSS等
網(wǎng)站后端開發(fā)框架Spring+SpringMVC+MyBatis整合使用
Eclipse開發(fā)工具的應(yīng)用
Maven項目管理工具的使用

1.3項目架構(gòu)設(shè)計及技術(shù)選取

大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析

1.4開發(fā)環(huán)境和開發(fā)工具

系統(tǒng)環(huán)境主要分為開發(fā)環(huán)境(Windows)和集群環(huán)境(Linux)

開發(fā)工具：Eclipse、JDK、Maven、VMware Workstation

集群環(huán)境：Hadoop、Hive、Sqoop、MySQL

web環(huán)境：Tomcat、Spring、Spring MVC、MyBatis、Echarts

1.5項目開發(fā)流程

1.搭建大數(shù)據(jù)實驗環(huán)境

(1)Linux?系統(tǒng)虛擬機(jī)的安裝與克隆

(2)配置虛擬機(jī)網(wǎng)絡(luò)與?SSH?服務(wù)

(3)搭建?Hadoop?集群

(4)安裝?MySQL?數(shù)據(jù)庫

(5)安裝?Hive

(6)安裝?Sqoop

2.編寫網(wǎng)絡(luò)爬蟲程序進(jìn)行數(shù)據(jù)采集

(1)準(zhǔn)備爬蟲環(huán)境

(2)編寫爬蟲程序

(3)將爬取數(shù)據(jù)存儲到?HDFS

3.數(shù)據(jù)預(yù)處理

(1)分析預(yù)處理數(shù)據(jù)

(2)準(zhǔn)備預(yù)處理環(huán)境

(3)實現(xiàn)?MapReduce?預(yù)處理程序進(jìn)行數(shù)據(jù)集成和數(shù)據(jù)轉(zhuǎn)換操作

(4)實現(xiàn)?MapReduce?預(yù)處理程序的兩種運(yùn)行模式

4.數(shù)據(jù)分析

(1)構(gòu)建數(shù)據(jù)倉庫

(2)通過?HSQL?進(jìn)行職位區(qū)域分析

(3)通過?HSQL?進(jìn)行職位薪資分析

(4)通過?HSQL?進(jìn)行公司福利標(biāo)簽分析

(5)通過?HSQL?進(jìn)行技能標(biāo)簽分析

5.數(shù)據(jù)可視化

(1)構(gòu)建關(guān)系型數(shù)據(jù)庫

(2)通過?Sqoop?實現(xiàn)數(shù)據(jù)遷移

(3)創(chuàng)建?Maven?項目配置項目依賴的信息

(4)編輯配置文件整合?SSM?框架

(5)完善項目組織框架

(6)編寫程序?qū)崿F(xiàn)職位區(qū)域分布展示

(7)編寫程序?qū)崿F(xiàn)薪資分布展示

(8)編寫程序?qū)崿F(xiàn)福利標(biāo)簽詞云圖

(9)預(yù)覽平臺展示內(nèi)容

(10)編寫程序?qū)崿F(xiàn)技能標(biāo)簽詞云圖

第二章：搭建大數(shù)據(jù)集群環(huán)境

2.1安裝準(zhǔn)備

虛擬機(jī)安裝與克隆(克隆方法選擇創(chuàng)建完整克隆)

虛擬機(jī)網(wǎng)絡(luò)配置

#編輯網(wǎng)絡(luò)
vi /etc/sysconfig/network-scripts/ifcfg-ens33
#重啟
service network restart
#配置ip和主機(jī)名映射
vi /etc/hosts

大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析

SSH服務(wù)配置

#查看SSH服務(wù)
rpm -qa | grep ssh
#SSH安裝命令
yum -y install openssh openssh-server
#查看SSH進(jìn)程
ps -ef | grep ssh
#生成密鑰對
ssh-keygen -t rsa
#復(fù)制公鑰文件
ssh-copy-id 主機(jī)名

2.2Hadoop集群搭建

大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析

步驟：

下載安裝
配置環(huán)境變量（編輯環(huán)境變量文件——配置系統(tǒng)環(huán)境變量——初始化環(huán)境變量）
環(huán)境驗證

JDK安裝

1.安裝rz，通過rz命令上傳安裝包
yum install lrzsz
2.解壓
tar -zxvf jdk-8u181-linux-x64.tar.gz -C /usr/local
3.修改名字
mv jdk1.8.0_181/ jdk
4.配置環(huán)境變量
vi /etc/profile
#JAVA_HOME
export JAVA_HOME=/usr/local/jdk
export PATH=$PATH:$JAVA_HOME/bin
5.初始化環(huán)境變量
source /etc/profile
6.驗證配置
java -version

Hadoop安裝

1.通過rz命令上傳安裝包
2.解壓
tar -zxvf hadoop2.7.1.tar.gz -C /usr/local
3.修改名字
mv hadoop2.7.1/ hadoop
4.配置環(huán)境變量
vi /etc/profile
#HADOOP_HOME
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
5.初始化環(huán)境變量
source /etc/profile
6.驗證配置
hadoop version

Hadoop集群配置

大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析

步驟：

配置文件
修改文件(hadoop-env.sh、yarn-env.sh、core-site.xml、hdfs-site.xml、mapred-site.xml、yarn-site.xml)

修改slaves文件并將集群主節(jié)點(diǎn)的配置文件分發(fā)到其他主節(jié)點(diǎn)

1.cd hadoop/etc/hadoop
2.vi hadoop-env.sh
#配置JAVA_HOME
export JAVA_HOME=/usr/local/jdk
3.vi yarn-env.sh
#配置JAVA_HOME(記得去掉前面的#注釋,注意別找錯地方)

4.vi core-site.xml
#配置主進(jìn)程N(yùn)ameNode運(yùn)行地址和Hadoop運(yùn)行時生成數(shù)據(jù)的臨時存放目錄
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://hadoop1:9000</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/usr/local/hadoop/tmp</value>
  </property>
</configuration>

5.vi hdfs-site.xml
#配置Secondary NameNode節(jié)點(diǎn)運(yùn)行地址和HDFS數(shù)據(jù)塊的副本數(shù)量
<configuration>
 <property>
  <name>dfs.replication</name>
  <value>3</value>
 </property>
 <property>
  <name>dfs.namenode.secondary.http-address</name>
  <value>hadoop2:50090</value>
 </property>
</configuration>

6.cp mapred-site.xml.template mapred-site.xml
vi mapred-site.xml
#配置MapReduce程序在Yarns上運(yùn)行
<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

7.vi yarn-site.xml
#配置Yarn的主進(jìn)程ResourceManager管理者及附屬服務(wù)mapreduce_shuffle
<configuration>
<!-- Site specific YARN configuration properties -->
 <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop1</value>
    </property>

    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

8.vi slaves
hadoop1
hadoop2
hadoop3
9.scp /etc/profile root@hadoop2:/etc/profile
scp /etc/profile root@hadoop3:/etc/profile
scp -r /usr/local/* root@hadoop2:/usr/local/
scp -r /usr/local/* root@hadoop3:/usr/local/
10.記得在hadoop2、hadoop3初始化
source /etc/profile

Hadoop集群測試

格式化文件系統(tǒng)
啟動hadoop集群

驗證各服務(wù)器進(jìn)程啟動情況

#1.格式化文件系統(tǒng)
初次啟動HDFS集群時，對主節(jié)點(diǎn)進(jìn)行格式化處理
hdfs namenode -format
或者h(yuǎn)adoop namenode -format
#2.進(jìn)入hadoop/sbin/
cd /usr/local/hadoop/sbin/
#3.主節(jié)點(diǎn)上啟動HDFSNameNode進(jìn)程
hadoop-daemon.sh start namenode
#4.每個節(jié)點(diǎn)上啟動HDFSDataNode進(jìn)程
hadoop-daemon.sh start datanode
#5.主節(jié)點(diǎn)上啟動YARNResourceManager進(jìn)程
yarn-daemon.sh start resourcemanager
#6.每個節(jié)點(diǎn)上啟動YARNodeManager進(jìn)程
yarn-daemon.sh start nodemanager
#7.規(guī)劃節(jié)點(diǎn)上啟動SecondaryNameNode進(jìn)程
hadoop-daemon.sh start secondarynamenode
#8.jps(5個進(jìn)程)
DataNode
ResourceManager
NameNode
NodeManager
jps

大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析

通過UI界面查看Hadoop運(yùn)行狀態(tài)

在Windows操作系統(tǒng)配置IP映射，文件路徑C:\Windows\System32\drivers\etc，在etc文件添加如下配置內(nèi)容

大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析

2.3Hive安裝

安裝MySQL服務(wù)

#安裝mariadb
yum install mariadb-server mariadb
#啟動服務(wù)
systemctl start mariadb
systemctl enable mariadb
#切換到mysql數(shù)據(jù)庫
use mysql;
#修改root用戶密碼
update user set password=PASSWORD('123456') where user = 'root';
#設(shè)置允許遠(yuǎn)程登錄
grant all privileges on *.* to 'root'@'%'
identified by '123456' with grant option;
#更新權(quán)限表
flush privileges;

安裝hive

#1.解壓
tar -zxvf apache-hive-1.2.2-bin.tar.gz -C /usr/local
#2.修改名字
mv apache-hive-1.2.2-bin/ hive
#3.配置文件
cd /hive/conf
cp hive-env.sh.template hive-env.sh
vi hive-env.sh(修改 export HADOOP_HOME=/usr/local/hadoop)

#4.
vi hive-site.xml

<configuration>
        <property>
          <name>javax.jdo.option.ConnectionURL</name>
          <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
          <description>JDBC connect string for a JDBC metastore</description>
        </property>

        <property>
          <name>javax.jdo.option.ConnectionDriverName</name>
          <value>com.mysql.jdbc.Driver</value>
          <description>Driver class name for a JDBC metastore</description>
        </property>

        <property>
          <name>javax.jdo.option.ConnectionUserName</name>
          <value>root</value>
          <description>username to use against metastore database</description>
        </property>

        <property>
          <name>javax.jdo.option.ConnectionPassword</name>
          <value>123456</value>
          <description>password to use against metastore database</description>
        </property>
        
</configuration>

#5.上傳mysql驅(qū)動包
cd ../lib
rz(mysql-connector-java-5.1.40.jar)
#6.配置環(huán)境變量
vi /etc/profile
#添加HIVE_HOME
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin

source /etc/profile
#7.啟動hive
cd ../bin/
./hive

大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析

2.4Sqoop安裝

#1.解壓
tar -zxvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz -C /usr/local
#2.修改名字
mv sqoop-1.4.7.bin__hadoop-2.6.0/ sqoop
#3.配置
cd sqoop/conf/
cp sqoop-env-template.sh sqoop-env.sh
vi sqoop-env.sh
修改 
export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop
export HIVE_HOME=/usr/local/hive
#4.配置環(huán)境變量
vi /etc/profile
#添加SQOOP_HOME
export SQOOP_HOME=/usr/local/sqoop
export PATH=$PATH:$SQOOP_HOME/bin

source /etc/profile
#5.效果測試
cd ../lib
rz(mysql-connector-java-5.1.40.jar)#上傳jar包到lib目錄下
cd ../bin/
sqoop list-database \
-connect jdbc:mysql://localhost:3306/ \
--username root --password 123456
#(sqoop list-database用于輸出連接的本地MySQL數(shù)據(jù)庫中的所有數(shù)據(jù)庫，如果正確返回指定地址的MySQL數(shù)據(jù)庫信息，說明Sqoop配置完畢)

大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析

第三章：數(shù)據(jù)采集

3.1知識概要

1.數(shù)據(jù)源分類(系統(tǒng)日志采集、網(wǎng)絡(luò)數(shù)據(jù)采集、數(shù)據(jù)庫采集)

2.HTTP請求過程

大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析

3.HttpClient

大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析

3.2分析與準(zhǔn)備

1.分析網(wǎng)頁數(shù)據(jù)結(jié)構(gòu)

使用Google瀏覽器進(jìn)入到開發(fā)者模式，切換到Network這項，設(shè)置過濾規(guī)則，查看Ajax請求中的JSON文件；在JSON文件的“content-positionResult-result”下查看大數(shù)據(jù)職位相關(guān)的信息

2.數(shù)據(jù)采集環(huán)境準(zhǔn)備

大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析

?在pom文件中添加編寫爬蟲程序所需要的HttpClient和JDK1.8依賴

  <dependencies>
  <dependency>
	<groupId>org.apache.httpcomponents</groupId>
	<artifactId>httpclient</artifactId>
	<version>4.5.4</version>
  </dependency>

  <dependency>
	<groupId>jdk.tools</groupId>
	<artifactId>jdk.tools</artifactId>
     <version>1.8</version>
     <scope>system</scope>
     <systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
     </dependency>
  </dependencies>

3.3采集網(wǎng)頁數(shù)據(jù)

1.創(chuàng)建相應(yīng)結(jié)果JavaBean類

通過創(chuàng)建的HttpClient響應(yīng)結(jié)果對象作為數(shù)據(jù)存儲的載體，對響應(yīng)結(jié)果中的狀態(tài)碼和數(shù)據(jù)內(nèi)容進(jìn)行封裝

//HttpClientResp.java
package com.position.reptile;
import java.io.Serializable;
public class HttpClientResp implements Serializable {

	private static final long serialVersionUID = 2963835334380947712L;
    //響應(yīng)狀態(tài)碼
	private int code;
	//響應(yīng)內(nèi)容
    private String content;
    //空參構(gòu)造
	public HttpClientResp() {
		
	}
	
	public HttpClientResp(int code) {
		super();
		this.code = code;
	}

	public HttpClientResp(String content) {
		super();
		this.content = content;
	}

	public HttpClientResp(int code, String content) {
		super();
		this.code = code;
		this.content = content;
	}
	
	//getter和setter方法
	public int getCode() {
		return code;
	}

	public void setCode(int code) {
		this.code = code;
	}

	public String getContent() {
		return content;
	}

	public void setContent(String content) {
		this.content = content;
	}

	//重寫toString方法
	@Override
	public String toString() {
		return "HttpClientResp [code=" + code + ", content=" + content + "]";
	}

}

2.封裝HTTP請求的工具類

在com.position.reptile包下，創(chuàng)建一個命名為HttpClientUtils.java文件的工具類，用于實現(xiàn)HTTP請求方法

(1)定義三個全局變量

//編碼格式
private static final String ENCODING = "UTF-8";
//設(shè)置連接超時時間，單位毫秒
private static final int CONNECT_TIMEOUT = 6000;
//設(shè)置響應(yīng)時間
private static final int SOCKET_TIMEOUT = 6000;

(2)編寫packageHeader()方法，用于封裝HTTP請求頭

// 封裝請求頭
	public static void packageHeader(Map<String, String> params, HttpRequestBase httpMethod){
		if (params != null) {
			// set集合中得到的就是params里面封裝的所有請求頭的信息，保存在entrySet里面
			Set<Entry<String, String>> entrySet = params.entrySet();
			// 遍歷集合
			for (Entry<String, String> entry : entrySet) {
				// 封裝到httprequestbase對象里面
				httpMethod.setHeader(entry.getKey(),entry.getValue());
			}
		}
	}

(3)編寫packageParam()方法，用于封裝HTTP請求參數(shù)

// 封裝請求參數(shù)
	public static void packageParam(Map<String,String> params,HttpEntityEnclosingRequestBase httpMethod) throws UnsupportedEncodingException {
		if (params != null) {
			List<NameValuePair> nvps = new ArrayList<NameValuePair>();
			Set<Entry<String, String>> entrySet = params.entrySet();
			for (Entry<String, String> entry : entrySet) {
				// 分別提取entry中的key和value放入nvps數(shù)組中
				nvps.add(new BasicNameValuePair(entry.getKey(), entry.getValue()));
			}
			httpMethod.setEntity(new UrlEncodedFormEntity(nvps, ENCODING));
		}
	}

(4)編寫HttpClientResp()方法，用于獲取HTTP響應(yīng)內(nèi)容

public static HttpClientResp getHttpClientResult(CloseableHttpResponse httpResponse,CloseableHttpClient httpClient,HttpRequestBase httpMethod) throws Exception{
		httpResponse=httpClient.execute(httpMethod);
		//獲取HTTP的響應(yīng)結(jié)果
		if(httpResponse != null && httpResponse.getStatusLine() != null) {
			String content = "";
			if(httpResponse.getEntity() != null) {
				content = EntityUtils.toString(httpResponse.getEntity(),ENCODING);
			}
			return new HttpClientResp(httpResponse.getStatusLine().getStatusCode(),content);
		}
		return new HttpClientResp(HttpStatus.SC_INTERNAL_SERVER_ERROR);
	}

(5)編寫doPost()方法，提交請求頭和請求參數(shù)

public static HttpClientResp doPost(String url,Map<String,String>headers,Map<String,String>params) throws Exception{
		CloseableHttpClient httpclient = HttpClients.createDefault();
		HttpPost httppost = new HttpPost(url);
		//封裝請求配置
		RequestConfig requestConfig = RequestConfig.custom()
				.setConnectTimeout(CONNECT_TIMEOUT)
				.setSocketTimeout(SOCKET_TIMEOUT)
				.build();
		//設(shè)置post請求配置項
		httppost.setConfig(requestConfig);
		//設(shè)置請求頭
		packageHeader(headers,httppost);
		//設(shè)置請求參數(shù)
		packageParam(params,httppost);
		//創(chuàng)建httpResponse對象獲取響應(yīng)內(nèi)容
		CloseableHttpResponse httpResponse = null;
		try {
			return getHttpClientResult(httpResponse,httpclient,httppost);
		}finally {
			//釋放資源
			release(httpResponse,httpclient);
		}
	}

(6)編寫release()方法，用于釋放HTTP請求和HTTP響應(yīng)對象資源

private static void release(CloseableHttpResponse httpResponse,CloseableHttpClient httpClient) throws IOException{
		if(httpResponse != null) {
			httpResponse.close();
		}
		if(httpClient != null) {
			httpClient.close();
		}
	}

3.封裝存儲在HDFS工具類

(1)在pom.xml文件中添加hadoop的依賴，用于調(diào)用HDFS API

     <dependency>
     <groupId>org.apache.hadoop</groupId>
     <artifactId>hadoop-common</artifactId>
     <version>2.7.1</version>
     </dependency>
     
     <dependency>
     <groupId>org.apache.hadoop</groupId>
     <artifactId>hadoop-client</artifactId>
     <version>2.7.1</version>
     </dependency>

(2)在com.position.reptile包下，創(chuàng)建名為HttpClientHdfsUtils.java文件的工具類，實現(xiàn)將數(shù)據(jù)寫入HDFS的方法createFileBySysTime()

public class HttpClientHdfsUtils {
	public static void createFileBySysTime(String url,String fileName,String data) {
		System.setProperty("HADOOP_USER_NAME", "root");
		Path path = null;
		//讀取系統(tǒng)時間
		Calendar calendar = Calendar.getInstance();
		Date time = calendar.getTime();
		//格式化系統(tǒng)時間
		SimpleDateFormat format = new SimpleDateFormat("yyyMMdd");
		//獲取系統(tǒng)當(dāng)前時間，將其轉(zhuǎn)換為String類型
		String filepath = format.format(time);
		//構(gòu)造Configuration對象，配置hadoop參數(shù)
		Configuration conf = new Configuration();
		URI uri= URI.create(url);
		FileSystem fileSystem;
		try {
			//獲取文件系統(tǒng)對象
			fileSystem = FileSystem.get(uri,conf);
			//定義文件路徑
			path = new Path("/JobData/"+filepath);
			if(!fileSystem.exists(path)) {
				fileSystem.mkdirs(path);
			}
			//在指定目錄下創(chuàng)建文件
			FSDataOutputStream fsDataOutputStream = fileSystem.create(new Path(path.toString()+"/"+fileName));
			//向文件中寫入數(shù)據(jù)
			IOUtils.copyBytes(new ByteArrayInputStream(data.getBytes()),fsDataOutputStream,conf,true);
			fileSystem.close();
		}catch(IOException e) {
			e.printStackTrace();
		}
	}
}

4.實現(xiàn)網(wǎng)頁數(shù)據(jù)采集

(1)通過Chrome瀏覽器查看請求頭

大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析

?(2)在com.position.reptile包下，創(chuàng)建名為HttpClientData.java文件的主類，用于數(shù)據(jù)采集功能

public class HttpClientData {
	
	public static void main(String[] args) throws Exception {
		//設(shè)置請求頭
		Map<String,String>headers = new HashMap<String,String>();
		headers.put("Cookie","privacyPolicyPopup=false; user_trace_token=20221103113731-d2950fcd-eb36-486c-9032-feab09943d4d; LGUID=20221103113731-ef107f32-06e0-4453-a89c-683f5a558e86; _ga=GA1.2.11435994.1667446652; RECOMMEND_TIP=true; index_location_city=%E5%85%A8%E5%9B%BD; __lg_stoken__=a5abb0b1f9cda5e7a6da82dd7a4397075c675acce324397a86b9cbbd4fc31a58d921346f317ba5c8c92b5c4a9ebb0650576575b67ebae44f422aeb4b1a950643cd2854eece70; JSESSIONID=ABAAAECABIEACCAC2031D7A104C1E74CDC3FABFA00BCC7F; WEBTJ-ID=20221105161123-18446d82e00bcd-0f0b3aafbd8e8e-26021a51-921600-18446d82e018bf; _gid=GA1.2.1865104541.1667635884; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1667446652,1667456559,1667635885; PRE_UTM=; PRE_HOST=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist%5F%25E5%25A4%25A7%25E6%2595%25B0%25E6%258D%25AE%3FlabelWords%3D%26fromSearch%3Dtrue%26suginput%3D%3FlabelWords%3Dhot; LGSID=20221105161124-df5ffe02-aefa-434b-b378-2d64367fddde; PRE_SITE=https%3A%2F%2Fwww.lagou.com%2Fcommon-sec%2Fsecurity-check.html%3Fseed%3D5E87A87B3DA4AFE2BC190FBB560FB9266A5615D5937A536A0FA5205B13CAC74F0D0C1CC5AF1D2DD0C0060C9AF3B36CA5%26ts%3D16676358793441%26name%3Da5abb0b1f9cd%26callbackUrl%3Dhttps%253A%252F%252Fwww.lagou.com%252Fjobs%252Flist%5F%2525E5%2525A4%2525A7%2525E6%252595%2525B0%2525E6%25258D%2525AE%253FlabelWords%253D%2526fromSearch%253Dtrue%2526suginput%253D%253FlabelWords%253Dhot%26srcReferer%3D; _gat=1; X_MIDDLE_TOKEN=668d4b4d5ba925cb7156e2d72086c745; privacyPolicyPopup=false; sensorsdata2015session=%7B%7D; TG-TRACK-CODE=index_search; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221843b917f5d1b4-025994c92cf438-26021a51-921600-1843b917f5e3e5%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24os%22%3A%22Windows%22%2C%22%24browser%22%3A%22Chrome%22%2C%22%24browser_version%22%3A%22103.0.0.0%22%2C%22%24latest_referrer_host%22%3A%22%22%7D%2C%22%24device_id%22%3A%221843b917f5d1b4-025994c92cf438-26021a51-921600-1843b917f5e3e5%22%7D; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1667636243; LGRID=20221105161724-fad126be-48da-4684-aa52-1ff6cfb2dffd; SEARCH_ID=535076fc2a094fa2913263e0079a9038; X_HTTP_TOKEN=a18b9f65c1cbf1490626367661a3afc88e7340da5d");
		headers.put("Connection","keep-alive");
		headers.put("Accept","application/json, text/javascript, */*; q=0.01");
		headers.put("Accept-Language","zh-CN,zh;q=0.9");
		headers.put("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64)"+"AppleWebKit/537.36 (KHTML, like Gecko)"+"Chrome/103.0.0.0 Safari/537.36");
		headers.put("content-type","application/x-www-form-urlencoded; charset=UTF-8");
		headers.put("Referer", "https://www.lagou.com/jobs/list_%E5%A4%A7%E6%95%B0%E6%8D%AE?labelWords=&fromSearch=true&suginput=?labelWords=hot");
		headers.put("Origin", "https://www.lagou.com");
		headers.put("x-requested-with","XMLHttpRequest");
		headers.put("x-anit-forge-token","None");
		headers.put("x-anit-forge-code","0");
		headers.put("Host","www.lagou.com");
		headers.put("Cache-Control","no-cache");
		Map<String,String>params = new HashMap<String,String>();
		params.put("kd","大數(shù)據(jù)");
		params.put("city","全國");
		for (int i=1;i<31;i++){
				params.put("pn",String.valueOf(i));
		}
		for (int i=1;i<31;i++){
			params.put("pn",String.valueOf(i));
			HttpClientResp result = HttpClientUtils.doPost("https://www.lagou.com/jobs/positionAjax.json?"+"needAddtionalResult=false",headers,params);
			HttpClientHdfsUtils.createFileBySysTime("hdfs://hadoop1:9000","page"+i,result.toString());
			Thread.sleep(1 * 500);
			}
		}
      }

最終采集數(shù)據(jù)的結(jié)果?

大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析

第四章：數(shù)據(jù)預(yù)處理?

4.1分析預(yù)處理數(shù)據(jù)

查看數(shù)據(jù)結(jié)構(gòu)內(nèi)容，格式化數(shù)據(jù)

本項目主要分析的內(nèi)容是薪資、福利、技能要求、職位分布這四個方面。

salary(薪資字段的數(shù)據(jù)內(nèi)容為字符串形式)
city(城市字段的數(shù)據(jù)內(nèi)容為字符串形式)
skillLabels(技能要求字段的數(shù)據(jù)內(nèi)容為數(shù)組形式)
companyLabelList(福利標(biāo)簽數(shù)據(jù)字段數(shù)據(jù)形式為數(shù)組)；positionAdvantage(數(shù)據(jù)形式為字符串)

4.2設(shè)計數(shù)據(jù)預(yù)處理方案

大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析

4.3實現(xiàn)數(shù)據(jù)的預(yù)處理

(1)數(shù)據(jù)預(yù)處理環(huán)境準(zhǔn)備

大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析

?在pom.xml文件中，添加hadoop相關(guān)依賴

<dependencies>
  <dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-common</artifactId>
  <version>2.7.1</version>
  </dependency>
  
  <dependency>
     <groupId>org.apache.hadoop</groupId>
     <artifactId>hadoop-client</artifactId>
     <version>2.7.1</version>
     </dependency>
  
  </dependencies>

(2)創(chuàng)建數(shù)據(jù)轉(zhuǎn)換類

創(chuàng)建一個com.position.clean的Package,再創(chuàng)建CleanJob類，用于實現(xiàn)對職位信息數(shù)據(jù)進(jìn)行轉(zhuǎn)換操作

deleteString()方法，用于對薪資字符串處理(去除薪資中的"k"字符)

//刪除指定字符
	public static String deleteString(String str,char delChar) {
		StringBuffer stringBuffer = new StringBuffer("");
		for(int i=0;i<str.length();i++) {
			//str是要處理的字符串，delChar是要刪除的字符
			if(str.charAt(i) != delChar) {
				stringBuffer.append(str.charAt(i));
			}
		}
		return stringBuffer.toString();
	}

mergeString()方法，用于將companyLabelList字段中的數(shù)據(jù)內(nèi)容和positionAdvange字段中的數(shù)據(jù)內(nèi)容進(jìn)行合并處理，生成新字符串?dāng)?shù)據(jù)(以"-"為分隔符)

//處理合并福利標(biāo)簽
	public static String mergeString(String position,JSONArray company) throws JSONException {
		String result = "";
		if(company.length()!=0) {
			for(int i=0;i<company.length();i++) {
				result = result + company.get(i)+"-";
			}
		}
		if(position != "") {
			String[] positionList = position.split("|; |, |、, |，|/");
			for(int i=0;i<positionList.length;i++) {
				result = result + positionList[i].replaceAll("[\\pP\\p{Punct}]", "")+"-";
			}
		}
		return result.substring(0,result.length()-1);
	}

killResult()方法，用于將技能數(shù)據(jù)以"-"為分隔符進(jìn)行分隔，生成新的字符串?dāng)?shù)據(jù)

//處理技能標(biāo)簽
	public static String killResult(JSONArray killData) throws JSONException {
		String result = "";
		if(killData.length() != 0) {
			for(int i=0;i<killData.length();i++) {
				result = result + killData.get(i)+"-";
			}
			return result.substring(0,result.length()-1);
		}else {
			return "null";
		}
	}

resultToString()方法，將數(shù)據(jù)文件中的每一條職位信息數(shù)據(jù)進(jìn)行處理并重新組合成新的字符串形式

//數(shù)據(jù)清洗結(jié)果
	public static String resultToString(JSONArray jobdata) throws JSONException {
		String jobResultData="";
		for(int i=0;i<jobdata.length();i++) {
			String everyData = jobdata.get(i).toString();
			JSONObject everyDataJson=new JSONObject(everyData);
			String city = everyDataJson.getString("city");
			String salary = everyDataJson.getString("salary");
			String positionAdvantage = everyDataJson.getString("positionAdvantage");
			JSONArray companyLabelList = everyDataJson.getJSONArray("companyLabelList");
			JSONArray skillLables = everyDataJson.getJSONArray("skillLables");
			//處理薪資字段數(shù)據(jù)
			String salaryNew = deleteString(salary,'k');
			String welfare = mergeString(positionAdvantage,companyLabelList);
			String kill = killResult(skillLables);
			if(i == jobdata.length() -1) {
			jobResultData = jobResultData+city+","+salaryNew+","+welfare+","+kill;
			}else {
			jobResultData = jobResultData+city+","+salaryNew+","+welfare+","+kill+"\n";
		}
	}
	return jobResultData;
}
}

(3)創(chuàng)建實現(xiàn)Map任務(wù)的Mapper類

在com.position.clean包下，創(chuàng)建一個名稱為CleanMapper的類，用于實現(xiàn)MapReduce程序的Map方法

//CleanMapper類繼承Mapper基類，并定義Map程序輸入和輸出的key和value
public class CleanMapper extends Mapper<LongWritable,Text,Text,NullWritable>{
	//map()方法對輸入的鍵值對進(jìn)行處理
	protected void map(LongWritable key,Text value,Context context) throws IOException,InterruptedException {
		String jobResultData="";
		String reptileData = value.toString();
		//通過截取字符串方式獲取content中的數(shù)據(jù)
		String jobData = reptileData.substring(reptileData.indexOf("=",reptileData.indexOf("=")+1)+1,
				reptileData.length()-1
				);
		try {
			//獲取content中的數(shù)據(jù)內(nèi)容
			JSONObject contentJson = new JSONObject(jobData);
			String contentData = contentJson.getString("content");
			//獲取content下positionResult中的數(shù)據(jù)內(nèi)容
			JSONObject positionResultJson = new JSONObject(contentData);
			String positionResultData = positionResultJson.getString("positionResult");
			//獲取最終result中的數(shù)據(jù)內(nèi)容
			JSONObject resultJson = new JSONObject(positionResultData);
			JSONArray resultData = resultJson.getJSONArray("result");
			jobResultData = CleanJob.resultToString(resultData);
			context.write(new Text(jobResultData), NullWritable.get());
		} catch (JSONException e) {
			e.printStackTrace();
		}
	}

}

(4)創(chuàng)建并執(zhí)行MapReduce程序

在com.position.clean包下，創(chuàng)建一個名稱為CleanMain的類，用于實現(xiàn)MapReduce程序配置

public class CleanMain {
	public static void main(String[] args) throws IOException,ClassNotFoundException,InterruptedException {
		//控制臺輸出日志
		BasicConfigurator.configure();
		//初始化Hadoop配置
		Configuration conf = new Configuration();
		//定義一個新的Job,第一個參數(shù)是hadoop配置信息，第二個參數(shù)是Job的名字
		Job job = new Job(conf,"job");
		//設(shè)置主類
		job.setJarByClass(CleanMain.class);
		//設(shè)置Mapper類
		job.setMapperClass(CleanMapper.class);
		//設(shè)置job輸出數(shù)據(jù)的key類
		job.setOutputKeyClass(Text.class);
		//設(shè)置job輸出數(shù)據(jù)的value類
		job.setOutputValueClass(NullWritable.class);
		//數(shù)據(jù)輸入路徑
		FileInputFormat.addInputPath(job, new Path("hdfs://hadoop1:9000/JobData/20221105"));
		//數(shù)據(jù)輸出路徑
		FileOutputFormat.setOutputPath(job,new Path("D:\\BigData\\out"));
		System.exit(job.waitForCompletion(true)?0:1);
	}

}

(5)將程序打包提交到集群運(yùn)行

修改MapReduce程序主類

package com.position.clean;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.lib.CombineTextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.log4j.BasicConfigurator;

public class CleanMain {
	public static void main(String[] args) throws IOException,ClassNotFoundException,InterruptedException {
		//控制臺輸出日志
		BasicConfigurator.configure();
		//初始化Hadoop配置
		Configuration conf = new Configuration();
		//從hadoop命令行讀取參數(shù)
		String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
		//判斷讀取的參數(shù)正常是兩個，分別是輸入文件和輸出文件的目錄
		if(otherArgs.length != 2) {
			System.err.println("Usage:wordcount<in><out>");
			System.exit(2);
		}
		//定義一個新的Job,第一個參數(shù)是hadoop配置信息，第二個參數(shù)是Job的名字
		Job job = new Job(conf,"job");
		//設(shè)置主類
		job.setJarByClass(CleanMain.class);
		//設(shè)置Mapper類
		job.setMapperClass(CleanMapper.class);
		//處理小文件
		job.setInputFormatClass(CombineTextInputFormat.class);
		//n個小文件之和不能大于2MB
		CombineTextInputFormat.setMinInputSplitSize(job, 2097152);
		//在n個小文件之和大于2MB的情況下，需滿足n+1個小文件之和不能大于4MB
		CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);
		//設(shè)置job輸出數(shù)據(jù)的key類
		job.setOutputKeyClass(Text.class);
		//設(shè)置job輸出數(shù)據(jù)的value類
		job.setOutputValueClass(NullWritable.class);
		//設(shè)置輸入文件
		FileInputFormat.addInputPath(job,new Path(otherArgs[0]));
		//設(shè)置輸出文件
		FileOutputFormat.setOutputPath(job,new Path(otherArgs[1]));
		System.exit(job.waitForCompletion(true)?0:1);
	}

}

創(chuàng)建jar包

大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析

將jar包提交到集群運(yùn)行

大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析

第五章：數(shù)據(jù)分析

5.1數(shù)據(jù)分析概述

本項目通過使用基于分布式文件系統(tǒng)的Hive對招聘網(wǎng)站的數(shù)據(jù)進(jìn)行分析

5.2Hive數(shù)據(jù)倉庫

Hive是建立在Hadoop分布式文件系統(tǒng)上的數(shù)據(jù)倉庫，它提供了一系列工具，能夠?qū)Υ鎯υ贖DFS中的數(shù)據(jù)進(jìn)行數(shù)據(jù)提取、轉(zhuǎn)換和加載(ETL)，是一種可以存儲、查詢和分析存儲在Hadoop中的大規(guī)模的工具。Hive可以將HQL語句轉(zhuǎn)為MapReduce程序進(jìn)行處理。

本項目是將Hive數(shù)據(jù)倉庫設(shè)計為星狀模型，由一張事實表和多張維度表組成。

事實表(ods_jobdata_origin)主要用于存儲MapReduce計算框架清洗后的數(shù)據(jù)

字段	數(shù)據(jù)類型	描述
city	String	城市
salary	array<String>	薪資
company	array<String>	福利標(biāo)簽
kill	array<String>	技能標(biāo)簽

維度表(t_salary_detail)主要用于存儲薪資分布分析的數(shù)據(jù)

字段	數(shù)據(jù)類型	描述
salary	String	薪資分布區(qū)間
count	int	區(qū)間內(nèi)出現(xiàn)薪資的頻次

維度表(t_company_detail)主要用于存儲福利標(biāo)簽分析的數(shù)據(jù)

字段	數(shù)據(jù)類型	描述
company	String	每個福利標(biāo)簽
count	int	每個福利標(biāo)簽的頻次

維度表(t_city_detail)主要用于存儲城市分布分析的數(shù)據(jù)

字段	數(shù)據(jù)類型	描述
city	String	城市
count	int	城市頻次

維度表(t_kill_detail)主要用于存儲技能標(biāo)簽分析的數(shù)據(jù)

字段	數(shù)據(jù)類型	描述
kill	String	每個標(biāo)簽技能
count	int	每個標(biāo)簽技能的頻次

實現(xiàn)數(shù)據(jù)倉庫

啟動Hadoop集群后，在主節(jié)點(diǎn)hadoop1啟動Hive

將HDFS上的預(yù)處理數(shù)據(jù)導(dǎo)入到事實表ods_jobdata_origin中

--創(chuàng)建數(shù)據(jù)倉庫 jobdata
create database jobdata;
use jobdata;

--創(chuàng)建事實表 ods_jobdata_origin
create table ods_jobdata_origin(
city string comment '城市',
salary array<string> comment '薪資',
company array<string> comment '福利',
kill array<string> comment '技能')
comment '原始職位數(shù)據(jù)表'
row format delimited fields terminated by ','
collection items terminated by '-'
stored as textfile;

--加載數(shù)據(jù)
load data inpath '/JobData/output/part-r-00000' overwrite into table ods_jobdata_origin;

--查詢數(shù)據(jù)
select * from ods_jobdata_origin;

大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析

創(chuàng)建明細(xì)表ods_jobdata_detail用于存儲事實表細(xì)化薪資字段的數(shù)據(jù)

create table ods_jobdata_detail(
city string comment '城市',
salary array<string> comment '薪資',
company array<string> comment '福利',
kill array<string> comment '技能',
low_salary int comment '低薪資',
high_salary int comment '高薪資',
avg_salary double comment '平均薪資')
comment '職位數(shù)據(jù)明細(xì)表'
row format delimited fields terminated by ','
collection items terminated by '-'
stored as textfile;

insert overwrite table ods_jobdata_detail 
select city,salary,company,kill,salary[0],salary[1],(salary[0]+salary[1])/2
from ods_jobdata_origin;

對薪資字段內(nèi)容進(jìn)行扁平化處理，將處理結(jié)果存儲到臨時中間表t_ods_tmp_salary
```
create table t_ods_tmp_salary as select explode(ojo.salary) from ods_jobdata_origin ojo;
```

對t_ods_tmp_salary表的每一條數(shù)據(jù)進(jìn)行泛化處理，將處理結(jié)果存儲到中間表t_ods_tmp_salary_dist中

create table t_ods_tmp_salary_dist as select case 
when col>=0 and col<=5 then "0-5"
when col>=6 and col<=10 then "6-10"
when col>=11 and col<=15 then "11-15"
when col>=16 and col<=20 then "16-20"
when col>=21 and col<=25 then "21-25"
when col>=26 and col<=30 then "26-30"
when col>=31 and col<=35 then "31-35"
when col>=36 and col<=40 then "36-40"
when col>=41 and col<=45 then "41-45"
when col>=46 and col<=50 then "46-50"
when col>=51 and col<=55 then "51-55"
when col>=56 and col<=60 then "56-60"
when col>=61 and col<=65 then "61-65"
when col>=66 and col<=70 then "66-70"
when col>=71 and col<=75 then "71-75"
when col>=76 and col<=80 then "76-80"
when col>=81 and col<=85 then "81-85"
when col>=86 and col<=90 then "86-90"
when col>=91 and col<=95 then "91-95"
when col>=96 and col<=100 then "96-100"
when col>=101 then ">101" end from t_ods_tmp_salary;

對福利標(biāo)簽字段內(nèi)容進(jìn)行扁平化處理，將處理結(jié)果存儲到臨時中間表t_ods_tmp_company
```
create table t_ods_tmp_company as select explode(ojo.company) from ods_jobdata_origin ojo;
```
對技能標(biāo)簽字段內(nèi)容進(jìn)行扁平化處理，將處理結(jié)果存儲到臨時中間表t_ods_tmp_kill
```
create table t_ods_tmp_kill as select explode(ojo.kill) from ods_jobdata_origin ojo;
```

創(chuàng)建維度表t_ods_kill,用于存儲技能標(biāo)簽的統(tǒng)計結(jié)果

create table t_ods_kill(
every_kill string comment '技能標(biāo)簽',
count int comment '詞頻')
comment '技能標(biāo)簽詞頻統(tǒng)計'
row format delimited fields terminated by ','
stored as textfile;

創(chuàng)建維度表t_ods_company,用于存儲福利標(biāo)簽的統(tǒng)計結(jié)果

create table t_ods_company(
every_company string comment '福利標(biāo)簽',
count int comment '詞頻')
comment '福利標(biāo)簽詞頻統(tǒng)計'
row format delimited fields terminated by ','
stored as textfile;

創(chuàng)建維度表t_ods_salary,用于存儲薪資分布的統(tǒng)計結(jié)果

create table t_ods_salary(
every_partition string comment '薪資分布',
count int comment '聚合統(tǒng)計')
comment '薪資分布聚合統(tǒng)計'
row format delimited fields terminated by ','
stored as textfile;

創(chuàng)建維度表t_ods_city,用于存儲城市的統(tǒng)計結(jié)果

create table t_ods_city(
every_city string comment '城市',
count int comment '詞頻')
comment '城市統(tǒng)計'
row format delimited fields terminated by ','
stored as textfile;

5.3分析數(shù)據(jù)

職位區(qū)域分析

--職位區(qū)域分析
insert overwrite table t_ods_city
select city,count(1) from ods_jobdata_origin group by city;
--倒敘查詢職位區(qū)域的信息
select * from t_ods_city sort by count desc;

大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析

?職位薪資分析

--職位薪資分析
insert overwrite table t_ods_salary
select '_c0',count(1) from t_ods_tmp_salary_dist group by '_c0';
--查看維度表t_ods_salary中的分析結(jié)果，使用sort by 參數(shù)對表中的count列進(jìn)行倒序排序
select * from t_ods_salary sort by count desc;

--平均值
select avg(avg_salary) from ods_jobdata_detail;

--眾數(shù)
select avg_salary,count(1) as cnt from ods_jobdata_detail group by avg_salary order by cnt desc limit 1;

--中位數(shù)
select percentile(cast(avg_salary as bigint),0.5) from ods_jobdata_detail;

大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析

?公司福利標(biāo)簽分析

--公司福利分析
insert overwrite table t_ods_company
select col,count(1) from t_ods_tmp_company group by col;

--查詢維度表中的分析結(jié)果，倒序查詢前10個
select every_company,count from t_ods_company sort by count desc limit 10;

大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析

?職位技能要求分析

--職位技能要求分析
insert overwrite table t_ods_kill
select col,count(1) from t_ods_tmp_kill group by col;

--查看技能維度表中的分析結(jié)果，倒敘查看前3個
select every_kill,count from t_ods_kill sort by count desc limit 3;

大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析

第六章：數(shù)據(jù)可視化

6.1平臺概述

招聘網(wǎng)站職位分析-數(shù)據(jù)可視化系統(tǒng)主要通過Web平臺對分析結(jié)果進(jìn)行圖像化展示，旨在借助于圖形化手段，清晰有效地傳達(dá)信息，能夠真實反映現(xiàn)階段有關(guān)大數(shù)據(jù)職位的內(nèi)容。本系統(tǒng)采用ECharts來輔助實現(xiàn)。

招聘網(wǎng)站職位分析可視化系統(tǒng)以JavaWeb為基礎(chǔ)搭建，通過SSM(Spring+Springmvc+MyBatis)框架實現(xiàn)后端功能，前端在JSP中使用Echarts實現(xiàn)可視化展示，前后端的數(shù)據(jù)交互是通過SpringMVC與AJAX交互實現(xiàn)。

6.2數(shù)據(jù)遷移

創(chuàng)建關(guān)系型數(shù)據(jù)庫(通過Navicat工具連接)

--創(chuàng)建數(shù)據(jù)庫JobData
CREATE DATABASE JobData CHARACTER set utf8 COLLATE utf8_general_ci;
--創(chuàng)建城市分布表
create table t_city_count(
city VARCHAR(30) DEFAULT null,
count int(5) DEFAULT NULL
) ENGINE=INNODB DEFAULT CHARSET=utf8;
--創(chuàng)建薪資分布表
create table t_salary_count(
salary VARCHAR(30) DEFAULT null,
count int(5) DEFAULT NULL
) ENGINE=INNODB DEFAULT CHARSET=utf8;
--創(chuàng)建福利標(biāo)簽統(tǒng)計表
create table t_company_count(
company VARCHAR(30) DEFAULT null,
count int(5) DEFAULT NULL
) ENGINE=INNODB DEFAULT CHARSET=utf8;
--創(chuàng)建技能標(biāo)簽統(tǒng)計表
create table t_kill_count(
kills VARCHAR(30) DEFAULT null,
count int(5) DEFAULT NULL
) ENGINE=INNODB DEFAULT CHARSET=utf8;

通過Sqoop實現(xiàn)數(shù)據(jù)遷移

Sqoop主要用于在Hadoop(Hive)與傳統(tǒng)數(shù)據(jù)庫(MySQL)間進(jìn)行數(shù)據(jù)傳遞，可以將一個關(guān)系型數(shù)據(jù)庫中的數(shù)據(jù)導(dǎo)入到Hadoop的HDFS中，也可以將HDFS的數(shù)據(jù)導(dǎo)入到關(guān)系型數(shù)據(jù)庫中。

（啟動的時候，有相關(guān)的警告信息，配置bin/configure-sqoop?文件，注釋對應(yīng)的相關(guān)語句）

大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析

--將職位所在的城市的分布統(tǒng)計結(jié)果數(shù)據(jù)遷移到t_city_count表中
bin/sqoop export \
--connect jdbc:mysql://hadoop1:3306/JobData?characterEncoding=UTF-8 \
--username root \
--password 123456 \
--table t_city_count \
--columns "city,count" \
--fields-terminated-by ',' \
--export-dir /user/hive/warehouse/jobdata.db/t_ods_city

--將職位薪資分布結(jié)果數(shù)據(jù)遷移到t_salary_count表中
bin/sqoop export \
--connect jdbc:mysql://hadoop1:3306/JobData?characterEncoding=UTF-8 \
--username root \
--password 123456 \
--table t_salary_dist \
--columns "salary,count" \
--fields-terminated-by ',' \
--export-dir /user/hive/warehouse/jobdata.db/t_ods_salary

--將職位福利統(tǒng)計結(jié)果數(shù)據(jù)遷移到t_company_count表中
bin/sqoop export \
--connect jdbc:mysql://hadoop1:3306/JobData?characterEncoding=UTF-8 \
--username root \
--password 123456 \
--table t_company_count \
--columns "company,count" \
--fields-terminated-by ',' \
--export-dir /user/hive/warehouse/jobdata.db/t_ods_company

--將職位技能標(biāo)簽統(tǒng)計結(jié)果遷移到t_kill_count表中
bin/sqoop export \
--connect jdbc:mysql://hadoop1:3306/JobData?characterEncoding=UTF-8 \
--username root \
--password 123456 \
--table t_kill_dist \
--columns "kills,count" \
--fields-terminated-by ',' \
--export-dir /user/hive/warehouse/jobdata.db/t_ods_kill

6.3平臺環(huán)境搭建

創(chuàng)建后會出現(xiàn)web.xml is missing and <failOnMissingWebXml> is set to true 的錯誤，是缺少web.xml文件導(dǎo)致的。在src/main/webapp/ WEB-INF下添加web.xml

配置pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.itcast.jobanalysis</groupId>
  <artifactId>job-web</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <packaging>war</packaging>
  <dependencies>
  	<dependency>
	    <groupId>org.codehaus.jettison</groupId>
	    <artifactId>jettison</artifactId>
	    <version>1.1</version>
	</dependency>
	<!-- Spring -->
  	<dependency>
		<groupId>org.springframework</groupId>
		<artifactId>spring-context</artifactId>
		<version>4.2.4.RELEASE</version>
	</dependency>
	<dependency>
		<groupId>org.springframework</groupId>
		<artifactId>spring-beans</artifactId>
		<version>4.2.4.RELEASE</version>
	</dependency>
	<dependency>
		<groupId>org.springframework</groupId>
		<artifactId>spring-webmvc</artifactId>
		<version>4.2.4.RELEASE</version>
	</dependency>
	<dependency>
		<groupId>org.springframework</groupId>
		<artifactId>spring-jdbc</artifactId>
		<version>4.2.4.RELEASE</version>
	</dependency>
	<dependency>
		<groupId>org.springframework</groupId>
		<artifactId>spring-aspects</artifactId>
		<version>4.2.4.RELEASE</version>
	</dependency>
	<dependency>
		<groupId>org.springframework</groupId>
		<artifactId>spring-jms</artifactId>
		<version>4.2.4.RELEASE</version>
	</dependency>
	<dependency>
		<groupId>org.springframework</groupId>
		<artifactId>spring-context-support</artifactId>
		<version>4.2.4.RELEASE</version>
	</dependency>
	<!-- Mybatis -->
	<dependency>
		<groupId>org.mybatis</groupId>
		<artifactId>mybatis</artifactId>
		<version>3.2.8</version>
	</dependency>
	<dependency>
		<groupId>org.mybatis</groupId>
		<artifactId>mybatis-spring</artifactId>
		<version>1.2.2</version>
	</dependency>
	<dependency>
		<groupId>com.github.miemiedev</groupId>
		<artifactId>mybatis-paginator</artifactId>
		<version>1.2.15</version>
	</dependency>
	<!-- MySql -->
	<dependency>
		<groupId>mysql</groupId>
		<artifactId>mysql-connector-java</artifactId>
		<version>5.1.32</version>
	</dependency>
	<!-- 連接池 -->
	<dependency>
		<groupId>com.alibaba</groupId>
		<artifactId>druid</artifactId>
		<version>1.0.9</version>
		<exclusions>
				<exclusion>
					<groupId>com.alibaba</groupId>
					<artifactId>jconsole</artifactId>
				</exclusion>
				<exclusion>
					<groupId>com.alibaba</groupId>
					<artifactId>tools</artifactId>
				</exclusion>
			</exclusions>
	</dependency>
	<!-- JSP相關(guān) -->
	<dependency>
		<groupId>jstl</groupId>
		<artifactId>jstl</artifactId>
		<version>1.2</version>
	</dependency>
	<dependency>
		<groupId>javax.servlet</groupId>
		<artifactId>servlet-api</artifactId>
		<version>2.5</version>
		<scope>provided</scope>
	</dependency>
	<dependency>
		<groupId>javax.servlet</groupId>
		<artifactId>jsp-api</artifactId>
		<version>2.0</version>
		<scope>provided</scope>
	</dependency>
	<dependency>
		<groupId>junit</groupId>
		<artifactId>junit</artifactId>
		<version>4.12</version>
	</dependency>
	<dependency>
		<groupId>com.fasterxml.jackson.core</groupId>
		<artifactId>jackson-databind</artifactId>
		<version>2.4.2</version>
	</dependency>
	<dependency>
	   <groupId>org.aspectj</groupId>
	   <artifactId>aspectjweaver</artifactId>
	   <version>1.8.4</version>
  	</dependency>
</dependencies>
<build>
	<finalName>${project.artifactId}</finalName>
	<resources>
		<resource>
			<directory>src/main/java</directory>
			<includes>
				<include>**/*.properties</include>
				<include>**/*.xml</include>
			</includes>
			<filtering>false</filtering>
		</resource>
		<resource>
			<directory>src/main/resources</directory>
			<includes>
				<include>**/*.properties</include>
				<include>**/*.xml</include>
			</includes>
			<filtering>false</filtering>
		</resource>
	</resources>
	<plugins>
<!-- 指定maven編譯的jdk版本,如果不指定,maven3默認(rèn)用jdk 1.5--> 
		<plugin>
			<groupId>org.apache.maven.plugins</groupId>
			<artifactId>maven-compiler-plugin</artifactId>
			<version>3.2</version>
			<configuration>
				<!-- 源代碼使用的JDK版本 -->
				<source>1.8</source>
				<!-- 需要生成的目標(biāo)class文件的編譯版本 --> 
				<target>1.8</target>
				<!-- 字符集編碼 -->
				<encoding>UTF-8</encoding>
			</configuration>
		</plugin>
		<!-- 配置Tomcat插件 -->
		<plugin>
			<groupId>org.apache.tomcat.maven</groupId>
			<artifactId>tomcat7-maven-plugin</artifactId>
			<version>2.2</version>
			<configuration>
				<path>/</path>
				<port>8080</port>
			</configuration>
		</plugin>
	</plugins>
</build>
</project>

在src/main/resources-spring文件夾下的applicationContext.xml中，編寫spring的配置內(nèi)容

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
 xmlns:context="http://www.springframework.org/schema/context" 
 xmlns:p="http://www.springframework.org/schema/p"
 xmlns:aop="http://www.springframework.org/schema/aop" 
 xmlns:tx="http://www.springframework.org/schema/tx"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://www.springframework.org/schema/beans
 http://www.springframework.org/schema/beans/spring-beans-4.2.xsd
 http://www.springframework.org/schema/context
 http://www.springframework.org/schema/context/spring-context-4.2.xsd
 http://www.springframework.org/schema/aop 
 http://www.springframework.org/schema/aop/spring-aop-4.2.xsd
 http://www.springframework.org/schema/tx 
 http://www.springframework.org/schema/tx/spring-tx-4.2.xsd
 http://www.springframework.org/schema/util 
 http://www.springframework.org/schema/util/spring-util-4.2.xsd">
    <!-- 數(shù)據(jù)庫連接池 -->
    <!-- 加載配置文件 -->
    <context:property-placeholder 
		location="classpath:properties/db.properties" />
 <!-- 數(shù)據(jù)庫連接池 -->
 <bean id="dataSource" 
		class="com.alibaba.druid.pool.DruidDataSource"        
		destroy-method="close">
       <property name="url" value="${jdbc.url}" />
       <property name="username" value="${jdbc.username}" />
       <property name="password" value="${jdbc.password}" />
       <property name="driverClassName" value="${jdbc.driver}" />
       <property name="maxActive" value="10" />
       <property name="minIdle" value="5" />
  </bean>
<!-- 讓spring管理sqlsessionfactory使用mybatis和spring整合包中的 -->
  <bean id="sqlSessionFactory" 
		class="org.mybatis.spring.SqlSessionFactoryBean">
      <!-- 數(shù)據(jù)庫連接池 -->
      <property name="dataSource" ref="dataSource" />
      <!-- 加載mybatis的全局配置文件 -->
      <property name="configLocation" 
				value="classpath:mybatis/mybatis-config.xml" />
  </bean>
  <!-- 使用掃描包的形式來創(chuàng)建mapper代理對象 -->
  <bean class="org.mybatis.spring.mapper.MapperScannerConfigurer">
      <property name="basePackage" value="cn.itcast.mapper" />
  </bean>
  <!-- 事務(wù)管理器 -->
  <bean id="transactionManager" 
		class="org.springframework.jdbc.datasource.DataSourceTransactionManager">
      <!-- 數(shù)據(jù)源 -->
      <property name="dataSource" ref="dataSource" />
  </bean>
  <!-- 通知 -->
  <tx:advice id="txAdvice" transaction-manager="transactionManager">
      <tx:attributes>
          <!-- 傳播行為 -->
          <tx:method name="save*" propagation="REQUIRED" />
          <tx:method name="insert*" propagation="REQUIRED" />
          <tx:method name="add*" propagation="REQUIRED" />
          <tx:method name="create*" propagation="REQUIRED" />
          <tx:method name="delete*" propagation="REQUIRED" />
          <tx:method name="update*" propagation="REQUIRED" />
          <tx:method name="find*" 
					propagation="SUPPORTS" 
					read-only="true" />
              <tx:method name="select*" 
						propagation="SUPPORTS" 
						read-only="true" />
              <tx:method name="get*" 
						propagation="SUPPORTS" 
						read-only="true" />
          </tx:attributes>
      </tx:advice>
      <!-- 切面 -->
      <aop:config>
          <aop:advisor advice-ref="txAdvice" 
						pointcut="execution(* cn.itcast.service..*.*(..))" />
      </aop:config>
      <!-- 配置包掃描器，掃描所有帶@Service注解的類 -->
      <context:component-scan base-package="cn.itcast.service" />
</beans>

在src/main/resources-spring文件夾下的springmvc.xml中，編寫SpringMVC的配置內(nèi)容

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:p="http://www.springframework.org/schema/p"
    xmlns:context="http://www.springframework.org/schema/context"
    xmlns:mvc="http://www.springframework.org/schema/mvc"
    xsi:schemaLocation="http://www.springframework.org/schema/beans
    http://www.springframework.org/schema/beans/spring-beans-4.2.xsd
    http://www.springframework.org/schema/mvc
    http://www.springframework.org/schema/mvc/spring-mvc-4.2.xsd
    http://www.springframework.org/schema/context
    http://www.springframework.org/schema/context/spring-context-4.2.xsd">
    <!-- 掃描指定包路徑 使路徑當(dāng)中的@controller注解生效 -->
    <context:component-scan base-package="cn.itcast.controller" />
    <!-- mvc的注解驅(qū)動  -->
    <mvc:annotation-driven />
    <!-- 視圖解析器 -->
    <bean
    class=
    "org.springframework.web.servlet.view.InternalResourceViewResolver">
        <property name="prefix" value="/WEB-INF/jsp/" />
        <property name="suffix" value=".jsp" />
    </bean>
    <!-- 配置資源映射 -->
        <mvc:resources location="/css/" mapping="/css/**"/>
        <mvc:resources location="/js/" mapping="/js/**"/>
        <mvc:resources location="/assets/" mapping="/assets/**"/>
        <mvc:resources location="/img/" mapping="/img/**"/>
    </beans>

編寫web.xml文件，配置spring監(jiān)聽器、編碼過濾器和SpringMVC前端控制器等信息

<web-app xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://java.sun.com/xml/ns/javaee" xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd" version="2.5">
<display-name>job-web</display-name>
<welcome-file-list>
<welcome-file>index.html</welcome-file>
</welcome-file-list>
<!--  加載spring容器  -->
<context-param>
<param-name>contextConfigLocation</param-name>
<param-value>classpath:spring/applicationContext.xml</param-value>
</context-param>
<listener>
<listener-class> org.springframework.web.context.ContextLoaderListener </listener-class>
</listener>
<!--  解決post亂碼  -->
<filter>
<filter-name>CharacterEncodingFilter</filter-name>
<filter-class> org.springframework.web.filter.CharacterEncodingFilter </filter-class>
<init-param>
<param-name>encoding</param-name>
<param-value>utf-8</param-value>
</init-param>
</filter>
<filter-mapping>
<filter-name>CharacterEncodingFilter</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>
<!--  配置springmvc的前端控制器  -->
<servlet>
<servlet-name>data-report</servlet-name>
<servlet-class> org.springframework.web.servlet.DispatcherServlet </servlet-class>
<init-param>
<param-name>contextConfigLocation</param-name>
<param-value>classpath:spring/springmvc.xml</param-value>
</init-param>
<load-on-startup>1</load-on-startup>
</servlet>
<!--  攔截所有請求 jsp除外  -->
<servlet-mapping>
<servlet-name>data-report</servlet-name>
<url-pattern>/</url-pattern>
</servlet-mapping>
<!--  全局錯誤頁面  -->
<error-page>
<error-code>404</error-code>
<location>/WEB-INF/jsp/404.jsp</location>
</error-page>
</web-app>

編寫數(shù)據(jù)庫配置參數(shù)文件db.properties，用于項目解耦

jdbc.driver=com.mysql.jdbc.Driver
jdbc.url=jdbc:mysql://hadoop1:3306/JobData?characterEncoding=utf-8
jdbc.username=root
jdbc.password=123456

編寫Mybatis-Config.xml文件，用于配置Mybatis相關(guān)配置

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE configuration PUBLIC "-//mybatis.org//DTD Config 3.0//EN"
"http://mybatis.org/dtd/mybatis-3-config.dtd">
<configuration>
</configuration>

6.4實現(xiàn)圖形化展示功能

實現(xiàn)職位區(qū)域分布展示

實現(xiàn)薪資分布展示

實現(xiàn)福利標(biāo)簽詞云圖

實現(xiàn)技能標(biāo)簽詞云圖

平臺可視化展示文章來源地址http://www.zghlxwxcb.cn/news/detail-449488.html

到了這里，關(guān)于大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析的文章就介紹完了。如果您還想了解更多內(nèi)容，請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章，希望大家以后多多支持TOY模板網(wǎng)！

国产无码综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

Toy模板網(wǎng)

大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析

第一章：項目概述

1.1項目需求

1.2預(yù)備知識

1.3項目架構(gòu)設(shè)計及技術(shù)選取

1.4開發(fā)環(huán)境和開發(fā)工具

1.5項目開發(fā)流程

第二章：搭建大數(shù)據(jù)集群環(huán)境

2.1安裝準(zhǔn)備

2.2Hadoop集群搭建

2.3Hive安裝

2.4Sqoop安裝

第三章：數(shù)據(jù)采集

3.1知識概要

3.2分析與準(zhǔn)備

3.3采集網(wǎng)頁數(shù)據(jù)

第四章：數(shù)據(jù)預(yù)處理?

4.1分析預(yù)處理數(shù)據(jù)

4.2設(shè)計數(shù)據(jù)預(yù)處理方案

4.3實現(xiàn)數(shù)據(jù)的預(yù)處理

第五章：數(shù)據(jù)分析

5.1數(shù)據(jù)分析概述

5.2Hive數(shù)據(jù)倉庫

5.3分析數(shù)據(jù)

第六章：數(shù)據(jù)可視化

6.1平臺概述

6.2數(shù)據(jù)遷移

6.3平臺環(huán)境搭建

6.4實現(xiàn)圖形化展示功能

覺得文章有用就打賞一下文章作者

支付寶掃一掃打賞

微信掃一掃打賞

支付寶掃一掃領(lǐng)取紅包，優(yōu)惠每天領(lǐng)

二維碼1

二維碼2

大數(shù)據(jù)項目實戰(zhàn)-招聘網(wǎng)站職位分析

第一章：項目概述

1.1項目需求

1.2預(yù)備知識

1.3項目架構(gòu)設(shè)計及技術(shù)選取

1.4開發(fā)環(huán)境和開發(fā)工具

1.5項目開發(fā)流程

第二章：搭建大數(shù)據(jù)集群環(huán)境

2.1安裝準(zhǔn)備

2.2Hadoop集群搭建

2.3Hive安裝

2.4Sqoop安裝

第三章：數(shù)據(jù)采集

3.1知識概要

3.2分析與準(zhǔn)備

3.3采集網(wǎng)頁數(shù)據(jù)

第四章：數(shù)據(jù)預(yù)處理?

4.1分析預(yù)處理數(shù)據(jù)

4.2設(shè)計數(shù)據(jù)預(yù)處理方案

4.3實現(xiàn)數(shù)據(jù)的預(yù)處理

第五章：數(shù)據(jù)分析

5.1數(shù)據(jù)分析概述

5.2Hive數(shù)據(jù)倉庫

5.3分析數(shù)據(jù)

第六章：數(shù)據(jù)可視化

6.1平臺概述

6.2數(shù)據(jù)遷移

6.3平臺環(huán)境搭建

6.4實現(xiàn)圖形化展示功能

相關(guān)文章

覺得文章有用就打賞一下文章作者

支付寶掃一掃打賞

微信掃一掃打賞

支付寶掃一掃領(lǐng)取紅包，優(yōu)惠每天領(lǐng)

二維碼1

二維碼2

支付寶掃一掃領(lǐng)取紅包，優(yōu)惠每天領(lǐng)