国产无码综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

datax 同步mongodb數(shù)據(jù)庫到hive(hdfs)和elasticserch（es)

2年前作者：拓路者8521分類：Toy博客閱讀(91)違法舉報

這篇具有很好參考價值的文章主要介紹了datax 同步mongodb數(shù)據(jù)庫到hive(hdfs)和elasticserch（es)。希望對大家有所幫助。如果存在錯誤或未考慮完全的地方，請大家不吝賜教，您也可以點擊"舉報違法"按鈕提交疑問。

一、同步環(huán)境

1.mongodb版本：3.6.3。（有點老了，后來發(fā)現(xiàn)flinkcdc都只能監(jiān)控一張表，多張表無法監(jiān)控）
2.datax版本：自己編譯的DataX-datax_v202210
3.hdfs版本：3.1.3
4.hive版本：3.1.2

二、同步思路

1.增量數(shù)據(jù)：需要每隔1小時將mongodb中17個集合的數(shù)據(jù)同步至hive，因為有數(shù)據(jù)生成時間，才用datax查詢方式，將上一個小時的數(shù)據(jù)依次循環(huán)調(diào)用datax同步至hdfs，利用shell腳本和調(diào)度器定時裝載至hive中形成ods層，并和其他表關聯(lián)處理形成dwd層，提供給需求方。
2.全量數(shù)據(jù)：歷史數(shù)據(jù)才用datax編寫腳本循環(huán)讀取+調(diào)度+hive動態(tài)分區(qū)方式同步至hive。因為hive動態(tài)分區(qū)默認只支持100個分區(qū)，我是按小時進行分區(qū)的，因此我每次只拉取4天數(shù)據(jù)，拉取太多報錯，編寫腳本，需要多少天，拉取多少天。（比較笨的方法，有更好的方式歡迎評論區(qū)討論）

三、datax配置

{
    "job": {
        "content": [
          {
              "reader": {
                  "name": "mongodbreader",
                  "parameter": {
                      "address": ["xxxxxxxx:27017"],
                      "authDb": "admin",
                      "userName": "xxxxx",
                      "userPassword": "xxxx",
                      "dbName": "xxxx",
                      "collectionName": "xxxx",
                      "column": [
                          {
                              "name": "_id",
                              "type": "string"
                          },
                          {
                              "name": "data",
                              "type": "string"
                          },
                          {
                              "name": "gid",
                              "type": "string"
                          },
                          {
                              "name": "text",
                              "type": "string"
                          },
                          {
                              "name": "time",
                              "type": "bigint"
                          },
                          {
                              "name": "uid",
                              "type": "string"
                          }
                      ],

                      "query":"{\"time\":{ \"$gte\": ${start_time}, \"$lt\": ${end_time}}}"

                    }
                },
                "writer": {
                    "name": "hdfswriter",
                    "parameter": {
                        "column": [
                             {
                              "name": "ask_id",
                              "type": "string"
                          },
                          {
                              "name": "data",
                              "type": "string"
                          },
                          {
                              "name": "gid",
                              "type": "string"
                          },
                          {
                              "name": "text",
                              "type": "string"
                          },
                          {
                              "name": "time",
                              "type": "string"
                          },
                          {
                              "name": "uid",
                              "type": "string"
                          }
                        ],
                        "compress": "gzip",
                        "defaultFS": "xxxx:8020",
                        "fieldDelimiter": "\t",
                        "fileName": "xxxx",
                        "fileType": "text",
                        "path": "${targetdir}",
                        "writeMode": "append"
                    }
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": 1
            }
        }
    }
}

這里面有兩個坑。
第一個：datax連接mongodb一定注意"authDb": “admin”,這個配置，要明確同步賬號認證庫的位置，賬號在那個庫里面認證的就寫哪個庫，由于mongodb每個庫是單獨認證的，一直報：

com.alibaba.datax.common.exception.DataXException: Code:[Framework-02], Description:[DataX引擎運行過程出錯，具體原因請參看DataX運行結束時的錯誤診斷信息  .].  - com.mongodb.MongoCommandException: Command failed with error 13: 'command count requires authentication' on server xxx:27117. The full response is { "ok" : 0.0, "errmsg" : "command count requires authentication", "code" : 13, "codeName" : "Unauthorized" }

找過很多資料，兩種方式解決賬號認證問題。一種是，剛才提到的指明賬號認證庫；第二種，就是同步哪個庫，單獨給這個賬號再授權一遍庫的權限，代碼如下：

db.createUser({user:"x x 
x x x",pwd:"xxxxxx",roles:[{"role":"read","db":"xxxx"}]})

查詢同步不需要太高的權限，read即可
第二坑：mongodb的query查詢，用的是json語句，網(wǎng)上有大神分享的源碼分析，里面的查詢條件是“and”語句，也就是說，用逗號分隔的查詢條件是and，想用or要多次查詢（但是我測試十幾也不全是and，好像是同樣的字段以最后一條為準，留著后面再研究班），哎，沒辦法，誰讓我懶得自己寫代碼，湊合著用吧。分享query查詢語句多個條件的用法：

                      "query":"{\"time\":{ \"$gte\": 1646064000, \"$lte\": 1648742399},\"time\":{ \"$gte\": 1654012800, \"$lte\": 1656604799},\"time\":{ \"$gte\": 1661961600, \"$lte\": 1664553599}}"

四、datax同步調(diào)度腳本

#!/bin/bash

# 定義變量方便修改
APP=xxx
TABLE=xxx
DATAX_HOME=xxxx
# 如果是輸入的日期按照取輸入日期；如果沒輸入日期取當前時間的前一小時

    do_date=2022111416
   hr1=${do_date: 8: 2}
   date1=${do_date: 0: 8}
  

hdfs_path=xxx

#處理目標路徑，此處的處理邏輯是，如果目標路徑不存在，則創(chuàng)建；若存在，則清空，目的是保證同步任務可重復執(zhí)行
  hadoop fs -test -e $hdfs_path
  if [[ $? -eq 1 ]]; then
    echo "路徑 $hdfs_path 不存在，正在創(chuàng)建......"
    hadoop fs -mkdir -p $hdfs_path
  else
    echo "路徑 $hdfs_path 已經(jīng)存在"
    fs_count=$(hadoop fs -count $hdfs_path)
    content_size=$(echo $fs_count | awk '{print $3}')
    if [[ $content_size -eq 0 ]]; then
      echo "路徑$hdfs_path為空"
    else
      echo "路徑$hdfs_path不為空，正在清空......"
      hadoop fs -rm -r -f $hdfs_path/*
    fi
  fi


#數(shù)據(jù)同步

for i in  xxx xxx xxx
do  
echo ================== $i 裝載日期為 $do_date ==================
python  $DATAX_HOME/bin/datax.py -p"-Dcollection=$i -Dtargetdir=$hdfs_path"   $DATAX_HOME/xxx
done

五、datax同步至es 配置

mongodb同步至es有一個專用的組件，monstache；知道，但還沒用過，留白，由于時間緊張用的datax，此處三個注意點：
1.object格式可以datax讀取的時候可用string，導入es再改回object
2.es重名沒問題
3.想用es中文分詞統(tǒng)計詞頻，除了要配置中文ik，也需要filedata=true；

{
    "job": {
        "content": [
          {
              "reader": {
                  "name": "mongodbreader",
                  "parameter": {
                      "address": ["xxxx:27017"],
                      "userName": "xxx",
                      "authDb": "xxx",
                      "userPassword": "xxxx",
                      "dbName": "xxxx",
                      "collectionName": "${collection}",
                      "column": [
                          {
                              "name": "_id",
                              "type": "string" #原有格式為objectid，用此處用string
                          },
                          {
                              "name": "data",
                              "type": "string" #原有格式為list（object），用string可以倒進去
                          },
                          {
                              "name": "gid",
                              "type": "string"
                          },
                          {
                              "name": "text",
                              "type": "string"
                          },
                          {
                              "name": "time",
                              "type": "bigint"
                          },
                          {
                              "name": "uid",
                              "type": "string"
                          },
                          {
                              "name": "deleted",
                              "type": "bigint"
                          }
                      ],

                      "query":"{\"time\":{ \"$gte\": 1661961600, \"$lte\": 1664553599}}"
                  }
                },
                "writer": {
                "name": "elasticsearchwriter",
                "parameter": {
                  "endpoint": "xxxxxx:9200",
                  "index": "xxxx",
                  "type": "xxxx",
                  "cleanup": false,
                  "settings": {"index" :{"number_of_shards": 1, "number_of_replicas": 0}},
                  "discovery": false,
                  "batchSize": 2048,
                  "splitter": ",",
                  "column": [
                   {
                              "name": "_id",
                              "type": "id"
                          },
                          {
                              "name": "data",
                              "type": "object" #源數(shù)據(jù)為object，此處也為object
                          
                          },
                          {
                              "name": "gid",
                              "type": "keyword"
                          },
                          {
                              "name": "text",#即使和關鍵詞重名也不影響，挺好
                              "type": "text","analyzer": "ik_smart"
                          },#此處想用es分詞，來統(tǒng)計詞頻的小伙伴建議開啟filedata：true，不知道能不能用哈，反正我知道不開啟，不能用，有興趣可以研究下，告訴我
                          {
                              "name": "time",
                              "type": "long"
                          },
                          {
                              "name": "uid",
                              "type": "keyword"
                          },
                          {
                              "name": "deleted",
                              "type": "long"
                          }
                  ]
                }
              }
            }
        ],
        "setting": {
            "speed": {
                "channel": 4
            }
        }
    }
}

六、其他問題

其他就比較簡單了，懶得記了，后面有問題再補充文章來源地址http://www.zghlxwxcb.cn/news/detail-421879.html

到了這里，關于datax 同步mongodb數(shù)據(jù)庫到hive(hdfs)和elasticserch（es)的文章就介紹完了。如果您還想了解更多內(nèi)容，請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關文章，希望大家以后多多支持TOY模板網(wǎng)！

本文來自互聯(lián)網(wǎng)用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。如若轉載，請注明出處：如若內(nèi)容造成侵權/違法違規(guī)/事實不符，請點擊違法舉報進行投訴反饋，一經(jīng)查實，立即刪除！

分享到：

領支付寶紅包贊助服務器費用

DataX將MySQL數(shù)據(jù)同步到HDFS中時，空值不處理可以嗎
DataX將MySQL數(shù)據(jù)同步到HDFS中時，空值(NULL)存到HDFS中時，默認是存儲為空字符串（‘’）。 HFDS Writer并未提供nullFormat參數(shù)：也就是用戶并不能自定義null值寫到HFDS文件中的存儲格式。默認情況下，HFDS Writer會將null值存儲為空字符串（‘’），而Hive默認的null值存儲格式為N。所以
2024年02月12日
瀏覽(29)
4、sybase相關同步-sybase通過datax同步到hdfs
1、datax3.0部署與驗證 2、mysql相關同步-mysql同步到mysql、mysql和hdfs相互同步 3、oracle相關同步-oracle到hdfs 4、sybase相關同步-sybase到hdfs 5、ETL工具的比較(DataPipeline，Kettle，Talend，Informatica，Datax ，Oracle Goldeng 本文介紹sybase的相關同步，sybase到hdfs同步。本文分為三部分，。本文的前
2024年02月08日
瀏覽(23)
HDFS 跨集群數(shù)據(jù)同步（hive,hadoop）
兩個不同的HDFS 集群數(shù)據(jù)遷移( A集群的數(shù)據(jù) - B 集群) 采用的是 SHELL 腳本 ?按表進行; 日期分區(qū)進行; #!/bin/bash ##################### #創(chuàng)建人:DZH #創(chuàng)建日期: 2020-04 #內(nèi)容：數(shù)據(jù)遷移 ##################### ##################################### [ \\\"$#\\\" -ne 0 ] FILE=$1 path=$(cd `dirname $0`; pwd) ############## 獲取執(zhí)
2024年04月27日
瀏覽(104)
【數(shù)據(jù)庫開發(fā)】DataX開發(fā)環(huán)境的安裝部署（Python、Java）
DataX是阿里云DataWorks數(shù)據(jù)集成的開源版本。下載即可用，支持linux和windows，只需要短短幾步驟就可以完成數(shù)據(jù)的傳輸。官網(wǎng)地址： https://github.com/alibaba/DataX DataX 是阿里云 DataWorks數(shù)據(jù)集成的開源版本，在阿里巴巴集團內(nèi)被廣泛使用的離線數(shù)據(jù)同步工具/平臺。DataX 實現(xiàn)了包括
2024年02月02日
瀏覽(20)
mongodb 數(shù)據(jù)庫管理（數(shù)據(jù)庫、集合、文檔）
目錄一、數(shù)據(jù)庫操作 1、創(chuàng)建數(shù)據(jù)庫 2、刪除數(shù)據(jù)庫二、集合操作 1、創(chuàng)建集合 2、刪除集合三、文檔操作 1、創(chuàng)建文檔 2、插入文檔 3、查看文檔 4、更新文檔 1）update() 方法 2）replace() 方法創(chuàng)建數(shù)據(jù)庫的語法格式如下：如果數(shù)據(jù)庫不存在，則創(chuàng)建數(shù)據(jù)庫，否則切換到該數(shù)據(jù)
2024年02月12日
瀏覽(33)
[虛幻引擎 MongoDB Client 插件說明] DTMongoDB MongoDB數(shù)據(jù)庫連接插件，UE藍圖可以操作MongoDB數(shù)據(jù)庫增刪改查。
本插件可以在UE里面使用藍圖操作MongoDB數(shù)據(jù)庫，對數(shù)據(jù)庫進行查詢，刪除，插入，替換，更新操作。插件下載地址在文章最后。 Create MongoDB Client - 創(chuàng)建客戶端對象創(chuàng)建一個 MongoDB 客戶端對象。 Connect By Url - 連接到數(shù)據(jù)庫 Url ：MongoDB的連接地址。如 mongoDB://account:password@ip:
2024年02月14日
瀏覽(63)
MongoDB數(shù)據(jù)庫從入門到精通系列文章之：MongoDB數(shù)據(jù)庫百篇技術文章匯總
MongoDB數(shù)據(jù)庫系列文章持續(xù)更新中：更多數(shù)據(jù)庫內(nèi)容請閱讀博主數(shù)據(jù)庫專欄，數(shù)據(jù)庫專欄涵蓋了Mysql、SQLServer、PostgreSQL、MongoDB、Oracle、Cassandra等數(shù)據(jù)庫數(shù)據(jù)庫專欄文章名稱文章鏈接數(shù)據(jù)庫安裝部署系列之：部署Mongodb5.0.6高可用集群詳細步驟數(shù)據(jù)庫安裝部署系列之：部署M
2024年02月11日
瀏覽(54)
redis數(shù)據(jù)庫和MongoDB數(shù)據(jù)庫基本操作
（1）設置鍵值（2）讀取鍵值（3）數(shù)值類型自增1 （4）數(shù)值類型自減1 （5）查看值的長度（1）對列表city插入元素：Shanghai Suzhou Hangzhou （2）將列表city里的頭部的元素移除（3）對一個已存在的列表插入新元素（4）查看list的值長度 (1）設置一個hash表，order表里包括的
2024年02月16日
瀏覽(33)
MongoDb數(shù)據(jù)庫
1.顯示所有數(shù)據(jù)庫： show dbs 2.切換到指定數(shù)據(jù)庫，如果沒有則自動創(chuàng)建數(shù)據(jù)庫 use databaseName 3.顯示當前所在數(shù)據(jù)庫 db 4.刪除當前數(shù)據(jù)庫 use?庫名 db.dropDatabase() 1.創(chuàng)建集合 db.createCollection(\\\'集合名稱\\\') 2.顯示當前數(shù)據(jù)庫中所有集合 show colletions? 3.刪除某個集合 db.xxx.drop(); 4.重命名集
2024年02月04日
瀏覽(64)
Mongodb連接數(shù)據(jù)庫
npm init ??npm i mongoose ?const mongoose=require(\\\"mongoose\\\") mongoose.connect(\\\"mongodb://127.0.0.1:27017/user\\\") 說明：mongodb是協(xié)議,user是數(shù)據(jù)庫，如果沒有會自動創(chuàng)建user數(shù)據(jù)庫?。 node 文件名 ? ? mongoose.disconnect()
2024年02月15日
瀏覽(31)