国产 无码 综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

ORC與Parquet壓縮分析

這篇具有很好參考價(jià)值的文章主要介紹了ORC與Parquet壓縮分析。希望對大家有所幫助。如果存在錯(cuò)誤或未考慮完全的地方,請大家不吝賜教,您也可以點(diǎn)擊"舉報(bào)違法"按鈕提交疑問。

ORC與Parquet壓縮分析

@date:2023年6月14日

環(huán)境

  • OS:CentOS 6.5
  • JDK:1.8
  • 內(nèi)存:256G
  • 磁盤:HDD
  • CPU:Dual 8-core Intel? Xeon? CPU (32 Hyper-Threads) E5-2630 v3 @ 2.40GHz

通過Orc和Parquet原生方式進(jìn)行數(shù)據(jù)寫入,并采用以下算法進(jìn)行壓縮測試

  • lzo
  • lz4(lz4_raw)
  • Zstandard
  • snappy

數(shù)據(jù)schema

盡可能的保持parquet與ORC的schema一致。

parquet

        MessageType schema = MessageTypeParser.parseMessageType("message schema {\n" +
                " required INT64 long_value;\n" +
                " required double double_value;\n" +
                " required boolean boolean_value;\n" +
                " required binary string_value (UTF8);\n" +
                " required binary decimal_value (DECIMAL(32,18));\n" +
                " required INT64 time_value;\n" +
                " required INT64 time_instant_value;\n" +
                " required INT64 date_value;\n" +
                "}");

orc

        TypeDescription readSchema = TypeDescription.createStruct()
                .addField("long_value", TypeDescription.createLong())
                .addField("double_value", TypeDescription.createDouble())
                .addField("boolean_value", TypeDescription.createBoolean())
                .addField("string_value", TypeDescription.createString())
                .addField("decimal_value", TypeDescription.createDecimal().withScale(18))
                .addField("time_value", TypeDescription.createTimestamp())
                .addField("time_instant_value", TypeDescription.createTimestampInstant())
                .addField("date_value", TypeDescription.createDate());

數(shù)據(jù)實(shí)驗(yàn)

將工程打包成uber JAR,通過java命令執(zhí)行

??對parquet使用lzo時(shí)需要額外的配置

  1. 在使用lzo的時(shí)候需要在系統(tǒng)上安裝Lzo 2.x

    # 查詢是否有l(wèi)zo安裝包
    [root@demo ~]# rpm -q lzo
    
    # yum方式安裝
    yum install lzo
    
    # rpm方式 下載lzo的rpm包
    rpm -ivh lzo-2.06-8.el7.x86_64.rpm
    
    # 源碼編譯安裝
    # 1源碼編譯的依賴
    yum -y install lzo-devel zlib-devel gcc autoconf automake libtool
    # 解壓縮源碼
    tar -zxvf lzo-2.10.tar.gz -C ../source
    # 配置和安裝
    cd ~/source/lzo-2.10
    ./configure --enable-shared --prefix /usr/local/lzo-2.1
    make && sudo make install
    
  2. 由于GPLNativeCodeLoader類在加載的時(shí)候默認(rèn)lib的目錄是/native/Linux-amd64-64/lib,所以需要使用的lib copy進(jìn)去。

    -rw-r--r-- 1 root root  112816 Jun 13 17:57 hadoop-lzo-0.4.20.jar
    -rw-r--r-- 1 root root  117686 Jun 13 17:17 libgplcompression.a
    -rw-r--r-- 1 root root    1157 Jun 13 17:17 libgplcompression.la
    -rwxr-xr-x 1 root root   75368 Jun 13 17:17 libgplcompression.so
    -rwxr-xr-x 1 root root   75368 Jun 13 17:17 libgplcompression.so.0
    -rwxr-xr-x 1 root root   75368 Jun 13 17:17 libgplcompression.so.0.0.0
    -rw-r--r-- 1 root root 1297096 Jun 13 17:17 libhadoop.a
    -rw-r--r-- 1 root root 1920190 Jun 13 17:17 libhadooppipes.a
    -rwxr-xr-x 1 root root  765897 Jun 13 17:17 libhadoop.so
    -rwxr-xr-x 1 root root  765897 Jun 13 17:17 libhadoop.so.1.0.0
    -rw-r--r-- 1 root root  645484 Jun 13 17:17 libhadooputils.a
    -rw-r--r-- 1 root root  438964 Jun 13 17:17 libhdfs.a
    -rwxr-xr-x 1 root root  272883 Jun 13 17:17 libhdfs.so
    -rwxr-xr-x 1 root root  272883 Jun 13 17:17 libhdfs.so.0.0.0
    -rw-r--r-- 1 root root  290550 Jun 13 17:17 liblzo2.a
    -rw-r--r-- 1 root root     929 Jun 13 17:17 liblzo2.la
    -rwxr-xr-x 1 root root  202477 Jun 13 17:17 liblzo2.so
    -rwxr-xr-x 1 root root  202477 Jun 13 17:17 liblzo2.so.2
    -rwxr-xr-x 1 root root  202477 Jun 13 17:17 liblzo2.so.2.0.0
    -rw-r--r-- 1 root root  246605 Jun 13 17:17 libsigar-amd64-linux.so
    
  3. 在執(zhí)行java需要手動(dòng)配置java.library.path和引用hadoop-lzo-0.4.20.jar(沒有找到將其一并打包到工程uber.jar里面的方式) hadoop-lzo編譯

 # 命令解釋
 java -cp file-compress.jar com.donny.orc.ReadWriterOrc {數(shù)據(jù)記錄數(shù)} {壓縮簡稱}
 # ORC未壓縮
 java -cp file-compress.jar com.donny.orc.ReadWriterOrc 10000 none
 # ORC采用lzo壓縮
 java -cp file-compress.jar com.donny.orc.ReadWriterOrc 10000 lzo
 # ORC采用lz4壓縮
 java -cp file-compress.jar com.donny.orc.ReadWriterOrc 10000 lz4
 # ORC采用zstd壓縮
 java -cp file-compress.jar com.donny.orc.ReadWriterOrc 10000 zstd
 # ORC采用snappy壓縮
 java -cp file-compress.jar com.donny.orc.ReadWriterOrc 10000 snappy
 
 # Parquet未壓縮
 java -cp file-compress.jar com.donny.parquet.NativeParquet 10000 none
 # Parquet采用lzo壓縮
 java -Djava.library.path=/native/Linux-amd64-64/lib -cp file-compress.jar:hadoop-lzo-0.4.20.jar com.donny.parquet.NativeParquet 300000000 lzo
 # Parquet采用lz4壓縮
 java -cp file-compress.jar com.donny.parquet.NativeParquet 10000 lz4_raw
 # Parquet采用zstd壓縮
 java -cp file-compress.jar com.donny.parquet.NativeParquet 10000 zstd
 # Parquet采用snappy壓縮
 java -cp file-compress.jar com.donny.parquet.NativeParquet 10000 snappy

壓縮結(jié)果

ORC與Parquet壓縮分析
ORC與Parquet壓縮分析

查詢分析

文件的查詢性能分析

文件使用建議

在數(shù)倉和數(shù)據(jù)湖的場景中,數(shù)據(jù)一般按以下結(jié)構(gòu)進(jìn)行分層存儲(chǔ):
ORC與Parquet壓縮分析

  • 貼源層:該層是將數(shù)據(jù)源中的數(shù)據(jù)直接抽取過來的,數(shù)據(jù)類型以文本為主,需要保持?jǐn)?shù)據(jù)原樣。數(shù)據(jù)不會(huì)發(fā)生變化,在初次清洗之后被讀取的概率也不大,可以采用ORC格式文件外加Zstandard存儲(chǔ)。以控制存儲(chǔ)最小。

  • 加工匯總層:該層是數(shù)倉的數(shù)據(jù)加工組織階段,會(huì)做一些數(shù)據(jù)的清洗和規(guī)范化的操作,比如去除空數(shù)據(jù)、臟數(shù)據(jù)、離群值等。采用ORC能夠較好支持該階段的數(shù)據(jù)ACID需求。數(shù)據(jù)壓縮可以采用Lz4,以達(dá)到最優(yōu)的性價(jià)比。

  • 應(yīng)用層:該層的數(shù)據(jù)是供數(shù)據(jù)分析和數(shù)據(jù)挖掘使用,比如常用的數(shù)據(jù)報(bào)表就是存在這里。此時(shí)的數(shù)據(jù)已經(jīng)具備了對外部的直接使用的能力。數(shù)據(jù)的可能具備了一定層度的結(jié)構(gòu)化,而Parquet在實(shí)現(xiàn)復(fù)雜的嵌套結(jié)構(gòu)方面,比ORC更具有優(yōu)勢。所以該層一般采用Parquet,處于該層的數(shù)據(jù)一般變化不大,可以采用Zstandard壓縮。

    主要考慮的因素

    • 數(shù)據(jù)的變化性
    • 數(shù)據(jù)的結(jié)構(gòu)復(fù)雜性
    • 數(shù)據(jù)的讀寫高效性
    • 數(shù)據(jù)壓縮率

附錄

編譯hadoop-lzo

編譯前提
  • 安裝JDK1.8+
  • 安裝maven
  • OS已經(jīng)安裝lzo的庫
  • 下載源碼包 https://github.com/twitter/hadoop-lzo/releases/tag/release-0.4.20
# 解壓安裝包
tar -zxvf hadoop-lzo-0.4.20.tar.gz -C /opt/software/hadoop-lzo/;
# 重命名
mv hadoop-lzo-release-0.4.20 hadoop-lzo-0.4.20;
# 進(jìn)入項(xiàng)目目錄
cd /opt/software/hadoop-lzo/hadoop-lzo-0.4.20;
# 進(jìn)行編譯
mvn clean package

可以通過對root模塊的pom.xml進(jìn)行修改來對Hadoop進(jìn)行適配。一般開源的不需要調(diào)整。

<properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
   <!-- <hadoop.current.version>2.6.4</hadoop.current.version>-->
    <hadoop.current.version>2.9.2</hadoop.current.version>
    <hadoop.old.version>1.0.4</hadoop.old.version>
</properties>
編譯程中出現(xiàn)的錯(cuò)誤
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.7:run (build-native-non-win) on project hadoop-lzo: An Ant BuildException has occured: exec returned: 1
[ERROR] around Ant part ...<exec failonerror="true" dir="${build.native}" executable="sh">... @ 16:66 in /opt/software/hadoop-lzo/hadoop-lzo-0.4.20/target/antrun/build-build-native-non-win.xml
[ERROR] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

通過配置JAVA_HOME環(huán)境變量解決文章來源地址http://www.zghlxwxcb.cn/news/detail-484421.html

結(jié)果文件
  • target/hadoop-lzo-0.4.20.jar
  • target/native/Linux-amd64-64/lib下的文件

file-compress.jar源碼

ReadWriterOrc類
package com.donny.orc;


import com.donny.base.utils.FileUtil;
import com.donny.parquet.NativeParquet;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hive.common.type.HiveDecimal;
import org.apache.hadoop.hive.ql.exec.vector.*;
import org.apache.hadoop.hive.ql.io.sarg.PredicateLeaf;
import org.apache.hadoop.hive.ql.io.sarg.SearchArgumentFactory;
import org.apache.hadoop.hive.serde2.io.HiveDecimalWritable;
import org.apache.orc.*;
import org.apache.parquet.hadoop.metadata.CompressionCodecName;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.File;
import java.io.IOException;
import java.math.BigDecimal;
import java.math.RoundingMode;
import java.sql.Timestamp;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.UUID;

/**
 * <dependency>
 * <groupId>org.apache.orc</groupId>
 * <artifactId>orc-core</artifactId>
 * <version>1.8.3</version>
 * </dependency>
 *
 * <dependency>
 * <groupId>org.apache.hadoop</groupId>
 * <artifactId>hadoop-client</artifactId>
 * <version>2.9.2</version>
 * </dependency>
 *
 * <dependency>
 * <groupId>org.lz4</groupId>
 * <artifactId>lz4-java</artifactId>
 * <version>1.8.0</version>
 * </dependency>
 *
 * @author 1792998761@qq.com
 * @description
 * @date 2023/6/8
 */
public class ReadWriterOrc {

    private static final Logger LOG = LoggerFactory.getLogger(ReadWriterOrc.class);
    public static String path = System.getProperty("user.dir") + File.separator + "demo.orc";
    public static CompressionKind codecName;
    static int records;

    public static void main(String[] args) throws IOException {
        // 寫入記錄數(shù)
        String recordNum = args[0];
        records = Integer.parseInt(recordNum);
        if (records < 10000 || records > 300000000) {
            LOG.error("壓縮記錄數(shù)范圍是10000~300000000");
            return;
        }
        // 壓縮算法
        String compressionCodecName = args[1];
        switch (compressionCodecName.toLowerCase()) {
            case "none":
                codecName = CompressionKind.NONE;
                break;
            case "lzo":
                codecName = CompressionKind.LZO;
                break;
            case "lz4":
                codecName = CompressionKind.LZ4;
                break;
            case "zstd":
                codecName = CompressionKind.ZSTD;
                break;
            default:
                LOG.error("目前壓縮算法支持none、lzo、lz4、zstd");
                return;
        }

        long t1 = System.currentTimeMillis();
        writerToOrcFile();
        long duration = System.currentTimeMillis() - t1;

        String fileSize = "";
        File afterFile = new File(path);
        if (afterFile.exists() && afterFile.isFile()) {
            fileSize = FileUtil.fileSizeByteConversion(afterFile.length(), 2);
        }
        LOG.info("Using the {} compression algorithm to write {} pieces of data takes time: {}s, file size is {}.",
                compressionCodecName, recordNum, (duration / 1000), fileSize);
    }

    public static void readFromOrcFile() throws IOException {
        Configuration conf = new Configuration();

        TypeDescription readSchema = TypeDescription.createStruct()
                .addField("long_value", TypeDescription.createLong())
                .addField("double_value", TypeDescription.createDouble())
                .addField("boolean_value", TypeDescription.createBoolean())
                .addField("string_value", TypeDescription.createString())
                .addField("decimal_value", TypeDescription.createDecimal().withScale(18))
                .addField("time_value", TypeDescription.createTimestamp())
                .addField("time_instant_value", TypeDescription.createTimestampInstant())
                .addField("date_value", TypeDescription.createDate());


        Reader reader = OrcFile.createReader(new Path(path),
                OrcFile.readerOptions(conf));
        OrcFile.WriterVersion writerVersion = reader.getWriterVersion();
        System.out.println("writerVersion=" + writerVersion);
        Reader.Options readerOptions = new Reader.Options()
                .searchArgument(
                        SearchArgumentFactory
                                .newBuilder()
                                .between("long_value", PredicateLeaf.Type.LONG, 0L, 1024L)
                                .build(),
                        new String[]{"long_value"}
                );

        RecordReader rows = reader.rows(readerOptions.schema(readSchema));

        VectorizedRowBatch batch = readSchema.createRowBatch();
        int count = 0;
        while (rows.nextBatch(batch)) {
            LongColumnVector longVector = (LongColumnVector) batch.cols[0];
            DoubleColumnVector doubleVector = (DoubleColumnVector) batch.cols[1];
            LongColumnVector booleanVector = (LongColumnVector) batch.cols[2];
            BytesColumnVector stringVector = (BytesColumnVector) batch.cols[3];
            DecimalColumnVector decimalVector = (DecimalColumnVector) batch.cols[4];
            TimestampColumnVector dateVector = (TimestampColumnVector) batch.cols[5];
            TimestampColumnVector timestampVector = (TimestampColumnVector) batch.cols[6];
            count++;
            if (count == 1) {
                for (int r = 0; r < batch.size; r++) {
                    long longValue = longVector.vector[r];
                    double doubleValue = doubleVector.vector[r];
                    boolean boolValue = booleanVector.vector[r] != 0;
                    String stringValue = stringVector.toString(r);
                    HiveDecimalWritable hiveDecimalWritable = decimalVector.vector[r];
                    long time1 = dateVector.getTime(r);
                    Date date = new Date(time1);
                    String format = new SimpleDateFormat("yyyy-MM-dd hh:mm:ss").format(date);
                    long time = timestampVector.time[r];
                    int nano = timestampVector.nanos[r];
                    Timestamp timestamp = new Timestamp(time);
                    timestamp.setNanos(nano);
                    System.out.println(longValue + ", " + doubleValue + ", " + boolValue + ", " + stringValue + ", " + hiveDecimalWritable.getHiveDecimal().toFormatString(18) + ", " + format + ", " + timestamp);

                }
            }

        }
        System.out.println("count=" + count);
        rows.close();
    }


    public static void writerToOrcFile() throws IOException {

        Configuration configuration = new Configuration();
        configuration.set("orc.overwrite.output.file", "true");
        TypeDescription schema = TypeDescription.createStruct()
                .addField("long_value", TypeDescription.createLong())
                .addField("double_value", TypeDescription.createDouble())
                .addField("boolean_value", TypeDescription.createBoolean())
                .addField("string_value", TypeDescription.createString())
                .addField("decimal_value", TypeDescription.createDecimal().withScale(18))
                .addField("time_value", TypeDescription.createTimestamp())
                .addField("time_instant_value", TypeDescription.createTimestampInstant())
                .addField("date_value", TypeDescription.createDate());

        Writer writer = OrcFile.createWriter(new Path(path),
                OrcFile.writerOptions(configuration)
                        .setSchema(schema)
                        .stripeSize(67108864)
                        .bufferSize(64 * 1024)
                        .blockSize(128 * 1024 * 1024)
                        .rowIndexStride(10000)
                        .blockPadding(true)
                        .compress(codecName));

        //根據(jù) 列數(shù)和默認(rèn)的1024 設(shè)置創(chuàng)建一個(gè)batch
        VectorizedRowBatch batch = schema.createRowBatch();
        LongColumnVector longVector = (LongColumnVector) batch.cols[0];
        DoubleColumnVector doubleVector = (DoubleColumnVector) batch.cols[1];
        LongColumnVector booleanVector = (LongColumnVector) batch.cols[2];
        BytesColumnVector stringVector = (BytesColumnVector) batch.cols[3];
        DecimalColumnVector decimalVector = (DecimalColumnVector) batch.cols[4];
        TimestampColumnVector dateVector = (TimestampColumnVector) batch.cols[5];
        TimestampColumnVector timestampVector = (TimestampColumnVector) batch.cols[6];
        for (int r = 0; r < records; ++r) {
            int row = batch.size++;
            longVector.vector[row] = r;
            doubleVector.vector[row] = r;
            booleanVector.vector[row] = r % 2;
            stringVector.setVal(row, UUID.randomUUID().toString().getBytes());
            BigDecimal bigDecimal = BigDecimal.valueOf((double) r / 3).setScale(18, RoundingMode.DOWN);
            HiveDecimal hiveDecimal = HiveDecimal.create(bigDecimal).setScale(18);
            decimalVector.set(row, hiveDecimal);
            long time = new Date().getTime();
            Timestamp timestamp = new Timestamp(time);
            dateVector.set(row, timestamp);
            timestampVector.set(row, timestamp);

            if (batch.size == batch.getMaxSize()) {
                writer.addRowBatch(batch);
                batch.reset();
            }
        }
        if (batch.size != 0) {
            writer.addRowBatch(batch);
            batch.reset();
        }
        writer.close();
    }
}
NativeParquet類
package com.donny.parquet;

import com.donny.base.utils.FileUtil;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.column.ParquetProperties;
import org.apache.parquet.example.data.Group;
import org.apache.parquet.example.data.GroupFactory;
import org.apache.parquet.example.data.simple.SimpleGroupFactory;
import org.apache.parquet.hadoop.ParquetFileWriter;
import org.apache.parquet.hadoop.ParquetReader;
import org.apache.parquet.hadoop.ParquetWriter;
import org.apache.parquet.hadoop.example.GroupReadSupport;
import org.apache.parquet.hadoop.example.GroupWriteSupport;
import org.apache.parquet.hadoop.metadata.CompressionCodecName;
import org.apache.parquet.schema.MessageType;
import org.apache.parquet.schema.MessageTypeParser;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.File;
import java.io.IOException;
import java.math.BigDecimal;
import java.math.RoundingMode;
import java.util.Date;
import java.util.Random;
import java.util.UUID;

/**
 * <dependency>
 * <groupId>org.lz4</groupId>
 * <artifactId>lz4-java</artifactId>
 * <version>1.8.0</version>
 * </dependency>
 *
 * <dependency>
 * <groupId>org.apache.hadoop</groupId>
 * <artifactId>hadoop-client</artifactId>
 * <version>2.9.2</version>
 * </dependency>
 *
 * <dependency>
 * <groupId>org.apache.parquet</groupId>
 * <artifactId>parquet-avro</artifactId>
 * <version>1.13.1</version>
 * </dependency>
 *
 * <dependency>
 * <groupId>org.apache.avro</groupId>
 * <artifactId>avro</artifactId>
 * <version>1.11.1</version>
 * </dependency>
 *
 * @author 1792998761@qq.com
 * @description
 * @date 2023/6/12
 */
public class NativeParquet {
    private static final Logger LOG = LoggerFactory.getLogger(NativeParquet.class);

    public static String path = System.getProperty("user.dir") + File.separator + "demo.parquet";

    public static void main(String[] args) throws IOException {
        // 寫入記錄數(shù)
        String recordNum = args[0];
        int records = Integer.parseInt(recordNum);
        if (records < 10000 || records > 300000000) {
            LOG.error("壓縮記錄數(shù)范圍是10000~300000000");
            return;
        }
        // 壓縮算法
        String compressionCodecName = args[1];
        CompressionCodecName codecName;
        switch (compressionCodecName.toLowerCase()) {
            case "none":
                codecName = CompressionCodecName.UNCOMPRESSED;
                break;
            case "lzo":
                codecName = CompressionCodecName.LZO;
                break;
            case "lz4":
                codecName = CompressionCodecName.LZ4;
                break;
            case "lz4_raw":
                codecName = CompressionCodecName.LZ4_RAW;
                break;
            case "zstd":
                codecName = CompressionCodecName.ZSTD;
                break;
            default:
                LOG.error("目前壓縮算法支持none、lzo、lz4、lz4_raw、zstd");
                return;
        }
        long t1 = System.currentTimeMillis();

        MessageType schema = MessageTypeParser.parseMessageType("message schema {\n" +
                " required INT64 long_value;\n" +
                " required double double_value;\n" +
                " required boolean boolean_value;\n" +
                " required binary string_value (UTF8);\n" +
                " required binary decimal_value (DECIMAL(32,18));\n" +
                " required INT64 time_value;\n" +
                " required INT64 time_instant_value;\n" +
                " required INT64 date_value;\n" +
                "}");

        GroupFactory factory = new SimpleGroupFactory(schema);


        Path dataFile = new Path(path);

        Configuration configuration = new Configuration();
        GroupWriteSupport.setSchema(schema, configuration);
        GroupWriteSupport writeSupport = new GroupWriteSupport();

        ParquetWriter<Group> writer = new ParquetWriter<>(
                dataFile,
                ParquetFileWriter.Mode.OVERWRITE,
                writeSupport,
                codecName,
                ParquetWriter.DEFAULT_BLOCK_SIZE,
                ParquetWriter.DEFAULT_PAGE_SIZE,
                ParquetWriter.DEFAULT_PAGE_SIZE, /* dictionary page size */
                ParquetWriter.DEFAULT_IS_DICTIONARY_ENABLED,
                ParquetWriter.DEFAULT_IS_VALIDATING_ENABLED,
                ParquetProperties.WriterVersion.PARQUET_1_0,
                configuration
        );
        Group group;
        for (int i = 0; i < records; i++) {
            group = factory.newGroup();
            group.append("long_value", new Random().nextLong())
                    .append("double_value", new Random().nextDouble())
                    .append("boolean_value", new Random().nextBoolean())
                    .append("string_value", UUID.randomUUID().toString())
                    .append("decimal_value", BigDecimal.valueOf((double) i / 3).setScale(18, RoundingMode.DOWN).toString())
                    .append("time_value", new Date().getTime())
                    .append("time_instant_value", new Date().getTime())
                    .append("date_value", new Date().getTime());
            writer.write(group);
        }

        writer.close();

//        GroupReadSupport readSupport = new GroupReadSupport();
//        ParquetReader<Group> reader = new ParquetReader<>(dataFile, readSupport);
//        Group result = null;
//        while ((result = reader.read()) != null) {
//            System.out.println(result);
//        }
        long duration = System.currentTimeMillis() - t1;

        String fileSize = "";
        File afterFile = new File(path);
        if (afterFile.exists() && afterFile.isFile()) {
            fileSize = FileUtil.fileSizeByteConversion(afterFile.length(), 2);
        }
        LOG.info("Using the {} compression algorithm to write {} pieces of data takes time: {}s, file size is {}.",
                compressionCodecName, recordNum, (duration / 1000), fileSize);
    }
}
FileUtil類
package com.donny.base.utils;

import java.math.BigDecimal;
import java.math.RoundingMode;
import java.text.DecimalFormat;

/**
 * File使用幫助工具類
 *
 * @author 1792998761@qq.com
 * @date 2019/11/21 14:44
 * @since 1.0
 */
public class FileUtil {

    /**
     * 數(shù)據(jù)存儲(chǔ)單位類型 B
     */
    public static final int STORAGE_UNIT_TYPE_B = 0;
    /**
     * 數(shù)據(jù)存儲(chǔ)單位類型 KB
     */
    public static final int STORAGE_UNIT_TYPE_KB = 1;
    /**
     * 數(shù)據(jù)存儲(chǔ)單位類型 MB
     */
    public static final int STORAGE_UNIT_TYPE_MB = 2;
    /**
     * 數(shù)據(jù)存儲(chǔ)單位類型 GB
     */
    public static final int STORAGE_UNIT_TYPE_GB = 3;
    /**
     * 數(shù)據(jù)存儲(chǔ)單位類型 TB
     */
    public static final int STORAGE_UNIT_TYPE_TB = 4;
    /**
     * 數(shù)據(jù)存儲(chǔ)單位類型 PB
     */
    public static final int STORAGE_UNIT_TYPE_PB = 5;
    /**
     * 數(shù)據(jù)存儲(chǔ)單位類型 EB
     */
    public static final int STORAGE_UNIT_TYPE_EB = 6;
    /**
     * 數(shù)據(jù)存儲(chǔ)單位類型 ZB
     */
    public static final int STORAGE_UNIT_TYPE_ZB = 7;
    /**
     * 數(shù)據(jù)存儲(chǔ)單位類型 YB
     */
    public static final int STORAGE_UNIT_TYPE_YB = 8;
    /**
     * 數(shù)據(jù)存儲(chǔ)單位類型 BB
     */
    public static final int STORAGE_UNIT_TYPE_BB = 9;
    /**
     * 數(shù)據(jù)存儲(chǔ)單位類型 NB
     */
    public static final int STORAGE_UNIT_TYPE_NB = 10;
    /**
     * 數(shù)據(jù)存儲(chǔ)單位類型 DB
     */
    public static final int STORAGE_UNIT_TYPE_DB = 11;

    private FileUtil() {
        throw new IllegalStateException("Utility class");
    }

    /**
     * 將文件大小轉(zhuǎn)為人類慣性理解方式
     *
     * @param size               大小 單位默認(rèn)B
     * @param decimalPlacesScale 精確小數(shù)位
     */
    public static String fileSizeByteConversion(Long size, Integer decimalPlacesScale) {
        int scale = 0;
        long fileSize = 0L;
        if (decimalPlacesScale != null && decimalPlacesScale >= 0) {
            scale = decimalPlacesScale;
        }
        if (size != null && size >= 0) {
            fileSize = size;
        }
        return sizeByteConversion(fileSize, scale, STORAGE_UNIT_TYPE_B);
    }

    /**
     * 將文件大小轉(zhuǎn)為人類慣性理解方式
     *
     * @param size               大小
     * @param decimalPlacesScale 精確小數(shù)位
     * @param storageUnitType    起始單位類型
     */
    public static String fileSizeByteConversion(Long size, Integer decimalPlacesScale, int storageUnitType) {
        int scale = 0;
        long fileSize = 0L;
        if (decimalPlacesScale != null && decimalPlacesScale >= 0) {
            scale = decimalPlacesScale;
        }
        if (size != null && size >= 0) {
            fileSize = size;
        }
        return sizeByteConversion(fileSize, scale, storageUnitType);
    }

    private static String sizeByteConversion(long size, int decimalPlacesScale, int storageUnitType) {
        BigDecimal fileSize = new BigDecimal(size);
        BigDecimal param = new BigDecimal(1024);
        int count = storageUnitType;
        while (fileSize.compareTo(param) > 0 && count < STORAGE_UNIT_TYPE_NB) {
            fileSize = fileSize.divide(param, decimalPlacesScale, RoundingMode.HALF_UP);
            count++;
        }
        StringBuilder dd = new StringBuilder();
        int s = decimalPlacesScale;
        dd.append("0");
        if (s > 0) {
            dd.append(".");
        }
        while (s > 0) {
            dd.append("0");
            s = s - 1;
        }
        DecimalFormat df = new DecimalFormat(dd.toString());
        String result = df.format(fileSize) + "";
        switch (count) {
            case STORAGE_UNIT_TYPE_B:
                result += "B";
                break;
            case STORAGE_UNIT_TYPE_KB:
                result += "KB";
                break;
            case STORAGE_UNIT_TYPE_MB:
                result += "MB";
                break;
            case STORAGE_UNIT_TYPE_GB:
                result += "GB";
                break;
            case STORAGE_UNIT_TYPE_TB:
                result += "TB";
                break;
            case STORAGE_UNIT_TYPE_PB:
                result += "PB";
                break;
            case STORAGE_UNIT_TYPE_EB:
                result += "EB";
                break;
            case STORAGE_UNIT_TYPE_ZB:
                result += "ZB";
                break;
            case STORAGE_UNIT_TYPE_YB:
                result += "YB";
                break;
            case STORAGE_UNIT_TYPE_DB:
                result += "DB";
                break;
            case STORAGE_UNIT_TYPE_NB:
                result += "NB";
                break;
            case STORAGE_UNIT_TYPE_BB:
                result += "BB";
                break;
            default:
                break;
        }
        return result;
    }
}

到了這里,關(guān)于ORC與Parquet壓縮分析的文章就介紹完了。如果您還想了解更多內(nèi)容,請?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!

本文來自互聯(lián)網(wǎng)用戶投稿,該文觀點(diǎn)僅代表作者本人,不代表本站立場。本站僅提供信息存儲(chǔ)空間服務(wù),不擁有所有權(quán),不承擔(dān)相關(guān)法律責(zé)任。如若轉(zhuǎn)載,請注明出處: 如若內(nèi)容造成侵權(quán)/違法違規(guī)/事實(shí)不符,請點(diǎn)擊違法舉報(bào)進(jìn)行投訴反饋,一經(jīng)查實(shí),立即刪除!

領(lǐng)支付寶紅包贊助服務(wù)器費(fèi)用

相關(guān)文章

  • elementUI date-picker 日期格式轉(zhuǎn)為 2023/08/08格式

    坑: dom 中 value注意這里的是 “yyyy/MM/dd” 。moment 中在 format格式化的時(shí)候是 “yyyy/MM/DD”。\\\"YYYY/MM/DD\\\"也可以 還是要仔細(xì)看官網(wǎng):https://element.eleme.cn/#/zh-CN/component/date-picker#ri-qi-ge-shi

    2024年02月13日
    瀏覽(15)
  • 使用hive查看orc文件 orcfiledump命令詳解 異常處理(Failed to read ORC file)

    使用hive查看orc文件 orcfiledump命令詳解 異常處理(Failed to read ORC file)

    列式存儲(chǔ)以orc和parquet文件居多,現(xiàn)階段hive數(shù)據(jù)存儲(chǔ)的主流格式是orc,然后結(jié)合presto(目前對orc的支持好于parquet)做一些即席查詢。hive數(shù)據(jù)文件是直接存儲(chǔ)在hdfs上,但是hadoop貌似沒有提供直接查看文本的命令,好在hive提供了支持。 1. 命令幫助: hive --service orcfiledump --help 2. 數(shù)

    2024年02月16日
    瀏覽(26)
  • 【python】pyarrow.parquet+pandas:讀取及使用parquet文件

    【python】pyarrow.parquet+pandas:讀取及使用parquet文件

    ??Parquet是一種用于 列式存儲(chǔ) 和 壓縮數(shù)據(jù) 的文件格式,廣泛應(yīng)用于大數(shù)據(jù)處理和分析中。Python提供了多個(gè)庫來處理Parquet文件,例如pyarrow和fastparquet。 ??本文將介紹如何使用pyarrow.parquet+pandas庫操作Parquet文件。 ?? pyarrow.parquet 模塊,可以讀取和寫入Parquet文件,以及進(jìn)行

    2024年02月21日
    瀏覽(23)
  • C++---狀態(tài)壓縮dp---憤怒的小鳥(每日一道算法2023.4.19)

    C++---狀態(tài)壓縮dp---憤怒的小鳥(每日一道算法2023.4.19)

    注意事項(xiàng): 難度警告!這題在NOIP中也算偏難的題,量力而行。 本題為\\\"狀態(tài)壓縮dp—最短Hamilton路徑\\\"的擴(kuò)展題,建議先閱讀這篇文章并理解。 本題是\\\"重復(fù)覆蓋問題\\\"可以使用\\\"Dancing Links\\\"做,但我們這里是用的狀態(tài)壓縮dp來寫。 題目: Kiana 最近沉迷于一款神奇的游戲無法自拔。

    2023年04月23日
    瀏覽(16)
  • Android圖片壓縮原理分析(三)—— 哈夫曼壓縮講解

    Android圖片壓縮原理分析(三)—— 哈夫曼壓縮講解

    ? 前面幾篇文章,我們了解了一些關(guān)于圖片壓縮的基礎(chǔ)知識(shí)以及Android的Bitmap相關(guān)的知識(shí),然后也提到的 Skia 是 Android 的重要組成部分。在魯班壓縮算法解析中初次提到了哈夫曼壓縮,那么他們之間到底是存在什么關(guān)系呢?今天我們就來探究探究。 什么是skia圖像引擎了,詳細(xì)

    2024年02月11日
    瀏覽(15)
  • 【數(shù)據(jù)存儲(chǔ)】數(shù)據(jù)壓縮需求分析

    【數(shù)據(jù)存儲(chǔ)】數(shù)據(jù)壓縮需求分析

    1.數(shù)據(jù)壓縮需求分析 數(shù)據(jù)壓縮的價(jià)值 數(shù)據(jù)壓縮不僅僅是能夠?yàn)橛脩艄?jié)約存儲(chǔ)空間,也能較快的傳輸各種信息,減小通信延遲。此外,在節(jié)省通信帶寬和節(jié)約信息傳送資源消耗方面,數(shù)據(jù)壓縮也能起到很大的作用。 數(shù)據(jù)壓縮領(lǐng)域主流的無損壓縮算法 當(dāng)前數(shù)據(jù)壓縮領(lǐng)域流行的無

    2024年01月25日
    瀏覽(25)
  • ORC工具(使用阿里云統(tǒng)一文字識(shí)別接口實(shí)現(xiàn))

    ORC工具(使用阿里云統(tǒng)一文字識(shí)別接口實(shí)現(xiàn))

    廢話不多,直接上代碼。 轉(zhuǎn)換的字符結(jié)果是一行。

    2024年02月12日
    瀏覽(25)
  • Parquet文件格式問答

    Parquet文件格式是一種列式存儲(chǔ)格式,用于在大數(shù)據(jù)生態(tài)系統(tǒng)中存儲(chǔ)和處理大規(guī)模數(shù)據(jù) 。它由Apache Parquet項(xiàng)目開發(fā)和維護(hù),是一種開放的、跨平臺(tái)的數(shù)據(jù)存儲(chǔ)格式。 Parquet文件格式采用了一種高效的壓縮和編碼方式,可以在壓縮和解壓縮時(shí)利用數(shù)據(jù)的局部性和重復(fù)性,從而達(dá)到

    2024年02月04日
    瀏覽(16)
  • 記一次 Windows10 內(nèi)存壓縮模塊 崩潰分析

    記一次 Windows10 內(nèi)存壓縮模塊 崩潰分析

    在給各位朋友免費(fèi)分析 .NET程序 各種故障的同時(shí),往往也會(huì)收到各種其他類型的dump,比如:Windows 崩潰,C++ 崩潰,Mono 崩潰,真的是啥都有,由于基礎(chǔ)知識(shí)的相對缺乏,分析起來并不是那么的順利,今天就聊一個(gè) Windows 崩潰的內(nèi)核dump 吧,這個(gè) dump 是前幾天有位朋友給到我的

    2023年04月26日
    瀏覽(28)
  • 【數(shù)據(jù)壓縮】第二次作業(yè)——PNG圖像格式分析

    【數(shù)據(jù)壓縮】第二次作業(yè)——PNG圖像格式分析

    PNG格式是一種采用無損壓縮算法的位圖格式,支持索引、灰度、RGB三種顏色方案以及alpha通道等特性。 PNG能夠支持256色調(diào)色板技術(shù),產(chǎn)生文件的體積小,最高支持24位真彩色圖像以及8位灰度圖像,支持存在附加文本信息,以保留圖像名稱、作者、著作權(quán)、創(chuàng)作時(shí)間、注釋等,

    2023年04月08日
    瀏覽(17)

覺得文章有用就打賞一下文章作者

支付寶掃一掃打賞

博客贊助

微信掃一掃打賞

請作者喝杯咖啡吧~博客贊助

支付寶掃一掃領(lǐng)取紅包,優(yōu)惠每天領(lǐng)

二維碼1

領(lǐng)取紅包

二維碼2

領(lǐng)紅包