国产无码综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

高性能計(jì)算的矩陣乘法優(yōu)化 - Python +MPI的實(shí)現(xiàn)

2年前作者：JintuZheng分類：Toy博客閱讀(31)違法舉報(bào)

這篇具有很好參考價(jià)值的文章主要介紹了高性能計(jì)算的矩陣乘法優(yōu)化 - Python +MPI的實(shí)現(xiàn)。希望對(duì)大家有所幫助。如果存在錯(cuò)誤或未考慮完全的地方，請(qǐng)大家不吝賜教，您也可以點(diǎn)擊"舉報(bào)違法"按鈕提交疑問(wèn)。

本次實(shí)驗(yàn)的目的是使用MPI的并行性來(lái)進(jìn)行矩陣乘法優(yōu)化，本人使用 Python 實(shí)現(xiàn)

0. 硬件信息

實(shí)驗(yàn)硬件：

CPU：AMD Ryzen 7 5800H(3.20 GHz)

內(nèi)存：32GB (3200MHz)

1. 實(shí)驗(yàn)要求、數(shù)據(jù)

要求：使用一個(gè)矩陣，一個(gè)向量相乘，分別用單進(jìn)程和多進(jìn)程的mpi接口實(shí)現(xiàn)。

全局的規(guī)模參數(shù)是 Scale

數(shù)據(jù)示例：

當(dāng) Scale=5時(shí)，數(shù)據(jù)示例如下：

矩陣形式：

$\begin{bmatrix}2&-1&0&0&0\\-1&2&-1&0&0\\0&-1&2&-1&0\\0&0&-1&2&0\\0&0&0&-1&2\end{bmatrix}$

向量形式：

$\begin{bmatrix}1&2&3&1&2\end{bmatrix}$

備注：矩陣由三個(gè)對(duì)眼矩陣（方陣）合并而成，向量由1，2，3按順序重復(fù)組成

2. 數(shù)據(jù)生成實(shí)現(xiàn)

generate_example_matrix 使用三個(gè)對(duì)眼矩陣的蒙版控制三個(gè)矩陣相加
generate_example_vector 以重復(fù)的1，2，3進(jìn)行repeat

import numpy as np
from functools import wraps
import time

def generate_example_matrix(h, w):
    _vs = (-1, 2, -1)
    _i = -1  # shift bits
    example_data = np.zeros([h, w], dtype=np.int32)
    for i in range(3):
        example_data_eye_mask = np.eye(h, w, _i + i,
                                       dtype=np.bool_)  # build eyes and shift
        example_data[example_data_eye_mask == True] = _vs[i]
    return example_data


def generate_example_vector(w):
    _rest_dict = {
        1: [1],
        2: [1, 2],
        3: [1, 2, 3],
    }
    rest_bits = int(w % 3)
    repeat_times = int(w // 3)
    example_vector = np.repeat([[1, 2, 3]], repeat_times, axis=0).ravel()

    if rest_bits > 0:
        tail_vec = _rest_dict[rest_bits]
        tail_vec = np.array(tail_vec, dtype=np.int32)
        example_vector = np.concatenate([example_vector, tail_vec], axis=0)
    return example_vector

將兩個(gè)生成函數(shù)放在utils.py中。

3. 矩陣計(jì)算實(shí)現(xiàn)

在numpy里面由@運(yùn)算和dot運(yùn)算，這些運(yùn)算是經(jīng)過(guò)編譯優(yōu)化的，也就是說(shuō)本身就是并行運(yùn)算。

本次實(shí)驗(yàn)不使用這樣的運(yùn)算，因?yàn)閚umpy的編譯優(yōu)化實(shí)在過(guò)于強(qiáng)大，我測(cè)試過(guò)使用numpy的@運(yùn)算，單進(jìn)程下居然比mpi還要快。

所以，為了凸顯mpi并行優(yōu)化之后的效果，我們使用相對(duì)naive的雙重for循環(huán)實(shí)現(xiàn)。

這是numpy的實(shí)現(xiàn)：

# def numpy_method(example_matrix, example_vector):
#    return example_matrix @ example_vector

這是雙重for循環(huán)的實(shí)現(xiàn)：

def naive_method(example_matrix, example_vector):
    result = []
    h, w = example_matrix.shape
    for hi in range(h):
        colv = example_matrix[hi, :]
        temp = 0
        for wi in range(w):
            temp += colv[wi] * example_vector[wi]
        result.append(temp)
    return np.array(result)

測(cè)試兩種寫法的結(jié)果準(zhǔn)確性：

    h = 5
    w = 5
    print('--- Current matrix scale is: {} x {} ---'.format(h,w))
    
    example_matrix = generate_example_matrix(h, w)
    example_vector = generate_example_vector(w)
    
    result = numpy_method(example_matrix, example_vector)
    print(result)
    
    result = naive_method(example_matrix, example_vector)
    print(result)

生成的示例數(shù)據(jù)：

example_matrix:
[[ 2 -1  0  0  0]
 [-1  2 -1  0  0]
 [ 0 -1  2 -1  0]
 [ 0  0 -1  2 -1]
 [ 0  0  0 -1  2]]
 
example_vector:
[1 2 3 1 2]

輸出：

--- Current matrix scale is: 5 x 5 ---
[ 0  0  3 -3  3]
[ 0  0  3 -3  3]

4. 單進(jìn)程實(shí)現(xiàn)

from utils import generate_example_matrix, generate_example_vector

def naive_method(example_matrix, example_vector):
    result = []
    h, w = example_matrix.shape
    for hi in range(h):
        colv = example_matrix[hi, :]
        temp = 0
        for wi in range(w):
            temp += colv[wi] * example_vector[wi]
        result.append(temp)
    return np.array(result)

def single_main():
    start_time = time.perf_counter()
    h = 5
    w = 5
    print('--- Current matrix scale is: {} x {} ---'.format(h,w))
    example_matrix = generate_example_matrix(h, w)
    example_vector = generate_example_vector(w)
    result = numpy_method(example_matrix, example_vector)
    result = naive_method(example_matrix, example_vector)
    end_time = time.perf_counter()
    print('single method used time is: {:.2f}s\n'.format(end_time - start_time))

if __name__ == '__main__':
    single_main()

5. 多進(jìn)程實(shí)現(xiàn)

使用mpi4py

安裝如下：

pip install mpi4py-mpich
pip install mpi4py

由于mpi4py使用的是mpi2-standard的標(biāo)準(zhǔn)，對(duì)于超過(guò)3GB的序列化對(duì)象會(huì)打包失敗

當(dāng)矩陣的規(guī)模過(guò)大的時(shí)候（比如計(jì)算50000 x 50000的矩陣），在mpi2-standard中，numpy數(shù)組的分發(fā)會(huì)出現(xiàn)緩沖區(qū)溢出。

因此我參照mpi4py的issue所述： https://github.com/mpi4py/mpi4py/issues/119

進(jìn)行了分發(fā)的修改：

將實(shí)現(xiàn)中的：

input_data = comm.scatter(input_data, root = 0)

改成：

def bug_fix(comm, attributes):
    if comm.size == 0:
        (item,) = attributes
    elif comm.rank == 0:
        for i, attr in enumerate(attributes):
            if i == 0:
                item = attr
            else:
                comm.send(attr, dest=i)
    else:
        item = comm.recv(source=0)
    return item
# -----------------------
# input_data = comm.scatter(input_data, root = 0) # when buffer more than 3GB, failed
input_data = bug_fix(comm, input_data)

并且使用pkl5進(jìn)行序列化打包：

from mpi4py.util import pkl5

# comm = MPI.COMM_WORLD # before
comm = pkl5.Intercomm(MPI.COMM_WORLD) # after

完整代碼如下：

from mpi4py import MPI
import time
import numpy as np
from utils import *
import sys
from mpi4py.util import pkl5

# WARNING: https://github.com/mpi4py/mpi4py/issues/119 
# mpi4py cannot use more 3GB buffer for numpy array


# comm = MPI.COMM_WORLD
comm = pkl5.Intercomm(MPI.COMM_WORLD)
rank = comm.Get_rank()
nprocs = comm.Get_size()

# def numpy_method(example_matrix, example_vector):
#     return example_matrix @ example_vector

def naive_method(example_matrix, example_vector):
    result = []
    h, w = example_matrix.shape
    for hi in range(h):
        colv = example_matrix[hi, :]
        temp = 0
        for wi in range(w):
            temp += colv[wi] * example_vector[wi]
        result.append(temp)
    return np.array(result)


def bug_fix(comm, attributes):
    if comm.size == 0:
        (item,) = attributes
    elif comm.rank == 0:
        for i, attr in enumerate(attributes):
            if i == 0:
                item = attr
            else:
                comm.send(attr, dest=i)
    else:
        item = comm.recv(source=0)
    return item

def main():
    end_time = None
    h, w = int(sys.argv[1]), int(sys.argv[2]) # `h` is the taskstep param
    # ===============================
    # scatter process
    if rank == 0:
        start_time = time.perf_counter()
        example_matrix = generate_example_matrix(h, w)
        example_vector = generate_example_vector(w)
        # input_data = (example_matrix, example_vector)
        slice_length, res = divmod(h, nprocs)
        rank_indexes = np.arange(0, h, slice_length)
        input_data = []
        start_index = rank_indexes[0]
        for i in rank_indexes:
            start_index = i
            sub_example_matrix = example_matrix[i:i + slice_length, :]
            sub_example_vector = example_vector
            input_data.append((sub_example_matrix, sub_example_vector))
            
        if res > 0:
            sub_example_matrix = example_matrix[start_index:, :]
            sub_example_vector = example_vector
            input_data.append((sub_example_matrix, sub_example_vector))
    else:
        input_data = None
        output_data = None
        
    # input_data = comm.scatter(input_data, root = 0)
    input_data = bug_fix(comm, input_data)
    sub_example_matrix, sub_example_vector = input_data
    
    result = naive_method(sub_example_matrix, sub_example_vector)
    output_data = comm.gather(result, root = 0)

    # gather process
    if rank == 0:
        output_data = np.concatenate(output_data, axis = 0)
        print(output_data)
        # =================
        end_time = time.perf_counter()
        print('mpi method by process-size:{} used time is: {:.2f}s\n'.format(nprocs, end_time - start_time))
    else:
        input_data = None
        output_data = None

if __name__ == '__main__':
    main()

啟動(dòng)以上腳本的測(cè)試：

將以上腳本命名為：multiprocess_main.py

啟動(dòng)mpi接口的命令：

mpiexec -np [進(jìn)程數(shù)] python multiprocess_main.py [矩陣H參數(shù)] [矩陣W參數(shù)]

示例：

mpiexec -np 8 python multiprocess_main.py 5000 5000

使用8個(gè)進(jìn)程，以并行進(jìn)程計(jì)算5000x5000的矩陣和5000x1進(jìn)行矩乘。

總流程如下：

step1：在root = 0的進(jìn)程進(jìn)行分發(fā)。（根據(jù)tasksteps，即矩陣的規(guī)模）
step2：每個(gè)進(jìn)程使用雙重循環(huán)計(jì)算元素相乘
step3：所有子進(jìn)程將運(yùn)算結(jié)果規(guī)約到root = 0的進(jìn)程
step4：只保留root0的運(yùn)算結(jié)果

6. 優(yōu)化結(jié)果對(duì)比

設(shè)置4組不同規(guī)模的測(cè)試參數(shù)，以10倍規(guī)模遞增

scale = 5, 500, 5000, 50000

多進(jìn)程并行進(jìn)程數(shù)測(cè)試參數(shù)：5，10，16，20，25

關(guān)于強(qiáng)拓展性和弱拓展性：

強(qiáng)擴(kuò)展：如果在增加進(jìn)程或線程個(gè)數(shù)時(shí)，可以維持固定效率，卻不增加問(wèn)題規(guī)模

弱擴(kuò)展：如果在增加進(jìn)程或線程個(gè)數(shù)時(shí)，只有以相同倍率增加問(wèn)題規(guī)模才能保持效率值

強(qiáng)拓展和弱拓展探究的是并行開銷和并行價(jià)值之間權(quán)衡的哲學(xué)

單進(jìn)程測(cè)試結(jié)果：

--- Current matrix scale is: 5 x 5 ---
single method used time is: 0.000268s

--- Current matrix scale is: 50 x 50 ---
single method used time is: 0.000843s

--- Current matrix scale is: 500 x 500 ---
single method used time is: 0.045645s

--- Current matrix scale is: 5000 x 5000 ---
single method used time is: 5.010791s

--- Current matrix scale is: 50000 x 50000 ---
single method used time is: 498.076443s

分析：由上可見，當(dāng)規(guī)模是5000的時(shí)候，單進(jìn)程的運(yùn)算時(shí)間還可以接受，但是再上升一個(gè)數(shù)量級(jí)的時(shí)候，運(yùn)行時(shí)間發(fā)生劇增，使用了將近500秒的時(shí)間完成。

多進(jìn)程測(cè)試結(jié)果：

5進(jìn)程并行，計(jì)算規(guī)模5

--- Current matrix scale is: 5 x 5 ---
mpi method by process-size:5 used time is: 0.001074s

5進(jìn)程并行，計(jì)算規(guī)模50

--- Current matrix scale is: 50 x 50 ---
mpi method by process-size:5 used time is: 0.001124s

5進(jìn)程并行，計(jì)算規(guī)模500

--- Current matrix scale is: 500 x 500 ---
mpi method by process-size:5 used time is: 0.011929s

5進(jìn)程并行，計(jì)算規(guī)模5000

--- Current matrix scale is: 5000 x 5000 ---
mpi method by process-size:5 used time is: 1.182357s

5進(jìn)程并行，計(jì)算規(guī)模50000

--- Current matrix scale is: 50000 x 50000 ---
mpi method by process-size:5 used time is: 126.274598s

10進(jìn)程并行，計(jì)算規(guī)模50

--- Current matrix scale is: 50 x 50 ---
mpi method by process-size:10 used time is: 0.007400s

10進(jìn)程并行，計(jì)算規(guī)模500

--- Current matrix scale is: 500 x 500 ---
mpi method by process-size:10 used time is: 0.016241s

10進(jìn)程并行，計(jì)算規(guī)模5000

--- Current matrix scale is: 5000 x 5000 ---
mpi method by process-size:10 used time is: 0.802658s

10進(jìn)程并行，計(jì)算規(guī)模50000

--- Current matrix scale is: 50000 x 50000 ---
mpi method by process-size:10 used time is: 84.558449s

16進(jìn)程并行，計(jì)算規(guī)模50000

--- Current matrix scale is: 50000 x 50000 ---
mpi method by process-size:16 used time is: 70.913076s

20進(jìn)程并行，計(jì)算規(guī)模5000

--- Current matrix scale is: 5000 x 5000 ---
mpi method by process-size:20 used time is: 0.805241s

20進(jìn)程并行，計(jì)算規(guī)模50000

--- Current matrix scale is: 50000 x 50000 ---
mpi method by process-size:20 used time is: 73.206234s

25進(jìn)程并行，計(jì)算規(guī)模50000

--- Current matrix scale is: 50000 x 50000 ---
mpi method by process-size:25 used time is: 73.724819s

匯總表格：

單進(jìn)程：

規(guī)模	時(shí)間（s）
5	0.00026
50	0.00084
500	0.04564
5000	5.0107
50000	498.07644

并行進(jìn)程數(shù)：5

規(guī)模	時(shí)間（s）
5	0.00107
50	0.00112
500	0.01192
5000	1.18235
50000	126.27459

并行進(jìn)程數(shù)：10

規(guī)模	時(shí)間（s）
50	0.00740
500	0.01624
5000	0.80265
50000	84.55844

并行進(jìn)程數(shù)：16

規(guī)模	時(shí)間（s）
50000	70.91307

并行進(jìn)程數(shù)：20

規(guī)模	時(shí)間（s）
5000	0.80524
50000	73.20623

并行進(jìn)程數(shù)：25

規(guī)模	時(shí)間（s）
50000	73.72481

分析：可見在多進(jìn)程并行起到優(yōu)化作用，而當(dāng)進(jìn)程數(shù)達(dá)到一定的時(shí)候（16），出現(xiàn)優(yōu)化停止的現(xiàn)象，時(shí)間并沒(méi)有減少，反而在進(jìn)程數(shù)大于16的時(shí)候出現(xiàn)增加，說(shuō)明并行開銷已經(jīng)出現(xiàn)負(fù)反饋，因此需要增加問(wèn)題規(guī)模去提高優(yōu)化率。在面臨資源極限之前，呈現(xiàn)強(qiáng)拓展性。當(dāng)達(dá)到資源極限的時(shí)候，出現(xiàn)弱拓展性。文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-421462.html

到了這里，關(guān)于高性能計(jì)算的矩陣乘法優(yōu)化 - Python +MPI的實(shí)現(xiàn)的文章就介紹完了。如果您還想了解更多內(nèi)容，請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章，希望大家以后多多支持TOY模板網(wǎng)！

本文來(lái)自互聯(lián)網(wǎng)用戶投稿，該文觀點(diǎn)僅代表作者本人，不代表本站立場(chǎng)。本站僅提供信息存儲(chǔ)空間服務(wù)，不擁有所有權(quán)，不承擔(dān)相關(guān)法律責(zé)任。如若轉(zhuǎn)載，請(qǐng)注明出處：如若內(nèi)容造成侵權(quán)/違法違規(guī)/事實(shí)不符，請(qǐng)點(diǎn)擊違法舉報(bào)進(jìn)行投訴反饋，一經(jīng)查實(shí)，立即刪除！

分享到：

領(lǐng)支付寶紅包贊助服務(wù)器費(fèi)用

高性能MySQL實(shí)戰(zhàn)（三）：性能優(yōu)化
大家好，我是方圓。這篇主要介紹對(duì)慢 SQL 優(yōu)化的一些手段，而在講解具體的優(yōu)化措施之前，我想先對(duì) EXPLAIN 進(jìn)行介紹，它是我們?cè)诜治霾樵儠r(shí)必要的操作，理解了它輸出結(jié)果的內(nèi)容更有利于我們優(yōu)化 SQL。為了方便大家的閱讀，在下文中規(guī)定類似 key1 的表示二級(jí)索引，key_
2024年02月11日
瀏覽(93)
C++高性能優(yōu)化編程之如何測(cè)量性能（一）
C++高性能優(yōu)化編程系列深入理解設(shè)計(jì)原則系列深入理解設(shè)計(jì)模式系列高級(jí)C++并發(fā)線程編程不好的編程習(xí)慣，不重視程序性能測(cè)量分析讓代碼跑的更快，會(huì)導(dǎo)致浪費(fèi)大量的CPU周期、程序響應(yīng)時(shí)間慢以及卡頓，用戶滿意度下降，進(jìn)而浪費(fèi)大量的時(shí)間返工去重構(gòu)本應(yīng)該一開始就
2024年02月06日
瀏覽(24)
MySQL高性能優(yōu)化規(guī)范建議
數(shù)據(jù)庫(kù)命令規(guī)范數(shù)據(jù)庫(kù)基本設(shè)計(jì)規(guī)范 1. 所有表必須使用 Innodb 存儲(chǔ)引擎 2. 數(shù)據(jù)庫(kù)和表的字符集統(tǒng)一使用 UTF8 3. 所有表和字段都需要添加注釋 4. 盡量控制單表數(shù)據(jù)量的大小,建議控制在 500 萬(wàn)以內(nèi)。 5. 謹(jǐn)慎使用 MySQL 分區(qū)表 6.盡量做到冷熱數(shù)據(jù)分離,減小表的寬度 7. 禁止在表中建
2024年02月12日
瀏覽(20)
如何評(píng)估和優(yōu)化系統(tǒng)的高性能
系統(tǒng)的關(guān)鍵性能指標(biāo)：吞吐量，延遲和TP。吞吐量：反應(yīng)單位時(shí)間內(nèi)處理請(qǐng)求的能力。延遲：從客戶端發(fā)送請(qǐng)求到接收響應(yīng)的時(shí)間。延遲和吞吐量的曲線如下圖所示：總體來(lái)看，隨著壓力增大，系統(tǒng)單位時(shí)間內(nèi)被訪問(wèn)的次數(shù)增加。結(jié)合延遲和吞吐量觀察的話，系統(tǒng)優(yōu)化性能
2024年02月22日
瀏覽(23)
Kafka高性能集群部署與優(yōu)化
Kafka 是由Apache Software Foundation開發(fā)的一個(gè)分布式流處理平臺(tái)，源代碼以Scala編寫。Kafka最初是由LinkedIn公司開發(fā)的，于2011年成為Apache的頂級(jí)項(xiàng)目之一。它是一種高吞吐量、可擴(kuò)展的發(fā)布訂閱消息系統(tǒng)，具有以下特點(diǎn)：高吞吐量：Kafka每秒可以處理數(shù)百萬(wàn)條消息。持久化：數(shù)據(jù)存
2024年02月13日
瀏覽(36)
數(shù)據(jù)庫(kù)——MySQL高性能優(yōu)化規(guī)范
所有數(shù)據(jù)庫(kù)對(duì)象名稱必須使用小寫字母并用下劃線分割所有數(shù)據(jù)庫(kù)對(duì)象名稱禁止使用 MySQL 保留（如果表名中包含查詢時(shí)，需要將其用單引號(hào)括起來(lái)）數(shù)據(jù)庫(kù)對(duì)象的命名要能做到見名識(shí)意，并且最后不要超過(guò) 32 個(gè)字符臨時(shí)庫(kù)表必須以 tmp_為前綴并以日期為后綴，
2024年02月11日
瀏覽(49)
讀高性能MySQL（第4版）筆記12_查詢性能優(yōu)化（下）
2.3.1.1.?讀取行指針和需要排序的字段，對(duì)其進(jìn)行排序，然后再根據(jù)排序結(jié)果讀取所需要的數(shù)據(jù)行 2.3.1.2.?即需要從數(shù)據(jù)表中讀取兩次數(shù)據(jù)，第二次讀取數(shù)據(jù)的時(shí)候，因?yàn)槭亲x取排序列進(jìn)行排序后的所有記錄，這會(huì)產(chǎn)生大量的隨機(jī)I/O，所以兩次傳輸排序的成本非常高 2.3.2.1.?先
2024年02月08日
瀏覽(20)
讀高性能MySQL（第4版）筆記11_查詢性能優(yōu)化（中）
2024年02月08日
瀏覽(23)
讀高性能MySQL（第4版）筆記10_查詢性能優(yōu)化（上）
4.11.1.1.?在存儲(chǔ)引擎層完成的 4.11.2.1.?直接從索引中過(guò)濾不需要的記錄并返回命中的結(jié) 4.11.2.2.?在MySQL服務(wù)器層完成的，但無(wú)須再回表查詢記錄 4.11.3.1.?在MySQL服務(wù)器層完成 4.11.3.2.?需要先從數(shù)據(jù)表中讀出記錄然后過(guò)濾 4.13.2.1.?使用單獨(dú)的匯總表 5.5.1.1.?定期清除大量數(shù)據(jù)時(shí)，
2024年02月08日
瀏覽(33)
企業(yè)如何構(gòu)建高性能計(jì)算云？
HPC是推動(dòng)科學(xué)和工程應(yīng)用發(fā)展的重要組成部分。除了將處理器向Exascale邁進(jìn)之外，工作負(fù)載的性質(zhì)也在不斷變化—從傳統(tǒng)的模擬和建模到混合工作負(fù)載，包括企業(yè)內(nèi)部和云應(yīng)用，還需要整合、吸收和分析來(lái)自無(wú)數(shù)物聯(lián)網(wǎng)傳感器的數(shù)據(jù)。同時(shí)，隨著HPC基礎(chǔ)設(shè)施上的人工智能工作
2024年02月03日
瀏覽(23)