本次實(shí)驗(yàn)的目的是使用MPI的并行性來(lái)進(jìn)行矩陣乘法優(yōu)化,本人使用 Python 實(shí)現(xiàn)
0. 硬件信息
實(shí)驗(yàn)硬件:
- CPU:AMD Ryzen 7 5800H(3.20 GHz)
- 內(nèi)存:32GB (3200MHz)
1. 實(shí)驗(yàn)要求、數(shù)據(jù)
要求:使用一個(gè)矩陣,一個(gè)向量相乘,分別用單進(jìn)程和多進(jìn)程的mpi接口實(shí)現(xiàn)。
全局的規(guī)模參數(shù)是 Scale
數(shù)據(jù)示例:
當(dāng) Scale=5
時(shí),數(shù)據(jù)示例如下:
矩陣形式:
[ 2 ? 1 0 0 0 ? 1 2 ? 1 0 0 0 ? 1 2 ? 1 0 0 0 ? 1 2 0 0 0 0 ? 1 2 ] \begin{bmatrix}2&-1&0&0&0\\-1&2&-1&0&0\\0&-1&2&-1&0\\0&0&-1&2&0\\0&0&0&-1&2\end{bmatrix} ?2?1000??12?100?0?12?10?00?12?1?00002? ?
向量形式:
[ 1 2 3 1 2 ] \begin{bmatrix}1&2&3&1&2\end{bmatrix} [1?2?3?1?2?]
備注:矩陣由三個(gè)對(duì)眼矩陣(方陣)合并而成,向量由1,2,3按順序重復(fù)組成
2. 數(shù)據(jù)生成實(shí)現(xiàn)
generate_example_matrix 使用三個(gè)對(duì)眼矩陣的蒙版控制三個(gè)矩陣相加
generate_example_vector 以重復(fù)的1,2,3進(jìn)行repeat
import numpy as np
from functools import wraps
import time
def generate_example_matrix(h, w):
_vs = (-1, 2, -1)
_i = -1 # shift bits
example_data = np.zeros([h, w], dtype=np.int32)
for i in range(3):
example_data_eye_mask = np.eye(h, w, _i + i,
dtype=np.bool_) # build eyes and shift
example_data[example_data_eye_mask == True] = _vs[i]
return example_data
def generate_example_vector(w):
_rest_dict = {
1: [1],
2: [1, 2],
3: [1, 2, 3],
}
rest_bits = int(w % 3)
repeat_times = int(w // 3)
example_vector = np.repeat([[1, 2, 3]], repeat_times, axis=0).ravel()
if rest_bits > 0:
tail_vec = _rest_dict[rest_bits]
tail_vec = np.array(tail_vec, dtype=np.int32)
example_vector = np.concatenate([example_vector, tail_vec], axis=0)
return example_vector
將兩個(gè)生成函數(shù)放在utils.py
中。
3. 矩陣計(jì)算實(shí)現(xiàn)
在numpy里面由@運(yùn)算和dot運(yùn)算,這些運(yùn)算是經(jīng)過(guò)編譯優(yōu)化的,也就是說(shuō)本身就是并行運(yùn)算。
本次實(shí)驗(yàn)不使用這樣的運(yùn)算,因?yàn)閚umpy的編譯優(yōu)化實(shí)在過(guò)于強(qiáng)大,我測(cè)試過(guò)使用numpy的@運(yùn)算,單進(jìn)程下居然比mpi還要快。
所以,為了凸顯mpi并行優(yōu)化之后的效果,我們使用相對(duì)naive的雙重for循環(huán)實(shí)現(xiàn)。
這是numpy的實(shí)現(xiàn):
# def numpy_method(example_matrix, example_vector):
# return example_matrix @ example_vector
這是雙重for循環(huán)的實(shí)現(xiàn):
def naive_method(example_matrix, example_vector):
result = []
h, w = example_matrix.shape
for hi in range(h):
colv = example_matrix[hi, :]
temp = 0
for wi in range(w):
temp += colv[wi] * example_vector[wi]
result.append(temp)
return np.array(result)
測(cè)試兩種寫法的結(jié)果準(zhǔn)確性:
h = 5
w = 5
print('--- Current matrix scale is: {} x {} ---'.format(h,w))
example_matrix = generate_example_matrix(h, w)
example_vector = generate_example_vector(w)
result = numpy_method(example_matrix, example_vector)
print(result)
result = naive_method(example_matrix, example_vector)
print(result)
生成的示例數(shù)據(jù):
example_matrix:
[[ 2 -1 0 0 0]
[-1 2 -1 0 0]
[ 0 -1 2 -1 0]
[ 0 0 -1 2 -1]
[ 0 0 0 -1 2]]
example_vector:
[1 2 3 1 2]
輸出:
--- Current matrix scale is: 5 x 5 ---
[ 0 0 3 -3 3]
[ 0 0 3 -3 3]
4. 單進(jìn)程實(shí)現(xiàn)
from utils import generate_example_matrix, generate_example_vector
def naive_method(example_matrix, example_vector):
result = []
h, w = example_matrix.shape
for hi in range(h):
colv = example_matrix[hi, :]
temp = 0
for wi in range(w):
temp += colv[wi] * example_vector[wi]
result.append(temp)
return np.array(result)
def single_main():
start_time = time.perf_counter()
h = 5
w = 5
print('--- Current matrix scale is: {} x {} ---'.format(h,w))
example_matrix = generate_example_matrix(h, w)
example_vector = generate_example_vector(w)
result = numpy_method(example_matrix, example_vector)
result = naive_method(example_matrix, example_vector)
end_time = time.perf_counter()
print('single method used time is: {:.2f}s\n'.format(end_time - start_time))
if __name__ == '__main__':
single_main()
5. 多進(jìn)程實(shí)現(xiàn)
使用mpi4py
安裝如下:
pip install mpi4py-mpich
pip install mpi4py
由于mpi4py使用的是mpi2-standard的標(biāo)準(zhǔn),對(duì)于超過(guò)3GB的序列化對(duì)象會(huì)打包失敗
當(dāng)矩陣的規(guī)模過(guò)大的時(shí)候(比如計(jì)算50000 x 50000的矩陣),在mpi2-standard中,numpy數(shù)組的分發(fā)會(huì)出現(xiàn)緩沖區(qū)溢出。
因此我參照mpi4py的issue所述: https://github.com/mpi4py/mpi4py/issues/119
進(jìn)行了分發(fā)的修改:
將實(shí)現(xiàn)中的:
input_data = comm.scatter(input_data, root = 0)
改成:
def bug_fix(comm, attributes):
if comm.size == 0:
(item,) = attributes
elif comm.rank == 0:
for i, attr in enumerate(attributes):
if i == 0:
item = attr
else:
comm.send(attr, dest=i)
else:
item = comm.recv(source=0)
return item
# -----------------------
# input_data = comm.scatter(input_data, root = 0) # when buffer more than 3GB, failed
input_data = bug_fix(comm, input_data)
并且使用pkl5進(jìn)行序列化打包:
from mpi4py.util import pkl5
# comm = MPI.COMM_WORLD # before
comm = pkl5.Intercomm(MPI.COMM_WORLD) # after
完整代碼如下:
from mpi4py import MPI
import time
import numpy as np
from utils import *
import sys
from mpi4py.util import pkl5
# WARNING: https://github.com/mpi4py/mpi4py/issues/119
# mpi4py cannot use more 3GB buffer for numpy array
# comm = MPI.COMM_WORLD
comm = pkl5.Intercomm(MPI.COMM_WORLD)
rank = comm.Get_rank()
nprocs = comm.Get_size()
# def numpy_method(example_matrix, example_vector):
# return example_matrix @ example_vector
def naive_method(example_matrix, example_vector):
result = []
h, w = example_matrix.shape
for hi in range(h):
colv = example_matrix[hi, :]
temp = 0
for wi in range(w):
temp += colv[wi] * example_vector[wi]
result.append(temp)
return np.array(result)
def bug_fix(comm, attributes):
if comm.size == 0:
(item,) = attributes
elif comm.rank == 0:
for i, attr in enumerate(attributes):
if i == 0:
item = attr
else:
comm.send(attr, dest=i)
else:
item = comm.recv(source=0)
return item
def main():
end_time = None
h, w = int(sys.argv[1]), int(sys.argv[2]) # `h` is the taskstep param
# ===============================
# scatter process
if rank == 0:
start_time = time.perf_counter()
example_matrix = generate_example_matrix(h, w)
example_vector = generate_example_vector(w)
# input_data = (example_matrix, example_vector)
slice_length, res = divmod(h, nprocs)
rank_indexes = np.arange(0, h, slice_length)
input_data = []
start_index = rank_indexes[0]
for i in rank_indexes:
start_index = i
sub_example_matrix = example_matrix[i:i + slice_length, :]
sub_example_vector = example_vector
input_data.append((sub_example_matrix, sub_example_vector))
if res > 0:
sub_example_matrix = example_matrix[start_index:, :]
sub_example_vector = example_vector
input_data.append((sub_example_matrix, sub_example_vector))
else:
input_data = None
output_data = None
# input_data = comm.scatter(input_data, root = 0)
input_data = bug_fix(comm, input_data)
sub_example_matrix, sub_example_vector = input_data
result = naive_method(sub_example_matrix, sub_example_vector)
output_data = comm.gather(result, root = 0)
# gather process
if rank == 0:
output_data = np.concatenate(output_data, axis = 0)
print(output_data)
# =================
end_time = time.perf_counter()
print('mpi method by process-size:{} used time is: {:.2f}s\n'.format(nprocs, end_time - start_time))
else:
input_data = None
output_data = None
if __name__ == '__main__':
main()
啟動(dòng)以上腳本的測(cè)試:
將以上腳本命名為:multiprocess_main.py
啟動(dòng)mpi接口的命令:
mpiexec -np [進(jìn)程數(shù)] python multiprocess_main.py [矩陣H參數(shù)] [矩陣W參數(shù)]
示例:
mpiexec -np 8 python multiprocess_main.py 5000 5000
使用8個(gè)進(jìn)程,以并行進(jìn)程計(jì)算5000x5000的矩陣和5000x1進(jìn)行矩乘。
總流程如下:
- step1:在root = 0的進(jìn)程進(jìn)行分發(fā)。(根據(jù)tasksteps,即矩陣的規(guī)模)
- step2:每個(gè)進(jìn)程使用雙重循環(huán)計(jì)算元素相乘
- step3:所有子進(jìn)程將運(yùn)算結(jié)果規(guī)約到root = 0的進(jìn)程
- step4:只保留root0的運(yùn)算結(jié)果
6. 優(yōu)化結(jié)果對(duì)比
設(shè)置4組不同規(guī)模的測(cè)試參數(shù),以10倍規(guī)模遞增
scale = 5, 500, 5000, 50000
多進(jìn)程并行進(jìn)程數(shù)測(cè)試參數(shù):5,10,16,20,25
關(guān)于強(qiáng)拓展性和弱拓展性:
- 強(qiáng)擴(kuò)展:如果在增加進(jìn)程或線程個(gè)數(shù)時(shí),可以維持固定效率,卻不增加問(wèn)題規(guī)模
- 弱擴(kuò)展:如果在增加進(jìn)程或線程個(gè)數(shù)時(shí),只有以相同倍率增加問(wèn)題規(guī)模才能保持效率值
強(qiáng)拓展和弱拓展探究的是并行開銷和并行價(jià)值之間權(quán)衡的哲學(xué)
單進(jìn)程測(cè)試結(jié)果:
--- Current matrix scale is: 5 x 5 ---
single method used time is: 0.000268s
--- Current matrix scale is: 50 x 50 ---
single method used time is: 0.000843s
--- Current matrix scale is: 500 x 500 ---
single method used time is: 0.045645s
--- Current matrix scale is: 5000 x 5000 ---
single method used time is: 5.010791s
--- Current matrix scale is: 50000 x 50000 ---
single method used time is: 498.076443s
分析:由上可見,當(dāng)規(guī)模是5000的時(shí)候,單進(jìn)程的運(yùn)算時(shí)間還可以接受,但是再上升一個(gè)數(shù)量級(jí)的時(shí)候,運(yùn)行時(shí)間發(fā)生劇增,使用了將近500秒的時(shí)間完成。
多進(jìn)程測(cè)試結(jié)果:
5進(jìn)程并行,計(jì)算規(guī)模5
--- Current matrix scale is: 5 x 5 ---
mpi method by process-size:5 used time is: 0.001074s
5進(jìn)程并行,計(jì)算規(guī)模50
--- Current matrix scale is: 50 x 50 ---
mpi method by process-size:5 used time is: 0.001124s
5進(jìn)程并行,計(jì)算規(guī)模500
--- Current matrix scale is: 500 x 500 ---
mpi method by process-size:5 used time is: 0.011929s
5進(jìn)程并行,計(jì)算規(guī)模5000
--- Current matrix scale is: 5000 x 5000 ---
mpi method by process-size:5 used time is: 1.182357s
5進(jìn)程并行,計(jì)算規(guī)模50000
--- Current matrix scale is: 50000 x 50000 ---
mpi method by process-size:5 used time is: 126.274598s
10進(jìn)程并行,計(jì)算規(guī)模50
--- Current matrix scale is: 50 x 50 ---
mpi method by process-size:10 used time is: 0.007400s
10進(jìn)程并行,計(jì)算規(guī)模500
--- Current matrix scale is: 500 x 500 ---
mpi method by process-size:10 used time is: 0.016241s
10進(jìn)程并行,計(jì)算規(guī)模5000
--- Current matrix scale is: 5000 x 5000 ---
mpi method by process-size:10 used time is: 0.802658s
10進(jìn)程并行,計(jì)算規(guī)模50000
--- Current matrix scale is: 50000 x 50000 ---
mpi method by process-size:10 used time is: 84.558449s
16進(jìn)程并行,計(jì)算規(guī)模50000
--- Current matrix scale is: 50000 x 50000 ---
mpi method by process-size:16 used time is: 70.913076s
20進(jìn)程并行,計(jì)算規(guī)模5000
--- Current matrix scale is: 5000 x 5000 ---
mpi method by process-size:20 used time is: 0.805241s
20進(jìn)程并行,計(jì)算規(guī)模50000
--- Current matrix scale is: 50000 x 50000 ---
mpi method by process-size:20 used time is: 73.206234s
25進(jìn)程并行,計(jì)算規(guī)模50000
--- Current matrix scale is: 50000 x 50000 ---
mpi method by process-size:25 used time is: 73.724819s
匯總表格:
單進(jìn)程:
規(guī)模 | 時(shí)間(s) |
---|---|
5 | 0.00026 |
50 | 0.00084 |
500 | 0.04564 |
5000 | 5.0107 |
50000 | 498.07644 |
并行進(jìn)程數(shù):5
規(guī)模 | 時(shí)間(s) |
---|---|
5 | 0.00107 |
50 | 0.00112 |
500 | 0.01192 |
5000 | 1.18235 |
50000 | 126.27459 |
并行進(jìn)程數(shù):10
規(guī)模 | 時(shí)間(s) |
---|---|
50 | 0.00740 |
500 | 0.01624 |
5000 | 0.80265 |
50000 | 84.55844 |
并行進(jìn)程數(shù):16
規(guī)模 | 時(shí)間(s) |
---|---|
50000 | 70.91307 |
并行進(jìn)程數(shù):20
規(guī)模 | 時(shí)間(s) |
---|---|
5000 | 0.80524 |
50000 | 73.20623 |
并行進(jìn)程數(shù):25文章來(lái)源:http://www.zghlxwxcb.cn/news/detail-421462.html
規(guī)模 | 時(shí)間(s) |
---|---|
50000 | 73.72481 |
分析:可見在多進(jìn)程并行起到優(yōu)化作用,而當(dāng)進(jìn)程數(shù)達(dá)到一定的時(shí)候(16),出現(xiàn)優(yōu)化停止的現(xiàn)象,時(shí)間并沒(méi)有減少,反而在進(jìn)程數(shù)大于16的時(shí)候出現(xiàn)增加,說(shuō)明并行開銷已經(jīng)出現(xiàn)負(fù)反饋,因此需要增加問(wèn)題規(guī)模去提高優(yōu)化率。在面臨資源極限之前,呈現(xiàn)強(qiáng)拓展性。當(dāng)達(dá)到資源極限的時(shí)候,出現(xiàn)弱拓展性。文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-421462.html
到了這里,關(guān)于高性能計(jì)算的矩陣乘法優(yōu)化 - Python +MPI的實(shí)現(xiàn)的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!