国产 无码 综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

(全英語版)處理惡意軟件的隨機森林分類器算法(Random Forest Classifier On Malware)

這篇具有很好參考價值的文章主要介紹了(全英語版)處理惡意軟件的隨機森林分類器算法(Random Forest Classifier On Malware)。希望對大家有所幫助。如果存在錯誤或未考慮完全的地方,請大家不吝賜教,您也可以點擊"舉報違法"按鈕提交疑問。

Random Forest Classifier On Malware

(copyright 2020 by YI SHA, if you want to re-post this,please send me an email:shayi1983end@gmail.com)

(全英語版)處理惡意軟件的隨機森林分類器算法(Random Forest Classifier On Malware)

Overview


隨機森林分類器是最近很流行的一種識別惡意軟件的機器學習算法,由 python 編程語言實現(xiàn);用于殺毒軟件的傳統(tǒng)基于特征碼、簽名、啟發(fā)式識別已經(jīng)無法完全檢測大量的變體,因此需要一種高效和準確的方法。很幸運的是我們有開源的?sklearn 庫能夠利用:

In this tutorial,I'll show you how to use random forest classifier machine learning algorithm to detect malware by Python programming language;

The traditional yet obsolete signature-based or heuristic approach used by majority anti-virus softwares? are no longer suitable for detecting huge-scale malware variations emerged nowadays;for these billions of variations,we need a fast、automatically and accurately way to make judgement about an unknown software binary is malicious or benign;


The Python sklearn library provide a Random Forest Classifier Class for doing this job excellently,note the simplest way of using random forest algorithm is in a dichotomy scenario:determine or classified an unknown object into its two possible categories ;which means any task that involve dichotomy,not merely malware-benign ware identification,can take advantage of Random Forest Classifier;?


So let's enter our topic,from a high-level overview perspective,I'll extract any printable string whose length large than five characters from the two training datasets:malware and benign ware,respectively;then compress these data using hashing trick to save memory usage and boosting analysis speed;then I use these data,along with a label vector,to train our random forest classifier machine learning model,make it to have a general concept about what is a malware or benign ware;finally,I pass in a sheer previously unseen Windows PE binary file to this classifier,let it make prediction,the resulting value is a probability of its maliciousness,and feed this to other components logic inside an anti-virus;

(don't worry too much about aforementioned terminologies,I will explain them as I bring you to the code line by line;)


Implementation and Execution

We import the first three prerequis Python libraries:?

? re(regular expression);?

? numpy;?

? FeatureHasher Class(perform string hashing ):



The definition of function get_string_features() as shown in following figures,it take an absolute filename path as its first argument,and an instance of FeatureHasher Class as its 2nd argument;

The "front-end" of this function open a PE binary file specified by caller,and use regular expression performing text match on that file,return all matched strings into a list(the variable strings);


For example,if we extract strings from a malware binary using above code snippet,findall() method will return a list containing all candidate strings:


The "back-end" of this function iterate over this strings list,using every string as a key,and 1 as its corresponding value to build a feature dictionary,indicating that string existing within this binary;then it use the transform() method coming from FeatureHasher Class, to compress this dictionary,after that,dense the resulting sparse matrix,convert it to a standard numpy array,and return the first element to the caller:


To make this point more clear,I do some experiment to show you the internal working of that code chunk:



As you can see from the above figure,compare to the original list we used for storage raw strings,this function return a very large numpy array, but most of then are zero,only 256 / 20000 = 1% are occupy by 1;


Next,I formally acquire every fully absolutely filename path from the given two training data set directory by using the following code piece:

Basically,this will construct two lists of complete filename path for malware and benign ware locate in hard disk drive,respectively,and the execution output is shown in following figure:


Now we can actually invoke get_string_features() on full_file_path_malware and full_file_path_benign lists to extract hashed string-based features for every binary;I achieve this by a compact list comprehension expression;also,we need another label vector aforementioned to tell the machine learning detector the rules of how to treat these binaries as malicious or benign:


According to machine learning community and mathematical convention , we frequently use capitalized "X" to represent a matrix and lower-case “y” to represent single vector;because get_string_features() return a list,calling it repeatedly will produce a list of lists——so "X" is a two-dimensional matrix;also,"y" has identical length with "X",and labeled 1 for all malware hashed string lists inside "X";labeled 0 for all benign ware hashed? string lists inside "X":



After data preparation and pre-processing,next we use Random Forest Classifier support by sklearn library, to fit(or "train")this machine learning malware detector based on this set of training data "X" and "y":?




The final step,I extracted a hashed string-based features form an unknown、real-world Windows PE binary file(which is a kind of launcher of a popular MMORPG client ^^),the use our classifier to probe it:


The predict_proba() method gives out the probability of that binary could be malicious and return it into a second element of a list,the first element is the probability? of that binary could be benign,so these two member are mutual:they adds to 100%:



As you can see,sklearn library handle the most heavy lifting works including created different decision trees randomly (to allow them form a forest)、the mathematical decision processes behind each of these trees and make a majority vote to determine whether this unknown is malicious;

So make leverage of its merits to conduct artificial intelligence-related problem solving only require several lines of code;


By carefully watch the output above you may be wondering why this customized machine learning detector treat a legitimate online game client as a malware ??

There are several reasons can explain this seemingly "false positive" phenomenon,such as? those strings related? to anti-debugging、anti-reverse engineering techniques might appear within these launcher,which also frequently used by malware authors;but more importantly,we can change the threshold value defined in our if clause as a simple way to reduce "false positive" and increase detection accuracy;



Evaluate Performance

To evaluate the accuracy of this machine learning detector furthermore,we can setup a optional experimental procedure,called "cross validation",involves these steps:?

① Randomly divide our training data into subsets——several training sets,to train the classifier;and a test set,which playing a role as previously unseen binaries set to test the classifier,

② Let it make probability prediction about the maliciousness scores;use that scores accompanying with the test label vector(which generated randomly from also dividing the original label vector into training and test set,which representing the "official" categorize standards that we know in advance),to compute the "Receiver Operating Characteristic (ROC) curve" of this detector;

The ROC curve measures the relationship and changes between a classifier's true positive rate(TPR) and false positive rate(FPR),we can use roc_curve() function of metrics module coming from sklearn library for this task;


③ Then we record the TPR and FPR value in memory by using semilogx() function of pyplot module coming from the de facto data visualization library——matplotlib——and then exchange(alternately) the roles of traning and testing subsets,repeat above process until all subsets are covered,which is why it called "cross validation";

④ Finally,we actually draw all ROC curves computed during these processes using a series of pyplot's plotting functionalities and display it;


To preventing you get confused with all these complex steps involved in a "cross validation",I show you a overall clear logic in the following figure:


Now you have the general concept of the "cross validation",let's walk through the code:

Here,I wrapped all the logic into a cv_evaluate() function that takes "X" training dataset matrix and "y" label vector as its first two arguments,and a FeatureHasher instance as its last argument;the function import three essential libraries and modules,convert "X" and "y" to numpy arrays,set and initializes a counter variable used for final chart plotting;


The KFold instance is actually an iterator that gives a different training or test example split on each iteration,here I specified the passes of iteration is two,and randomly separating training and testing sets by setting 3rd argument shuffle=True;thus? at each iteration,we get different training and test sets to train and test a different random forest classifier(notice the instantiate stage was putting inside the for loop to guarantee each new classifier CANNOT see or remember the previous experiments and will get outcome independently);


The following figure demonstrate the process when I told KFold() to perform three times of "cross validation",as you can see clearly,a random forest classifier and a matplotlib line2D object was generated three times:

The final figure showing each of the three ROC curves being drawn,we can explain as this:within about a 1%(10^-2)false positive rate,we have approximately of maximum 80% average of true positive rate;and as the true positive rate of this machine learning detector approach from 80% to 100%,at the same time,its false positive rate also increases from 1% to 100% ?。?!



Summary

In this tutorial I showed you how to extract and prepare training and testing dataset then train and test a specific malware machine learning model,you also know how to evaluate its detection accuracy in a general trend,however,what technique this tutorial haven't told you is how to improve its accuracy and reduce its false positive rate;to achieve this goal you will need to train and test at least more than tens of thousands of samples(you can get them from virustotal.com),or you can redesign the feature extraction logic to include import address table(IAT) analysis of a PE file,or assembly instruction N-gram analysis of a PE file;alternatively,you can explore other machine learning algorithm provided by sklearn,such as logistic regression、decision tree,which I will leave you for exercises^^



Appendix A

This section will help you understand the internal behavior of the iterator that KFold() return;

Suppose we have a list of dictionaries store?

the correspondences between movie names and their box offices(measured by USD),in ascending order:


Now one of our requirements is to extract a sub collection from it with some specific film members,but using the traditional multi-indices may failed,because pure Python list doesn't support specifying multiple index simultaneously:


One workaround of this problem is to using numpy's array() function,convert our whole movies and box offices list to an array ( said,A),then also convert those indices to another array (said,B),then you can safely use B as indices into A,to retrieve several members at once:




This seems pretty cool,but what if we now have another requirement:to randomly divide this movies-box office array into two parts with different elements in them??

This is where KFold() from sklearn's? cross_validation module comes into play,the following code show you how easily I accomplish this with only handful lines of code:


execution outputs:




The second argument of KFold() specify iteration passes,it must less than or equal to the number of elements in target array which we want to split on;

As you can see,within each iteration,we divide the array into two separate parts ,each part have randomly members in it;and we know that KFold() return randomly arranged indices as two sub-arrays of its parent array,in the above case,array "np_MoviesBoxOffice" has

a complete indices [0-8],indices_A and indices_B only contain partially random? indices from "np_MoviesBoxOffice";this is why we can use them index into the original parent array,to split our training and testing set!??!文章來源地址http://www.zghlxwxcb.cn/news/detail-656052.html

?

到了這里,關(guān)于(全英語版)處理惡意軟件的隨機森林分類器算法(Random Forest Classifier On Malware)的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!

本文來自互聯(lián)網(wǎng)用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權(quán),不承擔相關(guān)法律責任。如若轉(zhuǎn)載,請注明出處: 如若內(nèi)容造成侵權(quán)/違法違規(guī)/事實不符,請點擊違法舉報進行投訴反饋,一經(jīng)查實,立即刪除!

領(lǐng)支付寶紅包贊助服務器費用

相關(guān)文章

  • 四、分類算法 - 隨機森林

    四、分類算法 - 隨機森林

    目錄 1、集成學習方法 2、隨機森林 3、隨機森林原理 4、API 5、總結(jié) sklearn轉(zhuǎn)換器和估算器 KNN算法 模型選擇和調(diào)優(yōu) 樸素貝葉斯算法 決策樹 隨機森林

    2024年02月22日
    瀏覽(20)
  • 分類算法系列⑥:隨機森林

    分類算法系列⑥:隨機森林

    目錄 集成學習方法之隨機森林 1、集成學習方法 2、隨機森林 3、隨機森林原理 為什么采用BootStrap抽樣 為什么要有放回地抽樣 4、API 5、代碼 代碼解釋 結(jié)果 6、隨機森林總結(jié) ? ??作者介紹:雙非本科大三網(wǎng)絡工程專業(yè)在讀,阿里云專家博主,專注于Java領(lǐng)域?qū)W習,擅長web應用

    2024年02月10日
    瀏覽(21)
  • 分類算法-隨機森林實戰(zhàn)案例

    分類算法-隨機森林實戰(zhàn)案例

    ????????隨機森林是一種 有監(jiān)督學習算法 ,是以決策樹為基學習器的 集成學習算法 。???????? ????????那什么是有監(jiān)督學習呢?有監(jiān)督學習就是把有已知結(jié)果的數(shù)據(jù)集拿去訓練,如果訓練結(jié)果與標準答案的精度足夠高就可以使用這個模型去預測或者分類未知結(jié)果

    2023年04月16日
    瀏覽(34)
  • 【機器學習】隨機森林 – Random forest

    【機器學習】隨機森林 – Random forest

    隨機森林是一種由 決策樹 構(gòu)成的 集成算法 ,他在很多情況下都能有不錯的表現(xiàn)。 要深入理解上面這句話,請閱讀我的另外兩篇文章: 【機器學習】決策樹 – Decision Tree 【機器學習】集成學習 - Ensemble Learning 隨機森林屬于 集成學習 中的 Bagging (Bootstrap AGgregation 的簡稱)

    2024年02月16日
    瀏覽(30)
  • 隨機森林(Random Forest)簡單介紹

    隨機森林(Random Forest)簡單介紹

    隨機森林是一種監(jiān)督式學習算法,適用于分類和回歸問題。它可以用于數(shù)據(jù)挖掘,計算機視覺,自然語言處理等領(lǐng)域。隨機森林是在決策樹的基礎(chǔ)上構(gòu)建的。隨機森林的一個重要特點是它可以減少決策樹由于過度擬合數(shù)據(jù)而導致的過擬合,從而提高模型的性能。 隨機森林是一

    2024年02月07日
    瀏覽(23)
  • 隨機森林算法介紹及多分類預測的R實現(xiàn)

    隨機森林算法介紹及多分類預測的R實現(xiàn)

    隨機森林(Random Forest)是一種經(jīng)典的機器學習算法,是數(shù)據(jù)科學家中最受歡迎和常用的算法之一,最早由Leo Breiman和Adele Cutler于2001年提出。它是基于集成學習(Ensemble Learning)的一種方法,通過組合多個決策樹來進行預測和分類,在回歸問題中則取平均值。其最重要的特點之

    2024年02月09日
    瀏覽(20)
  • 機器學習之隨機森林(Random forest)

    機器學習之隨機森林(Random forest)

    隨機森林是一種監(jiān)督式算法,使用由眾多決策樹組成的一種集成學習方法,輸出是對問題最佳答案的共識。隨機森林可用于分類或回歸,是一種主流的集成學習算法。 隨機森林中有許多的分類樹。我們要將一個輸入樣本進行分類,我們需要將輸入樣本輸入到每棵樹中進行分類

    2024年02月15日
    瀏覽(23)
  • 大數(shù)據(jù)分析案例-基于隨機森林算法構(gòu)建新聞文本分類模型

    大數(shù)據(jù)分析案例-基于隨機森林算法構(gòu)建新聞文本分類模型

    ???♂? 個人主頁:@艾派森的個人主頁 ???作者簡介:Python學習者 ?? 希望大家多多支持,我們一起進步!?? 如果文章對你有幫助的話, 歡迎評論 ??點贊???? 收藏 ??加關(guān)注+ 喜歡大數(shù)據(jù)分析項目的小伙伴,希望可以多多支持該系列的其他文章 大數(shù)據(jù)分析案例合集

    2024年02月02日
    瀏覽(37)
  • 分類預測 | Matlab實現(xiàn)GA-RF遺傳算法優(yōu)化隨機森林多輸入分類預測

    分類預測 | Matlab實現(xiàn)GA-RF遺傳算法優(yōu)化隨機森林多輸入分類預測

    效果一覽 基本介紹 Matlab實現(xiàn)GA-RF遺傳算法優(yōu)化隨機森林多輸入分類預測(完整源碼和數(shù)據(jù)) Matlab實現(xiàn)GA-RF遺傳算法優(yōu)化隨機森林分類預測,多輸入單輸出模型。GA-RF分類預測模型 多特征輸入單輸出的二分類及多分類模型。程序內(nèi)注釋詳細,直接替換數(shù)據(jù)就可以用。程序語言為

    2024年02月07日
    瀏覽(26)
  • 【Sklearn】基于隨機森林算法的數(shù)據(jù)分類預測(Excel可直接替換數(shù)據(jù))

    隨機森林(Random Forest)是一種集成學習方法,通過組合多個決策樹來構(gòu)建強大的分類或回歸模型。隨機森林的模型原理和數(shù)學模型如下: 隨機森林是一種集成學習方法,它結(jié)合了多個決策樹來改善預測的準確性和魯棒性。每個決策樹都是獨立地訓練,并且它們的預測結(jié)果綜

    2024年02月12日
    瀏覽(27)

覺得文章有用就打賞一下文章作者

支付寶掃一掃打賞

博客贊助

微信掃一掃打賞

請作者喝杯咖啡吧~博客贊助

支付寶掃一掃領(lǐng)取紅包,優(yōu)惠每天領(lǐng)

二維碼1

領(lǐng)取紅包

二維碼2

領(lǐng)紅包