深度學(xué)習(xí)基礎(chǔ)知識(shí)-感知機(jī)+神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)

這篇具有很好參考價(jià)值的文章主要介紹了深度學(xué)習(xí)基礎(chǔ)知識(shí)-感知機(jī)+神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)。希望對(duì)大家有所幫助。如果存在錯(cuò)誤或未考慮完全的地方，請(qǐng)大家不吝賜教，您也可以點(diǎn)擊"舉報(bào)違法"按鈕提交疑問。

參考書籍：（找不到資源可以后臺(tái)私信我）
《深度學(xué)習(xí)入門：基于Python的理論與實(shí)現(xiàn) (齋藤康毅)》
《Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition (Aurelien Geron [Géron, Aurélien])》

機(jī)器學(xué)習(xí)和深度學(xué)習(xí)的區(qū)別：
深度學(xué)習(xí)基礎(chǔ)知識(shí)-感知機(jī)+神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)

神經(jīng)網(wǎng)絡(luò)的構(gòu)造

Perceptron（感知機(jī)）

感知機(jī)就是一種接收多種輸入信號(hào)，輸出一個(gè)信號(hào)的原件。輸入信號(hào)被送往神經(jīng)元時(shí)，會(huì)被分別乘以固定的權(quán)重，神經(jīng)元會(huì)計(jì)算傳送來的信號(hào)的總和，只有當(dāng)這個(gè)總和超過某個(gè)界限值時(shí)，才會(huì)輸出1，此時(shí)被稱為“神經(jīng)元被激活”，這個(gè)界限值稱為閾值。（可以把閾值挪到等式的左邊，那只需要比較總和減閾值與0的關(guān)系）

感知機(jī)的實(shí)現(xiàn)就是TLU（Threshold Logic Unit），如下圖所示。其中X和W都是向量，Z其實(shí)計(jì)算的就是加權(quán)和。再經(jīng)過step函數(shù)就得到了輸出。
深度學(xué)習(xí)基礎(chǔ)知識(shí)-感知機(jī)+神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)
常用的step function有兩種：

設(shè)置不同的權(quán)重和閾值，我們可以用感知機(jī)表示與門、與非門、或門（此處不再證明）。下圖中直線就可以是一個(gè)或門，很好地把（0，0）和（1，0）、（0，1）、（1，1）分開了。
深度學(xué)習(xí)基礎(chǔ)知識(shí)-感知機(jī)+神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)
但是，單個(gè)感知機(jī)無法表示異或門，因?yàn)楫惢蛐枰们€來分類。也就是說，單層感知機(jī)只能表示線性空間。

使用多層感知機(jī)即可表示異或：
深度學(xué)習(xí)基礎(chǔ)知識(shí)-感知機(jī)+神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)
如果像下圖一樣，每個(gè)神經(jīng)元都接收前一層的每個(gè)輸出，那就是全連接層（fully connected layer / dense layer）。

輸出的計(jì)算公式為：
深度學(xué)習(xí)基礎(chǔ)知識(shí)-感知機(jī)+神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)
其中b是bias vector，每個(gè)神經(jīng)元都有一個(gè)bias。Φ是激活函數(shù)，如果神經(jīng)元是TLU，那么Φ就是step function。其他的激活函數(shù)還有sigmoid、ReLU、softmax等。用上sigmoid就是神經(jīng)網(wǎng)絡(luò)了。

Multilayer Perceptron

深度學(xué)習(xí)基礎(chǔ)知識(shí)-感知機(jī)+神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)
跟前面的單層perceptron的區(qū)別就是加了隱藏層。其中靠近輸入層的也叫l(wèi)ower layers，靠近輸出層的也叫upper layers。除了輸出層都有bias神經(jīng)元，每一層也都是全連接的。上圖也是feedforward neural network（FNN）。

激活函數(shù)

神經(jīng)網(wǎng)絡(luò)的激活函數(shù)必須使用非線性函數(shù)。因?yàn)槿绻褂镁€性函數(shù)，那么加深神經(jīng)網(wǎng)絡(luò)的層數(shù)是沒有意義（只是改了權(quán)重/參數(shù)）。

輸出層所用的激活函數(shù)，要根據(jù)求解問題的性質(zhì)決定。一般來說，回歸問題可以使用恒等函數(shù)（直接輸出，不作任何處理），二分類問題可以用sigmoid函數(shù)，多元分類問題可以使用softmax函數(shù)（只是訓(xùn)練階段，在推理階段，一般會(huì)省略掉sofmax）。

sigmoid函數(shù)：
hyperbolic tangent function：這里h就是sigmoid函數(shù)
tanh(z) = 2h(2x) - 1
ReLU（Rectified Linear Unit）函數(shù)：
softplus函數(shù)：（比ReLU更柔和一點(diǎn)）
softmax函數(shù)：

由于指數(shù)增長是很恐怖的，所以為了防止溢出，如下圖所示修改這個(gè)函數(shù)，并且將 C’ 改成 -max(x)

python實(shí)現(xiàn)：

def step_func(X):
    return np.array(X > 0, dtype=int)

def sigmoid_func(X):
    return 1 / (1 + np.exp(-X))
    
def relu(X):
    return np.maximum(0, X)

def softmax(X):
    c = np.max(X)
    exp_X = np.exp(X - c)
    sum_exp_X = np.sum(exp_X)
    y = exp_X / sum_exp_X
    return y

輸出層的神經(jīng)元數(shù)量需要根據(jù)問題決定。對(duì)于分類問題，輸出層的神經(jīng)元數(shù)量一般設(shè)為類別的數(shù)量。比如使用mnist訓(xùn)練手寫數(shù)字分類器，那可以給輸出層設(shè)置10個(gè)神經(jīng)元，每個(gè)神經(jīng)元對(duì)應(yīng)一個(gè)數(shù)字。

輸入數(shù)據(jù)的集合稱為批（batch)。因?yàn)橛?jì)算機(jī)一般會(huì)為批處理做優(yōu)化，所以以batch為單位進(jìn)行推理會(huì)比較快。

神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)

損失函數(shù)（loss function）

損失函數(shù)是表示神經(jīng)網(wǎng)絡(luò)性能的“惡劣程度”的指標(biāo)，即當(dāng)前的神經(jīng)網(wǎng)絡(luò)對(duì)監(jiān)督數(shù)據(jù)在多大程度上不擬合。在神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)/訓(xùn)練中，尋找最優(yōu)參數(shù)（權(quán)重W和偏置B）時(shí)，要尋找使損失函數(shù)的值盡可能小的參數(shù)，此時(shí)需要計(jì)算參數(shù)的導(dǎo)數(shù)，然后以導(dǎo)數(shù)為指引，逐步更新參數(shù)的值。

可以用作損失函數(shù)的有：

參考：https://zhuanlan.zhihu.com/p/532850353
深度學(xué)習(xí)基礎(chǔ)知識(shí)-感知機(jī)+神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)
均方誤差（mean squared error）/ L2 Loss

torch.nn.MSELoss(reduction='mean')

或者自己實(shí)現(xiàn)：

def mean_squared_error(y, t):
    return 0.5 * np.sum((y - t) ** 2)

RMSE
深度學(xué)習(xí)基礎(chǔ)知識(shí)-感知機(jī)+神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)
MSE比MAE收斂更快，假設(shè)的是模型的誤差服從標(biāo)準(zhǔn)高斯分布（均值0標(biāo)準(zhǔn)差1）

平均絕對(duì)誤差（mean absolute error）/ L1 Loss
深度學(xué)習(xí)基礎(chǔ)知識(shí)-感知機(jī)+神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)

torch.nn.L1Loss(reduction='mean')

MAE比MSE更不容易受到異常值影響，假設(shè)的是誤差服從拉普拉斯分布（μ=0 b=1）

Huber Loss
深度學(xué)習(xí)基礎(chǔ)知識(shí)-感知機(jī)+神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)

結(jié)合了MSE和MAE的優(yōu)點(diǎn)，下降速度接近MSE，缺點(diǎn)是需要設(shè)置δ。

torch.nn.HuberLoss(reduction='mean')

交叉熵誤差（cross entropy error）
其中y是one-hot表示，所以只需要計(jì)算正確預(yù)測(cè)的情況。比如正確標(biāo)簽的索引是2，神經(jīng)網(wǎng)絡(luò)的輸出是0.6，那么 E = -ln0.6
深度學(xué)習(xí)基礎(chǔ)知識(shí)-感知機(jī)+神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)

def cross_entropy_error(y, t):
    delta = 1e-7
    # 加一個(gè)delta是為了避免log(0)的情況
    return -np.sum(t * np.log(y + delta))

如果擴(kuò)展到計(jì)算一批的誤差，則是
深度學(xué)習(xí)基礎(chǔ)知識(shí)-感知機(jī)+神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)
mini-batch學(xué)習(xí)：從訓(xùn)練數(shù)據(jù)中選出一批數(shù)據(jù)（mini-batch），然后對(duì)每個(gè)mini-batch進(jìn)行學(xué)習(xí)。只要計(jì)算隨機(jī)選出的這批數(shù)據(jù)的損失函數(shù)，就可以近似得到所有訓(xùn)練數(shù)據(jù)的損失。

那么，mini-batch的交叉熵誤差這樣計(jì)算：

def cross_entropy_error_1hot_batch(y, t):
    if y.ndim == 1:
        t = t.reshape(1, t.size)
        y = y.reshape(1, y.size)
    batch_size = y.shape[0]
    delta = 1e-7
    return -np.sum(t * np.log(y + delta)) / batch_size

def cross_entropy_error(y, t):
    if y.ndim == 1:
        t = t.reshape(1, t.size)
        y = y.reshape(1, y.size)
    batch_size = y.shape[0]
    delta = 1e-7
    return -np.sum(np.log(y[np.arrange(batch_size), t] + delta)) / batch_size

數(shù)值微分（Numerical differentiation）

前面說了，在計(jì)算神經(jīng)網(wǎng)絡(luò)的參數(shù)的時(shí)候，需要計(jì)算參數(shù)的導(dǎo)數(shù)。計(jì)算導(dǎo)數(shù)時(shí)，實(shí)際計(jì)算的是近似值。如下圖所示，當(dāng)h足夠小的時(shí)候可以認(rèn)為近似值足夠接近真值。
深度學(xué)習(xí)基礎(chǔ)知識(shí)-感知機(jī)+神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)
數(shù)值微分指的就是用數(shù)值方法近似求解導(dǎo)數(shù)的過程。如果以x為中心，計(jì)算左右兩邊的差分，就是中心差分：（如果是 x+h 與 x
之間就是前向差分）
深度學(xué)習(xí)基礎(chǔ)知識(shí)-感知機(jī)+神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)
如果是基于數(shù)學(xué)式的推導(dǎo)求導(dǎo)數(shù)就稱為解析性求導(dǎo)（analytic differentiation）。比如y=x^2的導(dǎo)數(shù)y’=2x。解析性求導(dǎo)得到的導(dǎo)數(shù)是不含誤差的真實(shí)導(dǎo)數(shù)。

偏導(dǎo)數(shù)是在有多個(gè)變量的情況下，對(duì)某一個(gè)變量進(jìn)行求導(dǎo)。比如
深度學(xué)習(xí)基礎(chǔ)知識(shí)-感知機(jī)+神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)
它的偏微分是：

求法就是將其他變量看作常數(shù)，只對(duì)當(dāng)前變量做求導(dǎo)。

如果把所有變量的偏導(dǎo)數(shù)合在一起變成向量，就稱為梯度（gradient）。梯度指示的方向是各點(diǎn)處的函數(shù)值減小最多的方向，但并不保證梯度所指的方向就是函數(shù)的最小值。

梯度法：從某個(gè)位置沿梯度方向前進(jìn)一段，再重新求梯度，再繼續(xù)按新梯度方向前進(jìn)，以此逐漸減小函數(shù)值。尋找最小值的梯度法稱為梯度下降法（gradient descent method），反之稱為梯度上升法（gradient ascent method）。

梯度法想要尋找梯度為0的點(diǎn)，但是梯度為0處不一定是最小值。比如函數(shù)的極小值就是局部最小值，而鞍點(diǎn)（saddle point，指從某個(gè)方向看是極大值，從另一個(gè)方向看是極小值的點(diǎn)）也不是最小值。而且，如果函數(shù)很復(fù)雜且比較扁平，那么學(xué)習(xí)可能會(huì)進(jìn)入一個(gè)平坦區(qū)域（“學(xué)習(xí)高原”），此時(shí)將無法前進(jìn)。
深度學(xué)習(xí)基礎(chǔ)知識(shí)-感知機(jī)+神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)

def numerical_gradient(f, x):
    h = 1e-4 # 0.0001
    grad = np.zeros_like(x) # 生成和x形狀相同的數(shù)組
    for idx in range(x.size):
        tmp_val = x[idx]
        # f(x+h)的計(jì)算
        x[idx] = tmp_val + h
        fxh1 = f(x)
        # f(x-h)的計(jì)算
        x[idx] = tmp_val - h
        fxh2 = f(x)
        # 計(jì)算導(dǎo)數(shù)
        grad[idx] = (fxh1 - fxh2) / (2*h)
        x[idx] = tmp_val # 還原值
    return grad

def gradient_descent(f, init_x, lr=0.01, step_num=100):
    x = init_x
    for i in range(step_num):
        grad = numerical_gradient(f, x)
        x -= lr * grad
    return x

學(xué)習(xí)率是超參數(shù)。權(quán)重和偏置可以通過訓(xùn)練獲取，但學(xué)習(xí)率需要人工設(shè)定。

總結(jié)：神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)過程

神經(jīng)網(wǎng)絡(luò)存在合適的權(quán)重和偏置，調(diào)整它們以便擬合訓(xùn)練數(shù)據(jù)，這一過程稱為“學(xué)習(xí)”。學(xué)習(xí)分為4個(gè)步驟：

mini-batch
從訓(xùn)練數(shù)據(jù)中隨機(jī)選出一部分?jǐn)?shù)據(jù)稱為mini-batch。后續(xù)的目標(biāo)是減小mini-batch的損失函數(shù)的值。
計(jì)算梯度
求出各個(gè)權(quán)重參數(shù)的梯度。
更新參數(shù)
將權(quán)重參數(shù)沿著梯度方向進(jìn)行微小更新。
重復(fù)前三個(gè)步驟

一個(gè)epoch表示學(xué)習(xí)中所有訓(xùn)練數(shù)據(jù)都被使用過一次時(shí)的更新次數(shù)。比如有1w個(gè)訓(xùn)練數(shù)據(jù)，每個(gè)mini-batch是100個(gè)，那么就要做100次隨機(jī)梯度下降，所以epoch=100。實(shí)際操作時(shí)會(huì)先將所有訓(xùn)練數(shù)據(jù)隨機(jī)打亂，然后按指定的batch size，按順序生成mini-batch。

下面是示意代碼。沒有跑，因?yàn)椴糠趾瘮?shù)需要修改。文章來源地址http://www.zghlxwxcb.cn/news/detail-426173.html

class TwoLayerNet:
    def __init__(self, input_size, hidden_size, output_size,
                 weight_init_std=0.01):
        self.params = {}
        # 生成服從正態(tài)分布的數(shù)據(jù)，(r, c)
        self.params['W1'] = weight_init_std * np.random.randn(input_size, hidden_size)
        self.params['b1'] = np.zeros(hidden_size)
        self.params['W2'] = weight_init_std * np.random.randn(hidden_size, output_size)
        self.params['b2'] = np.zeros(output_size)

    def sigmoid_func(self, X):
        return 1 / (1 + np.exp(-X))

    def softmax(self, X):
        c = np.max(X)
        exp_X = np.exp(X - c)
        sum_exp_X = np.sum(exp_X)
        y = exp_X / sum_exp_X
        return y

    def cross_entropy_error(self, y, t):
        if y.ndim == 1:
            t = t.reshape(1, t.size)
            y = y.reshape(1, y.size)
        batch_size = y.shape[0]
        delta = 1e-7
        return -np.sum(np.log(y[np.arrange(batch_size), t] + delta)) / batch_size

    def predict(self, x):
        W1, W2 = self.params['W1'], self.params['W2']
        b1, b2 = self.params['b1'], self.params['b2']
        a1 = np.dot(x, W1) + b1
        z1 = self.sigmoid_func(a1)
        a2 = np.dot(z1, W2) + b2
        y = self.softmax(a2)
        return y

    def loss(self, x, t):
        y = self.predict(x)
        return self.cross_entropy_error(y, t)

    def accuracy(self, x, t):
        y = self.predict(x)
        y = np.argmax(y, axis=1)
        t = np.argmax(t, axis=1)
        acc = np.sum(y == t) / float(x.shape[0])
        return acc

    def numerical_gradient(self, f, x):
        h = 1e-4  # 0.0001
        grad = np.zeros_like(x)  # 生成和x形狀相同的數(shù)組
        for idx in range(x.size):
            tmp_val = x[idx]
            # f(x+h)的計(jì)算
            x[idx] = tmp_val + h
            fxh1 = f(x)
            # f(x-h)的計(jì)算
            x[idx] = tmp_val - h
            fxh2 = f(x)
            # 計(jì)算導(dǎo)數(shù)
            grad[idx] = (fxh1 - fxh2) / (2 * h)
            x[idx] = tmp_val  # 還原值
        return grad

    def gradient(self, x, t):
        loss_W = lambda W: self.loss(x, t)
        grads = {}
        grads['W1'] = self.numerical_gradient(loss_W, self.params['W1'])
        grads['b1'] = self.numerical_gradient(loss_W, self.params['b1'])
        grads['W2'] = self.numerical_gradient(loss_W, self.params['W2'])
        grads['b2'] = self.numerical_gradient(loss_W, self.params['b2'])
        return grads

X_train, y_train, X_test, y_test = get_data()
train_size = X_train.shape[0]
batch_size = 100
train_loss_list = []
train_acc_list = []
test_acc_list = []
iter_per_epoch = max(train_size / batch_size, 1)
# hyperparameters
iters_num = 10000
learning_rate = 0.1
network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10)
for i in range(iters_num):
    # get mini-batch
    batch_mask = np.random.choice(train_size, batch_size)
    x_batch = X_train[batch_mask]
    y_batch = y_train[batch_mask]
    # calc gradient
    grad = network.numerical_gradient(x_batch, y_batch)
    for key in ('W1', 'b1', 'W2', 'b2'):
        network.params[key] -= learning_rate * grad[key]
    loss = network.loss(x_batch, y_batch)
    train_loss_list.append(loss)
    # 計(jì)算每個(gè)epoch的識(shí)別精度
    if i % iter_per_epoch == 0:
        train_acc = network.accuracy(X_train, y_train)
        test_acc = network.accuracy(X_test, y_test)
        train_acc_list.append(train_acc)
        test_acc_list.append(test_acc)

到了這里，關(guān)于深度學(xué)習(xí)基礎(chǔ)知識(shí)-感知機(jī)+神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)的文章就介紹完了。如果您還想了解更多內(nèi)容，請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章，希望大家以后多多支持TOY模板網(wǎng)！