?? ?? ?? 爆:使用ChatGPT+Streamlit快速構(gòu)建機器學(xué)習(xí)數(shù)據(jù)集劃分應(yīng)用程序!?。?/h2>
ChatGPT 對于 Python 程序員有用嗎?特別是我們這些使用 Python 進行數(shù)據(jù)處理、數(shù)據(jù)清洗和構(gòu)建機器學(xué)習(xí)模型的人?我們試試就知道了。
來自 OpenAI 的 ChatGPT 是什么?假設(shè)你已經(jīng)知道了。網(wǎng)上鋪天蓋地的宣傳吶喊,我想不再需要介紹了。加入您碰巧不知道 ChatGPT 是什么,趕快去查查,不然會被人嘲笑的,然后再回來繼續(xù)。
與 ChatGPT 的對話可以是有趣的,甚至是有啟發(fā)性的。但是讓我們看看它實際上有多大幫助。
讓我們看看 ChatGPT 在根據(jù)我們提供的規(guī)范編寫代碼時能做些什么。我們將從簡單開始,并從那里開始構(gòu)建。
一、簡單數(shù)據(jù)集處理腳本
我認為最好從某種數(shù)據(jù)處理腳本開始。因為我打算稍后讓 ChatGPT 使用各種 Python 庫編寫一些機器學(xué)習(xí)腳本,所以這似乎是一個合理的起點。
1.目標
首先,我想嘗試加載一個數(shù)據(jù)集; 確保所有數(shù)據(jù)集值都是數(shù)值,或者將它們轉(zhuǎn)換為數(shù)值; 檢查缺失的值; 并將數(shù)據(jù)集分割為訓(xùn)練集和測試集。
2.用戶提示
Write a Python script that does the following:
- loads a dataset from csv on disk to Pandas DataFrame
- ensures that dataset is entirely numeric; converts string values to numeric and tracks this conversion in a dictionary to reference later
- checks for missing values in the DataFrame and imputes missing values for any record that has one missing value; entirely drops a record that has more than one missing value
- splits the dataset into test and training sets編寫一個 Python 腳本,執(zhí)行以下操作:
- 將磁盤上的 csv 數(shù)據(jù)集加載到 Pandas DataFrame
- 確保數(shù)據(jù)集完全是數(shù)字; 將字符串值轉(zhuǎn)換為數(shù)字,并在字典中跟蹤這種轉(zhuǎn)換,以便以后引用
- 檢查 DataFrame 中是否缺少值,并為缺少一個值的任何記錄計算缺少值; 完全刪除一個缺少多個值的記錄
- 將數(shù)據(jù)集分成測試集和訓(xùn)練集
3.GhatGPT返回代碼
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# Load the dataset from csv to a Pandas DataFrame
df = pd.read_csv("dataset.csv")
# Ensure the dataset is entirely numeric
# Convert string values to numeric and track this conversion in a dictionary to reference later
string_columns = df.columns[df.dtypes == 'object']
string_to_numeric = {}
for column in string_columns:
unique_values = df[column].unique()
mapping = {value: i for i, value in enumerate(unique_values)}
string_to_numeric[column] = mapping
df[column] = df[column].map(mapping)
# Check for missing values in the DataFrame
missing_values = df.isna().sum()
# Impute missing values for any record that has one missing value; entirely drop a record that has more than one missing value
for column, missing_value_count in missing_values.items():
if missing_value_count == 1:
df[column].fillna(df[column].mean(), inplace=True)
elif missing_value_count > 1:
df.dropna(subset=[column], inplace=True)
# Split the dataset into test and training sets
train_data, test_data, train_target, test_target = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.2, random_state=0)
注意: 此腳本假定目標變量存儲在 DataFrame 中名為 target 的列中。您需要根據(jù)數(shù)據(jù)集的結(jié)構(gòu)修改此代碼。
4.結(jié)果分析
ChatGPT 生成了一個有用的腳本。它可以加載一個data.csv的數(shù)據(jù)集,以及訓(xùn)練集/測試集分割的比例,并記錄了這兩點以及如果需要的話如何更改它們??偟膩碚f,任務(wù)完成了。
二、重構(gòu)我們的代碼
1.目標
現(xiàn)在我們已經(jīng)有了一個執(zhí)行某些操作的腳本,讓我們對它進行一些重構(gòu)。良好編碼的基礎(chǔ)是使用函數(shù)來組織邏輯代碼,促進重用,并確保您不會重復(fù)。
2.用戶提示
Rewrite the above script to use a series of functions that make sense
重寫上面的腳本,使用一系列有意義的函數(shù)
3.GhatGPT返回代碼
Here’s the script refactored into a series of functions that make sense:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
def load_dataset(filepath):
"""Loads a dataset from a csv file to a Pandas DataFrame"""
return pd.read_csv(filepath)
def convert_to_numeric(df):
"""Ensures the dataset is entirely numeric and converts string values to numeric"""
string_columns = df.columns[df.dtypes == 'object']
string_to_numeric = {}
for column in string_columns:
unique_values = df[column].unique()
mapping = {value: i for i, value in enumerate(unique_values)}
string_to_numeric[column] = mapping
df[column] = df[column].map(mapping)
return df, string_to_numeric
def handle_missing_values(df):
"""Handles missing values by imputing a mean for records with one missing value, and dropping records with more than one missing value"""
missing_values = df.isna().sum()
for column, missing_value_count in missing_values.items():
if missing_value_count == 1:
df[column].fillna(df[column].mean(), inplace=True)
elif missing_value_count > 1:
df.dropna(subset=[column], inplace=True)
return df
def split_dataset(df, target_column, test_size=0.2, random_state=0):
"""Splits the dataset into test and training sets"""
train_data, test_data, train_target, test_target = train_test_split(df.drop(target_column, axis=1), df[target_column], test_size=test_size, random_state=random_state)
return train_data, test_data, train_target, test_target
# Load the dataset from csv to a Pandas DataFrame
df = load_dataset("dataset.csv")
# Convert string values to numeric and handle missing values
df, string_to_numeric = convert_to_numeric(df)
df = handle_missing_values(df)
# Split the dataset into test and training sets
train_data, test_data, train_target, test_target = split_dataset(df, "target")
Note: This script assumes that the target variable is stored in a column named target
in the DataFrame. You’ll need to modify this code based on the structure of your dataset.
在這里,ChatGPT使用了四個函數(shù):
1.load_dataset(filepath) :加載 csv 數(shù)據(jù)集
2.convert_to_numeric(df) :將字符串值轉(zhuǎn)換為數(shù)值。
3.handle_missing_values(df):處理數(shù)據(jù)集中缺少的值。
4.split_dataset(df, target_column, test_size=0.2, random_state=0) :將數(shù)據(jù)集劃分為測試集和訓(xùn)練集。
測試樣例使用這些函數(shù)來加載、轉(zhuǎn)換和清理數(shù)據(jù)集,然后將其拆分為測試集和訓(xùn)練集。
數(shù)據(jù)集的路徑和測試比例是這些函數(shù)的輸入。
4.結(jié)果分析
我們從 ChatGPT 的第一個腳本可以看出一些非常明顯的邏輯代碼塊,可以將這些代碼塊按功能封裝成函數(shù)。ChatGPT 做了我期望它做的事情。同樣,生成了一個合格的腳本。
三、擴展我們的代碼
1.目標
我們的腳本現(xiàn)在看起來好看點了,因為它使用了一些函數(shù)。接下來,讓我們稍微擴展一下腳本,以便用戶可以在運行腳本時指定一些參數(shù)。
2.用戶提示
Rewrite the above program to solicit the following from the command line in the form of arguments, and use them in the program:
- the path/name of the dataset csv file
- the train/test split size
- the threshold for the number of missing values per record to drop after which to drop entire record, instead of imputing the missing values重寫上面的程序,以參數(shù)的形式從命令行獲取以下內(nèi)容,并在程序中使用它們:
- 數(shù)據(jù)集 csv 文件的路徑/名稱
- 訓(xùn)練集/測試集的分割比例
- 每條記錄丟失值數(shù)量的閾值,在該閾值之后刪除整條記錄,而不是輸入丟失值
3.GhatGPT返回代碼
Here’s the script refactored to solicit the arguments from the command line:
import pandas as pd
import numpy as np
import argparse
from sklearn.model_selection import train_test_split
def load_dataset(filepath):
"""Loads a dataset from a csv file to a Pandas DataFrame"""
return pd.read_csv(filepath)
def convert_to_numeric(df):
"""Ensures the dataset is entirely numeric and converts string values to numeric"""
string_columns = df.columns[df.dtypes == 'object']
string_to_numeric = {}
for column in string_columns:
unique_values = df[column].unique()
mapping = {value: i for i, value in enumerate(unique_values)}
string_to_numeric[column] = mapping
df[column] = df[column].map(mapping)
return df, string_to_numeric
def handle_missing_values(df, missing_value_threshold):
"""Handles missing values by imputing a mean for records with one missing value, and dropping records with more than one missing value"""
missing_values = df.isna().sum()
for column, missing_value_count in missing_values.items():
if missing_value_count <= missing_value_threshold:
df[column].fillna(df[column].mean(), inplace=True)
elif missing_value_count > missing_value_threshold:
df.dropna(subset=[column], inplace=True)
return df
def split_dataset(df, target_column, test_size=0.2, random_state=0):
"""Splits the dataset into test and training sets"""
train_data, test_data, train_target, test_target = train_test_split(df.drop(target_column, axis=1), df[target_column], test_size=test_size, random_state=random_state)
return train_data, test_data, train_target, test_target
if __name__ == '__main__':
# Define the command line arguments
parser = argparse.ArgumentParser(description='Handles missing values and splits a dataset into test and training sets')
parser.add_argument('filepath', type=str, help='The path/name of the dataset csv file')
parser.add_argument('test_size', type=float, help='The test/train split size')
parser.add_argument('missing_value_threshold', type=int, help='The number of missing values per record after which to drop the entire record, instead of imputing the missing values')
args = parser.parse_args()
# Load the dataset from csv to a Pandas DataFrame
df = load_dataset(args.filepath)
# Convert string values to numeric and handle missing values
df, string_to_numeric = convert_to_numeric(df)
df = handle_missing_values(df, args.missing_value_threshold)
# Split the dataset into test and training sets
train_data, test_data, train_target, test_target = split_dataset(df, df.columns[-1], test_size=args.test_size)
This script can be run from the command line with the following command format:
python script_name.py filepath test_size missing_value_threshold
4.結(jié)果分析
老實說,這里不需要太多的分析,因為 ChatGPT 的注釋已經(jīng)很好地解釋了它的功能。他竟然告訴我們該怎么執(zhí)行代碼?。?!
四、將我們的代碼轉(zhuǎn)換為一個 Streamlit 應(yīng)用程序
1.目標
現(xiàn)在,讓假設(shè)我們想讓這段代碼在應(yīng)用程序中使用。讓 ChatGPT 將這些函數(shù)包裝到 Streamlit 應(yīng)用程序中。
2.用戶提示
Rewrite the above program as a Streamlit app, allowing for the user to provide the same arguments as in the command line args above
將上面的程序重寫為 Streamlit 應(yīng)用程序,允許用戶提供與上面命令行 args 中相同的參數(shù)
3.GhatGPT返回代碼
Here’s the script refactored as a Streamlit app:
import streamlit as st
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
def load_dataset(filepath):
"""Loads a dataset from a csv file to a Pandas DataFrame"""
return pd.read_csv(filepath)
def convert_to_numeric(df):
"""Ensures the dataset is entirely numeric and converts string values to numeric"""
string_columns = df.columns[df.dtypes == 'object']
string_to_numeric = {}
for column in string_columns:
unique_values = df[column].unique()
mapping = {value: i for i, value in enumerate(unique_values)}
string_to_numeric[column] = mapping
df[column] = df[column].map(mapping)
return df, string_to_numeric
def handle_missing_values(df, missing_value_threshold):
"""Handles missing values by imputing a mean for records with one missing value, and dropping records with more than one missing value"""
missing_values = df.isna().sum()
for column, missing_value_count in missing_values.items():
if missing_value_count <= missing_value_threshold:
df[column].fillna(df[column].mean(), inplace=True)
elif missing_value_count > missing_value_threshold:
df.dropna(subset=[column], inplace=True)
return df
def split_dataset(df, target_column, test_size=0.2, random_state=0):
"""Splits the dataset into test and training sets"""
train_data, test_data, train_target, test_target = train_test_split(df.drop(target_column, axis=1), df[target_column], test_size=test_size, random_state=random_state)
return train_data, test_data, train_target, test_target
def main():
"""Main function for the Streamlit app"""
st.title("Dataset Preprocessing App")
# Get the file path, test size, and missing value threshold from the user
file_path = st.text_input("Enter the path/name of the dataset csv file:")
test_size = st.slider("Enter the test/train split size:", 0.0, 1.0, 0.2)
missing_value_threshold = st.slider("Enter the number of missing values per record after which to drop the entire record, instead of imputing the missing values:", 0, 100, 1)
# Load the dataset and convert string values to numeric
df = load_dataset(file_path)
df, string_to_numeric = convert_to_numeric(df)
# Handle missing values
df = handle_missing_values(df, missing_value_threshold)
# Split the dataset into test and training sets
train_data, test_data, train_target, test_target = split_dataset(df, df.columns[-1], test_size=test_size)
# Show the final processed data
st.write("Processed Dataset:")
st.write(df)
st.write("Train Data:")
st.write(train_data)
st.write("Train Target:")
st.write(train_target)
st.write("Test Data:")
st.write(test_data)
st.write("Test Target:")
st.write(test_target)
if name == 'main':
main()
To run the Streamlit app, you can save the above code in a
.py
file and run it in the command line with the following command:要運行 Streamlit 應(yīng)用程序,可以將上面的代碼保存在‘.py‘ 文件中,并使用以下命令在命令行中運行它:
streamlit run file_name.py
4.結(jié)果分析
它做了它該做的事。ChatGPT 利用我們的功能,建立了一個簡單的 Streamlit Web應(yīng)用。
五、總結(jié)
完成以上四步之后,我們檢驗下代碼是否可以正常運行,需要做一下幾步操作:
1、安裝streamlit 庫
pip install streamlit
2、安裝 scikit-learn庫
pip install scikit-learn
3、開開心心的執(zhí)行
streamlit run main.py
vscode代碼:
運行的web程序:文章來源:http://www.zghlxwxcb.cn/news/detail-421529.html
搞定!NB class ChatGPT!文章來源地址http://www.zghlxwxcb.cn/news/detail-421529.html
到了這里,關(guān)于?? ?? ?? 爆:使用ChatGPT+Streamlit快速構(gòu)建機器學(xué)習(xí)數(shù)據(jù)集劃分應(yīng)用程序?。。〉奈恼戮徒榻B完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!