來(lái)源于這里。
很多情況下用的是pandas而不是numpy,因?yàn)榍罢呤窃诤笳叩幕A(chǔ)上又封裝了一些操作,相當(dāng)于做了函數(shù)簡(jiǎn)化。pandas主要是數(shù)據(jù)預(yù)處理用的比較多。
十八 Pandas
1 文件讀取
1.1 讀取csv
任意一種格式,只要是以,
為分隔符,就可以用read_csv
讀取:
import pandas as pd
excel = pd.read_csv('excel.csv')
先把文件打印出來(lái)看看結(jié)果:
print(excel)
>>>
number1 Unnamed: 1 letter number2
0 1.0 NaN a 0.1
1 3.0 NaN c 0.5
2 4.0 NaN b 0.2
3 2.0 NaN d NaN
4 NaN NaN NaN NaN
5 5.0 NaN e NaN
6 9.0 NaN j NaN
7 8.0 NaN i NaN
結(jié)果是和表格一致的,而且打印出來(lái)的在表格左側(cè)加上了行號(hào)。而沒(méi)有內(nèi)容的地方用NaN
表示。忽略了前面的空行,但是內(nèi)部的空行不忽略。
接著我們看看這個(gè)文件的格式:
print(type(excel))
>>> <class 'pandas.core.frame.DataFrame'>
pandas的格式是DataFrame
,數(shù)據(jù)流。通過(guò)read_csv
之后,就是數(shù)據(jù)流格式,這是pandas里面最為核心也是最常用的格式。這個(gè)格式跟矩陣是差不多的,也是由行和列來(lái)組成。
我們?cè)倏纯次募?nèi)部的格式:
print(excel.dtypes)
>>> number1 float64
Unnamed: 1 float64
letter object
number2 float64
dtype: object
可以看到,基本是由int
、float
、object
組成的。
值得一提的是,文件number1
這一列存在浮點(diǎn)數(shù)(NaN
被認(rèn)為是浮點(diǎn)數(shù)),則類型就是float64
。
1.2 讀取txt
1.1
說(shuō)過(guò)了:任意一種格式,只要是以,
為分隔符,就可以用read_csv
讀取。因此用逗號(hào)分隔的txt格式也是一樣的方法:
import pandas as pd
excel = pd.read_csv('word_new.txt')
print(excel)
>>>
number word
0 1 my
1 2 name NaN
2 3 is
3 4\tnxy NaN
4 5 \t?
print(type(excel))
>>> <class 'pandas.core.frame.DataFrame'>
print(excel.dtypes)
>>> number object
word object
dtype: object
- 最前面的空行同樣被忽略了→
csv
和txt
都是用pd.read_csv()
讀取,用逗號(hào)來(lái)分隔,所以這兩個(gè)文件都忽略了前面的空行,但是內(nèi)容中間的空行不能忽略。- 第一行有內(nèi)容的行被認(rèn)為是表頭
number,word
和1,my
行都是被,
分隔的,所以很順利的占據(jù)兩列2 name
沒(méi)有逗號(hào),認(rèn)為是一列的,所以在后面補(bǔ)NaN
3, is
里面is
左右各有兩個(gè)空格,所以可以看到結(jié)果里面is
沒(méi)有和同一列其它元素一樣向右對(duì)齊- 最后兩行是用了
tab
1.3 讀取excel(xlsx)
pandas里面同樣包含讀取最普遍的 excel 的xlsx
格式的方式:
import pandas as pd
excel = pd.read_excel('excel.xlsx')
print(excel)
>>>
Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 number1 NaN letter number2
3 1 NaN a 0.1
4 3 NaN c 0.5
5 4 NaN b 0.2
6 2 NaN d NaN
7 NaN NaN NaN NaN
8 5 NaN j NaN
9 9 NaN h NaN
10 8 NaN NaN NaN
print(type(excel))
>>> <class 'pandas.core.frame.DataFrame'>
print(excel.dtypes)
>>> Unnamed: 0 object
Unnamed: 1 float64
Unnamed: 2 object
Unnamed: 3 object
dtype: object
這里就沒(méi)有忽略前面的空行。
\quad
2 內(nèi)容讀取
這里的excel
代表的是'excel.csv'
:
number1 Unnamed: 1 letter number2
0 1.0 NaN a 0.1
1 3.0 NaN c 0.5
2 4.0 NaN b 0.2
3 2.0 NaN d NaN
4 NaN NaN NaN NaN
5 5.0 NaN e NaN
6 9.0 NaN j NaN
7 8.0 NaN i NaN
2.1 讀取行
.head()
讀取前面幾行:
print(excel.head(3))
>>>
number1 Unnamed: 1 letter number2
0 1.0 NaN a 0.1
1 3.0 NaN c 0.5
2 4.0 NaN b 0.2
print(excel.head()) # 默認(rèn)5行
>>>
number1 Unnamed: 1 letter number2
0 1.0 NaN a 0.1
1 3.0 NaN c 0.5
2 4.0 NaN b 0.2
3 2.0 NaN d NaN
4 NaN NaN NaN NaN
.tail()
讀取后面幾行:
print(excel.tail())
>>>
number1 Unnamed: 1 letter number2
3 2.0 NaN d NaN
4 NaN NaN NaN NaN
5 5.0 NaN j NaN
6 9.0 NaN h NaN
7 8.0 NaN NaN NaN
.loc[]
讀取特定行:
print(excel.loc[3])
>>> number1 2.0
Unnamed: 1 NaN
letter d
number2 NaN
Name: 3, dtype: object
print(excel.loc[3:6])
>>>
number1 Unnamed: 1 letter number2
3 2.0 NaN d NaN
4 NaN NaN NaN NaN
5 5.0 NaN j NaN
6 9.0 NaN h NaN
print(excel.loc[[2,4,7]])
>>>
number1 Unnamed: 1 letter number2
2 4.0 NaN b 0.2
4 NaN NaN NaN NaN
7 8.0 NaN NaN NaN
2.2 讀取列
直接文件名+列名:
print(excel['number1'])
>>> 0 1.0
1 3.0
2 4.0
3 2.0
4 NaN
5 5.0
6 9.0
7 8.0
Name: number1, dtype: float64
'''讀取多列要再加一個(gè)[]'''
print(excel[['number1', 'letter']].head())
>>>
number1 letter
0 1.0 a
1 3.0 c
2 4.0 b
3 2.0 d
4 NaN NaN
如果要讀取列的表頭:
print(excel.columns)
>>> Index(['number1', 'Unnamed: 1', 'letter', 'number2'], dtype='object')
這樣出來(lái)的是名稱和類型,但是有時(shí)候只想要列表模式的名稱就行,所以在上面的代碼還要再加上.tolist()
:
columns_names = excel.columns.tolist()
print(columns_names)
>>> ['number1', 'Unnamed: 1', 'letter', 'number2']
這時(shí)候其實(shí)可以針對(duì)列名來(lái)篩選列,比如篩選數(shù)字列:
new = []
for i in columns_names:
if 'number' in i:
new.append(i)
print(new) # 這就是篩選出的列名
>>> ['number1', 'number2']
print(excel[new].head())
>>>
number1 number2
0 1.0 0.1
1 3.0 0.5
2 4.0 0.2
3 2.0 NaN
4 NaN NaN
或者篩選末尾有特定字符的列名(比如末尾都有某種計(jì)量單位),在這里篩選末尾都是1的列:
new = []
for i in columns_names:
if i.endswith('1'):
new.append(i)
print(new) # 這就是篩選出的列名
>>> ['number1', 'Unnamed: 1']
print(excel[new].head())
>>>
number1 Unnamed: 1
0 1.0 NaN
1 3.0 NaN
2 4.0 NaN
3 2.0 NaN
4 NaN NaN
3 數(shù)據(jù)處理
3.1 加減乘除
要保證運(yùn)算的是相同元素?。?!
3.1.1 列 與 元素
- 對(duì)于全為
NaN
的空列,不管怎么運(yùn)算,結(jié)果都還是NaN
,且類型為float
:
print(excel['Unnamed: 1'].head() + 10)
print(excel['Unnamed: 1'].head() - 10)
print(excel['Unnamed: 1'].head() * 10)
print(excel['Unnamed: 1'].head() / 10)
print(excel['Unnamed: 1'].head() + '10ab')
print(excel['Unnamed: 1'].head() - '10ab')
print(excel['Unnamed: 1'].head() * '10ab')
print(excel['Unnamed: 1'].head() / '10ab')
'''結(jié)果全都一樣'''
>>> 0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
Name: Unnamed: 1, dtype: float64
-
float、int、str
+ 元素:
print(excel['number1'].head() + 10) # 數(shù)字+數(shù)字
>>> 0 11.0
1 13.0
2 14.0
3 12.0
4 NaN
Name: number1, dtype: float64
print(excel['letter'].head() + '012ab') # 字符串+字符串
>>> 0 a012ab
1 b012ab
2 c012ab
3 d012ab
4 NaN
Name: letter, dtype: object
-
float、int、str
- 元素:
print(excel[['number1', 'number2']].head() - 10) # 數(shù)字-數(shù)字
>>>
number1 number2
0 -9.0 -9.9
1 -7.0 -9.5
2 -6.0 -9.8
3 -8.0 NaN
4 NaN NaN
-
float、int、str
×、÷ 元素:
×、÷里面的元素只能是數(shù)字!
print(excel[['number1', 'letter']] * 10)
>>>
number1 letter
0 10.0 aaaaaaaaaa
1 30.0 cccccccccc
2 40.0 bbbbbbbbbb
3 20.0 dddddddddd
4 NaN NaN
5 50.0 jjjjjjjjjj
6 90.0 hhhhhhhhhh
7 80.0 NaN
print(excel[['number1']] / 10)
>>>
number1
0 0.1
1 0.3
2 0.4
3 0.2
4 NaN
5 0.5
6 0.9
7 0.8
3.1.2 列 與 列
列 與 列 的運(yùn)算中,哪一列的行更多,結(jié)果的行就取決于哪一列。
- 對(duì)于全為
NaN
的空列,不管怎么運(yùn)算,結(jié)果都還是NaN
,但是類型會(huì)變:
print(excel['number1'].head() + excel['Unnamed: 1'].head())
print(excel['number1'].head() - excel['Unnamed: 1'].head())
print(excel['number1'].head() * excel['Unnamed: 1'].head())
print(excel['number1'].head() / excel['Unnamed: 1'].head())
>>> 0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
dtype: float64
print(excel['letter'].head() + excel['Unnamed: 1'].head())
print(excel['letter'].head() - excel['Unnamed: 1'].head())
print(excel['letter'].head() * excel['Unnamed: 1'].head())
print(excel['letter'].head() / excel['Unnamed: 1'].head())
>>> 0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
dtype: object
類型取決于與Unnamed
列運(yùn)算的那一列。
\quad
-
float、int
加減乘除float、int
:
'''與自己運(yùn)算,就會(huì)給出列名和類型'''
print(excel['number1'].head() + excel['number1'])
>>> 0 2.0
1 6.0
2 8.0
3 4.0
4 NaN
Name: number1, dtype: float64
'''與別人運(yùn)算,只會(huì)給類型'''
print(excel['number1'].head() - excel['number2'])
>>> 0 0.9
1 2.5
2 3.8
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
dtype: float64
'''總行數(shù)取決于行數(shù)最多的那個(gè)'''
print(excel['number1'].head() * excel['number2'])
>>> 0 0.1
1 1.5
2 0.8
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
dtype: float64
print(excel['number1'].head() / excel['number2'].head())
>>> 0 10.0
1 6.0
2 20.0
3 NaN
4 NaN
dtype: float64
\quad
-
float、int
與str
的列不能運(yùn)算 -
str
+str
(只有這一種)
print(excel['letter'].head() + excel['letter'].head())
>>> 0 aa
1 cc
2 bb
3 dd
4 NaN
Name: letter, dtype: object
3.2 最值、總和、歸一化
print(excel['number1'].max(), excel['number1'].min(), excel['number1'].sum())
>>> 9.0 1.0 32.0
歸一化:除以最大值
print(excel['number1'] / excel['number1'].max())
>>> 0 0.111111
1 0.333333
2 0.444444
3 0.222222
4 NaN
5 0.555556
6 1.000000
7 0.888889
Name: number1, dtype: float64
3.3 .sort_values()
排序
用法:DataFrame.sort_values(by='##',axis=0,ascending=True, inplace=False, na_position='last')
by=''
:針對(duì)列:指定列名by='number1'
(一般為了偷懶都直接寫(xiě)列名)
針對(duì)行:指定索引by=1
axis=0
:若axis=0或'index'
,則按列排序;
若axis=1或'columns'
,則按指定索引行排序ascending=True
:排列順序,默認(rèn)升序排列inplace=False
:是否用排序后的數(shù)據(jù)集替換原來(lái)的數(shù)據(jù),默認(rèn)不替換na_position='last'
:{'first','last'}
,設(shè)定缺失值的顯示位置(開(kāi)始還是末尾)
-
'number1'
,按列排序,升序排列,不替換,缺失值末尾顯示
print(excel.sort_values('number1'))
>>>
number1 Unnamed: 1 letter number2
0 1.0 NaN a 0.1
3 2.0 NaN d NaN
1 3.0 NaN c 0.5
2 4.0 NaN b 0.2
5 5.0 NaN j NaN
7 8.0 NaN NaN NaN
6 9.0 NaN h NaN
4 NaN NaN NaN NaN
可以看到,行與行之間按照排序重新?lián)Q了位置。
此時(shí)索引發(fā)生了變化,我們可以重新安排索引:
sorted_excel = excel.sort_values('number1') # 排序后的
sorted_excel_reindexed = sorted_excel.reset_index(drop=True) # 把之前的index給drop掉
print(sorted_excel_reindexed)
>>>
number1 Unnamed: 1 letter number2
0 1.0 NaN a 0.1
1 2.0 NaN d NaN
2 3.0 NaN c 0.5
3 4.0 NaN b 0.2
4 5.0 NaN j NaN
5 8.0 NaN NaN NaN
6 9.0 NaN h NaN
7 NaN NaN NaN NaN
-
'letter'
,按列排序,降序排列,不替換,缺失值顯示在開(kāi)始
print(excel.sort_values('letter', ascending=False, na_position='first'))
>>>
number1 Unnamed: 1 letter number2
4 NaN NaN NaN NaN
7 8.0 NaN NaN NaN
5 5.0 NaN j NaN
6 9.0 NaN h NaN
3 2.0 NaN d NaN
1 3.0 NaN c 0.5
2 4.0 NaN b 0.2
0 1.0 NaN a 0.1
- 按第7行排序,升序排列,不替換,缺失值顯示在開(kāi)頭
print(excel.sort_values(by=7, axis='columns', na_position='first'))
>>>
Unnamed: 1 letter number2 number1
0 NaN a 0.1 1.0
1 NaN c 0.5 3.0
2 NaN b 0.2 4.0
3 NaN d NaN 2.0
4 NaN NaN NaN NaN
5 NaN j NaN 5.0
6 NaN h NaN 9.0
7 NaN NaN NaN 8.0
這里是按照最后一行排序,可以看到,所有列都隨著第七行變化了順序。
3.4 缺失值處理
首先要找到缺失值:
number1 = excel['number1']
number1_isnull = pd.isnull(number1) # 判斷是否NaN
print(number1_isnull) # 布爾類型
>>> 0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 False
Name: number1, dtype: bool
print(number1[number1_isnull]) # 輸出True的
>>> 4 NaN
Name: number1, dtype: float64
然后找出好的值:
good_number1 = number1[number1_isnull==False]
print(good_number1) # 輸出不為缺失值的,也就是好的
>>> 0 1.0
1 3.0
2 4.0
3 2.0
5 5.0
6 9.0
7 8.0
Name: number1, dtype: float64
print(sum(good_number1) / len(good_number1)) # 均值
>>> 4.571428571428571
上面是求均值,其實(shí)已經(jīng)有函數(shù)直接解決了:
print(excel['number1'].mean())
>>> 4.571428571428571
4 常用函數(shù)
4.1 pivot_table 找出特定索引對(duì)應(yīng)的特定值
換一個(gè)表格:
import pandas as pd
excel = pd.read_csv('new.csv')
print(excel)
>>>
Survived Pclass Age Fare Embarked
0 0 3 22 7.2500 S
1 1 1 38 71.2833 C
2 1 2 26 7.9250 S
3 0 1 35 53.1000 S
4 1 3 35 8.0500 S
5 0 2 28 7.2000 C
找出特定索引對(duì)應(yīng)的特定值:
# 統(tǒng)計(jì)各個(gè)等級(jí)的船艙(Pclass)中,獲救的幾率,按平均值計(jì)算
passenger_survival = excel.pivot_table(index='Pclass', values='Survived', aggfunc=np.mean)
print(passenger_survival)
>>>
Survived
Pclass
1 0.5
2 0.5
3 0.5
# 統(tǒng)計(jì)各個(gè)船艙中乘客的年齡
passenger_age = excel.pivot_table(index='Pclass', values='Age') # aggfunc沒(méi)寫(xiě),就是默認(rèn)mean
print(passenger_age)
>>>
Pclass
1 36.5
2 27.0
3 28.5
# 統(tǒng)計(jì)不同登船地點(diǎn)(C、S)的價(jià)格和存活人數(shù)
port_stats = excel.pivot_table(index='Embarked', values=['Fare', 'Survived'], aggfunc=np.sum)
print(port_stats)
>>>
Fare Survived
Embarked
C 78.4833 1
S 76.3250 2
4.2 刪去缺失值多的那列或行
原始表格:
import pandas as pd
import numpy as np
excel = pd.read_csv('excel.csv')
print(excel)
>>>
number1 Unnamed: 1 letter number2
0 1.0 NaN a 0.1
1 3.0 NaN c 0.5
2 4.0 NaN b 0.2
3 2.0 NaN d NaN
4 NaN NaN NaN NaN
5 5.0 NaN j NaN
6 9.0 NaN h NaN
7 8.0 NaN NaN NaN
刪除列中含有 NaN 值較多的列:
# thresh 參數(shù)表示一個(gè)列中至少要有多少個(gè)非 NaN 值,否則該列將被刪除。
# 上面代碼中的 thresh=len(df)*0.8 表示一個(gè)列中非 NaN 值數(shù)量至少要占該列總數(shù)量的 80% 才會(huì)保留,這可以根據(jù)實(shí)際情況進(jìn)行調(diào)整。
drop1_excel = excel.dropna(thresh=len(excel)*0.7, axis='columns')
print(drop1_excel)
>>>
number1 letter
0 1.0 a
1 3.0 c
2 4.0 b
3 2.0 d
4 NaN NaN
5 5.0 j
6 9.0 h
7 8.0 NaN
刪掉指定列中,含有NaN的行:
drop1_excel = excel.dropna(axis=0, subset = ['letter', 'number2'])
print(drop1_excel)
>>>
number1 Unnamed: 1 letter number2
0 1.0 NaN a 0.1
1 3.0 NaN c 0.5
2 4.0 NaN b 0.2
4.3 定位
# 定位第7號(hào)樣本的number1是什么內(nèi)容
location = excel.loc[7, 'number1']
print(location) # 8.0
4.4 自定義函數(shù)apply
# 自定義一個(gè)想實(shí)現(xiàn)的功能的函數(shù)
def find(x): # 尋找第3行的樣本
a = x.loc[3]
return a
my = excel.apply(find)
print(my)
>>>
number1 2.0
Unnamed: 1 NaN
letter d
number2 NaN
dtype: object
def not_null_count(x): # 統(tǒng)計(jì)每一列的缺失值
y = pd.isnull(x) # 函數(shù)用于判斷一個(gè) DataFrame 或 Series 中每個(gè)元素是否為空值,返回一個(gè)由 True/False 構(gòu)成的布爾型 DataFrame/Series
null = x[y] # 取出為 True 的元素
return len(null)
print(excel.apply(not_null_count)) # 打印出每個(gè)列有多少個(gè)缺失值
>>>
number1 1
Unnamed: 1 8
letter 2
number2 5
dtype: int64
def is_adult(x): # 判斷是否是成年人,表格是new
if x['Age'] > 18:
return True
else:
return False
print(excel.apply(is_adult, axis=1))
>>>
0 True
1 True
2 True
3 True
4 True
5 True
dtype: bool
5 Series 結(jié)構(gòu)
參考:Pandas庫(kù)中的Series結(jié)構(gòu)
Series對(duì)象可以理解為由一列索引和一列值,共兩列數(shù)據(jù)組成的結(jié)構(gòu)。而DataFrame就是由一列索引和多列值組成的結(jié)構(gòu),其中,在DataFrame中的每一列都是一個(gè)Series對(duì)象。
Series(collection of values)
DataFrame(collection of Series objects)文章來(lái)源:http://www.zghlxwxcb.cn/news/detail-424434.html
5.1 從表格中提取 Series 結(jié)構(gòu)
import pandas as pd
excel = pd.read_csv('excel.csv')
print(excel)
>>>
number1 Unnamed: 1 letter number2
0 1.0 NaN a 0.1
1 3.0 NaN c 0.5
2 4.0 NaN b 0.2
3 2.0 NaN d NaN
4 NaN NaN NaN NaN
5 5.0 NaN j NaN
6 9.0 NaN h NaN
7 8.0 NaN NaN NaN
series_letter = excel['letter']
print(type(series_letter))
>>> <class 'pandas.core.series.Series'>
# 作為 Series 結(jié)構(gòu)直接提取
print(series_letter[0:5])
>>>
0 a
1 c
2 b
3 d
4 NaN
Name: letter, dtype: object
# 也可以作為 DataFrame 結(jié)構(gòu)
print(series_letter.loc[2])
>>> b
5.2 自己造一個(gè) Series 結(jié)構(gòu)
from pandas import Series
series_letter = excel['letter'].values
series_number1 = excel['number1'].values
series_new = Series(series_number1, index=series_letter) # 索引是number1,值是letter
print(series_new)
>>>
a 1.0
c 3.0
b 4.0
d 2.0
NaN NaN
j 5.0
h 9.0
NaN 8.0
dtype: float64
print(series_new[['b', 'j']]) # 相當(dāng)于額外造了一個(gè)索引,之前的索引一樣能用
>>>
b 4.0
j 5.0
dtype: float64
print(series_new[0:5])
>>>
a 1.0
c 3.0
b 4.0
d 2.0
NaN NaN
dtype: float64
照樣可以進(jìn)行排序等操作,參考Pandas庫(kù)中的Series結(jié)構(gòu)就行。文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-424434.html
到了這里,關(guān)于Python初學(xué)小知識(shí)(十四):數(shù)據(jù)分析處理庫(kù)Pandas的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!