《Python數(shù)據(jù)分析技術(shù)?！返?3章 03 可視化各級(jí)數(shù)據(jù)（Visualizing various levels of data）

這篇具有很好參考價(jià)值的文章主要介紹了《Python數(shù)據(jù)分析技術(shù)?！返?3章 03 可視化各級(jí)數(shù)據(jù)（Visualizing various levels of data）。希望對(duì)大家有所幫助。如果存在錯(cuò)誤或未考慮完全的地方，請(qǐng)大家不吝賜教，您也可以點(diǎn)擊"舉報(bào)違法"按鈕提交疑問。

03 可視化各級(jí)數(shù)據(jù)（Visualizing various levels of data）

Whenever you need to analyze data, first understand if the data is structured or unstructured. If the data is unstructured, convert it to a structured form with rows and columns, which makes it easier for further analysis using libraries like Pandas. Once you have data in this format, categorize each of the features or columns into the four levels of data and perform your analysis accordingly.

無論何時(shí)需要分析數(shù)據(jù)，首先要了解數(shù)據(jù)是結(jié)構(gòu)化的還是非結(jié)構(gòu)化的。如果是非結(jié)構(gòu)化數(shù)據(jù)，則應(yīng)將其轉(zhuǎn)換為具有行和列的結(jié)構(gòu)化形式，這樣更便于使用 Pandas 等庫進(jìn)行進(jìn)一步分析。有了這種格式的數(shù)據(jù)后，將每個(gè)特征或列歸類到數(shù)據(jù)的四個(gè)層次，然后進(jìn)行相應(yīng)的分析。

Note that in this chapter, we only aim to understand how to categorize the variables in a dataset and identify the operations and plots that would apply for each category. The actual code that needs to be written to visualize the data is explained in Chapter 7.

請(qǐng)注意，在本章中，我們只想了解如何對(duì)數(shù)據(jù)集中的變量進(jìn)行分類，并確定適用于每個(gè)類別的操作和繪圖。為實(shí)現(xiàn)數(shù)據(jù)可視化而需要編寫的實(shí)際代碼將在第 7 章中講解。

We look at how to classify the features and perform various operations using the famous Titanic dataset. The dataset can be imported from here: https://github.com/DataRepo2019/Data-files/blob/master/titanic.csv

我們將使用著名的泰坦尼克號(hào)數(shù)據(jù)集來研究如何對(duì)特征進(jìn)行分類并執(zhí)行各種操作。數(shù)據(jù)集可從此處導(dǎo)入： https://github.com/DataRepo2019/Data-files/blob/master/titanic.csv

Background information about the dataset: The RMS Titanic, a British passenger ship, sank on its maiden voyage from Southampton to New York on 15th April 1912, after it collided with an iceberg. Out of the 2,224 passengers, 1,500 died, making this event a tragedy of epic proportions. This dataset describes the survival status of the passengers and other details about them, including their class, name, age, and the number of relatives.

數(shù)據(jù)集背景信息： 1912 年 4 月 15 日，英國皇家泰坦尼克號(hào)客輪在從南安普頓到紐約的處女航中與冰山相撞沉沒。在 2224 名乘客中，有 1500 人喪生，使這一事件成為史詩般的悲劇。該數(shù)據(jù)集描述了乘客的生還狀況及其他詳細(xì)信息，包括他們的等級(jí)、姓名、年齡和親屬人數(shù)。

《Python數(shù)據(jù)分析技術(shù)?！返?3章 03 可視化各級(jí)數(shù)據(jù)（Visualizing various levels of data）,Python數(shù)據(jù)分析技術(shù)棧,python,數(shù)據(jù)分析,python,數(shù)據(jù)分析,開發(fā)語言

The features in this dataset, classified according to the data level, are captured in Table 4-1.

表 4-1 根據(jù)數(shù)據(jù)級(jí)別對(duì)數(shù)據(jù)集中的特征進(jìn)行了分類。

Let us now understand the rationale behind the classification of the features in this dataset.

現(xiàn)在，讓我們來了解一下對(duì)該數(shù)據(jù)集中的特征進(jìn)行分類的原理。

Nominal variables: Variables like “PassengerId”, “Survived”, “Name”, “Sex”, “Cabin”, and “Embarked” do not have any intrinsic ordering of their values. Note that some of these variables have numeric values, but these values are finite in number. We cannot perform an arithmetic operation on these values like addition, subtraction, multiplication, or division. One operation that is common with nominal variables is counting. A commonly used method in Pandas, value_counts (discussed in the next chapter), is used to determine the number of values per each unique category of the nominal variable. We can also find the mode (the most frequently occurring value). The bar graph is frequently used to visualize nominal data (pie charts can also be used), as shown in Figure 4-5.

名義變量： PassengerId"、“Survived”、“Name”、“Sex”、"Cabin "和 "Embarked "等變量的值沒有內(nèi)在順序。需要注意的是，其中一些變量有數(shù)值，但這些數(shù)值的數(shù)量是有限的。我們無法對(duì)這些數(shù)值進(jìn)行加、減、乘或除等算術(shù)運(yùn)算。對(duì)名義變量常用的一種操作是計(jì)數(shù)。Pandas 中的一個(gè)常用方法 value_counts（將在下一章中討論）用于確定標(biāo)稱變量中每個(gè)獨(dú)特類別的值的數(shù)量。我們還可以找到模式（出現(xiàn)頻率最高的值）。如圖 4-5 所示，條形圖常用于將名義數(shù)據(jù)可視化（也可以使用餅圖）。

Ordinal variables: “Pclass” (or Passenger Class) is an ordinal variable since its values follow an order. A value of 1 is equivalent to first class, 2 is equivalent to the second class, and so on. These class values are indicative of socioeconomic status.

順序變量： “Pclass”（或乘客等級(jí)）是一個(gè)順序變量，因?yàn)樗闹凳怯许樞虻摹?shù)值 1 代表一等艙，2 代表二等艙，以此類推。這些等級(jí)值表明了社會(huì)經(jīng)濟(jì)地位。

We can find out the median value and percentiles. We can also count the number of values in each category, calculate the mode, and use plots like bar graphs and pie charts, just as we did for nominal variables.

我們可以找出中位值和百分位數(shù)。我們還可以計(jì)算每個(gè)類別中的數(shù)值個(gè)數(shù)、計(jì)算模式，并使用條形圖和餅圖等圖表，就像我們對(duì)名義變量所做的那樣。

In Figure 4-6, we have used a pie chart for the ordinal variable “Pclass”

在圖 4-6 中，我們使用餅圖來表示序數(shù)變量 “Pclass”。

Ratio Data: The “Age” and “Fare” variables are examples of ratio data, with the value zero as a reference point. With this type of data, we can perform a wide range of mathematical operations. For example, we can add all the fares and divide it by the total number of passengers to find the mean. We can also find out the standard deviation. A histogram, as shown in Figure 4-7, can be used to visualize this kind of continuous data to understand the distribution.

比率數(shù)據(jù)：年齡 "和 "票價(jià) "變量是比率數(shù)據(jù)的例子，以零值為參考點(diǎn)。利用這類數(shù)據(jù)，我們可以進(jìn)行多種數(shù)學(xué)運(yùn)算。例如，我們可以將所有票價(jià)相加，然后除以乘客總數(shù)，得出平均值。我們還可以求出標(biāo)準(zhǔn)差。直方圖（如圖 4-7 所示）可用于直觀顯示這類連續(xù)數(shù)據(jù)，以了解其分布情況。

In the preceding plots, we looked at the graphs for plotting individual categorical or continuous variables. In the following section, we understand which graphs to use when we have more than one variable or a combination of variables belong to different scales or levels.

在前面的繪圖中，我們了解了用于繪制單個(gè)分類變量或連續(xù)變量的圖形。在下一節(jié)中，我們將了解當(dāng)有多個(gè)變量或變量組合屬于不同尺度或級(jí)別時(shí)，應(yīng)該使用哪種圖形。

繪制混合數(shù)據(jù)（Plotting mixed data）

In this section, we’ll consider three scenarios, each of which has two variables that may or may not belong to the same level and discuss which plot to use for each scenario (using the same Titanic dataset).

在本節(jié)中，我們將考慮三種情況，每種情況都有兩個(gè)變量，這兩個(gè)變量可能屬于也可能不屬于同一級(jí)別，并討論每種情況下應(yīng)使用哪種曲線圖（使用相同的泰坦尼克數(shù)據(jù)集）。

One categorical and one continuous variable: A box plot shows the distribution, symmetry, and outliers for a continuous variable. A box plot can also show the continuous variable against a categorical variable. In Figure 4-8, the distribution of ‘Age’ (a ratio variable) for each value of the nominal variable – ‘Survived’ (0 is the value for passengers who did not survive and 1 is the value for those who did).

一個(gè)分類變量和一個(gè)連續(xù)變量：方框圖顯示連續(xù)變量的分布、對(duì)稱性和異常值。方框圖還可以顯示連續(xù)變量與分類變量的對(duì)比情況。在圖 4-8 中，“年齡”（比率變量）在名義變量 “存活”（0 代表未存活乘客的值，1 代表存活乘客的值）的每個(gè)值上的分布情況。

Both continuous variables: Scatter plots are used to depict the relationship between two continuous variables. In Figure 4-9, we plot two ratio variables, ‘Age’ and ‘Fare’, on the x and y axes to produce a scatter plot.

都是連續(xù)變量：散點(diǎn)圖用于描述兩個(gè)連續(xù)變量之間的關(guān)系。在圖 4-9 中，我們將兩個(gè)比率變量 "年齡 "和 "票價(jià) "分別繪制在 x 軸和 y 軸上，從而得到散點(diǎn)圖。

Both categorical variables: Using a clustered bar chart (Figure 4-10), you can combine two categorical variables with the bars depicted side by side to represent every combination of values for the two variables.

兩個(gè)分類變量：使用聚類條形圖（圖 4-10），可以將兩個(gè)分類變量結(jié)合在一起，并列的條形圖代表了這兩個(gè)變量的所有數(shù)值組合。

We can also use a stacked bar chart to plot two categorical variables. Consider the following stacked bar chart, shown in Figure 4-11, plotting two categorical variables –“Pclass” and “Survived”

我們還可以使用堆疊條形圖來繪制兩個(gè)分類變量。下面是圖 4-11 所示的堆疊條形圖，其中繪制了兩個(gè)分類變量–"Pclass "和 “Survived”。

In summary, you can use a scatter plot for two continuous variables, a stacked or clustered bar chart for two categorical variables, and a box plot when you want to display a continuous variable across different values of a categorical variable.

總之，您可以對(duì)兩個(gè)連續(xù)變量使用散點(diǎn)圖，對(duì)兩個(gè)分類變量使用堆疊條形圖或聚類條形圖，當(dāng)您想在分類變量的不同值之間顯示連續(xù)變量時(shí)使用盒狀圖。文章來源地址http://www.zghlxwxcb.cn/news/detail-811936.html

到了這里，關(guān)于《Python數(shù)據(jù)分析技術(shù)?！返?3章 03 可視化各級(jí)數(shù)據(jù)（Visualizing various levels of data）的文章就介紹完了。如果您還想了解更多內(nèi)容，請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章，希望大家以后多多支持TOY模板網(wǎng)！