什么是 Pyjanitor

Pyjanitor 是一个简化数据清理过程的 Python 库。它是大名鼎鼎 Pandas 库的扩展，为清理和准备数据提供了额外的功能。Pyjanitor 易于使用、高效且高度可定制，因此深受数据科学家和分析师的青睐。

Pyjanitor 是一个通用性很强的库，为数据清理提供了广泛的功能。Pyjanitor 的一些主要功能包括：

添加和删除列
重命名列
处理缺失值
过滤数据
分组数据
重塑数据
处理字符串和文本数据

Pyjanitor 的优势

使用 Pyjanitor 进行数据清理的一些主要优势包括：

简化数据清理过程
省时省力
提供广泛的数据清理和准备功能
高度可定制和灵活
与 Pandas 和其他流行的 Python 库兼容

使用 Pyjanitor

假设我们有一个雇员及其工资的数据集。数据集中有一些缺失值，而且有些列的名称不一致。首先需要安装这个库。

pip install pyjanitor

下面我们来看看如何使用 Pyjanitor 清理数据集：

import pandas as pd
import janitor

# Read the dataset
df = pd.read_csv('employees.csv')

# Clean the column names
df = df.clean_names()

# Fill missing values with the median salary
df = df.fill_median('salary')

# Droping the unnecessary columns
df = df.remove_columns(['ssn', 'dob'])

# Convert the salary to a float
df['salary'] = df['salary'].astype(float)

# Sort the dataframe by the salary column in descending order
df = df.sort_values(by='salary', ascending=False)

# Save the cleaned dataframe to a new CSV file
df.to_csv('cleaned_employees.csv', index=False)

在本例中，首先导入了必要的库，包括 Pyjanitor。然后使用 Pandas 的 read_csv 函数读入数据集。然后，使用 Pyjanitor 的 clean_names 函数来规范列名。接着，使用 fill_median 函数用工资中位数来填补工资列中的任何缺失值。然后使用 remove_columns 函数删除任何不必要的列。

之后，使用 astype 方法将薪金列转换为浮点数。最后，使用 sort_values 方法按薪资降序对数据帧进行排序，并使用 to_csv 方法将清理后的数据帧保存为新的 CSV 文件。

下面是另一个演示如何使用 Pyjanitor 的简单示例：

import pandas as pd
import janitor

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, None, 30],
    'Salary': [50000, 60000, 75000]
})

# cleaning operations with Pyjanitor
cleaned_df = (
    df.clean_names()  # Cleaning column names
      .remove_empty()  # Removing rows with missing values
      .set_index('name')  # Setting column-'name' as the index
      .rename_column('age', 'years_old')  # Renaming column-'age'
)

print(cleaned_df)

# output -
#         years_old  salary
# name
# Alice         25.0   50000
# Bob            NaN   60000
# Charlie       30.0   75000

让我们探索 Pyjanitor 的更多特性。

重塑数据

除了清理数据，Pyjanitor 还可以用来重塑和转换数据。Pyjanitor 提供了多种函数，允许你以各种方式重塑数据，如透视、融化和拆分列。

下面是如何使用 Pyjanitor 重塑数据的示例：

import pandas as pd
import janitor

# Reading the dataset
df = pd.read_csv('my_dataset.csv')

假设数据集有以下列：id、日期、type_1、type_2、value_1 和 value_2。我们想重塑数据，使每个类型和值的组合都有单独的列。您可以使用 Pyjanitor 的 spread_columns() 函数来实现这一目的：

df = df.spread_columns(column_pairs=[('type_1', 'value_1'), ('type_2', 'value_2')], sep='_')

print(df.head())

处理字符串和文本数据

让我们使用另一个数据集来演示如何处理字符串和文本数据。假设我们有一个数据集，其中包含电影信息，包括标题、年份和类型。

然而，流派列中混合了大写和小写字母以及空白。我们希望对流派列进行标准化处理，使所有流派都使用标题大小写，并且没有前导或尾部空白。您可以使用 Pyjanitor 的 clean_text() 函数来实现这一目的：

# Read the dataset
df = pd.read_csv('movies.csv')

# Clean the genre column
df = df.clean_names().clean_text('genre')

print(df.head())

可以看到，流派栏已经标准化，所有流派都使用标题大小写，没有前导或尾部空白。Pyjanitor 为这一问题提供了解决方案，它提供了一套简化和自动化数据清理过程的功能。