书签分享收藏举报版权申诉 / 13

立即下载加入VIP,免费下载

当前位置：首页 > 考试认证 > 公务员考试 > Pandas中文官方文档之基础用法3.docx

Pandas中文官方文档之基础用法3.docx

文档编号：6719351
上传时间：2023-01-09
格式：DOCX
页数：13
大小：21.62KB

《Pandas中文官方文档之基础用法3.docx》由会员分享，可在线阅读，更多相关《Pandas中文官方文档之基础用法3.docx（13页珍藏版）》请在冰豆网上搜索。

Pandas中文官方文档之基础用法3.docx

Pandas中文官方文档之基础用法3

函数应用

不管是为pandas对象应用自定义函数，还是应用其它第三方函数，都离不开以下三种方法。

用哪种方法取决于操作的对象是 DataFrame 或 Series ，是行或列，还是元素。

1.表级函数应用：

`pipe（）`

2.行列级函数应用：

apply（）

3.聚合API：

`agg（）`与`transform（）`

4.元素级函数应用：

`applymap（）`

表级函数应用

虽然可以把 DataFrame 与 Series 传递给函数。

不过，通过链式调用函数时，最好使用pipe（）方法。

对比以下两种方式：

# f, g, and h are functions taking and returning ``DataFrames``

>>> f（g（h（df）, arg1=1）, arg2=2, arg3=3）

下列代码与上述代码等效

>>> （df.pipe（h）

... .pipe（g, arg1=1）

... .pipe（f, arg2=2, arg3=3））

pandas鼓励使用第二种方式，即链式方法。

在链式方法中调用自定义函数或第三方支持库函数时，用 pipe 更容易，与用pandas自身方法一样。

上例中，f、g 与 h 这几个函数都把 DataFrame 当作首位参数。

要是想把数据作为第二个参数，该怎么办？

本例中，pipe 为元组（callable,data_keyword）形式。

.pipe把 DataFrame 作为元组里指定的参数。

下例用statsmodels拟合回归。

该API先接收一个公式，DataFrame 是第二个参数，data。

要传递函数，则要用pipe 接收关键词对（sm.ols,'data'）。

In [138]:

import statsmodels.formula.api as sm

In [139]:

bb = pd.read_csv（'data/baseball.csv', index_col='id'）

In [140]:

（bb.query（'h > 0'）

.....:

.assign（ln_h=lambda df:

np.log（df.h））

.....:

.pipe（（sm.ols, 'data'）, 'hr ~ ln_h + year + g + C（lg）'）

.....:

.fit（）

.....:

.summary（）

.....:

）

.....:

Out[140]:

"""

OLS Regression Results

==============================================================================

Dep. Variable:

hr R-squared:

0.685

Model:

OLS Adj. R-squared:

0.665

Method:

Least Squares F-statistic:

34.28

Date:

Thu, 22 Aug 2019 Prob （F-statistic）:

3.48e-15

Time:

15:

48:

59 Log-Likelihood:

-205.92

No. Observations:

68 AIC:

421.8

Df Residuals:

63 BIC:

432.9

Df Model:

4

Covariance Type:

nonrobust

===============================================================================

coef std err t P>|t| [0.025 0.975]

-------------------------------------------------------------------------------

Intercept -8484.7720 4664.146 -1.819 0.074 -1.78e+04 835.780

C（lg）[T.NL] -2.2736 1.325 -1.716 0.091 -4.922 0.375

ln_h -1.3542 0.875 -1.547 0.127 -3.103 0.395

year 4.2277 2.324 1.819 0.074 -0.417 8.872

g 0.1841 0.029 6.258 0.000 0.125 0.243

==============================================================================

Omnibus:

10.875 Durbin-Watson:

1.999

Prob（Omnibus）:

0.004 Jarque-Bera （JB）:

17.298

Skew:

0.537 Prob（JB）:

0.000175

Kurtosis:

5.225 Cond. No. 1.49e+07

==============================================================================

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

[2] The condition number is large, 1.49e+07. This might indicate that there are

strong multicollinearity or other numerical problems.

unix的 pipe 与后来出现的dplyr及magrittr启发了pipe 方法，在此，引入了R语言里用于读取pipe的操作符（%>%）。

pipe 的实现思路非常清晰，仿佛Python源生的一样。

强烈建议大家阅读 pipe（）的源代码。

行列级函数应用

apply（）方法可以沿着DataFrame的轴应用任何函数，比如，描述性统计方法，该方法支持 axis 参数。

In [141]:

df.apply（np.mean）

Out[141]:

one 0.811094

two 1.360588

three 0.187958

dtype:

float64

In [142]:

df.apply（np.mean, axis=1）

Out[142]:

a 1.583749

b 0.734929

c 1.133683

d -0.166914

dtype:

float64

In [143]:

df.apply（lambda x:

x.max（） - x.min（））

Out[143]:

one 1.051928

two 1.632779

three 1.840607

dtype:

float64

In [144]:

df.apply（np.cumsum）

Out[144]:

one two three

a 1.394981 1.772517 NaN

b 1.738035 3.684640 -0.050390

c 2.433281 5.163008 1.177045

d NaN 5.442353 0.563873

In [145]:

df.apply（np.exp）

Out[145]:

one two three

a 4.034899 5.885648 NaN

b 1.409244 6.767440 0.950858

c 2.004201 4.385785 3.412466

d NaN 1.322262 0.541630

apply（）方法还支持通过函数名字符串调用函数。

In [146]:

df.apply（'mean'）

Out[146]:

one 0.811094

two 1.360588

three 0.187958

dtype:

float64

In [147]:

df.apply（'mean', axis=1）

Out[147]:

a 1.583749

b 0.734929

c 1.133683

d -0.166914

dtype:

float64

默认情况下，apply（）调用的函数返回的类型会影响 DataFrame.apply 输出结果的类型。

∙函数返回的是 Series 时，最终输出的结果是 DataFrame。

输出的列与函数返回的 Series 索引相匹配。

∙函数返回其它任意类型时，输出结果是 Series。

result_type 会覆盖默认行为，该参数有三个选项：

reduce、broadcast、expand。

这些选项决定了列表型返回值是否扩展为 DataFrame。

用好 apply（）可以了解数据集的很多信息。

比如可以提取每列的最大值对应的日期：

In [148]:

tsdf = pd.DataFrame（np.random.randn（1000, 3）, columns=['A', 'B', 'C'],

.....:

index=pd.date_range（'1/1/2000', periods=1000））

.....:

In [149]:

tsdf.apply（lambda x:

x.idxmax（））

Out[149]:

A 2000-08-06

B 2001-01-18

C 2001-07-18

dtype:

datetime64[ns]

还可以向 apply（）方法传递额外的参数与关键字参数。

比如下例中要应用的这个函数：

def subtract_and_divide（x, sub, divide=1）:

return （x - sub） / divide

可以用下列方式应用该函数：

df.apply（subtract_and_divide, args=（5,）, divide=3）

为每行或每列执行 Series 方法的功能也很实用：

In [150]:

tsdf

Out[150]:

A B C

2000-01-01 -0.158131 -0.232466 0.321604

2000-01-02 -1.810340 -3.105758 0.433834

2000-01-03 -1.209847 -1.156793 -0.136794

2000-01-04 NaN NaN NaN

2000-01-05 NaN NaN NaN

2000-01-06 NaN NaN NaN

2000-01-07 NaN NaN NaN

2000-01-08 -0.653602 0.178875 1.008298

2000-01-09 1.007996 0.462824 0.254472

2000-01-10 0.307473 0.600337 1.643950

In [151]:

tsdf.apply（pd.Series.interpolate）

Out[151]:

A B C

2000-01-01 -0.158131 -0.232466 0.321604

2000-01-02 -1.810340 -3.105758 0.433834

2000-01-03 -1.209847 -1.156793 -0.136794

2000-01-04 -1.098598 -0.889659 0.092225

2000-01-05 -0.987349 -0.622526 0.321243

2000-01-06 -0.876100 -0.355392 0.550262

2000-01-07 -0.764851 -0.088259 0.779280

2000-01-08 -0.653602 0.178875 1.008298

2000-01-09 1.007996 0.462824 0.254472

2000-01-10 0.307473 0.600337 1.643950

apply（）有一个参数 raw，默认值为 False，在应用函数前，使用该参数可以将每行或列转换为 Series。

该参数为 True 时，传递的函数接收ndarray对象，若不需要索引功能，这种操作能显著提高性能。

聚合API

0.20.0版新增。

聚合API可以快速、简洁地执行多个聚合操作。

Pandas对象支持多个类似的API，如groupbyAPI、windowfunctionsAPI、resampleAPI。

聚合函数为DataFrame.aggregate（），它的别名是 DataFrame.agg（）。

这里使用与前例类似的 DataFrame：

In [152]:

tsdf = pd.DataFrame（np.random.randn（10, 3）, columns=['A', 'B', 'C'],

.....:

index=pd.date_range（'1/1/2000', periods=10））

.....:

In [153]:

tsdf.iloc[3:

7] = np.nan

In [154]:

tsdf

Out[154]:

A B C

2000-01-01 1.257606 1.004194 0.167574

2000-01-02 -0.749892 0.288112 -0.757304

2000-01-03 -0.207550 -0.298599 0.116018

2000-01-04 NaN NaN NaN

2000-01-05 NaN NaN NaN

2000-01-06 NaN NaN NaN

2000-01-07 NaN NaN NaN

2000-01-08 0.814347 -0.257623 0.869226

2000-01-09 -0.250663 -1.206601 0.896839

2000-01-10 2.169758 -1.333363 0.283157

应用单个函数时，该操作与 apply（）等效，这里也可以用字符串表示聚合函数名。

下面的聚合函数输出的结果为 Series：

In [155]:

tsdf.agg（np.sum）

Out[155]:

A 3.033606

B -1.803879

C 1.575510

dtype:

float64

In [156]:

tsdf.agg（'sum'）

Out[156]:

A 3.033606

B -1.803879

C 1.575510

dtype:

float64

# 因为应用的是单个函数，该操作与`.sum（）` 是等效的

In [157]:

tsdf.sum（）

Out[157]:

A 3.033606

B -1.803879

C 1.575510

dtype:

float64

对 Series 进行单个聚合操作，返回的是标量值：

In [158]:

tsdf.A.agg（'sum'）

Out[158]:

3.033606102414146

多函数聚合

还可以用列表形式传递多个聚合函数。

每个函数在输出结果 DataFrame 里以行的形式显示，行名是每个聚合函数的函数名。

In [159]:

tsdf.agg（['sum']）

Out[159]:

A B C

sum 3.033606 -1.803879 1.57551

多个函数输出多行：

In [160]:

tsdf.agg（['sum', 'mean']）

Out[160]:

A B C

sum 3.033606 -1.803879 1.575510

mean 0.505601 -0.300647 0.262585

对于 Series，多个函数返回的结果也是 Series，其索引为函数名：

In [161]:

tsdf.A.agg（['sum', 'mean']）

Out[161]:

sum 3.033606

mean 0.505601

Name:

A, dtype:

float64

传递 lambda 函数时，输出名为的行：

In [162]:

tsdf.A.agg（['sum', lambda x:

x.mean（）]）

Out[162]:

sum 3.033606

0.505601

Name:

A, dtype:

float64

应用自定义函数时，则该函数名为输出结果的行名：

In [163]:

def mymean（x）:

.....:

return x.mean（）

.....:

In [164]:

tsdf.A.agg（['sum', mymean]）

Out[164]:

sum 3.033606

mymean 0.505601

Name:

A, dtype:

float64

用字典实现聚合

指定为哪些列应用哪些聚合函数时，需要把包含列名与标量（或标量列表）的字典传递给DataFrame.agg。

注意：

这里输出结果的顺序不是固定的，要想让输出顺序与输入顺序一致，请使用OrderedDict。

In [165]:

tsdf.agg（{'A':

'mean', 'B':

'sum'}）

Out[165]:

A 0.505601

B -1.803879

dtype:

float64

输入的参数是列表时，输出结果为 DataFrame，并以矩阵形式显示所有聚合函数的计算结果，且输出结果由所有唯一函数组成。

未执行聚合操作的列输出结果为 NaN 值：

In [166]:

tsdf.agg（{'A':

['mean', 'min'], 'B':

'sum'}）

Out[166]:

A B

mean 0.505601 NaN

min -0.749892 NaN

sum NaN -1.803879

多种Dtype

DataFrame 里包含不能执行聚合操作的多种Dtype时，.agg 只计算可以执行聚合的列。

这与 groupby 的 .agg 操作类似：

In [167]:

mdf = pd.DataFrame（{'A':

[1, 2, 3],

.....:

'B':

[1., 2., 3.],

.....:

'C':

['foo', 'bar', 'baz'],

.....:

'D':

pd.date_range（'20130101', periods=3）}）

.....:

In [168]:

mdf.dtypes

Out[168]:

A int64

B float64

C object

D datetime64[ns]

dtype:

object

In [169]:

mdf.agg（['min', 'sum']）

Out[169]:

A B C D

min 1 1.0 bar 2013-01-01

sum 6 6.0 foobarbaz NaT

自定义Desc

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

下载	加入VIP,免费下载

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！ 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: Pandas 中文官方文档基础用法

冰豆网所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

关于本文

本文标题：Pandas中文官方文档之基础用法3.docx
链接地址：https://www.bdocx.com/doc/6719351.html

Pandas中文官方文档之基础用法3.docx

热门标签