Error message here!

Hide Error message here!

Error message here!

Hide Error message here!

Error message here!

Close

Pandas | 17 缺失数据处理

HongLingLiu 2019-11-04 07:50:00 阅读数:35 评论数:0 点赞数:0 收藏数:0

一、检查缺失值

```import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3),
index=['a', 'c', 'e', 'f','h'],
columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df)
print('\n')

print (df['one'].isnull())```

`        one       two     threea  0.036297 -0.615260 -1.341327b       NaN       NaN       NaNc -1.908168 -0.779304  0.212467d       NaN       NaN       NaNe  0.527409 -2.432343  0.190436f  1.428975 -0.364970  1.084148g       NaN       NaN       NaNh  0.763328 -0.818729  0.240498a    Falseb     Truec    Falsed     Truee    Falsef    Falseg     Trueh    FalseName: one, dtype: bool`

```import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print (df['one'].notnull())```
`输出结果：`
`a     Trueb    Falsec     Trued    Falsee     Truef     Trueg    Falseh     TrueName: one, dtype: bool`
`` ``

二、缺少数据的计算

• 在求和数据时，`NA`将被视为`0`
• 如果数据全部是`NA`，那么结果将是`NA`

```import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df)
print('\n')

print (df['one'].sum())```

`        one       two     threea -1.191036  0.945107 -0.806292b       NaN       NaN       NaNc  0.127794 -1.812588 -0.466076d       NaN       NaN       NaNe  2.358568  0.559081  1.486490f -0.242589  0.574916 -0.831853g       NaN       NaN       NaNh -0.328030  1.815404 -1.7067360.7247067964060545`

```import pandas as pd

df = pd.DataFrame(index=[0,1,2,3,4,5],columns=['one','two'])

print(df)
print('\n')

print (df['one'].sum())```

`   one  two0  NaN  NaN1  NaN  NaN2  NaN  NaN3  NaN  NaN4  NaN  NaN5  NaN  NaN0`

三、填充缺少数据

Pandas提供了各种方法来清除缺失的值。`fillna()`函数可以通过几种方法用非空数据“填充”`NA`值。

用标量值替换NaN

```import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one','two', 'three'])

df = df.reindex(['a', 'b', 'c'])
print (df)
print('\n')

print ("NaN replaced with '0':")
print (df.fillna(0))```

`````` one two three
a -0.479425 -1.711840 -1.453384
b  NaN   NaN NaN
c -0.733606 -0.813315 0.476788
NaN replaced with '0':
one two three
a -0.479425 -1.711840 -1.453384
b 0.000000 0.000000 0.000000
c -0.733606 -0.813315 0.476788
``````

替换丢失(或)通用值

```import pandas as pd

df = pd.DataFrame({'one':[10,20,30,40,50,2000],'two':[1000,0,30,40,50,60]})

print(df)
print('\n')

print (df.replace({1000:10,2000:60}))```

`    one   two0    10  1    20     02    30    303    40    404    50    505      60   one  two0   10   1   20    02   30   303   40   404   50   505     60`

填写NA前进和后退

`pad/fill` 填充方法向前
`bfill/backfill` 填充方法向后

```import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df)
print('\n')

`        one       two     threea -0.023243  1.671621 -1.687063b       NaN       NaN       NaNc -0.933355  0.609602 -0.620189d       NaN       NaN       NaNe  0.151455 -1.324563 -0.598897f  0.605670 -0.924828 -1.050643g       NaN       NaN       NaNh  0.892414 -0.137194 -1.101791        one       two     threea -0.023243  1.671621 -1.687063b -0.023243  1.671621 -1.687063c -0.933355  0.609602 -0.620189d -0.933355  0.609602 -0.620189e  0.151455 -1.324563 -0.598897f  0.605670 -0.924828 -1.050643g  0.605670 -0.924828 -1.050643h  0.892414 -0.137194 -1.101791`

```import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print (df.fillna(method='backfill'))```

``````        one       two     three
a  2.278454  1.550483 -2.103731
b -0.779530  0.408493  1.247796
c -0.779530  0.408493  1.247796
d  0.262713 -1.073215  0.129808
e  0.262713 -1.073215  0.129808
f -0.600729  1.310515 -0.877586
g  0.395212  0.219146 -0.175024
h  0.395212  0.219146 -0.175024
``````

四、丢失缺少的值

```import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print (df.dropna())```
`输出结果 ：`
``````        one       two     three
a -0.719623  0.028103 -1.093178
c  0.040312  1.729596  0.451805
e -1.029418  1.920933  1.289485
f  1.217967  1.368064  0.527406
h  0.667855  0.147989 -1.035978
``````

```import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print (df.dropna(axis=1))```
`输出结果：`
``````Empty DataFrame
Columns: []
Index: [a, b, c, d, e, f, g, h]
``````

https://www.cnblogs.com/Summer-skr--blog/p/11705887.html