与Pandas系列一起在运算符中使用 [英] Using in operator with Pandas series
问题描述
为什么我不能使用in
匹配熊猫系列中的字符串?在下面的示例中,第一个评估意外导致False,但是第二个评估有效.
Why can't I match a string in a Pandas series using in
? In the following example, the first evaluation results in False unexpectedly, but the second one works.
df = pd.DataFrame({'name': [ 'Adam', 'Ben', 'Chris' ]})
'Adam' in df['name']
'Adam' in list(df['name'])
推荐答案
在第一种情况下:
因为in
运算符被解释为对df['name'].__contains__('Adam')
的调用.如果查看pandas.Series
中__contains__
的实现,则会发现以下内容(从pandas.core.generic.NDFrame
插入):
In the first case:
Because the in
operator is interpreted as a call to df['name'].__contains__('Adam')
. If you look at the implementation of __contains__
in pandas.Series
, you will find that it's the following (inhereted from pandas.core.generic.NDFrame
) :
def __contains__(self, key):
"""True if the key is in the info axis"""
return key in self._info_axis
因此,您第一次使用in
被解释为:
so, your first use of in
is interpreted as:
'Adam' in df['name']._info_axis
可以得到False
,因为df['name']._info_axis
实际上包含有关range/index
的信息,而不是数据本身:
This gives False
, expectedly, because df['name']._info_axis
actually contains information about the range/index
and not the data itself:
In [37]: df['name']._info_axis
Out[37]: RangeIndex(start=0, stop=3, step=1)
In [38]: list(df['name']._info_axis)
Out[38]: [0, 1, 2]
在第二种情况下:
'Adam' in list(df['name'])
使用list
会将pandas.Series
转换为值列表.因此,实际的操作是这样的:
The use of list
, converts the pandas.Series
to a list of the values. So, the actual operation is this:
In [42]: list(df['name'])
Out[42]: ['Adam', 'Ben', 'Chris']
In [43]: 'Adam' in ['Adam', 'Ben', 'Chris']
Out[43]: True
以下是一些惯用的方法(以相关的速度)来完成您想要的事情:
Here are few more idiomatic ways to do what you want (with the associated speed):
In [56]: df.name.str.contains('Adam').any()
Out[56]: True
In [57]: timeit df.name.str.contains('Adam').any()
The slowest run took 6.25 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 144 µs per loop
In [58]: df.name.isin(['Adam']).any()
Out[58]: True
In [59]: timeit df.name.isin(['Adam']).any()
The slowest run took 5.13 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 191 µs per loop
In [60]: df.name.eq('Adam').any()
Out[60]: True
In [61]: timeit df.name.eq('Adam').any()
10000 loops, best of 3: 178 µs per loop
注意:@Wen在上面的评论中也建议使用最后一种方法
Note: the last way is also suggested by @Wen in the comment above
这篇关于与Pandas系列一起在运算符中使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!