These are sample chapters from the book Pandas Brain Teasers: 25 brain teasers to tickle your mind make you a better Pandas developer. by Miki Tebeka.

Buy the book at Gumroad (ePub & PDF)

The Brain Teasers

We shape our tools, and thereafter our tools shape us.
— Marshall McLuhan

1. Rectified

relu.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import pandas as pd


def relu(n):
    if n < 0:
        return 0
    return n


arr = pd.Series([-1, 0, 1])
print(relu(arr))
Try to guess what the output is before moving to the next page.

This code will raise a ValueError.

The problematic line is if n < 0:, n is the result of arr < 0 which is a pandas.Series.

In [1]: import pandas as pd
In [2]: arr = pd.Series([-1, 0, 1])
In [3]: arr < 0
Out[3]:
0     True
1    False
2    False
dtype: bool

Once arr < 0 is computed, we use it in an if statement. Which brings us to how boolean values work in Python.

Every Python object, not only True and False has a boolean value. The documentation state the rules:

Everything is True except:

  • 0 numbers: 0, 0.0, 0+0j …​

  • Empty collections: [], {}, '', …​

  • None

  • False

You can test the truth value of a Python object using the built-in bool function.

On top of the above, any object can state its own boolean value using the __bool__ special method. The boolean logic for pandas.Series is different than the one for a list or a tuple - it raises an exception.

In [4]: bool(arr < 0)
...
ValueError: The truth value of a Series is ambiguous.
Use a.empty, a.bool(), a.item(), a.any() or a.all().

The exception tells you the reasoning - it follows The Zen of Python which states:

In the face of ambiguity, refuse the temptation to guess.

So, what are your options? You can use all or any but then you’ll need to check the type of n to see if it’s a plain number of a pandas.Series.

A function that works both on scalar and a pandas.Series (or a numpy array) is called a "ufunc", short for "universal function". Most of the function from numpy or Pandas, such as min, to_datetime…​, are ufuncs.

numpy has a vectorize decorator for these cases.

relu_vec.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import numpy as np
import pandas as pd


@np.vectorize
def relu(n):
    if n < 0:
        return 0
    return n


arr = pd.Series([-1, 0, 1])
print(relu(arr))

Now relu will work both on scalars (e.g. 7, 2.18 …​) and vectors (e.g. numpy array, pandas.Series …​)

The output of relu now is numpy.ndarray, not pandas.Series. You might want to have a look at numba.vectorize as well.

1.1. Further Reading

2. Free Range

loc.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import pandas as pd

df = pd.DataFrame([
    [1, 1, 1],
    [2, 2, 2],
    [3, 3, 3],
    [4, 4, 4],
    [5, 5, 5],

])

print(len(df.loc[1:3]))
Try to guess what the output is before moving to the next page.

This code will print: 3

Slices in Python are half-open [1] range. You get values from the first index, up to but not including the last index.

In [1]: chars = ['a', 'b', 'c', 'd', 'e']
In [2]: chars[1:3]
Out[2]: ['b', 'c']

And most of the time, Pandas word the same way:

In [3]: s = pd.Series(chars)
In [4]: s[1:3]
Out[4]:
1    b
2    c
dtype: object

There are three ways to slice a pandas.Series or a pandas.DataFrame:

  • Using loc which works by label

  • Using iloc which works by offset

  • Using a slice notation (e.g. s[1:3] which works like iloc

loc works by label and it’s slices on a closed range - including the last index.

In [5]: df[1:3]
Out[5]:
   0  1  2
1  2  2  2
2  3  3  3
In [6]: df.iloc[1:3]
Out[6]:
   0  1  2
1  2  2  2
2  3  3  3
In [7]: df.loc[1:3]
Out[7]:
   0  1  2
1  2  2  2
2  3  3  3
3  4  4  4

Watch out for this off-by-one errors [2] when using .loc.

2.1. Further Reading

3. Off With Their NaNs

not_nan.py
1
2
3
4
5
import numpy as np
import pandas as pd

s = pd.Series([1, np.nan, 3])
print(s[~(s == np.nan)])
Try to guess what the output is before moving to the next page.

This code will will print

0    1.0
1    NaN
2    3.0
dtype: float64

We’ve covered some of the floating point oddities in [Multiplying]. NaN (or np.nan) is another oddity. The name NaN stands for "not a number", it serves two purposes - illegal computation and missing values.

Here’s an example of a bad computation:

In [1]: np.float64(0)/np.float64(0)
<ipython-input-50-796728115601>:1: RuntimeWarning: invalid value encountered in double_scalars
  np.float64(0)/np.float64(0)
Out[1]: nan

You see a warning but not an exception and the return value is nan.

nan does not equal any number, including itself.

In [2]: np.nan == np.nan
Out[2]: False

To check that a value is nan, you need to use a special function such as pandas.isnull.

In [3]: pd.isnull(np.nan)
Out[3]: True

You can use pandas.isnull to fix this teaser.

not_nan_fixed.py
1
2
3
4
5
import numpy as np
import pandas as pd

s = pd.Series([1, np.nan, 3])
print(s[~pd.isnull(s)])

pandas.isnull work with all of Pandas "missing" values: None, pandas.NaT (not a time) and the new pandas.NA.

Floating points have several other special "numbers" such as inf (infinity), -inf, -0, +0 and others. You can learn more about them in the links below.


1. [) in math.
2. "There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors." - Leon Bambrick.