2.1 KiB
Exercise 4 Groupby Apply
The goal of this exercise is to learn to group the data and apply a function on the groups. The use case we will work on is computing
-
Create a function that uses
pandas.DataFrame.clip
and that replace extreme values by a given percentile. The values that are greater than the upper percentile 80% are replaced by the percentile 80%. The values that are smaller than the lower percentile 20% are replaced by the percentile 20%. This process that correct outliers is called winsorizing. I recommend to use NumPy to compute the percentiles to make sure we used the same default parameters.def winsorize(df, quantiles): """ df: pd.DataFrame quantiles: list ex: [0.05, 0.95] """ #TODO return
Here is what the function should output:
df = pd.DataFrame(range(1,11), columns=['sequence']) print(winsorize(df, [0.20, 0.80]).to_markdown())
sequence 0 2.8 1 2.8 2 3 3 4 4 5 5 6 6 7 7 8 8 8.2 9 8.2 -
Now we consider that each value belongs to a group. The goal is to apply the winsorizing to each group. In this question we use winsorizing values that are common:
[0.05,0.95]
as percentiles. Here is the new data set:groups = np.concatenate([np.ones(10), np.ones(10)+1, np.ones(10)+2, np.ones(10)+3, np.ones(10)+4]) df = pd.DataFrame(data= zip(groups, range(1,51)), columns=["group", "sequence"])
The expected output (first rows) is:
sequence 0 1.45 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 9.55 10 11.45