2.1 KiB

Raw Blame History

Exercise 4 Groupby Apply

The goal of this exercise is to learn to group the data and apply a function on the groups. The use case we will work on is computing

Create a function that uses pandas.DataFrame.clip and that replace extreme values by a given percentile. The values that are greater than the upper percentile 80% are replaced by the percentile 80%. The values that are smaller than the lower percentile 20% are replaced by the percentile 20%. This process that correct outliers is called winsorizing. I recommend to use NumPy to compute the percentiles to make sure we used the same default parameters.
```
    def winsorize(df, quantiles):
        """
            df: pd.DataFrame
            quantiles: list 
                ex: [0.05, 0.95]
        """
        #TODO
        return 
```
Here is what the function should output:
```
    df = pd.DataFrame(range(1,11), columns=['sequence'])
    print(winsorize(df, [0.20, 0.80]).to_markdown())
```
sequence

0 2.8

1 2.8

2 3

3 4

4 5

5 6

6 7

7 8

8 8.2

9 8.2
Now we consider that each value belongs to a group. The goal is to apply the winsorizing to each group. In this question we use winsorizing values that are common: [0.05,0.95] as percentiles. Here is the new data set:
```
groups = np.concatenate([np.ones(10), np.ones(10)+1,  np.ones(10)+2, np.ones(10)+3, np.ones(10)+4])

df = pd.DataFrame(data= zip(groups,
                            range(1,51)),
                columns=["group", "sequence"])
```
The expected output (first rows) is:

sequence

0 1.45

1 2

2 3

3 4

4 5

5 6

6 7

7 8

8 9

9 9.55

10 11.45