Branch-AI/subjects/ai/data-wrangling-with-pandas/audit/README.md

#### Exercise 0: Environment and libraries

##### The exercise is validated is all questions of the exercise are validated.

##### Activate the virtual environment. If you used `conda` run `conda activate your_env`.

##### Run `python --version`.

###### Does it print `Python 3.x`? x >= 8

##### Does `import jupyter`, `import numpy` and `import pandas` run without any error?

---

---

#### Exercise 1: Concatenate

##### This question is validated if the outputted DataFrame is:

    |    | letter   |   number |
    |---:|:---------|---------:|
    |  0 | a        |        1 |
    |  1 | b        |        2 |
    |  2 | c        |        1 |
    |  3 | d        |        2 |

---

---

#### Exercise 2: Merge

##### The exercise is validated is all questions of the exercise are validated.

##### The question 1 is validated if the output is:

    |    |   id | Feature1_x   | Feature2_x   | Feature1_y   | Feature2_y   |
    |---:|-----:|:-------------|:-------------|:-------------|:-------------|
    |  0 |    1 | A            | B            | K            | L            |
    |  1 |    2 | C            | D            | M            | N            |

##### The question 2 is validated if the output is:

    |    |   id | Feature1_df1   | Feature2_df1   | Feature1_df2   | Feature2_df2   |
    |---:|-----:|:---------------|:---------------|:---------------|:---------------|
    |  0 |    1 | A              | B              | K              | L              |
    |  1 |    2 | C              | D              | M              | N              |
    |  2 |    3 | E              | F              | nan            | nan            |
    |  3 |    4 | G              | H              | nan            | nan            |
    |  4 |    5 | I              | J              | nan            | nan            |
    |  5 |    6 | nan            | nan            | O              | P              |
    |  6 |    7 | nan            | nan            | Q              | R              |
    |  7 |    8 | nan            | nan            | S              | T              |

    Note: Check that the suffixes are set using the suffix parameters rather than manually changing the columns' name.

---

---

#### Exercise 3: Merge MultiIndex

##### The exercice is validated is all questions of the exercice are validated.

##### The question 1 is validated if the outputted DataFrame's shape is `(1305, 5)` and if `merged.head()` returns a table as below. One of the answers that returns the correct DataFrame is `market_data.merge(alternative_data, how='left', left_index=True, right_index=True)`

|                                                      |      Open |    Close | Close_Adjusted |     Twitter |    Reddit |
| :--------------------------------------------------- | --------: | -------: | -------------: | ----------: | --------: |
| (Timestamp('2021-01-01 00:00:00', freq='B'), 'AAPL') | 0.0991792 | -0.31603 |       0.634787 | -0.00159041 |   1.06053 |
| (Timestamp('2021-01-01 00:00:00', freq='B'), 'FB')   | -0.123753 |  1.00269 |       0.713264 |   0.0142127 | -0.487028 |
| (Timestamp('2021-01-01 00:00:00', freq='B'), 'GE')   |  -1.37775 | -1.01504 |         1.2858 |    0.109835 |   0.04273 |
| (Timestamp('2021-01-01 00:00:00', freq='B'), 'AMZN') |   1.06324 | 0.841241 |      -0.799481 |   -0.805677 |  0.511769 |
| (Timestamp('2021-01-01 00:00:00', freq='B'), 'DAI')  | -0.603453 | -2.06141 |      -0.969064 |     1.49817 |  0.730055 |

##### The question 2 is validated if the numbers that are missing in the DataFrame are equal to 0 and if `filled_df.sum().sum() == merged_df.sum().sum()` gives: `True`

---

---

#### Exercise 4: Groupby Apply

##### The exercise is validated is all questions of the exercise are validated and if the for loop hasn't been used. The goal is to use `groupby` and `apply`.

##### The question 1 is validated if the output is:

```python
        df = pd.DataFrame(range(1,11), columns=['sequence'])
        print(winsorize(df, [0.20, 0.80]).to_markdown())
```

    |    |   sequence |
    |---:|-----------:|
    |  0 |        2.8 |
    |  1 |        2.8 |
    |  2 |        3   |
    |  3 |        4   |
    |  4 |        5   |
    |  5 |        6   |
    |  6 |        7   |
    |  7 |        8   |
    |  8 |        8.2 |
    |  9 |        8.2 |

##### The question 2 is validated if the output is a Pandas Series or DataFrame with the first 11 rows equal to the output below. The code below give a solution.

    |    |   sequence |
    |---:|-----------:|
    |  0 |       1.45 |
    |  1 |       2    |
    |  2 |       3    |
    |  3 |       4    |
    |  4 |       5    |
    |  5 |       6    |
    |  6 |       7    |
    |  7 |       8    |
    |  8 |       9    |
    |  9 |       9.55 |
    | 10 |      11.45 |

```python
    def winsorize(df_series, quantiles):
    """
        df: pd.DataFrame or pd.Series
        quantiles: list [0.05, 0.95]

    """
    min_value = np.quantile(df_series, quantiles[0])
    max_value = np.quantile(df_series, quantiles[1])

    return df_series.clip(lower = min_value, upper = max_value)


    df.groupby("group")[['sequence']].apply(winsorize, [0.05,0.95])
```

- https://towardsdatascience.com/how-to-use-the-split-apply-combine-strategy-in-pandas-groupby-29e0eb44b62e

---

---

#### Exercise 5: Groupby Agg

##### The question is validated if the output is as below. The columns don't have to be MultiIndex. A solution could be `df.groupby('product').agg({'value':['min','max','mean']})`

| product      | ('value', 'min') | ('value', 'max') | ('value', 'mean') |
| :----------- | ---------------: | ---------------: | ----------------: |
| chair        |            22.89 |            32.12 |            27.505 |
| mobile phone |              100 |           111.22 |            105.61 |
| table        |            20.45 |            99.99 |             51.22 |

---

---

#### Exercise 6: Unstack

##### The question 1 is validated if the output is similar (as the values are generated randomly, it's obvious the audit doesn't require to match the values below) to what `unstacked_df.head()`returns:

    | Date                |   ('Prediction', 'AAPL') |   ('Prediction', 'AMZN') |   ('Prediction', 'DAI') |   ('Prediction', 'FB') |   ('Prediction', 'GE') |
    |:--------------------|-------------------------:|-------------------------:|------------------------:|-----------------------:|-----------------------:|
    | 2021-01-01 00:00:00 |                 0.382312 |                -0.072392 |               -0.551167 |             -0.0585555 |                1.05955 |
    | 2021-01-04 00:00:00 |                -0.560953 |                 0.503199 |               -0.79517  |             -3.23136   |                1.50271 |
    | 2021-01-05 00:00:00 |                 0.211489 |                 1.84867  |                0.287906 |             -1.81119   |                1.20321 |

##### The question 2 is validated if the answer is: `unstacked.plot(title = 'Stocks 2021')`. The title can be anything else.
feat(ai-branch): add ai specialist branch - reformatted subjects/audits according to new conventions 1 year ago			`#### Exercise 0: Environment and libraries`

			`##### The exercise is validated is all questions of the exercise are validated.`

			##### Activate the virtual environment. If you used `conda` run `conda activate your_env`.

			##### Run `python --version`.

			###### Does it print `Python 3.x`? x >= 8

			##### Does `import jupyter`, `import numpy` and `import pandas` run without any error?

			`---`

			`---`

			`#### Exercise 1: Concatenate`

			`##### This question is validated if the outputted DataFrame is:`

			`\| \| letter \| number \|`
			`\|---:\|:---------\|---------:\|`
			`\| 0 \| a \| 1 \|`
			`\| 1 \| b \| 2 \|`
			`\| 2 \| c \| 1 \|`
			`\| 3 \| d \| 2 \|`

			`---`

			`---`

			`#### Exercise 2: Merge`

			`##### The exercise is validated is all questions of the exercise are validated.`

			`##### The question 1 is validated if the output is:`

			`\| \| id \| Feature1_x \| Feature2_x \| Feature1_y \| Feature2_y \|`
			`\|---:\|-----:\|:-------------\|:-------------\|:-------------\|:-------------\|`
			`\| 0 \| 1 \| A \| B \| K \| L \|`
			`\| 1 \| 2 \| C \| D \| M \| N \|`

			`##### The question 2 is validated if the output is:`

			`\| \| id \| Feature1_df1 \| Feature2_df1 \| Feature1_df2 \| Feature2_df2 \|`
			`\|---:\|-----:\|:---------------\|:---------------\|:---------------\|:---------------\|`
			`\| 0 \| 1 \| A \| B \| K \| L \|`
			`\| 1 \| 2 \| C \| D \| M \| N \|`
			`\| 2 \| 3 \| E \| F \| nan \| nan \|`
			`\| 3 \| 4 \| G \| H \| nan \| nan \|`
			`\| 4 \| 5 \| I \| J \| nan \| nan \|`
			`\| 5 \| 6 \| nan \| nan \| O \| P \|`
			`\| 6 \| 7 \| nan \| nan \| Q \| R \|`
			`\| 7 \| 8 \| nan \| nan \| S \| T \|`

			`Note: Check that the suffixes are set using the suffix parameters rather than manually changing the columns' name.`

			`---`

			`---`

			`#### Exercise 3: Merge MultiIndex`

			`##### The exercice is validated is all questions of the exercice are validated.`

			##### The question 1 is validated if the outputted DataFrame's shape is `(1305, 5)` and if `merged.head()` returns a table as below. One of the answers that returns the correct DataFrame is `market_data.merge(alternative_data, how='left', left_index=True, right_index=True)`

			`\| \| Open \| Close \| Close_Adjusted \| Twitter \| Reddit \|`
			`\| :--------------------------------------------------- \| --------: \| -------: \| -------------: \| ----------: \| --------: \|`
			`\| (Timestamp('2021-01-01 00:00:00', freq='B'), 'AAPL') \| 0.0991792 \| -0.31603 \| 0.634787 \| -0.00159041 \| 1.06053 \|`
			`\| (Timestamp('2021-01-01 00:00:00', freq='B'), 'FB') \| -0.123753 \| 1.00269 \| 0.713264 \| 0.0142127 \| -0.487028 \|`
			`\| (Timestamp('2021-01-01 00:00:00', freq='B'), 'GE') \| -1.37775 \| -1.01504 \| 1.2858 \| 0.109835 \| 0.04273 \|`
			`\| (Timestamp('2021-01-01 00:00:00', freq='B'), 'AMZN') \| 1.06324 \| 0.841241 \| -0.799481 \| -0.805677 \| 0.511769 \|`
			`\| (Timestamp('2021-01-01 00:00:00', freq='B'), 'DAI') \| -0.603453 \| -2.06141 \| -0.969064 \| 1.49817 \| 0.730055 \|`

			##### The question 2 is validated if the numbers that are missing in the DataFrame are equal to 0 and if `filled_df.sum().sum() == merged_df.sum().sum()` gives: `True`

			`---`

			`---`

			`#### Exercise 4: Groupby Apply`

			##### The exercise is validated is all questions of the exercise are validated and if the for loop hasn't been used. The goal is to use `groupby` and `apply`.

			`##### The question 1 is validated if the output is:`

			```python
			`df = pd.DataFrame(range(1,11), columns=['sequence'])`
			`print(winsorize(df, [0.20, 0.80]).to_markdown())`
			```

			`\| \| sequence \|`
			`\|---:\|-----------:\|`
			`\| 0 \| 2.8 \|`
			`\| 1 \| 2.8 \|`
			`\| 2 \| 3 \|`
			`\| 3 \| 4 \|`
			`\| 4 \| 5 \|`
			`\| 5 \| 6 \|`
			`\| 6 \| 7 \|`
			`\| 7 \| 8 \|`
			`\| 8 \| 8.2 \|`
			`\| 9 \| 8.2 \|`

			`##### The question 2 is validated if the output is a Pandas Series or DataFrame with the first 11 rows equal to the output below. The code below give a solution.`

			`\| \| sequence \|`
			`\|---:\|-----------:\|`
			`\| 0 \| 1.45 \|`
			`\| 1 \| 2 \|`
			`\| 2 \| 3 \|`
			`\| 3 \| 4 \|`
			`\| 4 \| 5 \|`
			`\| 5 \| 6 \|`
			`\| 6 \| 7 \|`
			`\| 7 \| 8 \|`
			`\| 8 \| 9 \|`
			`\| 9 \| 9.55 \|`
			`\| 10 \| 11.45 \|`

			```python
			`def winsorize(df_series, quantiles):`
			`"""`
			`df: pd.DataFrame or pd.Series`
			`quantiles: list [0.05, 0.95]`

			`"""`
			`min_value = np.quantile(df_series, quantiles[0])`
			`max_value = np.quantile(df_series, quantiles[1])`

			`return df_series.clip(lower = min_value, upper = max_value)`


			`df.groupby("group")[['sequence']].apply(winsorize, [0.05,0.95])`
			```

			`- https://towardsdatascience.com/how-to-use-the-split-apply-combine-strategy-in-pandas-groupby-29e0eb44b62e`

			`---`

			`---`

			`#### Exercise 5: Groupby Agg`

			##### The question is validated if the output is as below. The columns don't have to be MultiIndex. A solution could be `df.groupby('product').agg({'value':['min','max','mean']})`

			`\| product \| ('value', 'min') \| ('value', 'max') \| ('value', 'mean') \|`
			`\| :----------- \| ---------------: \| ---------------: \| ----------------: \|`
			`\| chair \| 22.89 \| 32.12 \| 27.505 \|`
			`\| mobile phone \| 100 \| 111.22 \| 105.61 \|`
			`\| table \| 20.45 \| 99.99 \| 51.22 \|`

			`---`

			`---`

			`#### Exercise 6: Unstack`

			##### The question 1 is validated if the output is similar (as the values are generated randomly, it's obvious the audit doesn't require to match the values below) to what `unstacked_df.head()`returns:

			`\| Date \| ('Prediction', 'AAPL') \| ('Prediction', 'AMZN') \| ('Prediction', 'DAI') \| ('Prediction', 'FB') \| ('Prediction', 'GE') \|`
			`\|:--------------------\|-------------------------:\|-------------------------:\|------------------------:\|-----------------------:\|-----------------------:\|`
			`\| 2021-01-01 00:00:00 \| 0.382312 \| -0.072392 \| -0.551167 \| -0.0585555 \| 1.05955 \|`
			`\| 2021-01-04 00:00:00 \| -0.560953 \| 0.503199 \| -0.79517 \| -3.23136 \| 1.50271 \|`
			`\| 2021-01-05 00:00:00 \| 0.211489 \| 1.84867 \| 0.287906 \| -1.81119 \| 1.20321 \|`

			##### The question 2 is validated if the answer is: `unstacked.plot(title = 'Stocks 2021')`. The title can be anything else.