# Hands-on Session 1: Statistics 101

## Getting Started

Language: Python https://www.python.org/

Package manager: Pip 

Alternative: Anaconda with Conda https://www.anaconda.com/ 

### Clean environment

Create a folder that will contain the ML hands-on's.

```mkdir ~/path_to_folder```

Now move to the new folder.

```cd ~/path_to_folder```

Create a local Python environment.

```python3 -m venv env```

You can activate it at any time with

```source ~/path_to_folder/env/bin/activate```

and deactivate it with

```deactivate```.



### Packages

Numpy https://numpy.org/

Scipy https://www.scipy.org/

Matplotlib https://matplotlib.org/

Pandas https://pandas.pydata.org/

Scikit-learn https://scikit-learn.org/stable/index.html

Jupyter https://jupyter.org/

Once in your local env you can install them with 

```pip install numpy scipy matplotlib pandas scikit-learn jupyter```.

### Notebook

Jupyter notebooks allow to present code and run it as you want.
In your local env, use

```jupyter notebook```

to start it. It will automatically start a server that can be opened in your browser
(works with Firefox and probably all the major ones).

The two major cell types that you are going to use are code cells and markdown cells.
A cell can be executed with CAPS + ENTER.

### Imports

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
#Allows matplotlib to work in a notebook
#No need to import scipy or sklearn for now

## Hands on 1 Mean or Median?

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 

In [None]:
#Gaussian samples
mc_runs = 100
mu = 1

ns = np.power(10, np.arange(7))
mean_mses = []
median_mses = []

for n in ns:
    mean_mse = 0
    median_mse = 0
    
    for _ in range(mc_runs):
        X = np.random.normal(mu, 1, n)
        mean_mse += (mu - np.mean(X))**2
        median_mse += (mu - np.median(X))**2
    
    mean_mse /= mc_runs
    median_mse /= mc_runs 
    
    mean_mses.append(mean_mse)
    median_mses.append(median_mse)
    
plt.plot(ns, mean_mses, label='Mean')
plt.plot(ns, median_mses, label='Median')
plt.xscale('log')
plt.yscale('log')
plt.legend()
plt.title('L2 Risk for gaussian samples')
plt.xlabel('n')
plt.ylabel('L2 Risk')
plt.show()

In [None]:
#Laplace samples
mc_runs = 100
mu = 1

ns = np.power(10, np.arange(7))
mean_mses = []
median_mses = []

for n in ns:
    mean_mse = 0
    median_mse = 0
    
    for _ in range(mc_runs):
        X = np.random.laplace(mu, 1, n)
        mean_mse += (mu - np.mean(X))**2
        median_mse += (mu - np.median(X))**2
    
    mean_mse /= mc_runs
    median_mse /= mc_runs 
    
    mean_mses.append(mean_mse)
    median_mses.append(median_mse)
    
plt.plot(ns, mean_mses, label='Mean')
plt.plot(ns, median_mses, label='Median')
plt.xscale('log')
plt.yscale('log')
plt.legend()
plt.title('L2 Risk for Laplace samples')
plt.xlabel('n')
plt.ylabel('L2 Risk')
plt.show()

## Hands on 2 Linear Regression with scikitlearn

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

# Load the diabetes dataset
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)

print("Example of input data:")
print(diabetes_X[0])

print("Example of output data:")
print(diabetes_y[0])

# Use only one feature
diabetes_X = diabetes_X[:, [1, 2, 4, 8]]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print('Mean squared error: %.2f'
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
      % r2_score(diabetes_y_test, diabetes_y_pred))

# Plot outputs
#plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
#plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)

#plt.xticks(())
#plt.yticks(())

#plt.show()