Tutorial 2
Contents
Tutorial 2#
Machine Learning III: Clustering & Classification
[insert your name]
Important reminders: Before starting, click “File -> Save a copy in Drive”. Produce a pdf for submission by “File -> Print” and then choose “Save to PDF”.
We use the dataset and some of the text/code from NMA W1D4 T2.
Imports
# @markdown Imports
# Imports
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets # interactive display
import math
from sklearn.cluster import KMeans
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 8
6 import ipywidgets as widgets # interactive display
7 import math
----> 8 from sklearn.cluster import KMeans
ModuleNotFoundError: No module named 'sklearn'
Plotting functions
# @markdown Plotting functions
import numpy
from numpy.linalg import inv, eig
from math import ceil
from matplotlib import pyplot, ticker, get_backend, rc
from mpl_toolkits.mplot3d import Axes3D
from itertools import cycle
%config InlineBackend.figure_format = 'retina'
plt.style.use("https://raw.githubusercontent.com/NeuromatchAcademy/course-content/master/nma.mplstyle")
def plot_function(f, name, var, points=(-10, 10)):
"""Evaluate f() on linear space between points and plot.
Args:
f (callable): function that maps scalar -> scalar
name (string): Function name for axis labels
var (string): Variable name for axis labels.
points (tuple): Args for np.linspace to create eval grid.
"""
x = np.linspace(*points)
ax = plt.figure().subplots()
ax.plot(x, f(x))
ax.set(
xlabel=f'${var}$',
ylabel=f'${name}({var})$'
)
Helper functions
# @markdown Helper functions
The data#
We will use a dataset that includes recordings of neurons as mice perform a decision task.
Mice had the task of turning a wheel to indicate whether they perceived a Gabor stimulus to the left, to the right, or not at all. Neuropixel probes measured spikes across the cortex. Today we’re going to decode the decision from neural data using Logistic Regression. We will only consider trials where the mouse chose “Left” or “Right” and ignore NoGo trials.
Data format#
In the hidden Data retrieval and loading
cell, there is a function that loads the data:
spikes
: an array of normalized spike rates with shape(n_trials, n_neurons)
choices
: a vector of 0s and 1s, indicating the animal’s behavioral response, with lengthn_trials
.
Data retrieval and loading#
#@title Data retrieval and loading
import os
import requests
import hashlib
url = "https://osf.io/r9gh8/download"
fname = "W1D4_steinmetz_data.npz"
expected_md5 = "d19716354fed0981267456b80db07ea8"
if not os.path.isfile(fname):
try:
r = requests.get(url)
except requests.ConnectionError:
print("!!! Failed to download data !!!")
else:
if r.status_code != requests.codes.ok:
print("!!! Failed to download data !!!")
elif hashlib.md5(r.content).hexdigest() != expected_md5:
print("!!! Data download appears corrupted !!!")
else:
with open(fname, "wb") as fid:
fid.write(r.content)
def load_steinmetz_data(data_fname=fname):
with np.load(data_fname) as dobj:
data = dict(**dobj)
return data
data = load_steinmetz_data()
for key, val in data.items():
print(key, val.shape)
spikes (276, 691)
choices (276,)
We will rename our variables to make clear which is the input and which is the output variable. We will also split into training and test sets.
n_trials = data["choices"].shape[0]
inds = np.arange(n_trials)
np.random.shuffle(inds)
train_inds = inds[:int(.8*n_trials)]
test_inds = inds[int(.8*n_trials):]
y = data["choices"][train_inds]
X = data["spikes"][train_inds]
print(X.shape)
print(y.shape)
(220, 691)
(220,)
Exercise 1: Coding & applying logistic regression#
We will write our own logistic regression code, instead of using sklearn, to make sure we really understand what’s going on with this model.
If you want challenge mode, don’t follow steps A-D below, just try doing it yourself without the guided steps! Join for step E and F. We’ll accept that for submission as long as you write clearly what you’ve done
A) Compute z#
For our first step, want to code a function that takes in the input and outputs z (the value that will go into the sigmoid):
First figure out what size \(\theta\) should be and create a fake \(\theta\) full of zeros.
Answer#
Code below
def transform_inputs(theta, X):
return ...
theta = np.zeros(...)
z = transform_inputs(theta, X)
print(z.shape)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[7], line 5
1 def transform_inputs(theta, X):
3 return ...
----> 5 theta = np.zeros(...)
7 z = transform_inputs(theta, X)
8 print(z.shape)
TypeError: expected a sequence of integers or a single integer, got 'Ellipsis'
B) Sigmoid#
Now, we need to code our sigmoid function:
Please complete the function below and make sure the resulting plot looks correct to you.
Answer#
Code below
def sigmoid(z):
"""Return the logistic transform of z."""
return ...
plot_function(sigmoid, "\sigma", "z", (-10, 10))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[8], line 6
2 """Return the logistic transform of z."""
4 return ...
----> 6 plot_function(sigmoid, "\sigma", "z", (-10, 10))
Cell In[2], line 25, in plot_function(f, name, var, points)
23 x = np.linspace(*points)
24 ax = plt.figure().subplots()
---> 25 ax.plot(x, f(x))
26 ax.set(
27 xlabel=f'${var}$',
28 ylabel=f'${name}({var})$'
29 )
File /opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/matplotlib/axes/_axes.py:1662, in Axes.plot(self, scalex, scaley, data, *args, **kwargs)
1419 """
1420 Plot y versus x as lines and/or markers.
1421
(...)
1659 (``'green'``) or hex strings (``'#008000'``).
1660 """
1661 kwargs = cbook.normalize_kwargs(kwargs, mlines.Line2D)
-> 1662 lines = [*self._get_lines(*args, data=data, **kwargs)]
1663 for line in lines:
1664 self.add_line(line)
File /opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/matplotlib/axes/_base.py:311, in _process_plot_var_args.__call__(self, data, *args, **kwargs)
309 this += args[0],
310 args = args[1:]
--> 311 yield from self._plot_args(
312 this, kwargs, ambiguous_fmt_datakey=ambiguous_fmt_datakey)
File /opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/matplotlib/axes/_base.py:504, in _process_plot_var_args._plot_args(self, tup, kwargs, return_kwargs, ambiguous_fmt_datakey)
501 self.axes.yaxis.update_units(y)
503 if x.shape[0] != y.shape[0]:
--> 504 raise ValueError(f"x and y must have same first dimension, but "
505 f"have shapes {x.shape} and {y.shape}")
506 if x.ndim > 2 or y.ndim > 2:
507 raise ValueError(f"x and y can be no greater than 2D, but have "
508 f"shapes {x.shape} and {y.shape}")
ValueError: x and y must have same first dimension, but have shapes (50,) and (1,)
C) Compute h#
We want to compute h now as:
Complete the code below to do this (call the functions you’ve created in parts A and B)!
Answer#
Code below
def compute_h(theta, X):
# Returns h after passing through sigmoid
return ...
h = compute_h(theta, X)
print(h.shape)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[9], line 6
1 def compute_h(theta, X):
2
3 # Returns h after passing through sigmoid
4 return ...
----> 6 h = compute_h(theta, X)
7 print(h.shape)
NameError: name 'theta' is not defined
D) Compute the cost function#
Now we want to be able to compute the cost function with theta, X, and y as our inputs.
Remember that:
Answer#
Code below
def logistic_loss(theta, X, y):
# Computes the cost function for all the training samples
...
return total_cost
total_cost = logistic_loss(theta, X, y)
print(total_cost)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[10], line 7
3 ...
5 return total_cost
----> 7 total_cost = logistic_loss(theta, X, y)
8 print(total_cost)
NameError: name 'theta' is not defined
E) Seeing the results#
Now we can use scipy.optimize.minimize
to perform gradient descent and see our results! This will take some time to run.
from scipy.optimize import minimize
outs = minimize(logistic_loss, theta, (X, y))
theta_hat = outs.x
print(theta_hat)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[11], line 3
1 from scipy.optimize import minimize
----> 3 outs = minimize(logistic_loss, theta, (X, y))
5 theta_hat = outs.x
7 print(theta_hat)
NameError: name 'theta' is not defined
Use \(\hat{\theta}\) to determine the accuracy on of the classification. We won’t use a function for this. Remember that you need to get h, then transform this to \(\hat{y}\) (where \(\hat{y}_i\) is 1 if h is greater than or equal to 0.5, otherwise it is 0), then compare \(\hat{y}\) and \(y\).
Answer#
Code below
def evaluate_accuracy(theta_hat, X, y):
...
return ...
acc = evaluate_accuracy(theta_hat, X, y)
print(acc)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[12], line 5
2 ...
3 return ...
----> 5 acc = evaluate_accuracy(theta_hat, X, y)
6 print(acc)
NameError: name 'theta_hat' is not defined
Exercise 2: Assessing accuracy#
How do we know if this accuracy is above chance? Let’s shuffle the trial decisions 1000 times, compute accuracies of logistic regression fit on the shuffled data, and compare.
We will actually use a sklearn logistic regression model to fit our models as it’s a little faster. See the cell below for how to use sklearn
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X, y)
acc = model.score(X, y)
print(acc)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[13], line 1
----> 1 from sklearn.linear_model import LogisticRegression
3 model = LogisticRegression()
5 model.fit(X, y)
ModuleNotFoundError: No module named 'sklearn'
A) Shuffle test#
Now complete the code below to shuffle the decisions taken on each trial. Fit a model on that shuffled data and record the accuracy in shuffled_accuracies
. We plot a histogram of the accuracies for all these shuffled samples and compare it to the true accuracy.
n_shuffles = 1000
shuffled_accuracies = np.zeros((n_shuffles,))
for i_shuffle in range(n_shuffles):
... # your code here to shuffle the labels
model = LogisticRegression()
model.fit(...)
shuffled_accuracies[i_shuffle] = model.score(...)
fig, ax = plt.subplots(1, 1, figsize = (10, 10))
ax.hist(shuffled_accuracies, 100);
ax.plot([acc, acc], ax.get_ylim(), 'r', linewidth = 2, label = 'Unshuffled accuracy')
ax.legend()
ax.set(xlabel = 'Accuracy', ylabel = 'Count')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[14], line 9
5 for i_shuffle in range(n_shuffles):
7 ... # your code here to shuffle the labels
----> 9 model = LogisticRegression()
11 model.fit(...)
13 shuffled_accuracies[i_shuffle] = model.score(...)
NameError: name 'LogisticRegression' is not defined
B) Interpreting results#
Are the accuracies for the shuffled data samples higher than you expected? Do you think you can conclude from the true accuracy that the neural responses have a significant amount of information about the decision? Why or why not?
Any initial thoughts about why we couldn’t just check whether our true accuracy is above 50%, which we may naively assume is chance performance?
Your answers here