Open In Colab

Tutorial 2#

Machine Learning III: Clustering & Classification

[insert your name]

Important reminders: Before starting, click “File -> Save a copy in Drive”. Produce a pdf for submission by “File -> Print” and then choose “Save to PDF”.

We use the dataset and some of the text/code from NMA W1D4 T2.

Imports

# @markdown Imports

# Imports
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets  # interactive display
import math
from sklearn.cluster import KMeans
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 8
      6 import ipywidgets as widgets  # interactive display
      7 import math
----> 8 from sklearn.cluster import KMeans

ModuleNotFoundError: No module named 'sklearn'

Plotting functions

# @markdown Plotting functions
import numpy
from numpy.linalg import inv, eig
from math import ceil
from matplotlib import pyplot, ticker, get_backend, rc
from mpl_toolkits.mplot3d import Axes3D
from itertools import cycle


%config InlineBackend.figure_format = 'retina'
plt.style.use("https://raw.githubusercontent.com/NeuromatchAcademy/course-content/master/nma.mplstyle")


def plot_function(f, name, var, points=(-10, 10)):
    """Evaluate f() on linear space between points and plot.

    Args:
      f (callable): function that maps scalar -> scalar
      name (string): Function name for axis labels
      var (string): Variable name for axis labels.
      points (tuple): Args for np.linspace to create eval grid.
    """
    x = np.linspace(*points)
    ax = plt.figure().subplots()
    ax.plot(x, f(x))
    ax.set(
      xlabel=f'${var}$',
      ylabel=f'${name}({var})$'
    )

Helper functions

# @markdown Helper functions

The data#

We will use a dataset that includes recordings of neurons as mice perform a decision task.

Mice had the task of turning a wheel to indicate whether they perceived a Gabor stimulus to the left, to the right, or not at all. Neuropixel probes measured spikes across the cortex. Today we’re going to decode the decision from neural data using Logistic Regression. We will only consider trials where the mouse chose “Left” or “Right” and ignore NoGo trials.

Data format#

In the hidden Data retrieval and loading cell, there is a function that loads the data:

  • spikes: an array of normalized spike rates with shape (n_trials, n_neurons)

  • choices: a vector of 0s and 1s, indicating the animal’s behavioral response, with length n_trials.

Data retrieval and loading#

#@title Data retrieval and loading
import os
import requests
import hashlib

url = "https://osf.io/r9gh8/download"
fname = "W1D4_steinmetz_data.npz"
expected_md5 = "d19716354fed0981267456b80db07ea8"

if not os.path.isfile(fname):
  try:
    r = requests.get(url)
  except requests.ConnectionError:
    print("!!! Failed to download data !!!")
  else:
    if r.status_code != requests.codes.ok:
      print("!!! Failed to download data !!!")
    elif hashlib.md5(r.content).hexdigest() != expected_md5:
      print("!!! Data download appears corrupted !!!")
    else:
      with open(fname, "wb") as fid:
        fid.write(r.content)

def load_steinmetz_data(data_fname=fname):

  with np.load(data_fname) as dobj:
    data = dict(**dobj)

  return data
data = load_steinmetz_data()
for key, val in data.items():
  print(key, val.shape)
spikes (276, 691)
choices (276,)

We will rename our variables to make clear which is the input and which is the output variable. We will also split into training and test sets.

n_trials = data["choices"].shape[0]
inds = np.arange(n_trials)
np.random.shuffle(inds)

train_inds = inds[:int(.8*n_trials)] 
test_inds = inds[int(.8*n_trials):] 

y = data["choices"][train_inds]
X = data["spikes"][train_inds]

print(X.shape)
print(y.shape)
(220, 691)
(220,)

Exercise 1: Coding & applying logistic regression#

We will write our own logistic regression code, instead of using sklearn, to make sure we really understand what’s going on with this model.

If you want challenge mode, don’t follow steps A-D below, just try doing it yourself without the guided steps! Join for step E and F. We’ll accept that for submission as long as you write clearly what you’ve done

A) Compute z#

For our first step, want to code a function that takes in the input and outputs z (the value that will go into the sigmoid):

\[ z = X\theta\]

First figure out what size \(\theta\) should be and create a fake \(\theta\) full of zeros.

Answer#

Code below

def transform_inputs(theta, X):

  return ...

theta = np.zeros(...)  

z = transform_inputs(theta, X)
print(z.shape)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[7], line 5
      1 def transform_inputs(theta, X):
      3   return ...
----> 5 theta = np.zeros(...)  
      7 z = transform_inputs(theta, X)
      8 print(z.shape)

TypeError: expected a sequence of integers or a single integer, got 'Ellipsis'

B) Sigmoid#

Now, we need to code our sigmoid function:

\[ h = \frac{1}{1 + \textrm{exp}(-z)}\]

Please complete the function below and make sure the resulting plot looks correct to you.

Answer#

Code below

def sigmoid(z):
  """Return the logistic transform of z."""

  return ...

plot_function(sigmoid, "\sigma", "z", (-10, 10))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[8], line 6
      2   """Return the logistic transform of z."""
      4   return ...
----> 6 plot_function(sigmoid, "\sigma", "z", (-10, 10))

Cell In[2], line 25, in plot_function(f, name, var, points)
     23 x = np.linspace(*points)
     24 ax = plt.figure().subplots()
---> 25 ax.plot(x, f(x))
     26 ax.set(
     27   xlabel=f'${var}$',
     28   ylabel=f'${name}({var})$'
     29 )

File /opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/matplotlib/axes/_axes.py:1662, in Axes.plot(self, scalex, scaley, data, *args, **kwargs)
   1419 """
   1420 Plot y versus x as lines and/or markers.
   1421 
   (...)
   1659 (``'green'``) or hex strings (``'#008000'``).
   1660 """
   1661 kwargs = cbook.normalize_kwargs(kwargs, mlines.Line2D)
-> 1662 lines = [*self._get_lines(*args, data=data, **kwargs)]
   1663 for line in lines:
   1664     self.add_line(line)

File /opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/matplotlib/axes/_base.py:311, in _process_plot_var_args.__call__(self, data, *args, **kwargs)
    309     this += args[0],
    310     args = args[1:]
--> 311 yield from self._plot_args(
    312     this, kwargs, ambiguous_fmt_datakey=ambiguous_fmt_datakey)

File /opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/matplotlib/axes/_base.py:504, in _process_plot_var_args._plot_args(self, tup, kwargs, return_kwargs, ambiguous_fmt_datakey)
    501     self.axes.yaxis.update_units(y)
    503 if x.shape[0] != y.shape[0]:
--> 504     raise ValueError(f"x and y must have same first dimension, but "
    505                      f"have shapes {x.shape} and {y.shape}")
    506 if x.ndim > 2 or y.ndim > 2:
    507     raise ValueError(f"x and y can be no greater than 2D, but have "
    508                      f"shapes {x.shape} and {y.shape}")

ValueError: x and y must have same first dimension, but have shapes (50,) and (1,)
../_images/Week11Tutorial2_20_1.png

C) Compute h#

We want to compute h now as:

\[ z = X\theta\]
\[h = \frac{1}{1 + \textrm{exp}(-z)}\]

Complete the code below to do this (call the functions you’ve created in parts A and B)!

Answer#

Code below

def compute_h(theta, X):

    # Returns h after passing through sigmoid
    return ...

h = compute_h(theta, X) 
print(h.shape)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 6
      1 def compute_h(theta, X):
      2 
      3     # Returns h after passing through sigmoid
      4     return ...
----> 6 h = compute_h(theta, X) 
      7 print(h.shape)

NameError: name 'theta' is not defined

D) Compute the cost function#

Now we want to be able to compute the cost function with theta, X, and y as our inputs.

Remember that:

\[J = - \frac{1}{m} \sum_i y_i log(h_i) + (1 - y_i) log(1-h_i)\]

Answer#

Code below

def logistic_loss(theta, X, y):
    # Computes the cost function for all the training samples
    ...

    return total_cost

total_cost = logistic_loss(theta, X, y)
print(total_cost)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[10], line 7
      3     ...
      5     return total_cost
----> 7 total_cost = logistic_loss(theta, X, y)
      8 print(total_cost)

NameError: name 'theta' is not defined

E) Seeing the results#

Now we can use scipy.optimize.minimize to perform gradient descent and see our results! This will take some time to run.

from scipy.optimize import minimize

outs = minimize(logistic_loss, theta, (X, y))

theta_hat = outs.x

print(theta_hat)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[11], line 3
      1 from scipy.optimize import minimize
----> 3 outs = minimize(logistic_loss, theta, (X, y))
      5 theta_hat = outs.x
      7 print(theta_hat)

NameError: name 'theta' is not defined

Use \(\hat{\theta}\) to determine the accuracy on of the classification. We won’t use a function for this. Remember that you need to get h, then transform this to \(\hat{y}\) (where \(\hat{y}_i\) is 1 if h is greater than or equal to 0.5, otherwise it is 0), then compare \(\hat{y}\) and \(y\).

Answer#

Code below

def evaluate_accuracy(theta_hat, X, y):
  ...
  return ...

acc = evaluate_accuracy(theta_hat, X, y)
print(acc)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[12], line 5
      2   ...
      3   return ...
----> 5 acc = evaluate_accuracy(theta_hat, X, y)
      6 print(acc)

NameError: name 'theta_hat' is not defined

Exercise 2: Assessing accuracy#

How do we know if this accuracy is above chance? Let’s shuffle the trial decisions 1000 times, compute accuracies of logistic regression fit on the shuffled data, and compare.

We will actually use a sklearn logistic regression model to fit our models as it’s a little faster. See the cell below for how to use sklearn

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(X, y)

acc = model.score(X, y)
print(acc)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[13], line 1
----> 1 from sklearn.linear_model import LogisticRegression
      3 model = LogisticRegression()
      5 model.fit(X, y)

ModuleNotFoundError: No module named 'sklearn'

A) Shuffle test#

Now complete the code below to shuffle the decisions taken on each trial. Fit a model on that shuffled data and record the accuracy in shuffled_accuracies. We plot a histogram of the accuracies for all these shuffled samples and compare it to the true accuracy.

n_shuffles = 1000

shuffled_accuracies = np.zeros((n_shuffles,))

for i_shuffle in range(n_shuffles):

    ... # your code here to shuffle the labels

    model = LogisticRegression()

    model.fit(...)

    shuffled_accuracies[i_shuffle] = model.score(...)

fig, ax = plt.subplots(1, 1, figsize = (10, 10))

ax.hist(shuffled_accuracies, 100);
ax.plot([acc, acc], ax.get_ylim(), 'r', linewidth = 2, label = 'Unshuffled accuracy')
ax.legend()
ax.set(xlabel = 'Accuracy', ylabel = 'Count')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[14], line 9
      5 for i_shuffle in range(n_shuffles):
      7     ... # your code here to shuffle the labels
----> 9     model = LogisticRegression()
     11     model.fit(...)
     13     shuffled_accuracies[i_shuffle] = model.score(...)

NameError: name 'LogisticRegression' is not defined

B) Interpreting results#

Are the accuracies for the shuffled data samples higher than you expected? Do you think you can conclude from the true accuracy that the neural responses have a significant amount of information about the decision? Why or why not?

Any initial thoughts about why we couldn’t just check whether our true accuracy is above 50%, which we may naively assume is chance performance?

Your answers here