Neuroscience Study

2.3 Neural Encoding: Feature Selection 본문

Computational Neuroscience/Week2 What do Neurons Encode?

2.3 Neural Encoding: Feature Selection

siliconvalleystudent 2022. 10. 4. 02:38

_118ba0a52ea1cb8a72560d78573bb6f8_Lecture-2-part-3.pdf
1.84MB

 

In the last section we argued that a good basic coding model for many neural systems is a combination of a linear filter, or feature, that extracts some component from the stimulus, and a nonlinear input-output function that maps the filtered stimulus onto the firing rate.

Our goal in this section is to understand how to find the components of such a model.

You'll be doing this for yourself in the homework.

We'll then go on to think about how to modify this model to incorporate other important
neuronal properties.

Let's step back to the original problem, which is to build a model like this.

To build this general model our problem is dimensionality.

Let's caste our minds back to the case of the movie we showed the retina.

We can define a movie in terms of the intensity of three colors in every pixel in, say, a one megapixel image.

And to capture any time dependence, we'll also need to keep enough frames of the movie to go back for maybe a second.

So each example of a stimulus is given by 3,000 times maybe 100 time points, or in the order of 300,000 values.

That's just one stimulus.

To sample the distribution of possible stimuli, when each is specified by hundreds of thousands of values is just impossible.

It would be impossible to fill up that response distribution, even if our stimulus was just 100 dimensions.

The amount of data needed is unmanageable.

So we need a strategy to find a way to pull out one or two or a few meaningful components in that image, so that we have any hope of even computing this response function.

So to proceed at all we need to find the feature that drives the neuron.

To do this, we'll sample the responses of this system to many stimuli but not to build the complete model, just enough so we can learn what it is that really drives the cell.

That will let us go from a model that depends on arbitrary characteristics of the input to the one that depends only on the key characteristics.

So, we're going to start with a very high dimensional description.

Let's say, a time bearing wave form or an image.

And pick out a small set of relevant dimensions, that's our goal.

So how do we how do we think about our arbitrary stimulus as a high dimensional vector?

So we start with our s(t).

What we're going to do is discretize it, so we take, time t1.

We take the value of the stimulus at that time, we'll call it s1, time t2, and we, we take the value of the stimulus at that time, and we plot these two points in this 2-dimensional space.

As we keep taking more and more time points, that gives us more and more axes in this diagram in which
we're now plotting that stimulus.

So this is s(t), plotted as the components of its representation at these different time points.

So now we want to sample the responses of the system to a variety of stimuli so we can characterize what it is about the input that triggers responses.

One common and useful method to use is Gaussian white noise.

Gaussian white noise is a randomly varying input, which is generated by choosing a new Gaussian random number
at each time step.

In practice, the time step sets a cut-off on the highest frequency that's represented in the signal.

White noise, therefore, contains a very broad spectrum of frequencies, and in fact, depending on how the noise is smoothed in practical applications, almost all frequencies that are there are present in the signals with equal power.

Here's an example of a white noise input that's been smoothed a little bit.

You'll be using an example of white noise in your problem sets.

Now each chunk of white noise, let's say a hundred time units long, can be plotted in a hundred dimensional space.

The axes I've drawn here might describe the value at time t1, the value at time t2, et cetera.

As we continue to stimulate with new examples of white noise, the different examples are plotted in these different points.

And they start to fill up a distribution.

Remember, because each one of these examples is chosen randomly.

The prior distribution is the distribution of stimulus points, independent of what the neural system is doing.

Because we've constructed our white noise from Gaussian random numbers, the distribution of the, of the stimulus along any axis, is Gaussian.

If we were to take this multidimensional distribution and project is onto just this one axis, we would find that all of those stimuli fill up a one dimensional Gaussian distribution.

Now, a multidimensional distribution that's Gaussian in all directions is called a multivariate Gaussian.

The beauty of such a distribution is that it's Gaussian no matter how you look at it.

If we chose to look at the distribution of stimuli projected across some other dimension that's not in our original time points, but maybe some linear combination of them.

Let's take a new vector and now project our stimuli onto that new vector.

We would find that even along that new vector, the distribution is again Gaussian.

Now let's take a look at the stimuli that trigger spikes to happen.

Here's one and let's say there's a bunch more.

You'll notice that there's some structure in this group of points.

Ordinarily if I were really plotting an arbitrary choice of three of the hundred possible dimensions.

I wouldn't be able to see this.

I need to search for the right way to rotate this hundred dimensional cloud, so that I can see that structure.

One way to find a good coordinate axis is to take the average of these points.

Then the vector defined by that average, the spike-triggered average in general, is indeed a good direction in
which to see structure.

So let's imagine we now take this vector through the data.

And let's project all these spike-triggering points onto that vector.

They're all going to have projections that are large and similar to one another, so this will be the distribution of points projected onto the spike-triggered average.

So while I wanted to give you a geometrical perspective of what you're doing that might seem a little abstract.

Operationally it's quite straightforward and intuitive.

Let's say you gave this system a long random white noise stimulus like this one, which is just a scaler quantity that's varying randomly in time.

And the system, this neuron spiked during this presentation several times.

Here's a spike, here's another spike, here's another spike.

For every time there's a spike, we look back in time, at the chunk of stimulus preceding that spike, and grab it.

Put it down in this list.

This will be one example of your spike-triggering stimulus set.

We repeat that for every spike in our data, and then the spike-triggered average is just the average of all of these examples.

That's drawn over here.

So what you're doing is approximating whatever is common to all of the stimuli that triggered a spike.

So if all goes well, you'll see that this average is much less noisy than the examples.

And it's generally quite sensible looking.

So what this system apparently likes to see is an input that generally ramps up a bit, and then goes down.

That is the feature that triggers this system to fire.

Here's an example of the same procedure, but when the stimulus is not just a scalar value, but more like an image.

Every column here, is an image, with pixels of different colors, maybe one that's been unwrapped into a single vector of values.

The spike-triggered average, now average over these chunks of spatiotemporal data, that precede every spike, now has
both time, a time dimension, and also a space dimension.

Now let's go back to only deal with time. So back in the time representation that we introduced before, our spike trigger to averages some vector.

We'll take it to be a unit vector.
Let's call it f. This is the object of our desire, the single feature that captures a relevent componant of the stimulus.

Now, recall the previous section of the lecture.

What do we do with this identified feature?

We used it as a linear filter. Linear filtering we said is the same as convolution.

And it's also the same as projection.

Let's take some arbitrary stimulus s, remember we can represent it as a point in this high dimensional space.

And if we filter it by this spike-triggered average, that's the same as projecting it as a vector onto the spiked triggered average, which is also a vector.

So what does that mean? We have this vector of our stimulus s. To project it onto s, f, means that we take its component that's aligned along the direction of f.
This is s.f.

So that filtering operation takes the high-dimensional s and extracts from it by projection only this value.

Only its length along the vector defined by f.

Okay. Now we've seen that a good way to find a feature that drives the neural system is to stimulate with white noise and use reverse correlation to compute the spike-triggered average.

This is a good approximation to our feature, f1.

Now, how do we proceed to compute the input/output function of the system with respect to this feature?

Remember that we're trying to find the probability of a spike, given the stimulus, but where the stimulus, here, is now replaced only by the component of the stimulus that's extracted by the linear filter that we've identified.

We can find this relationship from quantities we can measure in data by rewriting it using an identity about conditional distributions known as Bayes' rule.

We can rewrite this probility spike, given s1, in terms of the probability of s1 given a spike.

Probability there's a spike divided by the probability of s1.

Let's see what this means.

We have this now in terms of two distributions, here the prior, again, now the prior only with respect to that one variable that we've extracted from the stimulus.

And here what we called the spike conditional, conditional distribution.

We run a long stimulus and collect a bunch of spikes.

We project the stimulus onto our feature, f1, extracting component s1.

Here's s1 and here are the spikes.

We use this long stimulus run to make a histogram of s1 here.

Now which for our white noise experiments is just going to be Gaussian.

We then pick out the values of s1 at the times of spikes.

Here they appear to occur when S1 is particularly large, and we make a distribution of those.

Hopefully, that distribution is different from the prior.

We take their ratio, as we see here, and scale it by the overall rate as probability of spike.

So now we have a method to calculate our nonlinear input/output function that's associated with f1.

Let's look at a couple of examples.

Let's say that our neuron fires at random times, so when we build a histogram of our stimulus, that is the prior distribution, and a histogram of the special stimuli that trigger spikes, we'll find that those stimuli are actually not so special since the stimulus points associated with spike times suggest a random sampling from the Gaussian prior, their distribution is just the same as the distribution of the prior.

This could mean either that the stimulus had nothing to do with the firing of the neuron in the first place or else that we chose the wrong component and we filtered out whatever it was about the stimilus that this neuron is actually responding to.

So we get an input output crave which is just flat, has no variation in response as a function of S1.

What we want to see is a nice difference between the prior and the spike conditional distribution, which is going to result in an input/output curve that has structure that's interesting.

So here our input/output curve tells us that the neuron, as we saw previously, tends to fire, so it has, predicts a high firing rate, when the projections onto our, our identified feature are large.

This is success.

So now let's go back to the basic coding model that we developed and think about what's missing here.

We managed to get our dimensionality all the way down to 1.

Was that necessarily a good idea?
Let's relax that a bit. And add back something potentially important.

The possibility of sensitivity to multiple features.

Now the need for this should be intuitive. We base all our decisions on many input features and here's one of the most important ones.

Unless you have a brain full of Pamela Anderson neurons, though personally I think I only have a Pamela Anderson neuron, neuron.

Generally we choose a partner or a friend on the basis of many characteristics, flexibility, generosity, the ability to cook, political affinity.

There are also many characteristics that enter into the description of a person that may not matter to you at all for their suitability as a friend, maybe their eye color or their height or their typing speed.

These are all filtered out of this specific decision.

We select some subset of relevant features from the whole sea of possible descriptors.

To express this in terms of the models that we've been looking at so far, what we mean is that now we want to consider that there's not just one but several filters, each selecting a different component of the input.

The non-linear response function now combines the responses of those different components in maybe non-trivial ways.

Let's take a simple auditory example.

Let's imagine we have a core detecting neuron.

So f1, the first feature, selects frequency one, f2 selects the second frequency.

But only when both frequency one and frequency two are present in the input will we get a large firing rate,
given this nonlinearity. One could imagine many other possible ways of combining features in such a nonlinearity.

Let's go back to our picture of the white noise experiment to think how we could find these features in data.

So we saw that we could take the average of the points and compute the spike-triggered average.

But we can extract more information from that cloud of points.

One could also, for example, compute the next order moment, or its covariance.

To do this, we apply a method something like principal component analysis, or PCA.

I realize that most of you probably aren't familiar with this technique, and we don't really have time here to build up the
tools that we need to derive it properly.

So I'll just describe a little bit about what it does.

Its job is to find low dimensional structure, in that cloud of points.

So PCA is a general, famous, and kind of magical tool for discovering low dimensional structure in high dimensional data.

Here's an illustration of what it gives you.

Lets say you have a cloud of data where each data point has an XYZ co-ordinate and we plot it in this three dimensional space.

But in fact, unbeknownst to us, all the data actually lie on a two dimensional plane.

So if we run PCA on this data we'll discover that there are two-so called principle components and these comopnents correspond to an orthogonal set of vectors that span that two-dimensional cloud.

So this feat of discovery doesn't look super-impressive when all we're doing is reducing three dimensions to two.

We could have just rotated our axes around and noticed that.

But what if when we start, as when we do generally, we have hundreds of dimensions and that we're hoping that our data has some low dimensional structure?

We'll never find it by plotting one coordinate against another.

As those dimensions that are important are some, un, unknown linear combination of the original coordinates.

Here we had x, y, and z and our plane is defined by some linear combination of our original axes.

Generally the dimensions that pick out the relevant structure in the data will be some linear combination of our stimulus
coordinates in their original basis, perhaps time or space.

For those of you with some linear algebra, PCA gives us a new basis set in which to represent our data; a basis set that generally is a lot smaller than our original representation.

So we get a lot of compression.
And also, it's a basis set that's well matched to our particular data set, unlike a standard basis set, like a Fourier basis,
for example.

So here's a fun example that uses PCA. Although it takes a lot of pixels to make a picture of a face, it turns out that faces have a lot of common structure.

And most faces can be pretty well reconstructed from a small set, maybe seven or eight of principle components, computed from a big bunch of faces.

So these are called eigenfaces.

If we have a new face that we want to fit with this faces, we can construct that as a linear combination of, of Fred, George, Bob, and Bill.

So if we can represent any new face in, in terms of sums of these computed eigenfaces, instead of the intensity values of each pixel in the image, we can represent the face using seven or eight numbers instead of hundreds.

Dimensionality reduction using PCA has a lot of practical uses in neuroscience experiments too.

For example it can be used to sort out spike wave forms that were recorded on the same electrode from two or more different neurons.

Let's say one neuron would give a spike that has a nice clean signal that looks like this.

The other neuron would give a, a clean signal that has a different shape, maybe a little broader.

Any particular recording is going to look like a noisy example or, of one of the other, one or the other of these wave forms.

PCA can pick out two components that capture the largest amount of variance in the data.

Now you project each noisy data point, each example of a recording onto these two components.

Usually, this will keep the two components that span the wave forms of the two neuron spikes.

All the components that get thrown away are just noise.

You can then plot all of the different data, data points that were recorded, project it onto those two features, so now you're seeing every data point projected into the space defined by feature one and feature two.

And in this new two-dimensional coordinate frame, the wave forms from the two cells are now clearly separable.

So let's go back to white noise and neural coding.

Here's an example from neural coding where PCA was used to find multiple features and where that turned out to be very helpful.

Here you're looking at a scatter plot of all the stimuli that drove a retinal ganglion cell to fire. Each stimulus, each blue dot was 100 time steps of a white noise flicker.

So just a scalar that varied in time.

But now we've reduced each one of those stimuli to, to a point in two dimensions by projecting it onto the two features that we found, feature one and feature two.

For this particular retinal ganglion cell, the spike-triggered average was close to zero and this picture shows you why.

When we look at the stimuli that trigger spikes it turns out to be two group stimuli that drove the neuron and the average of the entire set is here approximately.

It's near zero.

But if we just take the average of the right group.

So we take this point in the middle.

We can actually look at what that point looks like as a feature. What we see, what we found, was that it looks like, something like this.

This is a feature that likes to see the light initially go down, and then go up.

This is what we might call, an on feature, toward a brighter light.

If we now look at the average of all of these stimuli, what happens is that that turns out to be a feature that is almost the same as the on feature, except it turns out to be an off feature.

So this neuron both likes it when the light goes on, and it likes it when the light goes off.

If we average all of those stimuli together, we'd get nothing.

But if we use this technique, where we now could pull out two different features, and plot our stimuli in that two dimensional space, now that structure is revealed.

It's important to realize that the two features, f1 and f2, that we found here are not themselves, the on and the off feature, but the analysis allowed us to find a coordinate system in which we could see that structure.

Okay. I've been making a lot of use of your linear algebra neurons.

Let's give them a bit of downtime with the relaxing view of a, of a little eigenpuppy.

Although we were not necessarily able to go into the details, I hope you got the flavor of the construction of these kinds of models, and a sense for why multidimensional models can be useful.

There are a lot of good resources to learn more about these techniques, and we will post them on the website.