*Value* composite type

The basic *Value* composite type looks like this:

```
mutable struct Value{opType} <: Number
data::Float64
grad::Float64
op::opType
end
```

The `<: Number`

part means that `Value`

is a subtype of `Number`

. Don't worry about this for now, but we'll discuss it more later.

There are three fields: `data`

, `grad`

, and `op`

. We've seen *two* of these fields before, in the Usage section – `Value.data`

and `Value.grad`

, representing the number being stored in the *Value* and its gradient.

`Value.op`

is something new that we'll be using behind the scenes as part of the gradient tracking. Basically, we'll use it to keep track of what operations and operands were used to create a *Value* object. To do this, we'll also need to define a new composite type of keep track of these operations. Here's what that looks like:

```
struct Operation{FuncType,ArgTypes}
op::FuncType
args::ArgTypes
end
```

`Operation.op`

will tell us the operation type (addition, multiplication, etc) and `Operation.args`

will point to the operands used in the operation, so that we can access them if we want to.

Next, we need a constructor so that we can initialize *Values*:

```
# constructor -- Value(data, grad, op)
Value(x::Number) = Value(Float64(x), 0.0, nothing)
```

Looks a bit complicated, I know, but let's break this down. We can initialize a *Value* object with `Value(x)`

where `x`

is some number. The `Value(Float64(x), 0.0, nothing)`

part means that when we initialze a *Value* with `Value(x)`

, this will set `Value.data = Float64(x)`

(casting `x`

to a Float64 if it's not already), `Value.grad = 0.0`

and `Value.op = nothing`

. The reason that the operation is set to "nothing" here is because we have initialized this *Value* ourselves rather than creating it as the result of an operation.

Next, a bit of code so that we can print out values and take a look at them.

```
import Base: show
function show(io::IO, value::Value)
print(io, "Value(",value.data, ")")
end
```

This lets us print a *Value* and see the number that it's storing. The `import Base: show`

at the top means that we're using a base Julia function called "show" and definining what it will do when we pass a *Value* as an input. We'll be doing this a lot, for many different base functions. In the source code, we import all of these functions with one line at the top of the file.

One more quick formality:

```
import Base.==
function ==(a::Value, b::Value)
return a===b
end
```

The triple-equal-sign `===`

checks if two variables are pointing to the same object in memory, so this code means that for two values `a`

and `b`

, the equality check `a==b`

will return `true`

only if they are pointing to the same *Value* object, and will return `false`

if they're pointing to two different objects (even if both store the same number, gradient, and operation history).

Ok, so that's our basic setup for *Values*. At this point, we should be able to run the following code:

```
x = Value(4.0)
println(x)
# output: Value(4.0)
println(x.data)
# output: 4.0
println(x.grad)
# output: 0.0
println(x.op)
# output:
y = Value(4.0)
println(x==y)
# output: false
z = x # z and x point to the same Value object
println(x==z)
# output: true
```

## Defining *Value* addition

Alright, so we have our basic building block, but now we want to be able to actually do some calculations with it.

Let's start with addition. Bear with me for a second, I'm gonna give you the full block of code and then we'll go through it bit by bit:

```
import Base.+
function +(a::Value, b::Value)
out = a.data + b.data
result = Value(out, 0.0, Operation(+, (a, b) )) # Value(data, grad, op)
return result
end
```

The `import Base.+`

means that we're importing the base addition function, and `+(a::Value, b::Value)`

means that we're defining what the `+`

operator will do when used on two *Values*, which we call `a`

and `b`

for the purpose of the function definition. Basically this will allow us to do `x + y`

where `x`

and `y`

are *Values* rather than regular numbers. Again, in the source code all these imports are in one line at the top.

`out = a.data + b.data`

is how we calculate the actual sum of the two input *Values* that will be stored in the output *Value*. Then we create a new *Value* with `result = Value(out, 0.0, Operation(+, (a, b) ))`

. Hopefully this part looks familiar, since we're using the same constructor syntax as before. This will set `result.data = out`

and `result.grad = 0.0`

. The only new part here is that instead of setting `result.op = nothing`

, we're setting `result.op = Operation(+, (a, b) )`

to specify that this *Value* was created from an addition operation, and pointing to `a`

and `b`

as the operands, so that we can access them if we want to.

Alright, I know things are getting a little complicated, but setting things up like this will give us a lot of power to go backwards through operations. For example, using only the parts we've written so far you should be able to run this code:

```
# define two Values
x = Value(2.0)
y = Value(3.0)
# add them together to get a new Value
z = x + y
println(z)
# output: Value(5.0)
# inspect the new Value to see what operation produced it
println(z.op.op)
# output: + (generic function with 194 methods)
# access the Values that were used as operands
println(z.op.args)
# output: (Value(2.0), Value(3.0))
```

## Defining *Value* backpropagation

Alright, now let's try to implement backpropagation for the addition operation. Basically the goal here is to be able to calculate the derivative of the output with respect to each of the inputs in the operation.

Before we actually write the code for this, I'll first show you what we want the end result to look like:

```
# define two Values
x = Value(2.0)
y = Value(3.0)
# add them together to get a new Value
z = x + y
# calculate the derivative of z with respect to the inputs
backward(z)
# the gradient of x tells us the derivative of z with respect to x
println(x.grad)
# output: 1.0
# dz/dx = 1, meaning an increase of 1 in x will lead to an increase of 1 in z.
# we can also check y.grad if we want to
println(y.grad)
# output: 1.0
```

Alright so that's how the end result should look, but now we need to actually write the code to get there. To do this, we're going to define a function called `backprop!()`

that takes in a *Value* as an input, and then computes the gradients of the operands that were used to create the *Value*. This will be an internal function (not actually called by the user), but pretty soon we'll also define another function called `backward()`

which will perform the full backward pass, calling `backprop!()`

along the way.

One of the cool things about Julia is something called "multiple dispatch" – this means that you can define functions with the same name that do things differently based on the type of input that's passed in. If you recall, when we originally defined our *Value* object, we made it so that the object type contains information about the operation that was used to create it: `Value{opType}`

.

For example:

```
x = Value(2.0)
println(typeof(x))
# output: Value{Nothing}
```

We'll begin with the `backprop!()`

function for this simple case, where the *Value* was not created by an operation, but rather defined by the user. In this case, we will just have the `backprop!()`

function do nothing:

`backprop!(val::Value{Nothing}) = nothing`

Now we'll do the harder case, where `backprop!()`

is applied to the result of an addition operation, to calculate the gradients of the operands. Let's look at the full code first, and then we'll discuss what each part is doing:

```
function backprop!(val::Value{Operation{FunType, ArgTypes}}) where {FunType<:typeof(+), ArgTypes}
val.op.args[1].grad += val.grad # update gradient of first operand
val.op.args[2].grad += val.grad # update gradient of second operand
end
```

I know, looks pretty confusing! Let's start with the function definition line: `backprop!(val::Value{Operation{FunType, ArgTypes}}) where {FunType<:typeof(+), ArgTypes}`

. This is just saying that we're definining what the ``backprop!()`

function will do when the input is a *Value* called `val`

that was created in an addition operation.

Then, the function updates two things: `val.op.args[1].grad`

and `val.op.args[2].grad`

. This is how we access the gradients of the operands that were used to create `val`

, so that we can update their gradients.

So how do we update the gradients? Well as we mentioned before, for a simple addition operation $z = x + y$ the derivatives of $z$ with respect to both variables are $\frac{dz}{dx} = 1$ and $\frac{dz}{dy} = 1$. This is because increasing either variable by some amount will cause $z$ to increase by the same amount.

But wait a minute... the code in our `backprop!()`

function looks way more complicated than that. We're *not* saying `val.op.args[1].grad = 1`

and `val.op.args[2].grad = 1`

(setting both gradients equal to 1). Instead we're saying `val.op.args[1].grad += val.grad`

and `val.op.args[2].grad += val.grad`

– incrementing the operand gradients by the current value of the input *Value* gradient. The reason we're doing this is because we need to think ahead a little bit. All this complication isn't necessary for our simple $z = x + y$ example, but we're trying to write this function in a general way so that it'll also work for more complicated examples in the futures.

Here's a more complicated example we want to be able to handle (although still using only addition):

```
x = Value(2.0)
y = Value(3.0)
z = x + y
w = z + x # using x a second time
backward(w)
println(x.grad)
# output: 2.0
println(y.grad)
# output: 1.0
println(z.grad)
# output: 1.0
```

This introduces two complications: it has two layers to the calulation, and $x$ is used *twice*. We use $z$ as an intermediate variable to store the result of $x+y$, but ultimately we're interested in $w = z + x$, and we want to find the derivatives $\frac{dw}{dx}$and $\frac{dw}{dy}$.

This example helps explain the rational for writing `backprop!()`

the way we did. We're calculating the gradients of the operands using the gradient of the input *Value* so that we can take advantage of the chain rule: we can calculate $\frac{dw}{dy} = \frac{dw}{dz}\frac{dz}{dy}$. This will involve two calls to `backprop!()`

. First, we'll call `backprop!(w)`

which will calculate the gradient of `z`

, and then we'll call `backprop!(z)`

, which will calculate the gradient of `y`

.

Getting the gradient of `x`

is a little more complicated. First, let's quicly prove to ourselves that $\frac{dw}{dx} = 2$. The full equation for $w$ is $w = z + x$. We know that $z = x + y$, so we can rewrite the equation for $w$ as $w = (x + y) + x$. Then, taking the derivative with respect to $x$ gives us $\frac{dw}{dx} = 2$. Since `x`

contributes to the value of `w`

twice, increasing `x`

by some amount will increase `w`

by *twice* that amount.

This is the rationale for why our `backprop!()`

function increments the gradients its updating, rather than just setting them to some number. This lets us account for situations where the same *Value* contributes more than once to the final sum. In our example, `x.grad`

will be updated twice by the `backprop!()`

function – once during the `backprop!(w)`

call and once during the `backprop!(z)`

call. Both of these updates will increase `x.grad`

by 1, leaving us with our final answer of $\frac{dw}{dx} = 2$.

Still with me? Alright, last part. We said before that `backprop!()`

is an internal function that won't actually be called by the user. Rather, the user will call a wrapper function `backward()`

on the final sum *Value*, and that function will do the full backward pass by calling `backprop!()`

as many times as required to calculate the derivatives for all of the input *Values*. So now we need to write the `backward()`

function.

First let's take a look at the full code, and then we'll discuss what each part is doing:

```
function backward(a::Value)
function build_topo(v::Value, visited=Value[], topo=Value[])
if !(v in visited)
push!(visited, v)
if v.op != nothing
for operand in v.op.args
if operand isa Value
build_topo(operand, visited, topo)
end
end
end
push!(topo, v)
end
return topo
end
topo = build_topo(a)
a.grad = 1.0
for node in reverse(topo)
backprop!(node)
end
end
```

When we call `backward(a)`

(where `a`

is the final result of some operations between *Values*) we want two things to happen. First, we want to fill up an array with `a`

itself, and all of the other *Values* that were used to create `a`

, sorted in topological order so that a *Value* comes after all of the dependencies used to calculate that *Value* – meaning that `a`

should be the last element in the array since everything else is a dependency of `a`

.

We can build this array with a recursive depth-first search. This is what the nested function `build_topo()`

is doing. `build_topo()`

returns the topologically sorted array of all the *Values*, with `a`

at the end. Then we set `a.grad = 1`

, since the derivative of a variable with respect to itself is 1. Finally, we iterate backwards through the list of *Values* and call `backprop!()`

on each one to update the gradients of its operands.

That's it! We're done! With the code we've written up to this point, we can do as many addition operations between *Values* as we want, then do a backward pass on the final sum to calculate its derivative with respect to all the inputs that went into it. We did it!

Now some of you are probably thinking *Wait a minute, we're not done! All we have is an addition operation for Values, and we haven't even started with Tensors yet!* Ok, yeah, that's true. I guess what I mean is we're done with the *difficult part* – the *Value* object structure, the logic of operation-tracking, and gradient-updating through backpropagation. Now that we're done with all that, adding more *Value* operations is easy. All we need to know is what the operation does, and how to calculate the derivative for it. Then we can use almost the exact same code we've already written, with only those parts changed. When we finally get to *Tensors*, the code will be almost exactly the same, except the operations and derivative calculations will be for matrix/vector form.

## Adding some robustness

Ok, time for a short digression. Let's take another look at our *Value* definition:

```
mutable struct Value{opType} <: Number
data::Float64
grad::Float64
op::opType
end
```

As we mentioned before, the `<: Number`

part means that `Value`

is a subtype of `Number`

. This subtyping isn't strictly necessary (and in fact an earlier version of this package didn't have it), but it will make it easy for us to add some robustness to *Values*.

For example, with our code so far, we can only do addition operations between *Values*. This is a good start, but ideally we'd also be able to do operations between *Values* and regular numbers, and have the output of those operations be a *Value*. Luckily, we can implement this with one line of code:

`Base.promote_rule(::Type{<:Value}, ::Type{T}) where {T<:Number} = Value`

This line defines a promotion rule, which specifies that for operations involving a *Value* and a *Number*, the *Number* should be converted to a *Value* for the purposes of the operation, and the result should be a *Value*. Now we should be able to run the following code without a problem:

```
test = Value(2.0) + Value(3.0) # Value + Value
println(test)
# output: Value(5.0)
test = Value(2.0) + 3.0 # Value + Number
println(test)
# output: Value(5.0)
test = 2.0 + Value(3.0) # Number + Value
println(test)
# output: Value(5.0)
```

## More *Value* operations

Alright, so let's add some more *Value* operations. We'll start with multiplication. Here's the code, for both the operation and the backward pass:

```
import Base.*
function *(a::Value, b::Value)
out = a.data * b.data
result = Value(out, 0.0, Operation(*, (a, b) )) # Value(data, grad, op)
return result
end
# backprop for multiplication operation
function backprop!(val::Value{Operation{FunType, ArgTypes}}) where {FunType<:typeof(*), ArgTypes}
val.op.args[1].grad += val.op.args[2].data * val.grad
val.op.args[2].grad += val.op.args[1].data * val.grad
end
```

That's it! Told you it was easy! The `*(a::Value, b::Value)`

function is almost exactly the same as the addition function we wrote before, except that we're setting `out = a.data * b.data`

and recording the operation as `Operation(*, (a, b) )`

.

The `backprop!()`

function is also very similar to the one we wrote for addition, with just a couple small changes. First of all, we're now using `where {FunType<:typeof(*), ArgTypes}`

in the funciton definition to specify that this is the version of `backprop!()`

to use when the input variable was created with a multiplication operation (again, the cool thing about multiple dispatch is that we can define several versions of a function with different input types).

The second minor difference is that we need to change the way the derivates are calculated, since we're dealing with multiplication rather than addition. For a multiplicaiton operation $z = xy$ the derivatives of $z$ with respect to $x$ and $y$ are $\frac{dz}{dx} = y$and $\frac{dz}{dy} = x$. The two lines inside `backprop!()`

are just saying this in code – the gradient of each operand is incremented by the value of the other operand multiplied by the `val.grad`

(the result of the operation) to allow for the chain rule.

With the code we've written so far, we can do things like this:

```
x = Value(2.0)
m = Value(4.0)
b = Value(7.0)
y = m*x + b
backward(y)
println(m.grad)
# output: 2.0
println(x.grad)
# output: 4.0
println(b.grad)
# output: 1.0
```

By the way, Julia will still take care of the order of operations for us here, so we could have written `y = b + m * x`

and gotten the same answer.

A lot of the operations will be like the multiplication case, where we'll need to write a new `backprop!()`

operation. However, sometimes we can find a clever way to do things that avoids this. For example, this is how we'll implement *Value* subtraction:

```
import Base.-
# negation
function -(a::Value)
return a * -1
end
# subtraction
function -(a::Value, b::Value)
return a + (-b)
end
```

The first function `-(a::Value)`

allows us to negate *Values* with a minus sign. This can be done by multiplying the *Value* by $-1$, an operation we can already do with our `*(a::Value, b::Number)`

function. The second function `-(a::Value, b::Value)`

allows us to do subtraction with *Values* by negating the second *Value* and then adding them together.

Pretty clever, right? This way we don't need to write a new `backprop!()`

function for subtraction, because we've turned the subtraction operation into a combination of multiplication and addition.

Anyway, from here it's just a matter of adding more operations so that we can do more calculations with our *Values*. There are the operations currently supported:

**Addition****Subtraction****Multiplication****Division****Exponents****e^x****log()****tanh()**

If you've understood everything up to this point, you should be able to read all the source code for the *Values* and make sense of it. If there are any operations you'd like to see added, either let me know and I'll try to add them, or you can also write them yourself and submit a pull request!

*Tensor* composite type

*Tensors* work almost exactly the same way as *Values*, except with a little bit of extra complications that come with dealing with vectors and matrices. But the fundamentals are basically the same. We'll track operations with our *Operation* objects, override several base Julia functions to work for *Tensor* operations, and implement the backward pass with an internal `backprop!()`

function and a user-facing `backward()`

function.

The *Operation* object structure is the same as before:

```
struct Operation{FuncType,ArgTypes}
op::FuncType
args::ArgTypes
end
```

And here's our definition of the *Tensor* object structure:

```
mutable struct Tensor{opType} <: AbstractArray{Float64, 2}
data::Array{Float64, 2}
grad::Array{Float64, 2}
op::opType
end
```

As you can see, it's very similary to the *Value* object, except that the `Tensor.data`

and `Tensor.grad`

fields are arrays rather than numbers. Note also that just as `Value`

is a subtype of `Number`

, `Tensor`

is a subtype of `AbstractArray{Float64, 2}`

, a 2-dimensional array of type `Float64`

.

We're going to have two different contstructors for our *Tensor* type - one to create a *Tensor* from a 2D array, and another to create a *Tensor* from a 1D array. Here's the 2D constructor:

Here's the *Tensor* constructor:

`Tensor(x::Array{Float64,2}) = Tensor(x, zeros(Float64, size(x)), nothing)`

Again, same basic idea as the *Value* constructor, except that we're dealing with arrays instead of numbers. Note that the constructor requires a `Float64`

array as input and won't accept other number types. Maybe I'll change that later to make it more robust.

Besides the 2D constructor, we also want the option to pass a 1D array and have it create either a row vector or column vector *Tensor*. For example, we want to be able to write this code, where we pass in a 1D array with shape (3,) and get a row vector *Tensor* with shape (1,3):

```
x = [1.0, 2.0, 3.0]
println(size(x))
# output: (3,)
x = Tensor(x)
println(size(x))
# output: (1,3)
```

Here's the 1D constructor that will let us do that:

```
function Tensor(x::Array{Float64, 1}; column_vector::Bool=false)
if column_vector
# column vector - size (N,1)
data_2D = reshape(x, (length(x), 1))
else
# DEFAULT row vector - size (1,N)
data_2D = reshape(x, (1,length(x)))
end
Tensor(data_2D, zeros(Float64, size(data_2D)), nothing) # Tensor(data, grad, op)
end
```

Pretty simple, we just take in the 1D array and reshape it to a row vector with `reshape(x, (1,length(x)))`

by default, or to a column vector with `reshape(x, (length(x), 1))`

if the user sets `column_vector::Bool=true`

.

Just a couple more quick things, which are also similar to our *Value* setup. The following code lets us print out *Tensors*, sets the `backprop!()`

function to be `nothing`

in cases where a *Tensor* was defined by the user rather than being created in an operation, and specifies that the equality check `==`

will return `true`

only if the two variables actually reference the same *Tensor* object:

```
import Base.show
function show(io::IO, tensor::Tensor)
print(io, "Tensor(",tensor.data, ")")
end
backprop!(tensor::Tensor{Nothing}) = nothing
import Base.==
function ==(a::Tensor, b::Tensor)
return a===b
end
```

So that covers the necessary parts. Now we can also add a couple more things just for convenience. Here's a line of code that will allow us to call `size(Tensor)`

to get the shape of the `Tensor.data`

field (which should be the same as the `Tensor.grad`

field):

`Base.size(x::Tensor) = size(x.data)`

Here's a line of code that allows use to use standard array index notation to access elements of `Tensor.data`

.

`Base.getindex(x::Tensor, i...) = getindex(x.data, i...)`

So for example, if we have a *Tensor* called `a`

, we can access elements of `a.data`

by writing `a[i,j]`

instead of having to write `a.data[i,j]`

.

Here's a similar line that allows us to *set* elements using standard array index notation:

`Base.setindex!(x::Tensor, v, i...) = setindex!(x.data, v, i...)`

So if we want to set an element of `a`

to `7`

, for example, we can just write `a[i,j] = 7`

, rather than having to write `a.data[i,j] = 7`

. So with those three lines added, we should now be able to run the following code:

```
x = Tensor([1.0 2.0; 3.0 4.0])
println(size(x))
# output: (2,2)
x[2,2] = 5.0
println(x[2,2])
# output: 5.0
```

Ok, now let's try defining a *Tensor* operation. When we were learning about how *Values* work we started with addition because that seemed like the easiest. But for *Tensors*, addition will actually be a little tough because of some shape-broadcasting we'll need to do. So we'll start with matrix multiplication, since that will be easier. Here's the code:

```
import Base.*
function *(a::Tensor, b::Tensor)
out = a.data * b.data
result = Tensor(out, zeros(Float64, size(out)), Operation(*, (a, b))) # Tensor(data, grad, op)
return result
end
```

Very similar to what we were doing with *Values*, except that this time it's matrix multiplication. But same idea. We do the matrix multiplication with `out = a.data * b.data`

. Then we store the resulting matrix `out`

along with an emptry gradient `size(out))`

in a new *Tensor* called `result`

, and record the operation as `Operation(*, (a, b))`

.

And here's the `backprop!()`

function:

```
function backprop!(tensor::Tensor{Operation{FunType, ArgTypes}}) where {FunType<:typeof(*), ArgTypes}
tensor.op.args[1].grad += tensor.grad * transpose(tensor.op.args[2].data)
tensor.op.args[2].grad += transpose(tensor.op.args[1].data) * tensor.grad
end
```

Again, basically the same idea as with the *Value* `backprop!()`

functions. The only difficult part is that now everything has to be done for matrices, which makes the actual calculations more complicated and less intuitive. If you want, you can always work out a simple example on paper to prove to yourself that the gradient updates in this `backprop!()`

function are actually correct.

Finally, here's the code for the full backward pass:

```
function backward(a::Tensor)
function build_topo(v::Tensor, visited=Tensor[], topo=Tensor[])
if !(v in visited)
push!(visited, v)
if v.op != nothing
for operand in v.op.args
if operand isa Tensor
build_topo(operand, visited, topo)
end
end
end
push!(topo, v)
end
return topo
end
topo = build_topo(a)
a.grad .= 1.0
for node in reverse(topo)
backprop!(node)
end
end
```

Again, almost exactly the same as for the *Values*.

## More *Tensor* operations

For the sake of completeness, I'm going to give you the code for all of the *Tensor* operations needed to write a basic neural network. For now, working through the details to understand them will be left as an exercise for the reader, although I'll probably try to come back to this section to write a more complete description when I have some more free time.

Here's addition:

```
import Base.+
function +(a::Tensor, b::Tensor)
# broadcasting happens automatically for row-vector
out = a.data .+ b.data
result = Tensor(out, zeros(Float64, size(out)), Operation(+, (a, b))) # Tensor(data, grad, op)
return result
end
function backprop!(tensor::Tensor{Operation{FunType, ArgTypes}}) where {FunType<:typeof(+), ArgTypes}
if size(tensor.grad) == size(tensor.op.args[1].data)
tensor.op.args[1].grad += ones(size(tensor.op.args[1].data)) .* tensor.grad
else
# reverse broadcast
tensor.op.args[1].grad += ones(size(tensor.op.args[1].grad)) .* sum(tensor.grad,dims=1)
end
if size(tensor.grad) == size(tensor.op.args[2].data)
tensor.op.args[2].grad += ones(size(tensor.op.args[2].data)) .* tensor.grad
else
# reverse broadcast
tensor.op.args[2].grad += ones(size(tensor.op.args[2].grad)) .* sum(tensor.grad,dims=1)
end
end
```

Hint about this one: the complicated parts are related to broadcasting so that we can add add a row vector to a matrix and have it automatically be added to every row (like adding biases in a neural net).

Here's the ReLU activation function:

```
function relu(a::Tensor)
out = max.(a.data,0)
result = Tensor(out, zeros(Float64, size(out)), Operation(relu, (a,))) # Tensor(data, grad, op)
return result
end
function backprop!(tensor::Tensor{Operation{FunType, ArgTypes}}) where {FunType<:typeof(relu), ArgTypes}
tensor.op.args[1].grad += (tensor.op.args[1].data .> 0) .* tensor.grad
end
```

Here's the combined softmax activation and cross entropy loss:

```
function softmax_crossentropy(a::Tensor,y_true::Union{Array{Int,2},Array{Float64,2}}; grad::Bool=true)
## implementing softmax activation and cross entropy loss separately leads to very complicated gradients
## but combining them makes the gradient a lot easier to deal with
## credit to Sendex and his textbook for teaching me this part
## great textbook for doing this stuff in Python, you can get it here:
## https://nnfs.io/
# softmax activation
exp_values = exp.(a.data .- maximum(a.data, dims=2))
probs = exp_values ./ sum(exp_values, dims=2)
probs_clipped = clamp.(probs, 1e-7, 1 - 1e-7)
# deal with 0s and 1s
# basically just returns an array with the probability of the correct answer for each batch
correct_confidences = sum(probs_clipped .* y_true, dims=2)
# negative log likelihood
sample_losses = -log.(correct_confidences)
# loss mean
out = [sum(sample_losses) / length(sample_losses)]
if grad
# it's easier to do the grad calculation here because doing it seperately will involve redoing a lot of calculations
samples = size(probs, 1)
# convert from one-hot to index list
y_true_argmax = argmax(y_true, dims=2)
a.grad = copy(probs)
for samp_ind in 1:samples
a.grad[samp_ind, y_true_argmax[samp_ind][2]] -= 1
## this syntax y_true_argmax[i][2] is just to get the column index of the true value
end
a.grad ./= samples
end
# reshape out from (1,) to (1,1)
out = reshape(out, (1, 1))
result = Tensor(out, zeros(Float64, size(out)), Operation(softmax_crossentropy, (a,))) # Tensor(data, grad, op)
return result
end
# softmax_crossentropy backprop is empty because gradient is easier to calculate during forward pass
function backprop!(tensor::Tensor{Operation{FunType, ArgTypes}}) where {FunType<:typeof(softmax_crossentropy), ArgTypes}
end
```

Note: this is the most complicated one by far, and is also the odd-one-out in that the gradient is actually calculated during the forward pass (unless told not to), which the `backprop!()`

function just does nothing.

So yeah, sorry to leave you guys with a complicated one here. I'll probably come back later and try to write a more thorough description of this one when I have some more free time in the future.