- #1
moyo
- 30
- 0
Hallo All!
Im glad to be here...
I have written a paper on a new type of network for NLP but i have lingering questions on how efficient and effective it is...this is a link to it on my blog...
ABSTRACT submited
So the following is a more formal framing of efficiency and effectiveness in this network that i would appreciate input on.
i have a type of network...it has one output node connected to all 4 hidden layer nodes, which are connected to the 12 input layer nodes the following way.Each is connected to exactly 9 nodes and the set of all sets of 3 nodes not being used by each node in the hidden layer consist of disjoint sets...this just means that we can't have any 2 nodes connected to the exact same nine nodes...or in general , we can't have any two nodes share missing nodes between them...
if this is clear so far there is another condition...the inputs of these input nodes is a number and each node in the hidden layer simply sums all the numbers from the input nodes it is connected to...
*This is not a neural network*...all a node in one layer does is collect a number from the nodes it is connected to , sums them and then passes them on to the next layer...no bias , no weights and no activation function...its really simple (for now)lol...now we could input random numbers into the input layer and compute a value at the end...so just imagine if we had an expected output number for a particular input of numbers and were to use this to calculate some form of loss...but we don't really have variables to adjust...
so we will say the variables we want to adjust ARE the input numbers and we want to adjust in the direction that qulaifies the out put for a certain forward pass according to some criteria that determines what we expect...so say we expect the number 10...and we get 5 with a random input of numbers then we adjust the variables (numbers) by back propagation with GD till they produce the number 10 at the output...
...my first question is , is this a convex optimisation problem and how can we tell that the global minimum error is zero?
Another question is , if we use a complex expression to calculate the value of a node , like for each node in the hidden layer receiving an input from 9 nodes we scale the number by multiplying it with the value of the three nodes left out of the 12.
so then it becomes a fully connected layer and in some connections the value of the connected node (3 of them for each node in the hidden layer )multiplies instead of adds to the value that the other nodes have...will it still be a convex function...
note that there is an interplay of features at play , each number in the input contributes to the output by being added in some nodes and multiplying the value of others...
if you have come this far ...could we have it that one of the three numbers is cubed , the other squred and last left alone of the three before multiplying...
would it still be convex..or get even more convoluted?
how can we generalise the results of what changes to the system takes us further from a convex loss function than others...
Ultimately
======
is there gauranteed one shared function, for each node in the hidden layers computation, that will give us a convex loss function. And what would be the general form of such.
=====
Part two of this
===
We connect 100 parralel networks of this type we will not attach their output nodes to another.We will have 100 output nodes. We also do not merge the hidden layer units of the 100 but we are choosing to keep a note of which nodes belonged to the hidden layer of each of the 100 networks.Also Each of those 100 have a number of nodes in their hidden layer between 1 and 10.
We need to keep this note in memory because we will delete ALL the input neurons and set it up again as follows , So these 100 networks are connected only because they are connected to samples of the input layer nodes, with each sample set intersecting with many others.
To clarrify:
We select a random selection of 20 nodes across the hidden layers partitions of ALL 100 networks and label them "A" then we omit them from the next procedure which is to perform this process again with the remaining nodes, and label them "B"...then we omit all A's and B's from the next procedure and repeat the process till we have none left to label.
Now we will connect each "A,B,C..." nodes to the input layer in the following way
Remember at first we remove all the input layer nodes, and populate the input layer as follows
======
create 3 nodes and place them in the input layer
connect ALL "A"'s to those 3 nodes...(these will be the 3 nodes we will use to modify the sum from the other by multiplying instead of adding).
create ANOTHER 3 nodes and place them in the input layer
Connect ALL "B"'s to those three nodes
repeat for all A ,B,C...
=======
Then:
Take each of the hidden nodes from each of the 100 networks we joined together.
Remember we still remember which hidden layer nodes came from which of the 100 networks.
So we take all nodes say of type B and connect them to the input nodes of the remaining nodes in the rest of the particular set of nodes that they came with in a particular network of the hundred .
E.g. if network 2 has nodes labelled " F,O,B,S" and network 35 has nodes labeled "P,B,E,C,B,G". then we connect both nodes labeled B and connect them to the input layer nodes of all the remaining nodes, i.e. "F,O,S,P,E,C,G"
If B was also in network ,15 ..which is "Y,B" ...then we include Y in the list of remaining nodes.
We do this for All A,B,C...nodes. Note B's are not in the list of all remaining nodes by definition so there is no connecting all B's in each network together.
======We have now completed the network. :)
=============
*pat on back*
so if we randomly assign an expected value of either 0 or 1 to each of the 100 output nodes then we backpropagate across each separately their loss..
This would be the equivalent of classifying each of 100 sentences as either spam or not spam...Of course there is no pattern in how the A,B,C... are distributed, but if this still works , then it means each word A,B,C... will have a range of nuances that are being captured in this system, and for this example we are installing into the system the definition of spam and non spam...i.e we are forcing the words to follow a pattern. if we get a nother random sample of words, and classify them as say spam , then they will be fitting this definition...of course really the definition is NOT spam and non spam , but 0 and 1 ness. Meaning that we could choose to deliberately use a system with real words set up like this and arrange for the spam examples to have an expected output of 0 and the non spam ones to be 1.
The system upon convergence , with enogh examples will have captured the relationships between words that belong to a spam sentence by adjusting the input layers variables.
If we store those variables we can use them on a random sentence to see if its spam or not.
Now for the ultimate questions...(why i put you through this)
Does this network pass all the requirements for convexity and a global minimum with a value of zero?
For any expected value
And what type of computation involving the 3 nodes connected to each A,B.. and their complement should we use to ensure we do infact pas these two requirements.
Im glad to be here...
I have written a paper on a new type of network for NLP but i have lingering questions on how efficient and effective it is...this is a link to it on my blog...
ABSTRACT submited
So the following is a more formal framing of efficiency and effectiveness in this network that i would appreciate input on.
i have a type of network...it has one output node connected to all 4 hidden layer nodes, which are connected to the 12 input layer nodes the following way.Each is connected to exactly 9 nodes and the set of all sets of 3 nodes not being used by each node in the hidden layer consist of disjoint sets...this just means that we can't have any 2 nodes connected to the exact same nine nodes...or in general , we can't have any two nodes share missing nodes between them...
if this is clear so far there is another condition...the inputs of these input nodes is a number and each node in the hidden layer simply sums all the numbers from the input nodes it is connected to...
*This is not a neural network*...all a node in one layer does is collect a number from the nodes it is connected to , sums them and then passes them on to the next layer...no bias , no weights and no activation function...its really simple (for now)lol...now we could input random numbers into the input layer and compute a value at the end...so just imagine if we had an expected output number for a particular input of numbers and were to use this to calculate some form of loss...but we don't really have variables to adjust...
so we will say the variables we want to adjust ARE the input numbers and we want to adjust in the direction that qulaifies the out put for a certain forward pass according to some criteria that determines what we expect...so say we expect the number 10...and we get 5 with a random input of numbers then we adjust the variables (numbers) by back propagation with GD till they produce the number 10 at the output...
...my first question is , is this a convex optimisation problem and how can we tell that the global minimum error is zero?
Another question is , if we use a complex expression to calculate the value of a node , like for each node in the hidden layer receiving an input from 9 nodes we scale the number by multiplying it with the value of the three nodes left out of the 12.
so then it becomes a fully connected layer and in some connections the value of the connected node (3 of them for each node in the hidden layer )multiplies instead of adds to the value that the other nodes have...will it still be a convex function...
note that there is an interplay of features at play , each number in the input contributes to the output by being added in some nodes and multiplying the value of others...
if you have come this far ...could we have it that one of the three numbers is cubed , the other squred and last left alone of the three before multiplying...
would it still be convex..or get even more convoluted?
how can we generalise the results of what changes to the system takes us further from a convex loss function than others...
Ultimately
======
is there gauranteed one shared function, for each node in the hidden layers computation, that will give us a convex loss function. And what would be the general form of such.
=====
Part two of this
===
We connect 100 parralel networks of this type we will not attach their output nodes to another.We will have 100 output nodes. We also do not merge the hidden layer units of the 100 but we are choosing to keep a note of which nodes belonged to the hidden layer of each of the 100 networks.Also Each of those 100 have a number of nodes in their hidden layer between 1 and 10.
We need to keep this note in memory because we will delete ALL the input neurons and set it up again as follows , So these 100 networks are connected only because they are connected to samples of the input layer nodes, with each sample set intersecting with many others.
To clarrify:
We select a random selection of 20 nodes across the hidden layers partitions of ALL 100 networks and label them "A" then we omit them from the next procedure which is to perform this process again with the remaining nodes, and label them "B"...then we omit all A's and B's from the next procedure and repeat the process till we have none left to label.
Now we will connect each "A,B,C..." nodes to the input layer in the following way
Remember at first we remove all the input layer nodes, and populate the input layer as follows
======
create 3 nodes and place them in the input layer
connect ALL "A"'s to those 3 nodes...(these will be the 3 nodes we will use to modify the sum from the other by multiplying instead of adding).
create ANOTHER 3 nodes and place them in the input layer
Connect ALL "B"'s to those three nodes
repeat for all A ,B,C...
=======
Then:
Take each of the hidden nodes from each of the 100 networks we joined together.
Remember we still remember which hidden layer nodes came from which of the 100 networks.
So we take all nodes say of type B and connect them to the input nodes of the remaining nodes in the rest of the particular set of nodes that they came with in a particular network of the hundred .
E.g. if network 2 has nodes labelled " F,O,B,S" and network 35 has nodes labeled "P,B,E,C,B,G". then we connect both nodes labeled B and connect them to the input layer nodes of all the remaining nodes, i.e. "F,O,S,P,E,C,G"
If B was also in network ,15 ..which is "Y,B" ...then we include Y in the list of remaining nodes.
We do this for All A,B,C...nodes. Note B's are not in the list of all remaining nodes by definition so there is no connecting all B's in each network together.
======We have now completed the network. :)
=============
*pat on back*
so if we randomly assign an expected value of either 0 or 1 to each of the 100 output nodes then we backpropagate across each separately their loss..
This would be the equivalent of classifying each of 100 sentences as either spam or not spam...Of course there is no pattern in how the A,B,C... are distributed, but if this still works , then it means each word A,B,C... will have a range of nuances that are being captured in this system, and for this example we are installing into the system the definition of spam and non spam...i.e we are forcing the words to follow a pattern. if we get a nother random sample of words, and classify them as say spam , then they will be fitting this definition...of course really the definition is NOT spam and non spam , but 0 and 1 ness. Meaning that we could choose to deliberately use a system with real words set up like this and arrange for the spam examples to have an expected output of 0 and the non spam ones to be 1.
The system upon convergence , with enogh examples will have captured the relationships between words that belong to a spam sentence by adjusting the input layers variables.
If we store those variables we can use them on a random sentence to see if its spam or not.
Now for the ultimate questions...(why i put you through this)
Does this network pass all the requirements for convexity and a global minimum with a value of zero?
For any expected value
And what type of computation involving the 3 nodes connected to each A,B.. and their complement should we use to ensure we do infact pas these two requirements.