I’ve read some articles on CNNs vs Capsule Networks and am still wrapping my head on the difference between the two
This is a quote from Geoffrey Hinton when he introduced CapsNet
‘Hinton: “The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.”’
I wanted to dive more into what this max pooling step is. From my understanding CNN works by having Convolutional layers which use kernels like the Sobel operator to identify different features like edges, etc. We could have different kernels for more complex features like noise, ear, etc
Is Max pooling at a high level just combining all the features that are found throughout the different Convolutional layers? -Face, noise, ear during facial recognition
Because of this, the CNN predicts both of these images to be faces – https://hackernoon.com/hn-images/1*R5sjf4XtV9FApC7N39mPew.png despite the fact that one is clearly not positionally.
I believe CapsNet was developed to solve this problem(position, translational/orientation equivariance). Can someone give a high level ELI5 overview of how CapsNet solves this problem and differentiates from ConvsNet? From reading, is it because CapsNet uses vectors(maintain position information) while ConvsNet uses scalars(matrix multiplication)