Let's say we have 3 channels in input [layer] and 64 channels/feature maps in output [layer]. How are 3 channels converted to 64 channels? How many filters do we need? To produce each / one feature map in output you will need 3 filters - one for each input channel. R - (one filter conv) - | G - (one filter conv) - | --> (summed after conv)--> | one output B - (one filter conv) - | So to generate 64 channels/feature maps in output, you'll need 3x64 filters/kernels. Thus, simply, Number of Filters = No. of Input Channels x No. of Output Channels. Or, You can think the filter is not 2D at all. Say we have an input of HxWxN, where HxW is the height and width of the image and N is the depth or number of channels, so our filter will be AxBxN, so not a 2D but an n-D filter. And we do the n-D convolution (basically a dot product of each pixel, whatever the dimension). [PS: Each filter has a bias, so AxNxN + 1 = number for params for 1 output] [Source]