Come on, satisfying results can be obtained without using ANN / RNAs, compare amplitude (mentioned in an answer here) will never work the way proposed, the question is very broad, I can not simply write an article here with all the steps, although it seems complicated if you have a good mathematical / algebraic base and in signal processing you will see that it is not that complicated, it can be laborious but not very complex , so being familiar with these fields is more than essential, as well as a solid basis in deterministic and stochastic processes.
Maybe I'll start talking the steps here and you do not understand bolhufas , so it's up to you to go deeper, the steps are:
Extract the characteristics of each bird (audio from each corner), this can be done by extracting the MFCC - Honey Frequency Cepstral Coefficients
The MFCC extracts the envelope / formants (contour) of the frequencies of a signal in the frequency domain, this consistently tells us the shape of the vocal tract in the envelope of the spectrum, we will have the frequency bands equally spaced on the scale honey, which approximates the response of the human auditory system more narrowly than the linearly spaced frequency bands used in the normal cepstrum, in general 12 coefficients are sufficient, roughly a spectrum filter bank:
Atthispointyouwillhavea12-positionvectorrepresentingthecharacteristicsofthecornerforeachbirdyouwant,Idonotwanttogodeep,butfromnowallyouhavetodoiscompareyourpre-recordedvectorwithanewoneunknown)andtoratewhichofthemhasthebestsimilarity,youcanstartwithsimplercomparisonssuchas Euclidiana or try something else but developed for example Dynamic Time Warping