Identifying birdcalls with deeplearning

How we can leverage advances in machine learning for biodiversity conservation among others
EDA
pytorch
biodiversity
conservation
spectrograms
bioacoustics
Published

January 8, 2026

Identifying birdcalls with deeplearning

Over a million species face extinction right now and we’re losing species faster than we can count them. While rainforests capture the spotlight, a quieter extinction is happening all around us—in the soundscapes we barely notice. Organizations such as Earth Species Initiative, Hula, Biometrio etc. aren’t waiting for boots on the ground to disappear into the wilderness; they’re training AI to listen instead.

Why & How?

What if you could monitor an entire ecosystem without stepping foot in it? A single passive acoustic monitor—a battery-powered device no bigger than a shoebox—can listen 24/7, capturing the calls of species too shy to let humans near them. Deploy a network of them, pair their recordings with machine learning models trained on repositories like Xeno-canto’s 1+ million bird recordings, and suddenly you can track biodiversity across landscapes faster and cheaper than traditional field surveys ever allowed. That’s the promise of passive acoustic monitoring: turning the invisible language of nature into quantifiable conservation data. In this guide, we’ll try to build our own deep learning model to do exactly that.

Deep learning already leveraging acoustics data to try to find known bird calls - presence and diversity of birds and other animals is a sign of healthy bio-diversity - increasingly, we are also trying to detect in-species sounds:

The following tutorial is based on this github repo that I created where you can find more information.

Code
import os
os.chdir('..')

import gc
import torch
import IPython.display as ipd

from ast import literal_eval
from src.modeling import Model
from src.processing import plot_batch
from src.utils import get_resp, get_metadata_for_birds
from src.audio import load_audio, get_melspec, plot_spec

To download metadata of recordings programatically from Xeno-Canto, you’d need an API key that’s freely available once you register.

I’m interested in identifying bird species in a small region and for that we can train a simple deeplearning model. Instead of training a model on the recordings from a small region, for which we may or may not have enough data, we instead find out which birds are typically found in the said region and try to get recordings for those species from a wider area (e.g., a city/country/continent etc.). Then we can test it to identify birds in our narrow region.

I’m interested in Munich city but first, let’s find out how many recordings for birds we have for Germany…

EDA

query = f"cnt:germany+grp:birds"
data = get_resp(query)
for k, v in data.items():
    if k!= 'recordings':
        print(k, v)
numRecordings 36264
numSpecies 340
page 1
numPages 726

Wow! we have ~36K recordings of 340 different species of birds… that’s very interesting. Can’t analyse all here… so let’s take a subset from Munich… We draw a bounding box

query = f"box:48,11.34,48.282,11.851+grp:birds"
data = get_resp(query, per_page=500)
for k, v in data.items():
    if k!= 'recordings':
        print(k, v)
numRecordings 102
numSpecies 58
page 1
numPages 1

much more manageable… We’ll get 10 recording samples for each of these 58 species from a much wider area and store it locally.

df = pd.DataFrame(data['recordings'])
df.shape
(102, 37)
df.sample(3).T
83 54 39
id 652760 366346 706139
gen Saxicola Phylloscopus Poecile
sp rubicola collybita montanus
ssp
grp birds birds birds
en European Stonechat Common Chiffchaff Willow Tit
rec Mathias Götz Michele Peron Mathias Götz
cnt Germany Germany Germany
loc München (near Garching bei München), Oberbaye... Bavaria (near Garching bei München), Oberbaye... Munich (near Perlacher Forst), Upper Bavaria,...
lat 48.2184 48.2489 48.0704
lon 11.6062 11.6532 11.5591
alt 500 480 550
type call call song
sex
stage
method field recording field recording field recording
url //xeno-canto.org/652760 //xeno-canto.org/366346 //xeno-canto.org/706139
file https://xeno-canto.org/652760/download https://xeno-canto.org/366346/download https://xeno-canto.org/706139/download
file-name XC652760-Schwarzkehlchen - Kontaktruf.mp3 XC366346-Zilpzalp_sweeoo_einsilbig.mp3 XC706139-Perlacher Forst: Grossheseloher_Brück...
sono {'small': '//xeno-canto.org/sounds/uploaded/FY... {'small': '//xeno-canto.org/sounds/uploaded/DB... {'small': '//xeno-canto.org/sounds/uploaded/FY...
osci {'small': '//xeno-canto.org/sounds/uploaded/FY... {'small': '//xeno-canto.org/sounds/uploaded/DB... {'small': '//xeno-canto.org/sounds/uploaded/FY...
lic //creativecommons.org/licenses/by-nc-sa/4.0/ //creativecommons.org/licenses/by-nc-sa/4.0/ //creativecommons.org/licenses/by-nc-sa/4.0/
q B B B
length 0:42 1:27 0:35
time 12:30 11:00 16:30
date 2021-05-24 2017-04-23 2022-02-28
uploaded 2021-05-29 2017-04-25 2022-03-05
also [] [] []
rmk sweeoo caller Type H2, bird on this site and c... Recorded with iPhone 8
animal-seen yes yes yes
playback-used no no no
temp
regnr
auto no no no
dvc
mic
smp 44100 44100 44100
Code
# df.to_csv('data/munich_birds.csv', index=False)
# df = pd.read_csv('data/munich_birds.csv')
# df.shape

# df_recs = pd.to_csv('data/munich_birds_recordings_samples_metadata.csv', index=False)
# df_recs = pd.read_csv('data/munich_birds_recordings_samples_metadata.csv')
df_recs = get_metadata_for_birds(df)
df_recs.sample(3).T
151 2309 2494
id 1062867 947332 1031229
gen Alauda Poecile Scolopax
sp arvensis palustris rusticola
ssp NaN NaN NaN
grp birds birds birds
en Eurasian Skylark Marsh Tit Eurasian Woodcock
rec David Darrell-Lambert Ulf Elman Esperanza Poveda
cnt United Kingdom Sweden Sweden
loc Lower Abbey Farm, Suffolk Härjarö, Trögden, Uppland Norrköping Municipality, Östergötland County
lat 52.2355 59.4628 58.6948
lon 1.5998 17.4141 16.4759
alt 4 0 80
type call call call
sex uncertain NaN female, male
stage uncertain NaN adult
method field recording field recording field recording
url //xeno-canto.org/1062867 //xeno-canto.org/947332 //xeno-canto.org/1031229
file https://xeno-canto.org/1062867/download https://xeno-canto.org/947332/download https://xeno-canto.org/1031229/download
file-name XC1062867-Eurasian-Skylark-call-very-good-2020... XC947332-241110_2680C-Entita-stjärtmes.mp3 XC1031229-scolopax-rusticola-1-lovsjon-08-04-2...
sono {'small': '//xeno-canto.org/sounds/uploaded/AY... {'small': '//xeno-canto.org/sounds/uploaded/LK... {'small': '//xeno-canto.org/sounds/uploaded/PR...
osci {'small': '//xeno-canto.org/sounds/uploaded/AY... {'small': '//xeno-canto.org/sounds/uploaded/LK... {'small': '//xeno-canto.org/sounds/uploaded/PR...
lic //creativecommons.org/licenses/by-nc/4.0/ //creativecommons.org/licenses/by-nc-nd/4.0/ //creativecommons.org/licenses/by-nc-sa/4.0/
q A A B
length 0:13 0:34 0:33
time 10:00 09:47 ?
date 2020-04-24 2024-11-10 2025-04-08
uploaded 2025-12-05 2024-11-10 2025-08-20
also [] ['Aegithalos caudatus'] []
rmk NaN [SNR 50dB(ISO/ITU)](https://xeno-canto.org/art... Viaje de grabación con José Carlos Sires y Gre...
animal-seen yes no no
playback-used no no no
temp NaN NaN NaN
regnr NaN NaN NaN
auto no no yes
dvc Sound devices mix pre 3 Tascam DR-100 mkIII olympus dm770
mic Telinga pro 8 MKII Parabol C no
smp 48000 48000 44100
code alauda_arvensis poecile_palustris scolopax_rusticola
df_recs.q.value_counts(dropna=False, normalize=True)
q
A           0.987018
B           0.011930
C           0.000702
no score    0.000351
Name: proportion, dtype: float64

Nice, almost all of them has high quality recordings…

df_recs.cnt.value_counts(dropna=False, normalize=True).head(10)
cnt
France            0.203860
Sweden            0.187719
Spain             0.092281
Poland            0.083860
United Kingdom    0.069825
Germany           0.053684
Netherlands       0.041754
Italy             0.035789
Norway            0.025965
Portugal          0.019298
Name: proportion, dtype: float64

Interesting! Most of the recordings for the birds that I am interested in from Munich comes not from Germany but from neighbouring countries… this provides insights as to the habitat spread of a species of interest.

We will now store these recordings…

get_data(df_recs)

Spectrograms

sample_audio = 'data/train_audio/asio_otus/592319.mp3'
ipd.Audio(sample_audio)

We will use librosa to handle loading audio of various formats. Make sure ffmpeg is installed in your system.

y, sr = load_audio(sample_audio)
y.shape, sr
((2009408,), 44100)

This 45 second clip contains more than 2M data points because of the high sampling rate.

We shall now extract mel-spectrogram from the wave data. There’re many articles out there explaining these, so I would refrain from going into detail. In short, audio is an analog signal that we convert into digital format with a certain sampling rate. Because it is a signal, we can run Fourier Transforms (FT) to decompose it into individual frequencies. But given the non-periodic nature of human speech or bird calls etc., we instead do STFT (Short Time FT) over a window. There can be many frequencies overlapping at any given point of time and these can be visualized as a spectrogram. Because of the non-linear nature of our innate hearing capabilities & constraints, we convert these frequencies to Mel Scale. Mel-Spectrograms are just that.

mel_dB = get_melspec(y, sr, plot=True)

Let’s zoom into a small section to see more details of the harmonics or overtones…

plot_spec(mel_dB[:, :500])

df_recs.smp.value_counts(normalize=True, dropna=False)
smp
48000    0.500702
44100    0.420702
32000    0.037544
96000    0.023158
24000    0.013684
22050    0.002807
16000    0.000702
8000     0.000351
88200    0.000351
Name: proportion, dtype: float64

good to see that most of the recordings have high sampling rates

df_recs[df_recs.also != "[]"].also.apply(lambda x: ', '.join(literal_eval(x))).values
array(['Loxia curvirostra', 'Loxia curvirostra, Troglodytes troglodytes',
       'Fringilla coelebs, Turdus merula', ...,
       'Fringilla coelebs, Prunella modularis, Homo sapiens, Bos taurus',
       'Columba palumbus, Turdus philomelos, Gallus gallus, Corvus corone, Erithacus rubecula, Troglodytes troglodytes',
       'Gallus gallus, Corvus corone, Coloeus monedula, Erithacus rubecula, Troglodytes troglodytes, Bos taurus'],
      shape=(1090,), dtype=object)

the metadata also includes other bird (& animal?) species that also appear in a given recording. We can also leverage these when doing multi label classification.

Train Model

We will build a very simple starter ResNet based model to identify birds calls/songs but this can be adapted to any species. It works the following way:

  1. Obtain MelSpectrograms of the audio recordings of your species of interest
  2. During training:
    1. randomly extract n seconds of audio data from the whole audio file
    2. generate mel-spectrogram of it
    3. pre-process and transform it into a single channel image (a square array)
    4. train on these patches

We keep the pretrained weights, freeze the resnet backbone and adjust the no. of last linear layer’s output features to the no. of classes we have in our dataset. We also change the inputs of the first CNN layer given that we have a single channel image.

device = 'cuda' if torch.cuda.is_available() else 'cpu'
mdl = Model(device, pretrained=True, freeze_backbone=True)
mdl.set_dataloaders("data/train_audio/", batch_size=32)
for x, y in mdl.train_dl:
    break
x.shape, y.shape
(torch.Size([32, 1, 256, 256]), torch.Size([32, 57]))
plot_batch(x[:8], y[:8])

we can now train for some epochs…You can also load the model from your earlier training checkpoint to continue training or for inference.

mdl.train(num_epochs=5)
mdl.load_from_chkpt('data/model_checkpoints/chkpt_12.pth')

Let’s load a sample audio file for inference…

fp = 'data/train_audio/corvus_corone/592284.mp3'
mdl.make_preds(fp)
'corvus_corone'

cool!

HF Spaces Demo

The model’s also hosted as a demo on huggingface spaces: munich-bird-identifier although keep in mind that the results aren’t that accurate as it’s just a PoC.

Next Steps:

Our current model barely scratches the surface. It just represents one of the many approaches for understanding how deep learning tackles bioacoustics. However, the field is advancing rapidly, and there are numerous directions to extend and enhance our approach.

Real-world bird classification demands moving beyond single-channel, binary classification models toward greater sophistication. We can do that by incorporating multi-channel spectrograms that capture complementary acoustic features (Mel-spectrograms at varying resolutions, delta features, chromagrams), then expand to multi-label scenarios where multiple species vocalize simultaneously. On top of that, we can develop multimodal models fusing acoustic data with temporal (time of day, season), spatial, and weather metadata; information that dramatically improves predictions since bird activity patterns are highly context-dependent. Google’s Massive Sound Embedding Benchmark (MSEB), presented at NeurIPS 2025, provides an extensible framework for evaluating such multimodal sound models, with BirdSet demonstrating how bioacoustics integrates into broader auditory intelligence research.

We also have to understand the resource constraints such as compute, power etc. when deploying models on to edge devices where light-weight models such as EfficientNet shines.

Standardized benchmarks like BirdSet become essential for validating progress against state-of-the-art approaches. BirdSet aggregates diverse datasets into a unified framework exposing models to both focal recordings (isolated calls from Xeno-Canto) and soundscape recordings (complex passive monitoring data), explicitly testing for the “covariate shift” problem where models trained on clean recordings struggle with real-world acoustic environments. By evaluating against BirdSet’s standardized pipeline and baseline results, we can quickly identify strengths and weaknesses in our model. Additionally, MSEB’s inclusion of BirdSet enables evaluation across downstream tasks beyond classification such as acoustic retrieval, clustering unknown species, and temporal segmentation of vocalizations within long recordings. Google’s Perch 2.0 model released recently got SoTA results on this dataset.

Apart from that, kaggle also hosts annual BirdCLEF competitions for bioacoustics wherein massive soundscape datasets get dissected by top teams who then publish their full solution writeups as open-source treasure. This knowledge sharing featuring cutting-edge techniques seeps straight from notebooks into real-world conservation tools. With models and datasets like these, only our imagination limits what we can achieve towards conservation of biodiversity.

Thanks a lot for reading! Auf Wiedersehen 👋!