I am a research fellow, conducting research into automatic analysis of bird sounds using machine learning.
—> Click here for more about my research.
I'm happy to say I'm now supervising two PhD students, Pablo and Veronica. Veronica is working on my project all about birdsong and machine learning - so I've got some notes here about recommended reading for someone starting on this topic. It's a niche topic but it's fascinating: sound in general is fascinating, and birdsong in particular is full of many mysteries, and it's amazing to explore these mysteries through the craft of trying to get machines to understand things on our behalf.
If you're thinking of starting in this area, you need to get acquainted with: (a) birds and bird sounds; (b) sound/audio and signal processing; (c) machine learning methods. You don't need to be expert in all of those - a little naivete can go a long way!
But here are some recommended reads. I don't want to give a big exhaustive bibliography of everything that's relevant. Instead, some choice reading that I have selected because I think it satisfies all of these criteria: each paper is readable, is relevant, and is representative of a different idea/method that I think you should know. They're all journal papers, which is good because they're quite short and focused, but if you want a more complete intro I'll mention some textbooks at the end.
Wang (2003), "An industrial strength audio search algorithm"
This paper tells you how the well-known "Shazam" music recognition system works. It uses a clever idea about what is informative and invariant about a music recording. The method is not appropriate for natural sounds but it's interesting and elegant.
Bonus question: Take some time to think about why this method is not appropriate for natural sounds, and whether you could modify it so that it is.
Stowell and Plumbley (2014), "Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning"
Lots of powerful machine learning right now uses deep learning. There's lots to read on the topic. Here's a blog post that I think gives a good introduction to deep learning. Also, for this article DO read the comments! The comments contain useful discussion from some experts such as Yoshua Bengio. Then after that, this recent Nature paper is a good introduction to deep learning from some leading experts, which goes into more detail while still at the conceptual level. When you come to do practical application of deep learning, the book "Neural Networks: Tricks of the Trade" is full of good practical advice about training and experimental setup, and you'll probably get a lot out of the tutorials for the tool you use (for example I used Theano's deep learning tutorials).
Theunissen and Shaevitz (2006), "Auditory processing of vocal sounds in birds"
This one is not computer science, it's neurology - it tells you how birds recognise sounds!
A question for you: should machines listen to bird sounds in the same way that birds listen to bird sounds?
O'Grady and Pearlmutter (2006), "Convolutive non-negative matrix factorisation with a sparseness constraint"
Kershenbaum et al (2014), "Acoustic sequences in non-human animals: a tutorial review and prospectus"
Benetos et al (2013), "Automatic music transcription: challenges and future directions"
Domingos (2012), "A few useful things to know about machine learning"
I'm having problems understanding people. More specifically, I'm having problems now that people are using emoji in their messages. Is it just me?
OK so here's what just happened. I saw this tweet which has some text and then 3 emoji. Looking at the emoji I think to myself,
"Right, so that's: a hand, a beige square (is the icon missing?), and an evil scary face. Hmm, what does he mean by that?"
I know that I can mouseover the images to see text telling me what the actual icons are meant to be. SO I mouseover the three images in turn and I get:
So it turns out I've completely misunderstood the emotion that was supposed to be on that face icon. Note that you probably see a different image than I do anyway, since different systems show different images for each glyph.
Clapping hands, OK fine, I can deal with that. Clapping hands and grinning face must mean that he's happy about the thing.
But "(white skin)"? WTF?
Is it just me? How do you manage to interpret these things?
Lancashire hotpot is a classic dish where I come from. Lamb, onion, potatoes, slow-cooked.
There's a short version of this post: Felicity Cloake's "perfect Lancashire hotpot" article in the Guardian is correct. Read that article.
Really the main way you can mess up Lancashire hotpot is by trying to fancy it up. As Cloake says, don't pre-cook the potatoes or the onions, or the meat. With the meat, lamb neck is a good choice, easy to find in supermarkets and good for slow cooking. (I bet Cloake is right that mutton is more traditional and would suit it well, but I don't tend to find that in the shops.) Cut the meat into BIG pieces - not "bite-size" pieces as in many stews, and not the bite-size pieces you get in supermarket ready-diced meat. Bigger than that. At least an inch thick.
I'm pretty sure I remember there being carrots in the regular school hotpot, so I add carrot (in big chunks so it stands up to the long cooking). Floury potatoes (not waxy) is the right way to do it, definitely - and for the reasons mentioned by Cloake: "the potatoes that have come into contact with the gravy dissolve into a rich, meaty mash, while those on top go crisp and golden – for which one needs a floury variety such as, indeed, a maris piper." I've got a standard recipe book here which says to put some potatoes on the bottom as well as the top, and that seems a bit odd at first glance but it gives you a good ratio of crispy potato to melted potato...
In a sense this is basically just a stew/casserole and you can do what you like, so I can try not to be too dogmatic, but it's one of those minimalist recipes where if you mess about with it too much you have "just another stew" rather than this particular flavour. It's traditional to use kidneys as well as meat (my grandma did that) but we didn't have that at school and certainly when I'm cooking just for me I'm not going to bother. However, I'm shocked to see Jane Horrocks suggest putting black pudding in underneath the potatoes! It's also mentioned by commenters on the Guardian article, so I assume it must be a habit in some bits of Lancashire... but not my bit.
That aside, the recipe to look at is Felicity Cloake's "perfect Lancashire hotpot" article in the Guardian.
One of the interesting threads of discussion was about deep learning and modern neural networks, and how best to use them for audio signal processing. The deep learning revolution has already had a big impact on audio: famously, deep learning gets powerful results on speech recognition and is now used pervasively in industry for that task. It's also widely studied in image and video processing.
But that doesn't mean the job is done. Speech recognition is only one of many ways we get information out of audio, and other "tasks" are not direct analogies, they have different types of inputs and outputs. Secondly, there are many different neural net architectures, and still much lively research in which architectures are best for which purposes. Part of the reason that big companies get great results for speech recognition is that they have masses and masses of data. In cases where we have modest amounts of data, or data without labels, or data with fuzzy labels, getting the architecture just right is an important thing to focus on.
And audio signal processing insights are important for getting neural nets right for audio. This was one of the main themes of Emmanuel Vincent's WASPAA keynote, titled "Is audio signal processing still useful in the era of machine learning?" (Slides.) He mentioned, for example, that the intelligent application of data augmentation is a good way for audio insight to help train deep nets well. I agree, but in the long-term I think the more important point is that our expertise should be used to help get the architectures right. There's also the thorny question (and hot topic in deep learning) of how to make sense of what deep nets are actually doing: in a sense this is the flip-side of the architecturing issue, making sense of an architecture once it's been found to work!
It's common knowledge that convolutional nets (ConvNets) and recurrent neural nets (specifically LSTMs) are powerful architectures, and in principle LSTMs should be particularly appropriate for time-series data such as audio. Lots of recent work confirms this. At the SANE workshop Tuomas Virtanen presented results showing strong performance at sound event detection (recovering a "transcript" of the events in an audio scene), and Ron Weiss presented impressive deep learning that could operate directly from raw waveforms to perform beamforming and speech recognition from multi-microphone audio. Weiss was using an architecture combining convolutional units (to create filters) and LSTM units (to handle temporal dependences). Pablo Sprechmann discussed a few different architectures, including one "unfolded NMF"-type architecture. (The "deep unfolding" approach is certainly a fruitful idea for deep learning architectures. Introduced a couple of years ago by Hershey et al. [EDIT: It's been pointed out that the unfolding idea was first proposed by Gregor and Lecun in 2010, and unfolded NMF was described by Sprechmann et al. in 2012. The contribution of Hershey et al. comes from the curious step of untying the unfolded parameters, which turns a truncated iterative algorithm into something more like a deep network.])
I'd like to focus on a couple of talks at SANE that exemplified how domain issues inform architectural issues:
Paris Smaragdis presented "NMF? Neural Nets? It’s all the same...". Smaragdis is well-known for his work on NMF (non-negative matrix factorisation) methods. He presented a great narrative arc of how you might start with NMF and then throw away the things that irritate you about it - such as the spectrogram, instead working with convolutional filters learnt from raw audio.
(Note this other recurring theme: I already mentioned that Ron Weiss was also talking about waveform-based methods. Others have worked on this before, such as Sander Dieleman's paper on "end-to-end" deep learning for music audio. It's still not clear if ditching the spectrogram is actually that beneficial. Certainly if you do, you need lots of data in order to train successfully, as Weiss demonstrated empirically. I don't think I'd recommend ditching the spectrogram yet unless you're really sure what you're doing...)
The really surprising thing about Smaragdis' talk (given his previous work) was that by the time he'd deconstructed NMF and built up a neural net having similar source-separation properties, the end result was a surprisingly recognisable autoencoder - however, with a nicely principled architecture, and also some specific modifications (some "skip" connections and choices about tying/untying parameters). This autoencoder is not the same as NMF - it doesn't have the same non-negativity constraints, for example - but is inspired by similar motivations.
Happily, these discussions relate to some work I've been involved in this year. I spent some time visiting Rich Turner in Cambridge, and we had good debates and a small study about how to design a neural network well for audio. We have a submitted paper about denoising audio without access to clean data using a partitioned autoencoder which is the first fruit of that visit. The paper focuses on the "partitioning" issue but the design of the autoencoder itself has some similarities to what Paris Smaragdis was describing, and for similar reasons.
There's sometimes a temptation to feel despondent about deep learning: the sense of foreboding that a generic deep network will always beat your clever insightful method, just because of the phenomenal amounts of data and compute-hours that some large company can throw at it. All of the above discussions feed into a more optimistic interpretation, that domain expertise is crucial for getting machine-learning systems to learn the right thing - as long as you can learn to jettison some of your cherished habits (e.g. MFCCs!) at the right moment.
I live in Tower Hamlets, the London borough with the largest proportion of Muslims in the UK. I see plenty of women every day who wear a veil of one kind or another. I don't have any kind of Muslim background so what could I do to start understanding why they wear what they do?
I went on a book hunt and luckily I found a book that gives a really clear background: "A Quiet Revolution" by Leila Ahmed. It's a book that describes some of the twentieth-century back-and-forth of different Islamic traditions, trends and politics, and how they relate to veils. The book has a great mix of historical overview and individual voices.
So, while of course there's lots I still don't understand, this book gives a really great grounding in what's going on with Muslim women, veils, and Western society. It's compulsory reading before launching into any naive feminist critique of Islam and/or veils. I'm sure feminists within Islam still have a lot to work out, and I don't know what the balance of "progress" is like there - please don't mistake me for thinking all is rosy. (There are some obvious headline issues, such as those countries which legally enforce veiling. I think to some Western eyes those headlines can obscure the fact that there are feminist conversations happening within Islam, and good luck to them.)
A couple of things that the book didn't cover, that I'd still like to know more about:
The Long Mynd is a range of hills in Shropshire. Very beautiful area this time of year. Lots of birds too. People often comment on the birds of prey: the buzzard, red kite and kestrel, soaring silently above and occasionally plummeting to pounce on something. Of course I'm more interested in the birds making the sounds all around.
I was most taken by the meadow pipits - as you walk around on the Mynd, they often leap surprised out of the heather and flitter away making alarmed "peeppeeppeep" sounds (or maybe more whistly than that, "pssppssppssp"). I saw a skylark too, ascending from the ground about 20 metres in front of me. It's great to witness it when they do that: an unhurried circling ascent, all the while burbling out their famously complex melodious song, like a little enraptured fax machine going to heaven.
While hanging around in the forest I noticed how many non-vocal bird sounds you can hear. The most common example is wing flutter sounds, I heard them from lots of different species, and the sound can often be very deliberate. The most surprising sound of all was when I was walking past a tree and heard a knocking sound and I thought, "Oh, is that a woodpecker starting up?" - but it wasn't. I could see the little bird on a branch a few metres away and it was a coal tit, doing a bit of a woodpecker impression. It would peck at the branch hard, about four times in a row, repeatedly, giving me the impression it might have been trying to do some DIY of some sort. It also tried it on a second branch.
Lots of gangs of ravens around too - their curious adaptable calls reminding me of the ones I saw recently at Seewiesen. I often heard (from a distance) the song of the nuthatch - that nice simple ascending note that I first encountered when camping in Dorset. Now and again a jay, lovely orangey and cyan colouring contrasting with its raspy magpie-ish yell. The jays seem to be shy around here, unlike the one that used to hang around in our garden in London.
Of course all the usual gang was there too: lots of robins singing, jackdaw, magpie, wren, house sparrow, blackbird, stock pigeon, one wood pigeon, the occasional chiff chaff. I think I heard a goldcrest at one point but I'm unsure. One willow warbler down by the reservoir.
Our journal paper Detection and Classification of Acoustic Scenes and Events is now out in IEEE Transactions on Multimedia! It evaluates many different methods for detecting/classifying in everyday audio recordings.
I'm highlighting this paper because it covers the whole process of the IEEE DCASE evaluation challenge that we ran a little while ago, with many international research teams submitting systems either for audio event detection or audio scene classification.
It was a big team effort, with various people putting many months of time in, from 2012 through to 2015 (even though it was essentially an unfunded initiative!). Specific thanks to Dimitrios and Emmanouil, who I know put lots of manual effort in, repeatedly, to get this right.
The International Bioacoustics Congress 2015 was a fantastic conference. Lots of fascinating research, in a great place (Murnau, Bavaria, Germany), and very well organised! In this note I want to capture some thoughts that it triggered, about the practical organisation of a conference.
The staff that faciliated the conference made it run very smoothly. There were helpful people in the downstairs office almost all week, to ask questions etc. I particularly appreciated the facilitation for conference speakers: downstairs, the organisers loaded our presentations onto the laptop and checked they worked; then upstairs, there was a sound engineer who very efficiently fitted us with the radio mic and opened the presentations. This kind of support was crucial to make it possible to have such a busy schedule: many sessions had only 15 minutes per speaker! So no time for messing around.
Various IBAC people said, and I agree, that it's vital to keep it as a single-track conference: that seems to be part of its friendly community atmosphere. This is tricky, as IBAC has grown so that the schedule is now tightly-packed, and one "easy" way to reduce the pressure would be to go multi-track. I suspect the biggest risk there is of splitting the community into taxa (birds, marine, anurans, etc). So if parallel sessions were to be used (not my preferred solution), it'd be better to do that with the "open" rather than themed sessions, as someone at the AGM suggested. (The mix of open and themed sessions was well-balanced here in 2015.)
Every day opened with a 60-minute keynote, which is a great and widely-used pattern. We then had 20-minute slots in the themed sessions, and 15-minute slots in the open sessions. In my home discipline I've never seen 15-minute talk slots, and I think that's too short. I think that 20-minute slots are good, as long as the chair insists on keeping some time for questions, since I personally believe that public discussion with conference speakers is a really important part of what conference presentations are for. The IBAC chairs didn't insist on this at all really, which is a shame. That aside, they were well hosted.
The poster sessions were lively and very interesting, but physically they were too full! It was often very difficult to even read the titles of posters, let alone talk to the person standing there, if one or two people were discussing a nearby poster. This could have been improved by having 4 separate sessions of 40 posters, rather than 2 sessions of 80 which were each repeated for two days.
So, as I've already implied, IBAC was very highly subscribed, with many talks and posters, and I've been suggesting it could be better if the programme was a bit less tightly-packed. How could this be done (without going multi-track)? One answer is to be more selective, i.e. to accept fewer abstracts. Immediately I want to highlight a risk of this: it's great at IBAC to have lots of student and early-postgrad presenters, so we would want to avoid a selection process that favoured big names or experienced abstract-submitters. (We'd also want to maintain a decent balance across taxa.) I'd suggest a simple quota: minimum 50% student or recently-graduated people, both for talks and for posters.
Being selective has a cost: interesting things get rejected. The quality of IBAC 2015 was high, there's no need to be selective for quality purposes. IBAC is currently every two years. I wonder if the IBAC community would be interested in having IBAC every year? There's clearly enough content for that. Would it suit the rhythm of the community? Could the IBAC steering committee cope with the doubled workload?
I find a printed programme absolutely essential. The 2015 organisers decided that many people don't want it because they use electronic versions, so printing it would be wasteful. That's fine, but for me and many others we need something. I think ideal would be simply to have a tick-box on the conference registration form, "Would you like a printed programme?" Simple to handle, and reduces unnecessary printing.
A few other miscellaneous thoughts:
Of course almost everything I've written is about general conference organisation, not just IBAC. These thoughts are spurred by conversations we had at IBAC, and spurred by the overall extremely good conference organisation. Massive thanks to the IBAC 2015 organisers and staff!
P.S. I previously blogged about the research at IBAC 2015.