Other things on this site...

Evolutionary sound
Listen to Flat Four Internet Radio
Learn about
The Molecules of HIV
Make Oddmusic!
Make oddmusic!
[Blog archives]

Reading list: excellent papers for birdsong and machine learning

I'm happy to say I'm now supervising two PhD students, Pablo and Veronica. Veronica is working on my project all about birdsong and machine learning - so I've got some notes here about recommended reading for someone starting on this topic. It's a niche topic but it's fascinating: sound in general is fascinating, and birdsong in particular is full of many mysteries, and it's amazing to explore these mysteries through the craft of trying to get machines to understand things on our behalf.

If you're thinking of starting in this area, you need to get acquainted with: (a) birds and bird sounds; (b) sound/audio and signal processing; (c) machine learning methods. You don't need to be expert in all of those - a little naivete can go a long way!

But here are some recommended reads. I don't want to give a big exhaustive bibliography of everything that's relevant. Instead, some choice reading that I have selected because I think it satisfies all of these criteria: each paper is readable, is relevant, and is representative of a different idea/method that I think you should know. They're all journal papers, which is good because they're quite short and focused, but if you want a more complete intro I'll mention some textbooks at the end.

  • Briggs et al (2012) "Acoustic classification of multiple simultaneous bird species: A multi-instance multi-label approach"

    • This paper describes quite a complex method but it has various interesting aspects, such as how they detect individual bird sounds and how they modify the classifier so that it handles multiple simultaneous birds. To my mind this is one of the first papers that really gave the task of bird sound classification a thorough treatment using modern machine learning.
  • Lasseck (2014) "Large-scale identification of birds in audio recordings: Notes on the winning solution of the LifeCLEF 2014 Bird Task"

    • A clear description of one of the modern cross-correlation classifiers. Many people in the past have tried to identify bird sounds by template cross-correlation - basically, taking known examples and trying to detect if the shape matches well. The simple approach to cross-correlation fails in various situations such as organic variation of sound. The modern approach, introduced to bird classification by Gabor Fodor in 2013 and developed further by Lasseck and others, uses cross-correlation, but it doesn't use it to guess the answer, it uses it to generate new data that gets fed into a classifier. At time of writing (2015), this type of classifier is the type that tends to win bird classification contests.
  • Wang (2003), "An industrial strength audio search algorithm"

    • This paper tells you how the well-known "Shazam" music recognition system works. It uses a clever idea about what is informative and invariant about a music recording. The method is not appropriate for natural sounds but it's interesting and elegant.

      Bonus question: Take some time to think about why this method is not appropriate for natural sounds, and whether you could modify it so that it is.

  • Stowell and Plumbley (2014), "Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning"

    • This is our paper about large-scale bird species classification. In particular, a "feature-learning" method which seems to work well. There are some analogies between our feature-learning method and deep learning, and also between our method and template cross-correlation. These analogies are useful to think about.
  • Lots of powerful machine learning right now uses deep learning. There's lots to read on the topic. Here's a blog post that I think gives a good introduction to deep learning. Also, for this article DO read the comments! The comments contain useful discussion from some experts such as Yoshua Bengio. Then after that, this recent Nature paper is a good introduction to deep learning from some leading experts, which goes into more detail while still at the conceptual level. When you come to do practical application of deep learning, the book "Neural Networks: Tricks of the Trade" is full of good practical advice about training and experimental setup, and you'll probably get a lot out of the tutorials for the tool you use (for example I used Theano's deep learning tutorials).

    • I would strongly recommend NOT diving in with deep learning until you have spent at least a couple of months reading around different methods. The reason for this is that there's a lot of "craft" to deep learning, and a lot of current-best-practice that changes literally month by month, and anyone who gets started could easily spend three years tweaking parameters.
  • Theunissen and Shaevitz (2006), "Auditory processing of vocal sounds in birds"

    • This one is not computer science, it's neurology - it tells you how birds recognise sounds!

      A question for you: should machines listen to bird sounds in the same way that birds listen to bird sounds?

  • O'Grady and Pearlmutter (2006), "Convolutive non-negative matrix factorisation with a sparseness constraint"

    • An example of analysing a spectrogram using "non-negative matrix factorisation" (NMF), which is an interesting and popular technique for identifying repeated components in a spectrogram. NMF is not widely used for bird sound, but it certainly could be useful, maybe for feature learning, or for decoding, who knows - it's a tool that anyone analysing audio spectrograms should be aware of.
  • Kershenbaum et al (2014), "Acoustic sequences in non-human animals: a tutorial review and prospectus"

    • A good overview from a zoologist's perspective on animal sound considered as sequences of units. Note, while you read this, that sequences-of-units is not the only way to think about these things. It's common to analyse animal vocalisations as if they were items from an alphabet "A B A BBBB B A B C", but that way of thinking ignores the continuous (as opposed to discrete) variation of the units, as well as any ambiguity in what constitutes a unit. (Ambiguity is not just failure to understand: it's used constructively by humans, and probably by animals too!)
  • Benetos et al (2013), "Automatic music transcription: challenges and future directions"

    • This is a good overview of methods used for music transcription. In some ways it's a similar task to identifying all the bird sounds in a recording, but there are some really significant differences (e.g. the existence of tempo and rhythmic structure, the fact that musical instruments usually synchronise in pitch and timing whereas animal sounds usually do not). A big difference from "speech recognition" research is that speech recognition generally starts from the idea of there just being one voice. The field of music transcription has spent more time addressing problems of polyphony.
  • Domingos (2012), "A few useful things to know about machine learning"

    • lots of sensible, clearly-written advice for anyone getting involved in machine learning.


  • "Machine learning: a probabilistic perspective" by Murphy
  • "Nature's Music: the Science of Birdsong" by Marler and Slabbekoorn - a great comprehensive textbook about bird vocalisations.
Friday 13th November 2015 | science | Permalink / Comment

Emoji understanding fail

I'm having problems understanding people. More specifically, I'm having problems now that people are using emoji in their messages. Is it just me?

OK so here's what just happened. I saw this tweet which has some text and then 3 emoji. Looking at the emoji I think to myself,

"Right, so that's: a hand, a beige square (is the icon missing?), and an evil scary face. Hmm, what does he mean by that?"

I know that I can mouseover the images to see text telling me what the actual icons are meant to be. SO I mouseover the three images in turn and I get:

  • "Clapping hands sign"
  • "(white skin)"
  • "Grinning face with smiling eyes"

So it turns out I've completely misunderstood the emotion that was supposed to be on that face icon. Note that you probably see a different image than I do anyway, since different systems show different images for each glyph.

Clapping hands, OK fine, I can deal with that. Clapping hands and grinning face must mean that he's happy about the thing.

But "(white skin)"? WTF?

Is it just me? How do you manage to interpret these things?

Tuesday 10th November 2015 | technology | 1 Comment

How to cook the perfect Lancashire hotpot

Lancashire hotpot is a classic dish where I come from. Lamb, onion, potatoes, slow-cooked.

There's a short version of this post: Felicity Cloake's "perfect Lancashire hotpot" article in the Guardian is correct. Read that article.

Really the main way you can mess up Lancashire hotpot is by trying to fancy it up. As Cloake says, don't pre-cook the potatoes or the onions, or the meat. With the meat, lamb neck is a good choice, easy to find in supermarkets and good for slow cooking. (I bet Cloake is right that mutton is more traditional and would suit it well, but I don't tend to find that in the shops.) Cut the meat into BIG pieces - not "bite-size" pieces as in many stews, and not the bite-size pieces you get in supermarket ready-diced meat. Bigger than that. At least an inch thick.

I'm pretty sure I remember there being carrots in the regular school hotpot, so I add carrot (in big chunks so it stands up to the long cooking). Floury potatoes (not waxy) is the right way to do it, definitely - and for the reasons mentioned by Cloake: "the potatoes that have come into contact with the gravy dissolve into a rich, meaty mash, while those on top go crisp and golden – for which one needs a floury variety such as, indeed, a maris piper." I've got a standard recipe book here which says to put some potatoes on the bottom as well as the top, and that seems a bit odd at first glance but it gives you a good ratio of crispy potato to melted potato...

In a sense this is basically just a stew/casserole and you can do what you like, so I can try not to be too dogmatic, but it's one of those minimalist recipes where if you mess about with it too much you have "just another stew" rather than this particular flavour. It's traditional to use kidneys as well as meat (my grandma did that) but we didn't have that at school and certainly when I'm cooking just for me I'm not going to bother. However, I'm shocked to see Jane Horrocks suggest putting black pudding in underneath the potatoes! It's also mentioned by commenters on the Guardian article, so I assume it must be a habit in some bits of Lancashire... but not my bit.

That aside, the recipe to look at is Felicity Cloake's "perfect Lancashire hotpot" article in the Guardian.

Sunday 1st November 2015 | food | Permalink / Comment

Getting neural networks and deep learning right for audio (WASPAA/SANE)

I'm just back from a conference visit to the USA, to attend WASPAA and SANE. Lots of interesting presentations and discussions about intelligent audio analysis.

One of the interesting threads of discussion was about deep learning and modern neural networks, and how best to use them for audio signal processing. The deep learning revolution has already had a big impact on audio: famously, deep learning gets powerful results on speech recognition and is now used pervasively in industry for that task. It's also widely studied in image and video processing.

But that doesn't mean the job is done. Speech recognition is only one of many ways we get information out of audio, and other "tasks" are not direct analogies, they have different types of inputs and outputs. Secondly, there are many different neural net architectures, and still much lively research in which architectures are best for which purposes. Part of the reason that big companies get great results for speech recognition is that they have masses and masses of data. In cases where we have modest amounts of data, or data without labels, or data with fuzzy labels, getting the architecture just right is an important thing to focus on.

And audio signal processing insights are important for getting neural nets right for audio. This was one of the main themes of Emmanuel Vincent's WASPAA keynote, titled "Is audio signal processing still useful in the era of machine learning?" (Slides.) He mentioned, for example, that the intelligent application of data augmentation is a good way for audio insight to help train deep nets well. I agree, but in the long-term I think the more important point is that our expertise should be used to help get the architectures right. There's also the thorny question (and hot topic in deep learning) of how to make sense of what deep nets are actually doing: in a sense this is the flip-side of the architecturing issue, making sense of an architecture once it's been found to work!

It's common knowledge that convolutional nets (ConvNets) and recurrent neural nets (specifically LSTMs) are powerful architectures, and in principle LSTMs should be particularly appropriate for time-series data such as audio. Lots of recent work confirms this. At the SANE workshop Tuomas Virtanen presented results showing strong performance at sound event detection (recovering a "transcript" of the events in an audio scene), and Ron Weiss presented impressive deep learning that could operate directly from raw waveforms to perform beamforming and speech recognition from multi-microphone audio. Weiss was using an architecture combining convolutional units (to create filters) and LSTM units (to handle temporal dependences). Pablo Sprechmann discussed a few different architectures, including one "unfolded NMF"-type architecture. (The "deep unfolding" approach is certainly a fruitful idea for deep learning architectures. Introduced a couple of years ago by Hershey et al. [EDIT: It's been pointed out that the unfolding idea was first proposed by Gregor and Lecun in 2010, and unfolded NMF was described by Sprechmann et al. in 2012. The contribution of Hershey et al. comes from the curious step of untying the unfolded parameters, which turns a truncated iterative algorithm into something more like a deep network.])

I'd like to focus on a couple of talks at SANE that exemplified how domain issues inform architectural issues:

  • John Hershey presented "Deep clustering: discriminative embeddings for single-channel separation of multiple sources". The task being considered was source separation of two or more speaking people recorded in a mixture, which is usually handled by applying binary masking to a spectrogram of the mixture. The task then becomes how to identify which "pixel" of the spectrogram should be assigned to which speaker. In some sense, it's a big multilabel classification task, with each pixel needing a label. Except as John pointed out, it's not really a classification task but a clustering task, because when we get a mixture of two speakers and we want to separate them, we usually have no prior labels and no reason to care who is "Speaker 1" and who is "Speaker 2". Motivated by this, Hershey described an approach where a deep learning system is trained to cluster the pixels in a latent space. The objective function happens to be the same as the K-means objective, except that instead of learning which items go in which cluster, the net is being trained to move the items around in the latent space so that the cluster separation is maximised. (The work is described in this arxiv preprint.)
  • Paris Smaragdis presented "NMF? Neural Nets? It’s all the same...". Smaragdis is well-known for his work on NMF (non-negative matrix factorisation) methods. He presented a great narrative arc of how you might start with NMF and then throw away the things that irritate you about it - such as the spectrogram, instead working with convolutional filters learnt from raw audio.

    (Note this other recurring theme: I already mentioned that Ron Weiss was also talking about waveform-based methods. Others have worked on this before, such as Sander Dieleman's paper on "end-to-end" deep learning for music audio. It's still not clear if ditching the spectrogram is actually that beneficial. Certainly if you do, you need lots of data in order to train successfully, as Weiss demonstrated empirically. I don't think I'd recommend ditching the spectrogram yet unless you're really sure what you're doing...)

    The really surprising thing about Smaragdis' talk (given his previous work) was that by the time he'd deconstructed NMF and built up a neural net having similar source-separation properties, the end result was a surprisingly recognisable autoencoder - however, with a nicely principled architecture, and also some specific modifications (some "skip" connections and choices about tying/untying parameters). This autoencoder is not the same as NMF - it doesn't have the same non-negativity constraints, for example - but is inspired by similar motivations.

Happily, these discussions relate to some work I've been involved in this year. I spent some time visiting Rich Turner in Cambridge, and we had good debates and a small study about how to design a neural network well for audio. We have a submitted paper about denoising audio without access to clean data using a partitioned autoencoder which is the first fruit of that visit. The paper focuses on the "partitioning" issue but the design of the autoencoder itself has some similarities to what Paris Smaragdis was describing, and for similar reasons.

There's sometimes a temptation to feel despondent about deep learning: the sense of foreboding that a generic deep network will always beat your clever insightful method, just because of the phenomenal amounts of data and compute-hours that some large company can throw at it. All of the above discussions feed into a more optimistic interpretation, that domain expertise is crucial for getting machine-learning systems to learn the right thing - as long as you can learn to jettison some of your cherished habits (e.g. MFCCs!) at the right moment.

Monday 26th October 2015 | science | Permalink / Comment

Understanding why some Muslim women wear a veil

I live in Tower Hamlets, the London borough with the largest proportion of Muslims in the UK. I see plenty of women every day who wear a veil of one kind or another. I don't have any kind of Muslim background so what could I do to start understanding why they wear what they do?

I went on a book hunt and luckily I found a book that gives a really clear background: "A Quiet Revolution" by Leila Ahmed. It's a book that describes some of the twentieth-century back-and-forth of different Islamic traditions, trends and politics, and how they relate to veils. The book has a great mix of historical overview and individual voices.

So, while of course there's lots I still don't understand, this book gives a really great grounding in what's going on with Muslim women, veils, and Western society. It's compulsory reading before launching into any naive feminist critique of Islam and/or veils. I'm sure feminists within Islam still have a lot to work out, and I don't know what the balance of "progress" is like there - please don't mistake me for thinking all is rosy. (There are some obvious headline issues, such as those countries which legally enforce veiling. I think to some Western eyes those headlines can obscure the fact that there are feminist conversations happening within Islam, and good luck to them.)

A couple of things that the book didn't cover, that I'd still like to know more about:

  1. The UK/London perspective. The book is written by an Egyptian-American so its Western chapters are all about things happening in North America. I'm sure there are connections but I'm sure there are big differences too. (I am told that Deobandi Islam is pertinent in the UK, not mentioned in the book.)
  2. The full-covering face veils, those ones that hide all of the face apart from the eyes. Ahmed's book focuses mainly on the hijab style promoted by Islamists such as the Muslim Brotherhood (see the photo for an example of the style), so we don't hear much about where those full face-coverings come from or what the women who wear them think.
Sunday 4th October 2015 | books | Permalink / Comment

Bird diary: Meadow pipits on the Long Mynd

The Long Mynd is a range of hills in Shropshire. Very beautiful area this time of year. Lots of birds too. People often comment on the birds of prey: the buzzard, red kite and kestrel, soaring silently above and occasionally plummeting to pounce on something. Of course I'm more interested in the birds making the sounds all around.

I was most taken by the meadow pipits - as you walk around on the Mynd, they often leap surprised out of the heather and flitter away making alarmed "peeppeeppeep" sounds (or maybe more whistly than that, "pssppssppssp"). I saw a skylark too, ascending from the ground about 20 metres in front of me. It's great to witness it when they do that: an unhurried circling ascent, all the while burbling out their famously complex melodious song, like a little enraptured fax machine going to heaven.

While hanging around in the forest I noticed how many non-vocal bird sounds you can hear. The most common example is wing flutter sounds, I heard them from lots of different species, and the sound can often be very deliberate. The most surprising sound of all was when I was walking past a tree and heard a knocking sound and I thought, "Oh, is that a woodpecker starting up?" - but it wasn't. I could see the little bird on a branch a few metres away and it was a coal tit, doing a bit of a woodpecker impression. It would peck at the branch hard, about four times in a row, repeatedly, giving me the impression it might have been trying to do some DIY of some sort. It also tried it on a second branch.

Lots of gangs of ravens around too - their curious adaptable calls reminding me of the ones I saw recently at Seewiesen. I often heard (from a distance) the song of the nuthatch - that nice simple ascending note that I first encountered when camping in Dorset. Now and again a jay, lovely orangey and cyan colouring contrasting with its raspy magpie-ish yell. The jays seem to be shy around here, unlike the one that used to hang around in our garden in London.

Of course all the usual gang was there too: lots of robins singing, jackdaw, magpie, wren, house sparrow, blackbird, stock pigeon, one wood pigeon, the occasional chiff chaff. I think I heard a goldcrest at one point but I'm unsure. One willow warbler down by the reservoir.

Friday 2nd October 2015 | birds | Permalink / Comment
IBAC 2015 - bioacoustics research (Monday 14th September 2015)
[Blog archives]
This blog is powered by SamXom
Syndication: [Atom] [RSS]
Creative Commons License
Dan's blog articles may be re-used under the Creative Commons Attribution-Noncommercial-Share Alike 2.5 License. Click the link to see what that means...