This season, I'm lead organiser for two special conference sessions on machine listening for bird/animal sound: EUSIPCO 2017 in Kos, Greece, and IBAC 2017 in Haridwar, India. I'm very happy to see the diverse selection of work that has been accepted for presentation - the diversity of the research itself, yes, but also the diversity of research groups and countries from which the work comes.
The official programmes haven't been announced yet, but as a sneak preview here are the titles of the accepted submissions, so you can see just how lively this research area has become!
Accepted talks for IBAC 2017 session on "Machine Learning Methods in Bioacoustics":
A two-step bird species classification approach using silence durations in song bouts
Automated Assessment of Bird Vocalisation Activity
Deep convolutional networks for avian flight call detection
Estimating animal acoustic diversity in tropical environments using unsupervised multiresolution analysis
JSI sound: a machine-learning tool in Orange for classification of diverse biosounds
Prospecting individual discrimination of maned wolves’ barks using wavelets
Accepted papers for EUSIPCO 2017 session on "Bird Audio Signal Processing":
(This session is co-organised with Yiannis Stylianou and Herve Glotin)
Stacked Convolutional and Recurrent Neural Networks for Bird Audio Detection preprint
Densely Connected CNNs for Bird Audio Detection preprint
Classification of Bird Song Syllables Using Wigner-Ville Ambiguity Function Cross-Terms
Convolutional Recurrent Neural Networks for Bird Audio Detection preprint
Joint Detection and Classification Convolutional Neural Network (JDC-CNN) on Weakly Labelled Bird Audio Data (BAD)
Rapid Bird Activity Detection Using Probabilistic Sequence Kernels
Automatic Frequency Feature Extraction for Bird Species Delimitation
Two Convolutional Neural Networks for Bird Detection in Audio Signals
Masked Non-negative Matrix Factorization for Bird Detection Using Weakly Labelled Data
Archetypal Analysis Based Sparse Convex Sequence Kernel for Bird Activity Detection
Automatic Detection of Bird Species from Audio Field Recordings Using HMM-based Modelling of Frequency Tracks
Please note: this is a PREVIEW - sometimes papers get withdrawn or plans change, so these lists should be considered provisional for now.
People love to take the vegans down a peg or two. I guess they must unconsciously agree that the vegans are basically correct and doing the right thing, hence the defensive mud-slinging.
There's a bullshit article "Being vegan isn’t as good for humanity as you think". Like many bullshit articles, it's based on manipulating some claims from a research paper.
The point that the article is making is summarised by this quote:
"When applied to an entire global population, the vegan diet wastes available land that could otherwise feed more people. That’s because we use different kinds of land to produce different types of food, and not all diets exploit these land types equally."
This is factually correct, according to the original research paper which itself seems a decent attempt to estimate the different land requirements of different diets. The clickbaity inference, especially as stated in the headline, is that vegans are wrong. But that's where the bullshit lies.
Why? Look again at the quote. "When applied to an entire global population." Is that actually a scenario anyone expects? The whole world going vegan? In the next ten years, fifty years, a hundred? No. It's fine for the research paper to look at full-veganism as a comparison against the 9 other scenarios they consider (e.g. 20% veggy, 100% veggy), but the researchers are quite clear that their model is about what a whole population eats. You can think of it as what "an average person" eats, but no it's not what "each person should" eat.
The research concludes that a vegetarian diet is "best", judged on this specific criterion of how big a population can the USA's farmland support. And since that's for the population as a whole, and there's no chance that meat-eating will entirely leave the Western diet, a more sensible journalistic conclusion is that we should all be encouraged to be a bit more vegetarian, and the vegans should be celebrated for helping balance out those meat-eaters.
Plus, of course, the usual conclusion: more research is needed. This research was just about land use, it didn't include considerations of CO2 emissions, welfare, social attitudes, geopolitics...
The research illustrates that the USA has more than enough land to feed its population and that this could be really boosted if we all transition to eat a bit less meat. Towards the end of the paper, the researchers note that if the USA moved to a vegetarian diet, "the dietary changes could free up capacity to feed hundreds of millions of people around the globe."
People who do technical work with sound use spectrograms a heck of a lot. This standard way of visualising sound becomes second nature to us.
As you can see from these photos, I like to point at spectrograms all the time:
(Our research group even makes some really nice software for visualising sound which you can download for free.)
It's helpful to transform sound into something visual. You can point at it, you can discuss tiny details, etc. But sometimes, the spectrogram becomes a stand-in for listening. When we're labelling data, for example, we often listen and look at the same time. There's a rich tradition in bioacoustics of presenting and discussing spectrograms while trying to tease apart the intricacies of some bird or animal's vocal repertoire.
But there's a question of validity. If I look at two spectrograms and they look the same, does that mean the sounds actually sound the same?
In strict sense, we already know that the answer is "No". Us audio people can construct counterexamples pretty easily, in which there's a subtle audio difference that's not visually obvious (e.g. phase coherence -- see this delightfully devious example by Jonathan le Roux.) But it could perhaps be even worse than that: similarities might not just be made easier or harder to spot, in practice they could actually be differently arranged. If we have a particular sound X, it could audibly be more similar to A than B, while visually it could be more similar to B than A. If this was indeed true, we'd need to be very careful about performing tasks such as clustering sounds or labelling sounds while staring at their spectrograms.
So - what does the research literature say? Does it give us guidance on how far we can trust our eyes as a proxy for our ears? Well, it gives us hints but so far not a complete answer. Here are a few relevant factoids which dance around the issue:
- Agus et al (2012) found that people could respond particularly fast to voice stimuli vs other musical stimuli (in a go/no-go discrimination task), and that this speed wasn't explained by the "distance" measured between spectrograms. (There's no visual similarity judgment here, but a pretty good automatic algorithm for comparing spectrograms [actually, "cochleagrams"] acts as a proxy.)
- Another example which Trevor Agus sent me - I'll quote him directly: "My favourite counterexample for using the spectrogram as a surrogate for auditory perception is Thurlow (1959), in which he shows that we are rubbish at reporting the number of simultaneous pure tones, even when there are just 2 or 3. This is a task that would be trivial with a spectrogram. A more complex version would be Gutschalk et al. (2008) in which sequences of pure tones that are visually obvious on a spectrogram are very difficult to detect audibly. (This builds on a series of results on the informational masking of tones, but is a particularly nice example and audio demo.)"
- Zue (1985) gives a very nice introduction and study of "spectrogram reading of speech" - this is when experts learn to look at a spectrogram of speech and to "read" from it the words/phonemes that were spoken. It's difficult and anyone who's good at it will have had to practice a lot, but as the paper shows, it's possible to get up to 80-90% accuracy on labelling the phonemes. I was surprised to read that "There was a tendency for consonants to be identified more accurately than vowels", because I would have thought the relatively long duration of vowels and the concentration of energy in different formants would have been the biggest clue for the eye. Now, the paper is arguing that spectrogram reading is possible, but I take a different lesson from it here: 80-90% is impressive but it's much much worse than the performance of an expert who is listening rather than looking. In other words, it demonstrates that looking and listening are very different, when it comes to the task of identifying phonemes. There's a question one can raise about whether spectrogram reading would reach a higher accuracy if someone learned it as thoroughly as we learn to listen to speech, but that's unlikely to be answered any time soon.
- Rob Lachlan pointed out that most often we look at standard spectrograms which have a linear frequency scale, whereas our listening doesn't really treat frequency that way - it is more like a logarithmic scale, at least at higher frequencies. This could be accommodated by using spectrograms with log, mel or ERB frequency scales. People do have a habit of using the standard spectrogram, though, perhaps because it's the common default in software and because it's the one we tend to be most familiar with.
- We know that listening can be highly accurate in many cases. This is exploited in the field of "auditory display" in which listening is used to analyse scatter plots and all kinds of things. Here's a particularly lovely exmaple quoted from Barrett & Kramer (1999): "In an experiment dating back to 1945, pilots took only an hour to learn to fly using a sonified instrument panel in which turning was heard by a sweeping pan, tilt by a change in pitch, and speed by variation in the rate of a “putt putt” sound (Kramer 1994a, p. 34). Radio beacons are used by rescue pilots to home-in on a tiny speck of a life-raft in the vast expanse of the ocean by listening to the strength of an audio signal over a set of radio headphones."
- James Beauchamp sent me their 2006 study - again, they didn't use looking-at-spectrograms directly, but they did compare listening vs spectrogram analysis, as in Agus et al. The particularly pertinent thing here is that they evaluated this using small spectral modifications, i.e. very fine-scale differences. He says: "We attempted to find a formula based on spectrogram data that would predict percent error that listeners would incur in detecting the difference between original and corresponding spectrally altered sounds. The sounds were all harmonic single-tone musical instruments that were subjected to time-variant harmonic analysis and synthesis. The formula that worked best for this case (randomly spectrally altered) did not work very well for a later study (interpolating between sounds). Finding a best formula for all cases seems to still be an open question."
Really, what does this all tell us? It tells us that looking at spectrograms and listening to sounds are different in so many myriad ways that we definitely shouldn't expect the fine details to match up. We can probably trust our eyes for broad-brush tasks such as labelling sounds that are quite distinct, but for the fine-grained comparisons (which we often need in research) one should definitely be careful, and use actual auditory perception as the judge when it really matters. How to know when this is needed? Still a question of judgment, in most cases.
My thanks go to Trevor Agus, Michael Mandel, Rob Lachlan, Anto Creo and Tony Stockman for examples quoted here, plus all the other researchers who kindly responded with suggestions.
Know any research literature on the topic? If so do email me - NB there's plenty of literature on the accuracy of looking or of listening in various situations; here the question is specifically about comparisons between the two modalities.
Last year I took part in the Dagstuhl seminar on Vocal Interactivity in-and-between Humans, Animals and Robots (VIHAR). Many fascinating discussions with phoneticians, roboticists, and animal behaviourists (ethologists).
One surprisingly difficult topic was to come up with a basic data model for describing multi-party interactions. It was so easy to pick a hole in any given model: for example, if we describe actors taking "turns" which have start-times and end-times, then are we really saying that the actor is not actively interacting when it's not their turn? Do conversation participants really flip discretely between an "on" mode and an "off" mode, or does that model ride roughshod over the phenomena we want to understand?
I was reminded of this modelling question when I read this very interesting new journal article by a Japanese research group: "HARKBird: Exploring Acoustic Interactions in Bird Communities Using a Microphone Array". They have developed this really neat setup with a portable microphone array attached to a laptop which does direction-estimation and decodes which birds are heard from which direction. In the paper they use this to help annotate the time-regions in which birds are active, a bit like on/off model I mentioned above. Here's a quick sketch:
From this type of data, Suzuki et al calculate a measure called the transfer entropy which quantifies the extent to which one individual's vocalisation patterns contain information that predicts the patterns of another. It gives them a hypothesis test for whether one particular individual affects another, in a network: who is listening to whom?
That's a very similar question to the question we were asking in our journal article last year, "Detailed temporal structure of communication networks in groups of songbirds". I talked about our model at the Dagstuhl event. Here I'll merely emphasise that our model doesn't use regions of time, but point-like events:
So our model works well for short calls, but is not appropriate for data that can't be well-described via single moments in time (e.g. extended sounds that aren't easily subdivided). The advantage of our model is that it's a generative probabilistic model: we're directly estimating the characteristics of a detailed temporal model of the communication. The transfer-entropy method, by contrast, doesn't model how the birds influence each other, just detects whether the influence has happened.
I'd love to get the best of both worlds. a generative and general model for extended sound events influencing one another. It's a tall order because for point-like events, we have point process theory; for extended events I don't think the theory is quite so well-developed. Markov models work OK but don't deal very neatly with multiple parallel streams. The search continues.
A colleague pointed out this new review paper in the journal "Animal Behaviour": Applications of machine learning in animal behaviour studies.
It's a useful introduction to machine learning for animal behaviour people. In particular, the distinction between machine learning (ML) and classical statistical modelling is nicely described (sometimes tricky to convey that without insulting one or other paradigm).
The use of illustrative case studies is good. Most introductions to machine learning base themselves around standard examples predicting "unstructured" outcomes such as house prices (i.e. predict a number) or image categories (i.e. predict a discrete label). Two of the three case studies (all of which are by the authors themselves) similarly are about predicting categorical labels, but couched in useful biological context. It was good to see the case study relating to social networks and jackdaws. Not only because it relates to my own recent work with colleagues (specifically: this on communication networks in songbirds and this on monitoring the daily activities of jackdaws - although in our case we're using audio as the data source), but also because it shows an example of using machine learning to help elucidate structured information about animal behaviour rather than just labels.
The paper is sometimes mathematically imprecise: it's incorrect that Gaussian mixture models "lack a global optimum solution", for example (it's just that the global optimum can be hard to find). But the biggest omission, given that the paper was written so recently, is any real mention of deep learning. Deep learning has been showing its strengths for years now, and is not yet widely used in animal behaviour but certainly will be in years to come; researchers reading a review of "machine learning" should really come away with at least a sense of what deep learning is, and how it sits alongside other methods such as random forests. I encourage animal behaviour researchers to look at the very readable overview by LeCun et al in Nature.
InterSpeech 2016 was a very interesting conference. I have been to InterSpeech before, yes - but I'm not a speech-recognition person so it's not my "home" conference. I was there specifically for the birds/animals special session (organised by Naomi Harte and Peter Jancovic), but it was also a great opportunity to check in on what's going on in speech technology research.
Here's a note of some of the interesting papers I saw. I'll start with some of the birds/animals papers:
- Localizing Bird Songs Using an Open Source Robot Audition System with a Microphone Array by Suzuki et al. Some really interesting work from a Japanese group I hadn't met before - they use a simple USB multi-microphone device together with source-localisation software, and then use that to recover a multi-species transcript of activity. Potentially very useful.
- Recognition of Multiple Bird Species Based on Penalised Maximum Likelihood and HMM-Based Modelling of Individual Vocalisation Elements by Jancovic and Kokuer - I like their modulated-sinusoid detector - here it's incorporated into a system for multi-species recognition in polyphonic audio. Good to see this being developed. It'd be great if others could make use of this technique (I don't think source code is available...).
- Bird Song Synthesis Based on Hidden Markov Models by Bonada et al - this is work Jordi Bonada developed while on a research visit to our lab (C4DM), with a colleague of mine Rob Lachlan. Good probabilistic generation of birdsong needs to be developed, you see, because at the moment a lot of bird behaviour experiments use a small number of WAV files as play back, and the unnatural repetitiveness that results from this makes me uncomfortable! Hence we need work like this on flexible bird sound synthesis.
- Cost Effective Acoustic Monitoring of Bird Species by Ciira wa Maina. A neat paper about a prototype 'full solution', little Raspberry Pi sound recorders with a straightforward single-species sound detector.
- Feature Learning and Automatic Segmentation for Dolphin Communication Analysis by Kohlsdorf et al. I don't know what audio tracking methods dolphin researchers currently favour, but this is an interesting highly-scalable method based on feature-learning in spectrogram data.
- Robust Detection of Multiple Bioacoustic Events with Repetitive Structures by Frank Kurth. To be honest I don't grok this yet - it's a kind of enhanced autocorrelation, but I'm not sure what leads to it being said to be robust.
- A Real-Time Parametric General-Purpose Mammalian Vocal Synthesiser by Roger K Moore: a PureData patch for general-purpose mammal sound synthesis. Fun, and maybe useful for someone out there working on mammal sounds.
That's not all the bird/animal papers, sorry, just the ones I have comments about.
And now a sampling of the other papers that caught my interest:
- Retrieval of Textual Song Lyrics from Sung Inputs by Anna Kruspe. Nice to see work on aligning song lyrics against audio recordings - it's something that the field of MIR is in need of. The example application here is if you sing a few words, can a system retrieve the right song audio from a karaoke database?
- The auditory representation of speech sounds in human motor cortex - this journal article has some of the amazing findings presented by Eddie Chang in his fantastic keynote speech, discovering the way phonemes are organised in our brains, both for production and perception.
- Today's Most Frequently Used F0 Estimation Methods, and Their Accuracy in Estimating Male and Female Pitch in Clean Speech by Sofia StrÃ¶mbergsson. This survey is a great service for the community. The general conclusion is that Praat's pitch detection is really among the best off-the-shelf recommendations (for speech analysis, here - the evaluation hasn't been done for non-human sounds!).
- Supervised Learning of Acoustic Models in a Zero Resource Setting to Improve DPGMM Clustering by Heck et al - "zero-resource" speech analysis is interesting to me because it could be relevant for bird sounds. "Zero resource" means analysing languages for which we have no corpora or other helpful data available - all we have is audio recordings. (Sounds familiar?) In this paper the authors used some adaptation techniques to improve a method introduced last year based on unsupervised nonparametric clustering.
- Speech reductions cause a de-weighting of secondary acoustic cues by Varnet et al: a study of some niche aspects of human listening. Through tests of people listening to speech examples in noise they found that people's use of secondary cues - i.e. clues that help to distinguish one phoneme from another, which clues are embedded elsewhere in the word than the phoneme itself - changes according to the nature of the stimulus. Yet more evidence that perception is an active, context-sensitive process etc.
One thing you won't realise from my own notes is that InterSpeech was heavily dominated by deep learning. Convolutional neural nets (ConvNets), recurrent neural nets (RNNs), they were everywhere. Lots of discussion about connectionist temporal classification (CTC) - some people say it's the best, some people say it requires too much data to train properly, some people say they have other tricks so they can get away without it. It will be interesting to see how that discussion evolves. However, many of the other deep-learning based papers were much of a muchness: lots of people use a ConvNet or an RNN and, as we all know, in many cases they can get good results. They apply these to many tasks in speech technology. However, in many cases there was application without a whole lot of insight. That's the way the state of the art is at the moment, I guess. Therefore, many of my most interesting moments at InterSpeech were deep-learning-less :) see above.
(Also, I had to miss the final day, to catch my return flight. Wish I'd been able to go to the VAD and Audio Events session, for example.)
Another aspect of speech technology is the emphasis on public data challenges - there are lots of them! Speech recognition, speaker recognition, language recognition, distant speech recognition, zero-resource speech recognition, de-reverberation... Some of these have been running for years and the dedication of the organisers is worth praising. Useful to check in on how these things are organised, as we develop similar initiatives in general and natural sound scene analysis.
MLSP 2016 - i.e. the IEEE International Workshop on Machine Learning for Signal Processing - was a great, well-organised workshop, held last week on Italy's Amalfi coast. (Yes, lovely place to go for work - if only I'd had some spare time for sightseeing on the side! Anyway.)
Here are ...
I just read this new paper, "Metrics for Polyphonic Sound Event Detection" by Mesaros et al. Very relevant topic since I'm working on a couple of projects to automatically annotate bird sound recordings.
I was hoping this article would be a complete and canonical reference that I could use ...
On Monday I did a bit of what-you-might-call standup - at the Science Showoff night. Here's a video - including a bonus extra bird-imitation contest at the end!
I'm nearing the end of a great three-week research visit to the Max-Planck Institute for Ornithology at Seewiesen (Germany). It's a lovely place dedicated to the study of birds. Full of birds and ornithologists:
I'm visiting Manfred Gahr's group. We had some ideas in advance and ...