Simon Roper
Күн бұрын
14,680
1

How Does the Brain Understand Speech? An Overview

In this video, I explore some of the basics of auditory neuroscience, with an emphasis on speech perception. The video briefly explains how sound works, and then how it's transposed into electrical signals that the brain can work with, before briefly touching on how the brain processes speech.
________
This channel's Patreon (thank you to anybody who donates): / simonroper

Пікірлер: 121

@himynameisben952 ай бұрын
babe wake up, Simon Roper posted a 20-minute video
@Pupshah
2 ай бұрын
Babe time to gently drift off to a Simon Roper 20-minute video
@oravlaful
2 ай бұрын
me with every upload
@dinosaurandnapkin
Ай бұрын
Yuuuuuuuup
@stephanieparker12502 ай бұрын
What I find amazing is that our brain can be physically “hearing” numerous sounds at once but know which is language and which is just noise.. which to pay attention to and which sounds to ignore and even “tune out”.
@amandachapman4708
2 ай бұрын
And, like the app Merlin, it can listen to birdsong and learn to identify the different birds, even when two or more different ones are singing at the same time
@stephanieparker1250
2 ай бұрын
@@amandachapman4708 True! I love that app. 🙌
@FrozenMermaid666
Ай бұрын
If only my hern could understand what natives say in other languages that I am learning when they speak fast and without properly articulating and enunciating each syllable and word, including Icelandic and Dutch, which aren’t easy to understand, unless they are spoken very clearly... I am advanced level in Icelandic and Norwegian and upper advanced level in Dutch and upper intermediate level in Norse and German etc, and I can understand lots of words and sometimes full sentences when natives speak more clearly, but when they don’t speak clearly I cannot understand a word, but others seem to understand them... If I see the written text / sub, I can understand over 95% of the words, so I feel like I should be able to understand more when natives speak, even when not speaking as clearly as those that teach languages...
@blackholesun49422 ай бұрын
00:24 Part 1: Sound 04:46 Part 2: Hearing 08:17 Part 3: Neurons 12:35 Part 4: Human Reasearch
@samdickinson93022 ай бұрын
Thank you for your beautiful work, Simon.
@ac87uk
2 ай бұрын
I've been despairing at the general state of KZread lately, so I feel a renewed appreciation. Thanks for providing a sense of wonder without the clickbait.
@amillar72 ай бұрын
This was a nice refresher of my speech language pathology coursework. The entrainment section reminded me how one of my professors evaluated a client’s stuttering in Chinese even though she didn’t know the language. Her brain understood what a fluent syllable sounded regardless of meaning. Thanks!
@Matt199702 ай бұрын
Hey mate love the content, I have been seeing a lot of quite viral TikToks recently using your content, particularly on historical accents. Just though you'd like to know in case you weren't aware.
@modalmixture2 ай бұрын
3brown1blue has a fantastic series on neural networks that really helped me understand the part around 11:00 - how layers of neurons might process information at different levels of abstraction.
@sevomat2 ай бұрын
Also the birds eating pasta at the end was the perfect visual accompaniment.
@stephanieparker12502 ай бұрын
I’m betting many of your views are also fans of 3B1B. He’s pretty amazing. 👍
@katarzynabiel87982 ай бұрын
I soooo wish your channel had existed when I was at uni studying all of those topics
@SimonRGates2 ай бұрын
Re: words and the brain. There was a point when learning Japanese that words, even words I didn't know, became apparent in the stream of sounds. This happened quite early on, and way before I could work out where the words were when reading (no spaces in Japanese text) so presumably there is some clue in the sound that marks the boundary of words. Although, I'm not sure it's words per se, because 'words' in Japanese are often made up of a word plus a number of helper words (like concatenating the adjective 'nai' for negation etc), so maybe some sort of meaningful wordish unit.
@CaptainWumbo
2 ай бұрын
all languages do have this phenomenon of words tending to end and start a certain way. In Japanese your biggest cues would be the particles and the very consistent form of verbs and adjectives. In addition to this people tend to speak in a way that organizes words into phrases, which gives listeners more clues where words are beginning and ending. It may be easier to think of this in the reverse though, there are certain sounds that CANNOT end a kind of word. You gradually become sensitive to this. But in general, unless the utterance is simple and brief, we will get lost if there's several unknown words even in our native languages. I might argue that れる　せる　ます　ない are heard clearly by most people as separate words even if they are taught incorrectly as conjugations. Trying to think of them as belonging to the word they attach to feels a bit overwhelming, since the word begins to sound incredibly complicated compared to its dictionary form. Certainly most people hear 込む as its own word although it is almost always attached.
@SimonRGates
2 ай бұрын
@@CaptainWumbo Yeah, I guess everything does have a limited set of ending sounds, except for the nouns which are usually marked. I expect most people would hear the helper words as words in their base form quite quickly - although my family can't pick out words in Japanese. Whether they are part of the word maybe depends on how you look at it; when I hear something like きかせらなかったら it feels like I'm treating it like a conjugated verb, rather than a set of linked words... although possibly how my head feels about it isn't a good guide to what's actually going on.
@MuffinHop2 ай бұрын
Prairie dogs might be the only animals that can closely challenge human language, being able to have verbs, subjects and adjective for predators, they also have social chatter, which we lack the Rosetta Stone for. But people feel eerie about prairie dog lab testing when one realizes their communication is close to humans. Makes you value them a lot more as animals.
@lukeharrison16552 ай бұрын
your channel is fascinating to me
@samcousins32042 ай бұрын
congrats on the MSc!
@oliverlavers2880Ай бұрын
Honestly one of the most impressive channels on the planet. This is what the internet is for. I fooled around with this stuff in undergrad, and this was a very enjoyable and educational video to watch!
@helenamcginty4920Ай бұрын
When I used to have music lessons as a child at one level one had to sing a note in a chord, eg the middle of 3. I also played the viola in amateur orchestras and still can tune in to different instruments as well as concentrate on the separate sounds at the same time as hearing the whole. I expect all professional musicians do this all the time without thinking. Has anyone studied this? Animals, like bats and whales that rely so much on sound must have amazing brain systems.
@C_In_Outlaw38172 ай бұрын
Hey Simon , I’m a medical student in the states interested in neurology. I can’t wait to watch this as I find the language parts of the brain fascinating 😊
@randzopyr103820 күн бұрын
Thank you, I was listening because I enjoyed your voice and have only the slightest interest in linguistics and you sent me down a rabbit hole on neurons - so much more interesting and complex than I had been lead to believe, I can't wait to learn more.
@askarufus79392 ай бұрын
Recently I've been watching many "ghost hunting" videos where they "communicate" with ghosts using some kind of device that produces waves that sound like old broken TV. When they ask questions and we hear those waves they can really sound like human responses. The authors usually put the phrases that they interpreted written on the screen and that always made me think "wow, it really sounds that way!" As my native language is Polish, I decided to make a little experiment. I tried to interpret only the audio, without reading what is written on screen and therefore is someone's else interpretation. What turned out was that I started hearing these sounds like it was someone speaking Polish. No way I could make any English speech out of this. It's like my brain was wired to "fill in the gaps" only in the language of my daily use. Brain hears a bunch of audio mess and "corrects" it in the frames of patterns that it is used to. The interesting thing is also (and I've heard it from many non native English speakers) that in order to better understand what's being said I have to listen to English at at least 130% volume of what I would be able to hear in my native Polish. In Polish I can understand even things I can barely hear. It makes you think of the scale of work that our brain does when helping us to make something out of what comes to our ears.
@tunneloflight2 ай бұрын
Great presentation Simon. There are very parallel structures to these involved in vision. The superior colliculus serves a similar function to the inferior colliculus. The SC directs movement of the eyes to find movement and to integrate the function of both eyes. It seems likely that the IC does a similar function using sound. In evolutionary history I would be unsurprised to learn that positioning of the ears to localize sound changes is controlled by the IC in animals with movable and shapable ears. Similarly, the lateral geniculate nucleus (nuclei) parallels the medial geniculate nucleus, with the medial specialized for sound and the lateral for vision. The layer cake structure of the LGN do many functions, most of which are as yet not understood. It does do two major functions that we know - One is receiving the differential signal across the visual surface, from which the LGN directly extracts and refines all of the visual edges in the field of view. These are sent to the visual cortex separately from the color and black and white 'images'. The visual cortex then integrates this information to find surfaces and extract shapes and present the whole lot to the conscious mind and the subconscious mind integrated with time delayed information from the ears, touch and other sense organs. The LGN also finds the movement of edges within the visual field. These represent movement in the world - > potential danger or opportunity. The LGN prioritizes sending this information both to the SC to automatically direct gaze to look at the movement in advance of image formation and conscious or subconscious awareness, and to the default mode network via a specialized part of the visual cortex to alert the higher minds to danger or opportunity. I would be unsurprised to learn that the MGN performs similar functions for sound. We know that the brain processes the audio from matched signals within about 35 hz from both ears to find common sounds differing in frequency (doppler effects), and then produces the average and difference between these as separate 'sounds'. Neither was actually heard. The differential signal can be used to localize a sound source. It also 'resonates' with various parts of the brain and can interact with frequencies that the brain uses for other functions - such as going into deep resting states, disconnecting the brain from the spine to allow dreaming without thrashing, and much more. It would be very interesting to learn what other parallels may exist between the visual and auditory circuitry. And in the case of dolphins with their sonar, and image formation from that. It seems likely that 'speech' in their minds is likely processed in ways more similar to vision and presented as such. No doubt they also then do that in reverse, presenting a visual signal outward through sonar to speak in images. We are primitive by comparison. There is another interesting aspect that may give us more clues. Some of us (likely 1-2% of the population) are extremely sensitive to visual flicker in the 100-120 Hz range. I am one such. This visual flicker is quite common from LED lighting. LED lighting quite commonly uses extremely inexpensive power conversion circuitry to convert 50 or 60 Hz AC power into DC power that has a 100 or 120 Hz ripple. This higher frequency ripple is faster than the visual field for image processing can handle. It is lost from the visual information as a result through fusion. However it is well within the range of the eyes, the superior folliculus and the lateral geniculate nuclei to detect and process. When LED lights are operating and flickering they result in the entire visual field flickering. In sound this would be the equivalent of having a tone present in the environment. When the brain is unable to localize this apparent source of danger because it is literally everywhere in the visual field, the LGN then signals the visual cortex, the precuneus and the default mode network that there is danger. When that danger does not resolve, the brain then seems to attempt to localize the apparent source of movement by directing that the ears listen more closely. This then results in the default mode network directing the medial genicular nucleus to then work the circuitry backward that you've described including the fusiform cells and particularly the inner hair cells to tell the ears to listen more closely. In doing so the brain seems to make a mistake and the inner hair cells reactivate circuitry of outer hair cells that have died. You'll know these as the high-pitched sounds that you hear every now and again that then fades away. That is an outer hair cell dying and then being removed from circuitry involved in hearing. The consequence then is that these now dead cells send signals to the brain which are not real sounds anywhere but in the mind. These can sound like very different things but quite often they can be things like very loud tri-tonal sounds at high frequencies, or crashing waves or burbling or other sounds. If this is true then it may present a better understanding allowing better treatment for tinnitus. And also a greater understanding of the parallels between the visual and auditory sound systems and the integration between them.
@chriflu2 ай бұрын
Awesome video! In case you're reading the comments, this led me to a few "neurolinguistic" questions that you could touch upon in future videos. For the sake of brevity I'll focus on voiced vs. unvoiced consonants first. - Foreign language learning: For context, my native dialects are Bernese German and Viennese German (two very Southern dialects of German) and my realization of Standard German is a quite neutral, but distinctly Southern one: i.e. complete devoicing of all consonants regardless of position, independence of vowel length and vowel quality, over-average rhoticity etc., but otherwise quite "standard". As a child some part of the neural network in my brain clearly made a distinction between voiced and unvoiced consonants because I could tell if some-one had a Northern or "German" accent, but that was not linked to meaning. The only information I got from this distinction was where some-one was from. Then, when I learnt foreign languages as a teenager, I understood that "face" and "phase" (English), "ruse" and "russe" (French) or "fredda" and "fretta" (Italian) weren't the same thing semantically. And I remember this was quite a challenge for me. What happened in my brain back then? Something must have been rewired between the areas that process sound, the areas that process syllables, a lot of in-between areas and the areas that process meaning, right? How does that rewiring exactly work? - AI - why is it so bad at some of this? For example, I use Google Car Play and I need to talk to it in a caricatural North-German accent for it to understand me. Otherwise, my "s" becomes "ss" or "c", my "d" becomes "t", my "b" becomes "p" in the transcription etc. It, therefore, misunderstands every single address and it's just sad for every-one - whilst even people from Hamburg can understand me perfectly well because the language processing part of their brain switches to "Southern accent" mode within a milli-second (the same way some-one from, say, New Zealand can immediately understand some-one from South Carolina and vice versa once they've figured out each others' accents. Why is AI language recognition so bad at this?) - Why are tonal languages the exception rather than the rule? Why do we differentiate two of the three basic musical phenomena (melody, harmony, rhythm) by pitch first and foremost rather than timbre/attack profile etc. while we usually differentiate phonemes by the latter two characteristics - which do not play as much of a role in music? What happens in the brain there? And is it a bug or a feature? Could it be that music and song emerged so early in human history that languages evolved in a way that would permit us to set ANY text to music? I would love to learn about your thoughts about these topics. Great channel!
@sgriggl
2 ай бұрын
I can venture a guess at your AI question, at least partially. AI only "knows" about the world thru its training data. And it tends to be trained on "standard" languages and dialects. So when it's fed a bunch of data on German, it probably has very little Southern-accented data --- just as when it's trained on English, it gets little to no data for non-standard dialects (this is definitely a problem across languages). Human speakers probably have both the advantage of having more "data" (more experiences listening to, and comprehending, other accents and dialects), as well as the ability to take in new data and begin processing / training on it on the fly, as they are exposed to it (usually in context, where they have other clues as to the meaning). The computer just has the audio data, and doesn't "realize" it's "hearing" it wrong --- b/c it also isn't trying to "make sense" of the speech data. For speech recognition specifically, it just sees the task as converting audio data to orthographic data.The human, on the other hand, is trying to "communicate" --- a very different task.
@Snazzysneferu2 ай бұрын
The Atomic Shrimp of linguistics
@superbodypop7289
2 ай бұрын
exactly
@joefization
2 ай бұрын
Hello fellow Shrimp fans! Never considered how Simon and Mike's viewers might overlap but I can't say I'm surprised finding you here. I'd like to see a collaboration video haha!
@Snazzysneferu
2 ай бұрын
@@joefization They overlap in many things, not least in how lovely their comment sections are!
@DavideDF
2 ай бұрын
I love this comparison, Atomic Shrimp is amazing
@weirdlanguageguy
2 ай бұрын
That's a perfect comparison, thank you for bringing it to my attention
@glitteraapje73292 ай бұрын
simon normally: i'm not a linguist UㅅU simon now: i am a neuro scientist OwO
@revolution12372 ай бұрын
It's quite astonishing, isn't it, how our brain can process numerous sounds simultaneously, yet still distinguish between conversation and mere noise... what to focus on and what sounds to avoid or even ignore completely.
@daveesons1962 ай бұрын
Fascinating video, Simon!
@carlinberg2 ай бұрын
Super interesting, great video!
@jimlay93122 ай бұрын
Very cool and well done!
@watleythewizard23812 ай бұрын
That was a good explanation. Thanks
@rollinwithunclepete8242 ай бұрын
Very Interesting, Simon! Thank you
@tedrussell11282 ай бұрын
Long time follower here. I'm not a KZread fanatic, but yours are always worth the listen, and just the right length. This is one of your best yet. Some of it made me think of A Thousand Brains, by Jeff Hawkins, a brilliant (and he never hesitates to remind the reader of it) neuroscientist. Have you read it?
@ZBisson2 ай бұрын
Thanks for the good video, Simon.
@myouatt59872 ай бұрын
Absolutely fascinating - thank you! 😄
@sheilam49642 ай бұрын
Very interesting and informative. Gives something to think about. Thx for doing this, filming it and sharing it with us.
@frankharr94662 ай бұрын
That was pretty neat. Thank you. I've never had that level of detail before.
@jhonbus2 ай бұрын
Brilliant. Looking forward to future videos on this theme!
@vitamins-and-ironАй бұрын
this is a really interesting video. thank you!
@DocEmCee2 ай бұрын
I enjoyed this video even more than usual (which is already a lot!). Can't wait for the next one.
@dianetheone40592 ай бұрын
Excellent-sidebar-to-your-usual-content!
@laan_the_man75772 ай бұрын
Thanks Simon
@ddboss2590Ай бұрын
very informative and interesting! Hope you make more videos in this series!
@fbkintanar2 ай бұрын
Although I am mainly focused on higher level linguistic processing, I have become increasingly interested in the neurocognitive dimension of things. A lot of the introductory material in cognitive neuroscience focuses on visual processing as somehow paradigm. My recent thinking is that perception is layered on the control loops of perceptually-modulated movement and higher-level cognitive "actions", including speech acts. That said, perception is obviously very important, and audition and phonemic perception provide an interesting contrast to visual and tactile-haptic perception and control loops. From the perspective of linguistic understanding, I wonder whether you have given consideration to the universally available but seldom developed human competence for linguistically sophisticated sign language production and perception. For a couple of decades, the hypothesis has been circulating that sign language may have preceded speech, or evolved simultaneously. Important phenomena of language, like reference and conceptually-structuring actions in predicating, seem to have natural roots in gesture and the perception of gesture. One can even think of vocalizations as a kind of audible (and visible in facial expression) gesture. The possibility of lip reading seems to support this, as does the McGurk effect. I hope you continue to produce content like this, leading up to not only the evolution of phonemic systems, but broader cognitive systems of higher understanding of speech and sign languages.
@MaximSchoemaker2 ай бұрын
The videography in this is stunning ✨
@ralphwortley12062 ай бұрын
Oviously people with some degree of deafness, like me, would be particularly interested in this - and in further in-depth analysis. I am therefore grateful for the text, but ask whether in any future presentation you could run the text along the bottom of the frame. Thanks for an interesting presentation. (Ralph Wortley, Phd Psychology, Wits U.)
@chickendoyle2 ай бұрын
Could listen to you all day
@danielesantospirito57432 ай бұрын
Very interesting, good video! Especially the part concerning neural connections made a lot of intuitive sense, it just feels correct, at least as a simplified explanation.
@LimeyRedneck2 ай бұрын
Very clear video and looking forward to Part II 🤠💜
@Kargoneth2 ай бұрын
Fascinating.
@ArturoStojanoff2 ай бұрын
Bro, I just started the semester at Uni and I'm taking Psycholinguistics II and it's about this. You're gonna make my class redundant.
@beepboop2042 ай бұрын
i think Sellars whole "Myth of Jones" works well as a general explanation (in the manifest image of the world, as he would probably put it), but its cool we are progressing to the sort of scientific image of the world his students like Churchland are advocates of. i got to drink some beer with Churchland once, it was a very satisfying experience.
@rembo962 ай бұрын
Very interesting video! Thank you, Simon.
@Ogma3bandcamp2 ай бұрын
Completely unrelated but I keep meaning to ask you, why does the Lowestoft accent sound quite similar to me to a West country Bristol accent? Extreme west England and extreme east England sharing a similar lilt and general feel.
@michaeldeloatch74612 ай бұрын
Simon - in future, with this sort of video, please adopt a more peremptory tone. Even though the topic is of great interest to me, as I have a degree of hearing loss and sometimes struggle to understand speech in my dotage now, nevertheless toward the end you ASMR'd me to sleep in my office chair and I snapped awake almost falling into the floor when you said "thanks for watching" haha.
@cykkm2 ай бұрын
Simon, thank you. I've been in the field of speech production and understanding for… well, pretty much forever. Your presentation is extremely well structured, and I don't see any obvious mistakes or oversimplifications. I don't think I would be able do such an informative sci-ed 20-minute-long video: I'd probably stuck in not quite relevant details, which you so masterly glossed over, such as starting with the cochlea and omitting other ear anatomy details, irrelevant here. I'm afraid to think how much work it must have taken to prepare such a concise lecture. Kudos to your excellent presentation, can't wait for the next part! Since I ought to nit-pick at least one thing, however little, here you go: the Broca's name is stressed on the last syllable; he was French. :-)
@MrVibrating2 ай бұрын
Delving a little deeper, there's a fundamental relationship between energy and information. From a physics point of view, there's a minimum energy cost associated with changing one bit of information, known as the Landauer limit. From a practical, biological perspective, processing information is obviously a brain's raison d'etre, ours consuming some 20% of all the energy we metabolise. Consequently, there's strong selection pressure on processing efficiency, and hence the intricacies of how we process information are highly attuned to and shaped by these associated energy costs. At this stage, 'information' falls broadly into two categories; primary sensation of stimuli, and then the meta-information - the information _about_ that information, which provides all the context and meaning. As humans we're uniquely dispositioned towards the latter form of processing, language of course forming a kind of universal abstraction layer, able to model, record and convey whatever we can muster into words. You can see where this is going - there's a guiding, formative relationship between how we processes meta-information generally - and language more specifically - and its energy cost of processing. This convergence is brought into sharp relief by a particular cognitive quirk we all share - namely, the perception of *octave equivalence.* We have around ten octaves of resolving bandwidth between 20 Hz and 20 kHz, though only around eight octaves of accurate pitch discrimination, which tapers off towards the upper and lower extremities. Within this range, all factors of two of any given frequency are octaves of that fundamental, producing this perception of ineffable parity that is maximum harmonic consonance. Octaves are the simplest-possible frequency relationship with respect to a processing system in which there is a fundamental convergent axis between informational and thermodynamic entropies. The former can be thought of in terms of Shannon difference - the unpredictability of a signal - the latter in terms of network connectivity, impulse rates and energy efficiency. One might argue that the unison at 1:1 is simpler and more consonant, but in terms of pure-tone stimuli a unison is simply the same signal superimposed - the phase angle may add a few decibels or cancel out entirely, but it's a harmonic interval of zero. The 'first harmonic' in every sense is thus the octave, resolving to the shortest temporal integration windows (TIW) in the respective processing nuclei. The same convergence of entropic minima in the time domain results in rhythm induction - so your basic 4/4 or 3/6 time signatures are the simplest rhythms with factors of two between bass, snare and hat beats, all the actual artistry built thereon. Octaves sound 'the same' as each other, such that pitch-class can be a thing - effectively stratifying a big chunk of bandwidth into discrete bands like lines superimposed over a page. The octave can thus be subdivided into smaller chunks of increasing entropy, relative to this equilibrium baseline of minimum entropy. And so for language, as for tonal systems; octave equivalence represents peak harmonic consonance, all subsequent, necessarily more-complex intervals increasingly inequivalent or dissonant, resolving to longer TIWs and costing more energy to process, and hence language too - all the meta-information we process in fact - reduces to spatiotemporal modulation of log(2) spectra or factor of two symmetry. The male vocal range spans around an octave, the female, two. Moreover, the processing principles underlying resolution of the octave equivalence paradox are more fundamental than audition itself, playing a similar role in all other modalities - even in V1 visual cortex, scene depth is mapped to hexagonal matrices delineated by octave bandwidths between the resolving pyramidal cells. Most sensory modalities have at least a couple of octaves of bandwidth. Arguably, reproduction of these principles in automata poses a more distinguishing benchmark than merely passing a Turing test; true artificial sentience should perceive octaves - and thus everything else within them - in much the same way us animals do. In conclusion, the rhythmic and pitch characteristics of language are mediated by these thermodynamic expediencies, consciousness is largely an entropy-reduction process, and simultaneous extraction of corresponding spatial and temporal components likely forms the basis of lateralisation asymmetry between cerebral hemispheres.
@AM-sw9di2 ай бұрын
Very interesting stuff. It did make me think about how when someone calls out (usually in a public area where there is a lot of sound anyway) you don't always register what they have said, only that it has been said, the amplitude, and the direction where its coming from. I wonder if this is because the sound is distorted in some way like an unrecognisable accent speaking the same language, and the brain can't recognise it? Or maybe because its startling something goes on with the amygdala which interferes and makes us only register the fact that it was loud and potentially a sign of danger? The amygdala can fire faster than its possible to consciously acknowledge something, and animals will often look quickly for sound regardless of what it is if it is alarming and unfamiliar enough. I suppose this is obvious. Though I have found that usually when someone calls for help in an alarming way it can take a few seconds for people to register what is being communicated, I have no real example of this though.
@ErikScott1282 ай бұрын
5:07 Minor (pedantic) point here. Technically, if you want to preserve the time dimension, you would use a wavelet transform or short-time Fourier transform. This is a very similar idea, and is basically a local Fourier transform focused on a narrow window of time, repeatedly performed at all points in time. A classic Fourier transform looks at the entire signal at once, for all values of time, and determines the magnitude and phase (encoded as a complex number) of frequencies for the entire signal. Time information is lost. Time and frequency are inherently related (one being a reciprocal of the other when considering the units). So, there is inherent uncertainty when discussing frequency at a point in time. The level of uncertainty is associated with the time interval and the frequency in question. For instance, you need a larger time interval be be more certain about a lower frequency than you would a high frequency. This is reflected in the output of a wavelet transform. The 3b1b video you linked is fantastic, but unfortunately he doesn't have one on wavelets (since they're much less commonly used). This is a video I happened across a while back which does a great job explaining wavelets, while also doing a good job explaining some of the more complex (pun intended) mathematical background required: kzread.info/dash/bejne/nKKs06qcf8W0e5c.html 7:30 This is interesting to me. Temporal frequency in sound (on the order of 10s of Hz to 10s of kHz) is transduced into a (discrete) spatial representation (different nerves), while magnitude has been transduced into a temporal frequency representation in terms of neuron firing. I would be interested to know how this actually works. From what little biology I've learned (and from your later explanation), I know the neurons don't really have intermediate states; they're quite binary (either they're firing, or they aren't) so I guess it does have to kind of be this way. I'm just interested in the physical/mechanical/biological/chemical/whatever mechanism that turns an amplitude modulated (AM) signal into what is essentially a pulse-frequency modulation (PFM) signal. Might be a bit deeper of an explanation then you were intending for this video, though, and I can appreciate that. Overall, I enjoyed and learned a lot from this video. Looking forward to the follow-up.
@lunkel8108
2 ай бұрын
Neurons are kind of half AM and half PFM. Before the axon, so in the dendrites and the main body (soma), the "value" of the neuron manifests as the magnitude of the membrane potential, which can vary continuously. The conversion to pulse frequency takes place at the axon hillock, which initiates the pulses (called action potentials) which then travels down the axon. (note: all of this is of course simplified). To understand how the conversion works, you have to understand what a pulse looks like. The resting state of the membrane is -70 mV, so the inside of the cell is a bit more negative than the outside. An action potential is initiated when the membrane reaches -55 mV. First a lot of positive sodium ions flow into the cell until the potential reaches about +40 mV (this positive potential is what causes the next bit of membrane to activate). Then a lot of positive potassium ions flow out of the cell to make the potential negative again, but overshoot by a bit (to something like -90 mV instead of -70 mV), before the membrane returns back to the resting state. Only then can the next action potential be triggered. This so-called hyperpolarization is the important part. If the membrane potential in the soma is more positive, the axon hillock returns from that inactive hyperpolarized state more quickly (because positive charges from the soma can flow into that region) and so the next action potential can happen sooner. The signal is then converted into an amplitude modulated signal again at the synapse. I hoped this helped to provide some insight! If anything is unclear, I'd be happy to clarify. Note that I'm not a neuroscientist, just a biochemist and I obviously left out a lot of details.
@samanthaclaire31142 ай бұрын
this was a great 20 minute revision sesh as someone who has just started their Master of Speech Pathology, thank you Simon keep up the good work!
@johannageisel53902 ай бұрын
Came here to learn something about robin language. I hope you don't disappoint.
@frostdova2 ай бұрын
fantastic video, I remember a bit of how speech is processed from a book I read a long time ago called "how the mind works" looking forward to the next video!
@SOOKIE420692 ай бұрын
This is a cool video! For a much funnier explanation of speech read Plato's Cratyllus.
@modalmixture2 ай бұрын
I wonder if the ease of syllable detection compared to phoneme detection is why the first sound-representational writing systems were syllabaries /abugidas, and alphabets only appeared later.
@ivaylostoyanov25152 ай бұрын
I thought you used to be studying archaeology at some point? How did you make the switch to cognitive neuroscience? Greetings from a fellow PaLS graduate! (BSc PaLS here)
@lunkel8108
2 ай бұрын
I'm fairly certain it's the other way around. He started out with neuroscience and then switched to archaeology after he got his MSc in that
@jangtheconqueror2 ай бұрын
Wait I thought you were working towards a Masters in Archaeology, did you switch?
@AbhNormal2 ай бұрын
Rather hot take here but I honestly believe that sound is so essential to the human experience that, given the choice between being deaf or blind, I would much rather be blind.
@Leafbd2 ай бұрын
at 18:06 you say syllables are reconstructable from a spectrogram, but then why do people disagree on where syllables begin and end within words or utterances? i think in most languages where you have a ~VCV~ sequence within a word it would be analyzed as having a break before the consonant, but in english it's usually analyzed as being after the consonant. is this actually measurable or is it only a convention? in my mind i separate english syllables the first way, different from the usual conventions for english
@antoninbesse7952 ай бұрын
Interesting to watch this and then for comparison watch a good video explanation of how AI large language models interpret phrases
@bensabelhaus72882 ай бұрын
Wow, As an autistic with a verbal processing disorder, brain damage literally in the exact spot and side you highlighted in blue (3 separate incidents to boot) and selective mutism... You managed to say what I literally cannot. Wow. Yeah. Unless you deal with this stuff too, you're clearly paying attention in class and you're being brought up to date lol I understand this and due to it being part of my autism special interests I was actually able to mostly process it too. Something to consider from a neurodivergent perspective. Processing can be altered depending on subject. If it's something the mind wants, it magically functions better. If you can improve that impulse spike threshold, you could "cure" much of autism downsides. But would that then take away what can also be a superpower where we can actually focus on very complex things? Hard to say. But smoothing that transition (for me Cannabis does that. No high, just a nice clear mind that is functional) for all input would be nice. I suspect a lot of our anxiety is caused by processing issues like this. It prioritizes the odd noise as it could be a sabre tooth tiger trying to eat me. That's a good thing, situationally, but not when turned on 24/7 because a tone is set to high priority. Just some thoughts
@PureImprov
Ай бұрын
I'm also autistic and found this interesting! I read that the Superior olivary complex is often different in the autistic brain. I did not know such a low level process was behind it and was only really aware of the differences in the prefrontal cortex involving working memory differences that are much more abstract. I agree, I seem to have to make up for this 'audio filtering' by applying more conscious focus or altering my environment. Definitely starting to understand there's a lot more I have to do consciously. Others seem to automatically get it done for them by their 'functional' brains. Doesn't this essentially mean autistics hear closer to the actual sound? If that's the case I really don't mind having more 'control' over what my brain presents to me as opposed to it doing the stuff for me because of some evolutionary adaptation! Takes up more energy though...
@ianremsen2 ай бұрын
If I can ask a big basic question-is there any evidence that spoken language and written language are "processed" very differently by the brain? I lapsed briefly into a phonocentric view of language for a bit that I've since upended, and I'm curious if there's physiological evidence to suggest that writing is more thanrepresentation of speech.
@artugert
2 ай бұрын
I would think there would be some difference in how the two are processed. I think in Chinese, the difference is even greater.
@dbass49732 ай бұрын
i tried and tried to switch off the linguistic part of my brain to save some energy for more important stuff but alas, to no particular avail
@mesechabe2 ай бұрын
OK it’s all right to lap up more screen time. Simon is back.
@cleon_teunissen2 ай бұрын
I think the expression 'all mangled together' that you use at 4:16 is most unfortunate. The amazing thing is that the various frequencies that are present in sound co-exist without affecting each other. As sound propagates in air there is no loss of information. All constituent frequencies co-exist. A high performance frequency separation system can and will recover all of the propagating information. That is what the cochlea does. In effect the cochlea is a biomechanical spectrum analyser. (It is only at extremely large amplitude that non-linear effects are significant. Once in non-linear circumstances frequencies do affect each other, and then an expression such as 'mangled together' is descriptive.) The key point: superposition of sound frequencies is linear. Here 'linear' means that if at the source the amplitude of one particular frequency is increased then for receivers only that frequency will be louder, in the original proportion. (About an exception to that 'linear' that we can all experience: the sound energy of low frequencies dissipates slower than the energy of high frequencies. When there is a lightning strike very nearby the sound we experience is a high pitched sound, a sharp crack, because there is a lot of high frequency in that noise. When the lightning strike is far away, and the sound travels multiple seconds then by the time it arrives the energy of the high frequency components has dissipated, but the energy of the low frequency components is mostly still there, and we experience a rumbling sound. So that is an example of 'non-linear'; for different frequencies attenuation with distance is different.)
@jacobpast5437
2 ай бұрын
It seems to me that Simon said exactly the same thing as you, or other way round, you said exactly the same thing as Simon. Sad then that you would get stuck on the phrase "mangled together" which Simon uses to emphasize the "compressed" state of acoustic information entering the ear (after all, as you say yourself "a high performance frequency seperation system" is needed to untangle the information) and interpret it to mean something that - taking context into consideration - obviously wasn't intended that way.
@cleon_teunissen
2 ай бұрын
@@jacobpast5437 Indeed I should have watched the entire video first. Simon proceeded to explain the biomechanics of the cochlea, and I should have figured he was headed for that. Still, I find the choice of words awkward. Let me make a comparison. Let's say you have several sentences, about 10 words each. Of each sentence each constituent word is written on a separate card. As long as that stack of cards remains in order the information represented in the sentences is still there. If you shuffle that deck of cards a large proportion of the information is lost; the original sentences can't reconstructed any more. I argue that the expression 'all mangled together' is in the same league as that shuffled state. By contrast, superposition of sounds is with very little loss of information. A better expression would have been for example 'all stacked together'. While the title of the video is 'how do brains proces speech?', Simon of course brings up that the resolution into constituent frequencies does not require neuronal processing; it happens biomechanically. By contrast, disambiguation of homonyms requires cognitive processing.
@jacobpast5437
2 ай бұрын
@@cleon_teunissen It is of course your right to criticise the choice of words and I understand what your getting at. Simon is probably the first to give you a point. I think part of the issue is that it is very important to you to emphasize that (almost) no information is lost, whereas I as a layman am more astonished at the amount of "compression" this information has undergone - several conversations, the music of the stereo playing, glasses and cuttlery clattering, the doorbell ringing, the dog barking: everything is reduced to some specific position of the eardrum at any point in time. So I guess that's why I tend to be less "offended" at the phrase "mangled together", especially since Simon explains that information isn't lost. So, I guess, two sides of the same coin. I also have to admit that I am amazed at the simplicity (at least theoretically) of the biomechanical solution to this problem.
@cleon_teunissen
2 ай бұрын
@@jacobpast5437 Indeed you have put the word 'compression' in quotes to indicate metaphoric use. By contrast: physical compression is irreversible, as in compression of aluminium cans into bales for transport. My guess is that in the age of telephones connected by wire the transfer of the information becomes associated with the imagery of a single line. So I can see how people may be inclined to use the word 'compressed' (in a metaphorical sense). To give you an idea of my outlook another comparison: a player piano playing music according to holes punched in a long sheet of paper that is fed along the mechanism. The holes in the paper are a representation of the music. The pattern of holes exists in a simultaneous form, and the holes don't interact. The sound propagating through the air is not compressed, in the sense that nothing irreversible has occurred as far as information is concerned. For me the metaphor is the whole width of that sheet of paper; it's all there. Sound is very, very rich in information. To illustrate that: comparison of video and sound digital formats. Video compression is lossy, in the sense that while to the naked eye the decompressed video file results in the same look as the raw image data, it is not identical. Likewise MP3 sound compression is lossy, in the sense that information that tends to be not registered by human hearing is discarded. The comparison: of a standard 720p video file with sound about half of it is compressed video information and the other half MP3 compressed audio information. That is: the amount of information required to adequately reproduce the sound is *at par* with the amount of informaton required to adequately reproduce the visuals. (Incidentally: sound information on a CD versus MP3 compressed sound information. The file on the CD is intended to allow full reconstruction of the sound signal as it was registered with the microphone. As mentioned in an earliner paragraph: MP3 compression is designed to leave out elements of the sound that tend to not be picked up by human hearing. The usual compression ratio is 1-to-11 )
@jacobpast5437
2 ай бұрын
@@cleon_teunissen I was thinking of lossless compression (PNG, FLAC, ZIP etc.) but didn't know if this was actually a precise/fitting comparison to what happens during the "mixdown" (again in quotes...) of the soundwaves emitted from various sources in the environment reaching the ears.
@Bildgesmythe2 ай бұрын
The human brain seems to search for speech. Inanimate drones and screeches can be interpreted as speech. We hear birds saying Katy Did, I heard my dying dryer saying help me.
@pansepot1490
2 ай бұрын
The squeaky wheel cries “oil me!”. ;)
@kargaroc3862 ай бұрын
mfw the ear contains a mechanical spectrometer
@cynocephalusw2 ай бұрын
Parsing seems to me the crucial functionality in every digital system and language is a digital system. A parser can convert a sequence of symbols into a higher structured entity, for instance a tree. Ribosomes can parse RNA into a proteine. But proteines can't be serialised into a RNA again. For billions of years the DNA/RNA system was the only parsing system. The human brain is now able to parse, but also serialise already parsed structures like a FromJSON and a ToJSON system in a computer language. @SimonRGates described the critical point, when a parser got armed. The at first unstructured stream of sounds gets magically structure, despite there is no complete understanding. Syllables are good units for parsing to set in. Often words are only one syllable long. Take the compound word "Achsschenkelbolzen". The primary elements are "Achs-", "Schenk-" and "Bolz-". "Achs" is reduced from "Achs-e", "Schenk-el" and "Bolz-en" are expanded with general stuff. "el", "-ent", "-ung", "-er" or "-in", are very common to refine the meaning of a syllable. A "Bolz-er" is different from "Bolz-en" but the injective, aggressive connotation is in the first syllable. You can even create a new meaning by saying "Bolz-ung", the injection of a Bolz-en into something. So there is a hierarchical relation between syllables. In "Ent-Bolzung" despite the word is not in use, the parser will induce a notion of disconnection. In this case the prefix is an environment for things to come. Latin and Greek are famous for their pre-fixes. Greek accomplished very sophisticated ones like "ana-", "kata-", "meta-", "para-" or "hypo-".
@cykkm
2 ай бұрын
I noticed that computer scientists often tend to see the world in the terms of their field, which is fallacious in perhaps almost all cases. I should perhaps note that the brain is extremely far from a digital system. I just wanted to warn you that analogies with the digital computer are misleading in this area.
@cynocephalusw
2 ай бұрын
@@cykkm Parsing is an overarching trait, that is not restricted to computer science. Text is an unidimensional chain. A representation in the brain is a multidimensional network. So parsing is inevitable and also serialisation. You are right, that applying models of the computer realm to the human brain is not valid. But certain tasks have to be solved in every digital system. Interestingly some constructs in language are workarounds for easing the pain of parsing.
@cykkm
2 ай бұрын
@@cynocephalusw I just wanted to warn you not to make this error, in good faith. Whether to persist in making it is your choice. Speech sound is not a 1-dimentional digital sequence. The neuron is not a computing element, it's a living cell whose behaviour is affected by its environment-the soup of nutrients and ions-as much as its dendritic activation. Brain does not do serialization, every neuron signals in its own time. Your integrated perception of self is a grand illusion. When you press a button and feel the touch, see a flash of light and hear a beep, the time when we see the activation in respective cortical areas may be off by as much as 500ms, yet you perceive them as simultaneous. When you read, you don't scan text on page with your eyes sequentially, yet you perceive it as if you do. You can think of "parsing" and "serialization", but you're thinking of a computer, not related to the brain in any sense at all. These abstractions are utterly useless for understanding of speech and language processing. And please, I'm not telling you how not to think about the CNS-I'm just pointing that this thinking, while extremely common among computer scientists, yet it's an error. In other words, and I'm sorry for being perhaps irritatingly straightforward, but I'm not going to be convinced otherwise. I've been doing this thing for 30 years, and I used to make the same mistake, too. All attempts to model large scale information processing in the brain (even at the level of a cortical column) fail spectacularly. Markram has been the latest but not the last victim of this thinking: his project successfully digitally modelled a single cortical column (order of 10⁵ neurons) in 2015, followed by 9 years of non-progress. It's understandable, as the brain is inherently chaotic, opportunistically self-organising system with complex dynamics (in the systems theory sense), and attempts to model it by reducing to neurons or even columns are doomed to be about as successful as predicting daily weather for a year ahead.
@cynocephalusw
2 ай бұрын
@@cykkm Oh, that was a charge. But don't get me wrong. I don't make any assumptions how this parsing process is achieved in human brains, but it's obvious, that input and output are sequential. A written text is a sequence and the number of different symbols is limited. They are packed in syllables and words, but anyways: The whole arrangement is a sequence. And when we are writing, the performance is sequential. I have no doubts, that you are right by saying "the brain is inherently chaotic, opportunistically self-organising system with complex dynamics." But that doesn't affect the notion of less structured input and output. A proteine has a higher kinded structure, than its informational RNA. A proteine can be functional, the RNA only in a very limited sense. A written text will be translated in multimodal and multidimensional associations. I don't want to press biochemical systems with its totally different dynamics in a technical straightjacket, but digital systems have many many traits in common. For instance they make use of encoding (not in a DNA/RNA driven system) and decoding. The interesting point for me is the fact, that the the DNA/RNA model had a monopoly in digitality for billions of years. One reason for the ubiquitous serial structure in DNA, language/text and programming languages and textual files is portability. You can't copy and teleport the structure of a entire brain or the states of a computer from one to another, but you can carry serialised entities. Furthermore: Digital information can be translated in finegrained operations. Language (even in bees) is able to instruct another individual to do things in a different place and time without the presence of the instructor. That's a temporal and spatial decoupling process, that's mostly overlooked. Digital information has longevity. It is able to travel through time, while its incarnations are vanishing. There are many more examples for the advantages digital systems have over analogue ones, but its bedtime.
@cykkm
2 ай бұрын
@@cynocephalusw He who wields a hammer, sees but nails. “There are many more … advantages digital systems have over analogue ones” - You've just told the Nature how wrong she has been and that she must instead start making brains in the "correct", i.e., digital way with a single CLK input pin-with a justification that the "correct" brains look to you more like nails. N-n-nice!!! I watch in an utter astonishment that you haven't even noticed _what_ you in fact said! Let's better leave it at that. I shall respectfully bow out.
@RealUlrichLeland2 ай бұрын
16:14 Are hearing people with muteness who have grown up in contact with spoken language still able to identify the syllables that language is composed of? If they aren't then that might imply that a recognition of syllables in sound relies upon an understanding of how those sounds are produced with the mouth and vocal chords, and isn't a part of the brain that you're born with.
@saltpony24 күн бұрын
If the sound waves are interpreted in different sized cochlea… then does your Chihuahua hear your voice differently than your Great Dane? So all sound is interpretive and not fundamental in its decoding?
@antonioamatruda45792 ай бұрын
Shouldn't "phrase" and "sentence" be the same thing, needing each and both of them at least a subject and a verb?
@kargaroc3862 ай бұрын
So what we're seeing here is basically an analog computer. And by "analog computer" I mean, there's no erasable program here, the "program" (as it were) is hard-wired, as in the makeup of how the processing elements are physically connected together *is* the program. But it is still a computer. If anyone here played minecraft and knows redstone, or something more complicated, this will be familiar to you.
@jacobpast5437
2 ай бұрын
I have never played Minecraft, but wanted to point out that we are talking about a _self-rewiring_ analog computer - the circuitry is constantly strengthened in some places, weakened in others, and even new connections are established and old ones disconnected (keeping in mind that these connections could be excitatory or inhibitory, as Simon mentions). So I'm not sure if the phrase "hard-wired" is really that fitting.
@antrewtАй бұрын
Of more greater import is the question, how does the brain process toxic pathogens? Not by examining neuroscience, which is replete with such pathogens. Perception is seeing, feeling, understanding. This is love. This is your love of language. It's 100% in the seeing, feeling and understanding of the language. And in that there is love. The brain chops up love like chopped liver, and thereby chops itself into liver. Look around you mate. The evidence is everywhere.
@johnnyroyal6404Ай бұрын
fourier is french, it doesnt matter anyway but you pronounced it wrong, great video very enjoyable
@nsf001-32 ай бұрын
How do we even know the brian understands speech in the first place?
@cykkm
2 ай бұрын
Brian and I have been friends for 20+ years, and I can assure you that he's able to understand speech. Unless we get mortally pissed, indeed.
@rchas10232 ай бұрын
With all respect, this seems very simplistic. Generally, a mammal has two ears, and the sound streams should be merged, and merged with the image from visual inputs. From this, 'voices' need to be recognised ( I extend the concept of 'voice' to include the natural sounds of the environment, such as a running engine, birdsong, and so on ). Each 'voice' must be analysed for speech content, tagged with direction, location and possibly owner. The language of the speech content needs to be identified, if possible. Only then can a start be made on understanding speech.
@dillanelliot8313Ай бұрын
"Promo SM" 🙄
@bun1972 ай бұрын
it wasn’t “ethical” to do that to animals either. and they will have almost certainly carried out these experiments on people, its just not public (yet).
@evolagenda2 ай бұрын
I'm interested in this in terms of learning a new language. With your example of the cochlear and how responsive it is to new sounds or separting new sounds how the system of thresholds and synapses fire and cascade from one synapse to another that sounds like it would be absolutely imperative to mimicked speech and trying to generate an authentic accent. In terms of comprehension are we saying that when learning a new language the hardware we have in our heads needs to be trained to distinguish and better separate these sounds at the ear before signals ever reach the brain. How would these new thresholds form and what sort of impact would this have on language comprehension?
@cmtwei9605Ай бұрын
Synapse used to be always pronounced as ˈsaɪ.næps (from Cambridge dictionary) when I was at medical school. Has this pronunciation shifted somewhat recently in Britain? 😮 People with motor dysphasia (impaired speaking) after strokes sometimes can still sing with words when there is music.