Debate: Do Language Models Need Sensory Grounding for Meaning and Understanding?

Ғылым және технология

A debate on the need for sensory grounding in language models: wp.nyu.edu/consciousness/do-l...
Linked with a workshop on the philosophy of deep learning:wp.nyu.edu/consciousness/the-...

Пікірлер: 26

  • @labsanta
    @labsanta Жыл бұрын

    My notes: Debate on whether language models need sensory grounding for meaning and understanding. Three yeses and three nos presented. *Three yeses and three nos presented. *Two issues discussed: sensory grounding for reasoning and whether abstract content needs sensory and motor primitives. *Language models need sensory grounding for meaning and understanding. *Animals and humans can learn new tasks quickly using background knowledge. *Large language models have billions to trillions of parameters but still do not meet human-level intelligence. Auto-regressive language models lack knowledge of underlying reality and have limitations, such as making factual errors. Future AI systems should aim for a world model that allows for hierarchical planning in abstract representations. *Auto-regressive language models lack knowledge of underlying reality and have limitations *As models get bigger and are trained on more data, they exhibit emergent properties *AI needs a world model that allows for hierarchical planning *Future AI systems should aim for a world model that allows for hierarchical planning in abstract representations The speaker advocates for an architecture called Japan predictive architecture to handle uncertainty in prediction. *Japan predictive architecture handles uncertainty *Current popular models don't work for video *Language models can learn associations and inherit grounded meaning *Adding sensory grounding doesn't improve models *Language model representations are similar to real-world representations Adding sensory information doesn't improve language models' performance. Language models' representations reflect what they learn through language, not perception. *Language models using only language perform similarly to models with added sensory information. *Models' representations reflect what they learn through language, not perception. *Language models perform well in analogical reasoning tasks with color. *Models have a conceptual space similar to what they would have had if they had perceived color directly. Two models, one for images and one for language, can be connected via a linear projection. Grounding is not necessary for meaning, but it is for understanding words. *Analogical reasoning is not the same as perception. *Models take an image and learn a vector representation. *A language model predicts the next word. *A linear projection connects image and language models. *Ungrounded models learn about word usage and relations. The concept of word meaning and its relationship with mental representations and flexible behaviors. The importance of a word's ability to stretch its meaning and AI's ability to model human-like word understanding. *A word is a mental representation of a category in the world that supports flexible behaviors. *Language models need grounding and visual encoders to describe scenes accurately. *Human concepts of words are richer and more abstract than AI systems' abilities. *Word representation should support responding to instructions, choosing words based on internal desires, and changing beliefs about the world. *AI's modeling of human-like word understanding should provide the basis for flexible physical, verbal, and behavioral actions. Text-based models lack connection to the outside world and do not support certain abilities. Word representations should support internal desires and changing beliefs based on input. Developing flexible word representations is a worthwhile enterprise for machines to understand us better. *Text-based models lack connection to the outside world and do not support certain abilities. *Word representations should support internal desires and changing beliefs based on input. *Developing flexible word representations is a worthwhile enterprise for machines to understand us better. The debate over whether sensory grounding is required for thinking is discussed, with comparisons to language models and philosophy. *Thomas Aquinas and Sensism vs. Avicenna's Sensory Perception argument from centuries ago *Language models potentially lack sensory and environmental grounding, making their understanding 'potentially meaningless' *AI systems without sensory input can still think and understand, but likely have limited understanding of sensory or embodied knowledge *Multimodal models process information similar to human senses and could even have quasi-embodied outputs Debate on the capabilities of language and multimodal models in understanding and reasoning. Non-linguistic processes may not be as easy as assumed. *Language vs multimodal models in understanding and reasoning *Capacity for bodily action and low-level perceptual understanding *Comparison of performance on text-only tasks and tests of cognition and reasoning *Unexpected performance of language models on non-linguistic tasks *Conceptual engineering required to answer if language models understand The text discusses different representational formats, including linguistic, imagistic, and distributed. It mentions the limitations of the linguistic format and emphasizes the importance of imagistic reasoning and skillful know-how. It concludes with a different model of language as a tool for social communication. *Cognition is essentially linguistic, but some thoughts do not rely on perception. *Rudolph Carnap developed a logical construction of the world that translated perceptual experience into logical sentences. *Different types of representations exist: linguistic, imagistic, and distributed. *The linguistic format is limited, and other forms of representation are essential, such as imagistic reasoning and skillful know-how. *Language can also be viewed as a tool for social communication. Language models depend on background knowledge, which is not easily captured through language alone. Sensory grounding is complex and involves more than linguistic understanding alone. There are many types of words in language, including abstract words. Language models can rate abstract words even though they require sensory grounding. *Language models depend on background knowledge. *Sensory grounding is complex. *Language models can rate abstract words. Discussion on grounded and ungrounded language models and the understanding of perceptual word meanings. *Language models lack extended understanding of concepts. *Human semantic judgments are better for grounded concepts. *Blind people have sophisticated understanding of perceptual word meanings. *Metaphorical meanings can be understood without visual perception. Discussion on large language models and meaning *Wrapping food in clean water allows for transparency *Large language models use fixed computation, unlike human reasoning *Debate on whether grounding is necessary for meaning *Clarification sought on reasoning abilities and grounding Discussion on language models and understanding, and the challenges of multimodal learning. *Words as pointers to concepts, used for reasoning and tasks. *Challenges of multimodal learning and linking modalities. *Understanding requires broad reasoning, knowledge is patchy. *Humans also fail to understand, language models have singular focus. *Language models don't have sensory grounding, but vector is in the right spot. Discussions on language models and their limitations in understanding abstract concepts and achieving global consistency through planning. *Various ways of achieving edge cases and abstract concepts. *Changing distribution or points to influence response. *Separating circulatory work model, goal/task, and text generation. *Architecture where work model is independent for a more general system. *Auto-regressive language models can exhibit planning. Discussion on language model planning and sensory grounding. *Language models plan and correct as they unfold. *Sensory grounding is indirectly present in language models. *Direct interaction with the environment may improve language models. *Language models need to establish the effects of their actions through observation or experimentation. Discussion on the role of sensory data in reasoning, language models, and morality. *Sensory data is important for learning about the world but may not always lead to causal models for action. *Language models can function without sensory data but may lack certain types of information. *Morality may not necessarily require sensory grounding, but some disagree. *Higher bandwidth sensory data does not always lead to better cognition. *Language can provide streamlined, useful relational data for certain types of cognition. The section discusses AI language models and their potential for acquiring knowledge and understanding. The speaker questions whether multimodal models can attain meaning that text-only models cannot. *Language is an entree to abstract reasoning. *AI's success is coming from language models. *Debate is about meaning and understanding. *Multimodal models could be replaced with linguistic data. *The debate is about text versus non-text data. Discussion on efficiency and reinforcement learning in language models, and their impact on knowledge and truthfulness. *Efficiency in language models doesn't directly impact their ability to understand and possess knowledge. *Language models are fine-tuned with reinforcement learning, involving human feedback to meet normative goals. *Reinforcement learning may impact the knowledge and truthfulness of language models in specific domains. *Correcting answers for common questions may help improve the performance of language models, but there will always be a long tail of difficult questions with no direct feedback. *Redesigning language models to optimize different objectives may be necessary to mitigate hallucination issues.

  • @pokerandphilosophy8328

    @pokerandphilosophy8328

    Жыл бұрын

    This is a very neat summary. It captures the salient points and arguments better than the GPT-4 caption based summary that I have posted more recently. GPT-4 distills the dialectical structure of the debate better but glosses over many points. (It must be thinking, "meh... I already know all of that.")

  • @1hf325bsa
    @1hf325bsa Жыл бұрын

    Thank you so much for making this debate public. I learned a lot! I really struggled with the audio quality sometimes, I hope there is space to improve that in the future.

  • @TomHutchinson5
    @TomHutchinson5 Жыл бұрын

    What a wonderful discussion! Thank you

  • @pokerandphilosophy8328
    @pokerandphilosophy8328 Жыл бұрын

    I've downloaded the caption file of this video and asked GPT-4 to summarize it. The caption file is barely intelligible with no sentence structure, no punctuation, no indication of who is talking when speakers overlap and interrupt each other. I also had to cut it onto eleven arbitrary segments that GPT-4 had to summarise without the benefit of knowing most of the previous segments or its own summaries of them. GPT-4 nevertheless saw through all of that and produced the following distillation: In this video conference titled "Debate: Do Language Models Need Sensory Grounding for Meaning and Understanding?" there are three speakers in favor (Yann LeCun, Brendan Lake, Jacob Browning) and three against (Ellie Pavlick, David Chalmers, Gary Lupyan). The debate is centered around two main issues: whether sensory and motor connections are needed for certain types of reasoning behavior, and if there is abstract content that doesn't require sensory and motor primitives. Yann LeCun, the first speaker, argues that language models, such as GPT-4, do need sensory grounding for meaning and understanding. He acknowledges the limitations of supervised learning and the recent success of self-supervised learning, which works well for text but not yet for other sensory information like video. He argues that current large language models, while impressive, still lack a true understanding of the world and make really stupid mistakes. This is because they do not have any knowledge of the underlying reality that humans share and assume as part of intelligence. LeCun further emphasizes that auto-regressive large language models have limited reasoning abilities and are prone to errors, especially when producing long answers. He believes that this problem is not fixable and that the current approach of these models cannot lead to human-level intelligence. (continuation in next reply)

  • @pokerandphilosophy8328

    @pokerandphilosophy8328

    Жыл бұрын

    2 In this segment the speaker argues that the debate on whether grounding is necessary for meaning in AI models should be based on empirical evidence rather than abstract principles. They emphasize that although current large language models are imperfect and make mistakes, these shortcomings do not necessarily mean that grounding is essential for meaning. The speaker presents two conclusions supported by current empirical data: Adding sensory grounding to existing models does not lead to significant improvements: While it is true that language models are imperfect, the available evidence suggests that adding sensory grounding to these models has not yielded substantial improvements in their performance. Existing models demonstrate an impressive understanding of meaning without grounding: Large language models seem to have a considerable degree of understanding and meaning even without grounding. This understanding is often derived from the associations between concepts and words that the models learn during training. In conclusion, the speaker argues that the necessity of grounding for meaning in AI models should be assessed based on empirical data, and the current evidence does not strongly support the need for grounding. 3 In this segment, the speaker discusses the difference between text-only models and models with sensory grounding. They explain that there seems to be no significant benefit from adding sensory data to language models. Through various tests and comparisons, they found that models with sensory grounding did not perform significantly better than text-only models. The speaker then talks about the structure of the representations that models learn, arguing that it resembles the meaning that would be gained through sensory grounding. They use color as an example domain, demonstrating that language models can perform well in tasks related to color, even without direct sensory grounding. The speaker also acknowledges potential criticism that reasoning about RGB codes might not be the same as perceptual reasoning, as RGB codes are already a symbolic representation. To address this, they experimented with using raw image encodings, such as pixels, as input to the models instead of RGB codes. They found that a linear projection from the space of the image encoder into the input space of the language model worked well, and there was no need for full multimodal training. Overall, the speaker's main point is that there is not strong evidence that sensory grounding makes a significant difference in language models, and the structure of the representations they learn resembles the meaning that would be gained through sensory grounding.

  • @pokerandphilosophy8328

    @pokerandphilosophy8328

    Жыл бұрын

    4 In this segment, the speaker discusses the importance of grounding for language models to understand words as people do. He refers to a paper called "Word Meaning in Minds and Machines," which explores what people know when they know the meaning of a word. To illustrate the richness and flexibility of human concepts, he provides examples of umbrella usage, highlighting that just adding a vision encoder or fine-tuning it doesn't necessarily mean the job is done. Furthermore, he presents an ad hoc umbrella test, where animals in nature construct makeshift umbrellas. Although the image captioning system struggles to recognize these scenarios, it shows the importance of grounding in language models for understanding words like humans. The speaker then presents four desiderata, which he believes are necessary for models of human-like word understanding: 1. Word representation should support describing scenes and understanding their descriptions. 2. Word representation should support responding to instructions and requests appropriately. 3. Word representation should support choosing words on the basis of internal desires, goals, and plans. 4. Word representation should support changing one's belief about the world based on linguistic input. These desiderata highlight the significance of grounding in enabling language models to understand words as people do, emphasizing the importance of sensory grounding for meaning and understanding. 5 In this section of the debate, the speaker discusses the long-standing philosophical debate on whether sensory grounding is necessary for meaning and understanding. This debate dates back to figures like Thomas Aquinas and Avicenna, with some arguing that sensing and perception are foundational to cognition, while others argue that thinking can occur without sensing or sensory capacities. Bringing this debate into the context of AI and language models, the speaker outlines the arguments for and against the necessity of sensory grounding in AI systems. Proponents of sensory grounding argue that AI systems need to have connections to the external world, bodily grounding in action, or grounding in internal states and goals to truly understand language and have meaningful representations. Critics, however, argue that AI systems may be able to develop meaningful understanding through exposure to linguistic input alone, without sensory grounding. The speaker concludes that the question of whether AI systems require sensory grounding to truly understand language remains open, and as AI research advances, this debate will continue to be a central topic of discussion and investigation in the field.

  • @pokerandphilosophy8328

    @pokerandphilosophy8328

    Жыл бұрын

    6 In this segment of the debate, the speaker highlights the historical background of AI research and the assumptions made in the field. They mention MIT's Summer Vision Project from 1966, which aimed to teach a computer to see, and how it failed to recognize that not all thoughts are linguistic. The speaker then outlines three different kinds of representational formats: linguistic, imagistic, and distributed. The limitations of each format are discussed, and it is noted that the linguistic format was overestimated, leading to blind spots in understanding animal and infant cognition. The speaker also discusses the pragmatist approach, which considers cognition as grounded in action and prediction, and sees language as a method for coordinating behavior. This approach suggests that language-only systems will possess only a shallow understanding, as language is meant for capturing high-level abstract information. In conclusion, the speaker supports the engineering model, which involves solving everyday problems requiring intuitive physics, biology, psychology, and social knowledge. They argue that this model is a more plausible approach to understanding and a better path to more capable AI. 7 In this segment of the video, the speaker argues that although language models can perform linguistic tasks like writing poetry, they struggle with simple engineering problems. This suggests that linguistic understanding is an important but limited part of their broader multimodal intuitive understanding of the world. A survey was posted on social media, and respondents were mostly from cognitive sciences, linguistics, and computer science backgrounds. The results show a split in opinion on whether large language models can derive true word meanings only from observing how words are used in language. The speaker highlights the distinction between concrete and abstract words, stating that abstract words play a central role in language. The question of sensory grounding is raised, and examples are given from child language and the knowledge language models can obtain from text alone. The speaker also discusses the notion of perceptual word meanings and whether a person born blind can understand words like "glistening" or "transparent." Evidence is presented that shows correlations between semantic judgments of congenitally blind people and sighted people. This suggests that even without direct sensory experience, people can gain understanding of certain concepts through language alone. 8 In this segment, the panel discusses whether large language models can predict and plan multiple tokens ahead. One panelist explains that the amount of computation for producing each token is limited, and there is no way for the system to think longer or shorter. This limitation is compared to human reasoning, which can think longer and have more computational power. Another panelist argues that large language models might not always use the same amount of processing for every token, and sometimes they commit to a token earlier or later depending on the difficulty of the prediction. They also question the connection between the need for sensory grounding and weaknesses in neural networks, as some arguments seem to focus on criticizing neural networks rather than addressing the issue of grounding. The panelists discuss the importance of reasoning abilities in understanding language and how concepts are used for multiple tasks. They touch on the patchy nature of current technologies and the need for new examples of multimodal learning to link these systems together more effectively. Finally, they debate the standards for understanding in philosophy, suggesting that broader reasoning capacities are necessary for true understanding rather than just successful use of language.

  • @pokerandphilosophy8328

    @pokerandphilosophy8328

    Жыл бұрын

    9 In this segment, the panelists discuss how human knowledge is patchy and context-dependent, comparing this to the behavior of large language models. They argue that language models do have goals, primarily to predict the next word. The panelists mention that language models may have a good understanding of meaning, but their lack of sensory grounding is a limitation. They also consider the challenge of handling abstract concepts and edge cases. The discussion then turns to the importance of goals and objectives in human cognition and the limitations of current language models in accomplishing multiple tasks or goals. The panelists also touch on the richness of learning from sensory grounding and the limitations of language as an approximate representation of reality. The audience raises questions about different forms of grounding and the role of planning in understanding. The panelists acknowledge that human planning is not always perfect, and global consistency may not always be achieved in language models. 10 In this segment of the debate, participants continue to discuss sensory grounding and its role in language models. One speaker points out that the debate isn't about whether language models have sensory grounding but rather whether they need it for meaning. Another speaker emphasizes that language models, like GPT-3, have no direct contact with the external world beyond text, which is different from how humans learn and interact. A participant highlights the potential importance of other sensory modalities, such as smell, and wonders how a lack of sensory grounding in those areas could affect language models. Another participant raises the question of whether meaning necessarily requires agency, as some speakers have been discussing planning and interacting with the world. They point out that there is a hierarchy in learning, with different levels of interaction and agency, and note that current language models may lack important structural aspects of human cognition, such as long-term prediction. In response to a question about the accumulation of errors in language models, a speaker clarifies that their assumption is not that there is a single correct answer for each token prediction, but rather that the errors are independent. 11 In the final segment of the "Debate: Do Language Models Need Sensory Grounding for Meaning and Understanding?" video, the panelists discuss topics such as the role of sensory grounding in morality, how sensing can boost or impair thinking, and whether multimodal models can attain meanings that text-only models cannot. Some panelists argue that language is an entry into more abstract reasoning and that models exposed to only language may receive more streamlined, pure relational data. Others point out that language can distill and clarify certain types of knowledge, leading to multiple paths for acquiring genuine knowledge. One panelist distinguishes between knowledge and reasoning on the one hand and meaning and understanding on the other, stating that a system without sensory grounding might not be good at scientific reasoning but could still understand certain concepts and have meaning. A question is raised about whether there are aspects of meaning that a multimodal model can attain that a text-only model cannot, considering that in principle, one could replace image data with linguistic data that describes the image data in perfect fidelity. The panelists conclude that efficiency is a key factor, but they also note that the architecture of the model plays a role in understanding the context of images. The panelists also discuss the impact of reinforcement learning from human feedback on the models' ability to learn a more involving function than simple next-word prediction. While this approach may help mitigate some issues, one panelist believes that it will not completely fix hallucination issues or misinformation, and that a different approach to model design and optimization objectives may be needed. The debate concludes with the panelists thanking each other and the audience for their participation.

  • @treflatface
    @treflatface Жыл бұрын

    Thank you for making this public!

  • @i_forget
    @i_forget Жыл бұрын

    Why was there so much visual content used in these debates? If we as humans didn’t _need_ an understanding of the world through visual communication, then these “visuals” are not necessary. Same with ears … Senses give us some kind of understanding that just an organization of representations for “real” or “un-real” “things” couldn’t provide. More sensors more measurements baby!

  • @stefl14
    @stefl14 Жыл бұрын

    The opposition is mired in a bygone era where it's 1921 and the Tractatus just dropped. While it's true that LLMs can form meaningful representations of concepts such as colors or distances between cities like London and Turin, but this is hardly the resounding victory they imagine. Language, both natural and formal, indeed places limits on the potential universes of reference, but this is far from the full story. Consider, for example, an SVG script accompanied by annotations such as "place a yellow triangle atop a larger blue square," followed by corresponding formal code specifying coordinates and RGB values. The natural language descriptions become grounded in a formally defined semantics, providing a kind of sensory grounding via the medium of code rather than, say, vision. In either case, coarse-grained and mostly abstract natural language acquires meaningful semantics. It is undeniable that LLMs improve with such grounding, as evidenced by their enhanced performance on natural language tasks when trained on code. The opposition might argue that grounding in code is ultimately language-based and not a separate sense, but this misses the point. For many concepts, code is an inefficient means of expression compared to vision or other modalities, depending on the context. As a result, only a small fraction of our human sensory experience can or will be formalized in this manner. Attempting to express the concept of a "dog" via a runout of Schrödinger's equation may be technically accurate, but it is certainly not practical or effective. Natural language evolved to facilitate high-bandwidth communication between humans, often leaving out crucial semantics due to the assumption of shared cognitive frameworks and experiences. The notion that complete grounding can be achieved through mappings between natural and formal languages is futile endeavor, requiring a prohibitive volume of data that doesn't even exist. Some additional points: - The idea that blind individuals possess comparable semantics to sighted individuals for concepts they cannot see is superficially compelling, but ultimately flawed. A blind person's conception of "glistening" is not ungrounded; they can experience smooth, glistening surfaces through touch and can discern positive affect in voices reacting to glistening objects just like anyone else. These tactile and auditory associations are, in fact, more salient to blind individuals. Hardly ungrounded. - The assertion that LLMs do not currently benefit significantly from multimodality is incorrect when we consider code a separate modality. Even if we strictly define modalities in terms of human senses like vision and audition, the claim will not age well. Multimodal benefits are already appearing in LLMs and have been evident in other ML subfields for a while (e.g. vision + audio). Present limitations likely stem from constraints in training data and architecture. Language alone won't deliver the understanding we seek. The very structure of language, including formal aspects, offers insufficient constraints on the nature of the universe. While language does provide some limitations, it is unclear what purpose such an argument serves if our goal is to develop better systems with deeper understanding. Philosophers continue to overemphasize language's importance, perhaps due to its role as our primary means of communication with fellow humans, leading to anthropomorphisation. The failure of the logicist program and the general problem of GOFAI should have taught us valuable lessons about language. It is high time philosophy internalized the wisdom of those, like Wittgenstein himself, who saw beyond the linguistic mirage.

  • @JimmyFingers
    @JimmyFingers Жыл бұрын

    the motivation of mortality and the incentive to avoid uncomfortable situations (hunger, predation etc) drives the advancement in learning ability. being alive creates a context both fear and love raise the stakes we rely on an intelligence in fight or flight but even we cant predict what course of action we will take in a sudden emergency.

  • @kevon217
    @kevon217 Жыл бұрын

    love the “straight to the horse’s mouth” slide. bonus points for sure.

  • @daewoongnim
    @daewoongnim Жыл бұрын

    Question. Which language model?

  • @daewoongnim
    @daewoongnim Жыл бұрын

    Btw. Chalmers says No? Hey, which model. Complete or not.

  • @ItsSid66
    @ItsSid6610 ай бұрын

    Is P(correct) = (1-e)^n at 8:52 correct?

  • @louisdeclee7301

    @louisdeclee7301

    10 ай бұрын

    Does anyone have a reference to the article proving this?

  • @countofst.germain6417
    @countofst.germain6417 Жыл бұрын

    Very nice

  • @a5154
    @a5154 Жыл бұрын

    This presentation is misleading, in a way that it presents an older version of GPT. See 1:13:43 for example, what the current version of GPT answers: Moving a sofa onto the roof of your house without using a pulley, ladder, crane, or other typical lifting devices can be a challenge. However, there are a few unconventional methods that might work, depending on the resources you have available. Keep in mind that these methods may be less safe or less practical, so take necessary precautions and use your best judgment. Ramps and Inclines: Create a long, sturdy ramp using wooden planks or metal beams. The ramp should reach from the ground to the roof, with a gradual incline. You may need additional support structures to hold the ramp in place. With the help of a few friends or neighbors, push or pull the sofa up the ramp to the roof. Be cautious about the angle and stability of the ramp to avoid accidents. === So it seems that the model perfectly understands, and as much as a human.

  • @daewoongnim
    @daewoongnim Жыл бұрын

    That’s the reason why I don’t like debate. (But I won it.)

  • @dr.mikeybee
    @dr.mikeybee Жыл бұрын

    Generative models need to be able to generate counterfactuals. Grounding needs to come from another part of cognitive architecture.

  • @mooncop
    @mooncop10 ай бұрын

    420 likes! LFG Hawaiiiii ❤

  • @k14pc
    @k14pc Жыл бұрын

    yes side got rekt

  • @daewoongnim
    @daewoongnim Жыл бұрын

    Charles Darwin versus Terminator

Келесі