Conference on Natural Language Learning 2024
Date:
Keynote Speaker Lorna Quandt
Integrating AI Driven sign language technologies in education recognition generation and interaction. ~300 sign languages in the world, many understudied. Iconic signs are related to the concept that they represent like a plant growing out of the ground. But there are still significant differences across languages.
ASL doesn’t match Rnglish grammar, similarly to the other sign languages across the world. Some done have different tenses, etc. Spacial relationships correspond to grammar like how the movement of a sign represents the subject of a verb. Co-articulation occurs when multiple ideas occur together such a adding the sign for 5 into being merged with years to represent 5 years.
Some aspects of sign languages don’t map onto direct equivalents in sign languages like the spatial grammar and morphology and combined different signs into a single one, also spatial orientation or facial expression impact the meaning. Spatial grammar can change depending on the way that you are standing relative to another person.
Challenges in sign language understanding for AI and NLP include the facts that spatial information depends on viewpoint of the person expressing the instructions e.g directions. Facial queues also difficult for AI and NLP understanding of signing, the same thing like eyebrows being up may indicate a statement being turned into a question or the emotional content of statement.
ASL citizen is a dataset of sign language information produced by Microsoft that has many examples of signs in home settings. *** Question: are there any examples in the data set of the issues that you talked about where facial expressions cam impact the meaning of a sentence?
ASL learning and education enhancement with AI powered sign language tools informed by the cognitive science of linguistics and perception. Gloves for understanding ASL only provide one sided communication and very basic sign language, also people just done like wearing the gloves. Not of interest to actual deaf and signing people. AI systems should be relevant to the cultural requirements and the linguistics of the language that is being translated.
Goals within AI and NLP sign language research are sign language recognition and sign language generation. Generation looks like creating an avatar with emotional representations so that difficult to sign concepts can be understood.
*** Question: What are the requirements of miltimodality in sign language understanding?
Sign language understanding can function by taking a video recording of someone signing and then getting post mappings of their facial features, shoulders arms, and hands, that are then mapped onto a dictionary of sign poses. Issues with this approach is that there is not a massive dataset of ASL, let alone other languages.
Most avatar videos of people signing are actually created using motion capture of someone in real life signing and then turning them into an avatar. Truly AI generated signs are not as advanced and there is limited training available, avatars tend to be robotic and don’t have good grammar or facial grammar.
Limited domain datasets like talking about the weather can significantly decrease the difficulty of generating signing avatars. Also, the deployment of the AI system can inform the design and simplify it in ways that makes the end result better.
One major challenge in learning ASL is that it requires feedback from someone else who already knows ASL. Interaction with a teacher is highly relevant. ASL learning games with interactive learning sessions and fluent signing avatars can solve some of these issues.
Sign recognition and generation using VR can allow people to learn without requiring feedback from human teachers. These use real life settings like coffee shops where the teacher avatar teaches through instruction how to properly sign, through instruction and feedback. Model for recognition has relatively high accuracy, and the generating teacher can teach properly. The experiment had participants use their own method of signing, which produces more difficult signers.
*** Question is RH ASL different from LH ASL?
Target audience is hearing people who do not sign ASL. Think aloud narratives shows that students found better recognition as being associated with less negative sentiments. Difficulty increases negative sentiments. People that computers were more complicated had a more positive sentiment than the game due to its naturalistic environment and the novelty.
Edpuzzel experiment compared the avatar based method of education with a video and quiz based method of teaching. Much better pleasantness with the Edpuzzel. Both conditions had very high training performance in both. Measuring both short term and long term analysis is coming soon in a follow up study.
Another issue that is currently being investigates is that deaf students struggle with STEM ASL terminology because it is so variable. At the university the presenter speaks there are issues with learning STEM. Different students may have learned significantly different signs for the detailed vocabulary in STEM classes. Could be up to 2 to 4 different signs in a single classroom for the same concept.
One way to address this is to use augmented reality to support different vocabularies within the same classroom. From science of learning research signs that describe concepts are better for learning than spelling out the signs. Recognizes and generates key stem terms. 2 students in a classroom that are representing the same terms in different ways and using AR goggles. A signing avatar and English caption can be used to augment the visual information of looking at someone signing something that you may not be familiar with. Don’t have shared vocabulary in the same way that hearing students do.
*** My question: Does the common grounding require the AR vs. just having a teacher that says we are going to use this term for something. Does the AR based method require using a single way of representing a concept for a teacher?
*** Answer: Yes but the recent research in education suggests that some methods of signing are better for educational outcomes, meaning it can be better for students to have a single way of signing some concepts.
Oral Session 1: Psycholinguistic Session (chair: Libby Barak)
- Leveraging a Cognitive Model to Measure Subjective Similarity of Human and GPT-4 Written Content
Tyler Malloy, Maria José Ferreira, Fei Fang, Cleotilde Gonzalez
My presentation.
- SPAWNing Structural Priming Predictions from a Cognitively Motivated Parser
Grusha Prasad, Tal Linzen
The cat cat examined by the doctor was skittish. Psycholinguistics of sentence structure. Structural primig is the idea that the processing or production of target sentences is facilitated when preceded by prime sentences. If we have an ambiguous target, like what is being skittish, is that determined by the priming sentences.
Many different hypotheses can be formed about the properties that are shared between sentences, to any to analyze, so instead we use the shared structure to limit the space of hypotheses under consideration. One way to do this is in a way that the shared structure facilitates human behavior.
Serial Parsing in ACT-R With Null-Elements. Uses ACT-R to determine how the retrieval occurs to determine the similarity between sentences and sentences that have been experienced. Predictions of what the current word being processed are (for example a transitive phrase vs. intransitive phrase) is based on the similarity of past experiences. When we get into a state where the only category available is impossible, then it goes back and looks to see if there could be an error.
Priming is a result of the base level activation of categories and the inhibition of past failed parsing. Reanalysis indexing is done to choose where we fix the error in our current parsing. Different options for re-indexing like going back to the start vs. selecting an element at random or using an information theoretic measure for selecting them.
*** Question: Do we only do this backwards looking process when there is an impossible categorization or when the categorization is very unlikely? How does this model differentiate between instances in experience and the ability to process Garden path sentences? Is this question of how to add the experience related to the modeling of the few participants that produced reduced relative clauses.
Compared to differences with and without Priors meaning there was uniform activation across categories?
- Lossy Context Surprisal Predicts Task-Dependent Patterns in Relative Clause Processing
Kate McCurdy, Michael Hahn
Subject relative clauses vs object relative clauses have different levels of processing difficulty. Surprisal theory says this is because the frequency of these types of clauses in English. And the memory based model suggests that this is because of the requirements of retrieving information when processing more complex sentences.
Previous analysis of eye tracking while reading suggests that the memory based account is more accurate. But this can depend on the way that the eye tracking data is collected. Using a maze instead of reading sentences shows a reversal and that surprisal theory is correct.
This work tries to use a lossy context surprisal model to address both types of human behavioral results that have been observed previously. LCS computes surprisal using a noisy memory representation instead of just an information theoretic method. The resource rational LCS uses a resource rational hyperparameter retention rate that varies the resources available.
*** Is the retention rate hyperparameter kept static for an individual, it seems like it would be varied based on their experience, if they are a physicist they could retain many complex physics terms in their head vs. a biologist trying to do that.
Different phenomenon require different retention rates, does that suggest reading time data and the maze task have different retention rates? Why would the maze task have a higher retention rate? The results seem like the model can produce both types of effects but that in one case it requires a low retention rate and in the other it requires a high retention rate.
- Multimodal Large Language Models “Foresee” Objects Based on Verb Information But Not Gender
Shuqi Wang, Xufeng Duan, Zhenguang Cai
Eye tracking studies where participants hear sentences and then view objects on the screen, when participants hear attention focuses on the objects that apply to the sentence at the current time based on the gender conguency and verb conguency. Like seeing the word wear and looking at either a tie or dress, and then seeing the name James and looking more at the tie.
Two stage prediction is for verb based prediction first and gender based prediction afterwards. This raises the question of whether machine attention in multimodal models reflects this gender and verb conguency of attentional focus.
Is there a decision that is being made in the task, like if the task was to click on the related term how might this impact the attentional focus of the model and the human performing the task.
Pretests showed that MLLMs can identify objects as being more related to a specific verb like wearing corresponding to clothes, MLLMs can also associate specific clothes with genders. They can also predict the word that will be followed using verb and gender information in terms of behavioral outcome.
*** Question: Are you interested in or have you studied gender neutral names and how they impact attention.
Attentional weights of MLLMs on the visual information related to the sentences they are processing were compared to observations of human attention in an eye tracking data seudy. LLAVA could predicatively attend to verb relevant objects.
*** Question: Resulting from alignment based training. Could this be related to RLHF or other approaches that teaches LLMs to not be biased based on gender. Is this version of LLaVA trained to be more aligned with this?
*** Answer: Because the behavior did show the gender differences we might not expect that RLHF would display more of the attentional bias.
Distinction between the behavior of models successfully predicting gender information… but the attention analysis did not. Was the behavioral output pre experiment assessment correct?
Using real world images meant that the models had a closer attentional results compared to using the cartoon like images.
Oral Session 2: Syntax and Structure Session (chair: Omri Abend)
- Is Structure Dependence Shaped for Efficient Communication?: A Case Study on Coordination
Kohei Kajikawa, Yusuke Kubota, Yohei Oseki
What kind of pressures shaped universals in natural language? One example is efficiency. Can be optimized under simplicity and informativeness tradeoff. How explainable is this hypothesis? Explains some universals but we don’t know whether it can explain more abstract universals like compositionality and this work focuses on the specific universal of structure dependence.
Domain specificity is relevant for structure dependency. Domain general efficient communication. Grammatical operations applied structurally rather than linearly… Coordinate structures are also thought of as being an instance of structure dependence. Investigating structure dependence and grammatical efficiency.
Made up languages that have different level of structure reductions. No reduction, structure reduction and linear reduction. Measure the efficiency of communication in these languages in terms of predictability and parsability as information theoretic properties of the word by word surprisal. These parameters are calculated with recurrent neural network grammars. RNNGs
*** Question: In the study you used 3 discrete types of languages, is it possible to think of this as more of a continuum where linear reduction is more or less applicable?
Structure reduction is best in terms of communicative efficiency as a balance of parsability and predictability. The other two options may be optimal for fully parsability or fully predictability but not a balance of the two. In some theories there is an optimization of efficient computation not communication and communication is considered as an epiphenomenon.
- NeLLCom-X: A Comprehensive Neural-Agent Framework to Simulate Language Learning and Group Communication
Yuchen Lian, Tessa Verhoef, Arianna Bisazza
Dynamical systems analysis of language… effect of group factors on a language systems are difficult to understand because causality cant be inferred and there is no experimental control. Conclusions can be drawn about the systemasticity and complexity of languages dependent on the number of speakers in the community.
Neural language learning and communication framework. Agents interacting with each other using prediefined artificial languages and using emergent communication playing a meaning reconstruction game via reinforcement learning.
Listening agents in the communication games attempt to match utterances onto meanings and if they can then both agents receive a reward. Role alternation means that there is self communication in the more recent version of the model. Self-play allows the speaker-listener to have a meaningful internal dialog. Then two-agents interact in the communication either in pairs or in groups.
*** Question is there any internal dialogue that goes on during the speaker listened version of the communication game?
Compared different artificial languages that had different requirements of word order in determining meaning. Question is what happens with pairwise interactions between agents that have learned two different of these artificial languages and having them interact with each other.
*** Questions: How long are the sentences in this game?
When playing this game with agents that have different production preferences there is a language drift during communication which goes towards mutual understanding and a push towards regularizing word order instead of marking. Changing it to the direction of preferring marking over word ordering results in the opposite movement of preferences.
Group communication experiment. Here we are interested in groups of different sizes that have differences in preference for production (question is there any difference within a group for production preferences, and how could this relate to bilinguals). Increasing the size of the groups means there is less variance (variance within the groups or between the groups?) larger groups tend towards lower entropy. larger groups develop more word order reliance rather than marker use (or was it the other way around).
- Solving the Challenge Set without Solving the Task: On Winograd Schemas as a Test of Pronominal Coreference Resolution
Ian Porada, Jackie CK Cheung
Real world pronominal ambiguity may be more difficult at constructed examples like the Winnograd schema challenge. Models that are better than others at WSC and may be worse at real world ambiguity, which is counter intuitive since the point of the challenge set is that it is highly difficult and more challenging than real world examples.
Fine tuned 7B LLama is better than an unfinetuned 70B LLama at the constructed winnograd datasets but worse than the larger model at real world examples of pronoun reference resolution.
Comparing different prompts for LLMs to signify which of the coreferents the pronoun refers to. The coref SoTA model which uses ver small number of parameters is better than the LLM models at doing this task in real world data.
Keynote 2: Thamar Solorio
Large language model research for making humans better global social beings. We currently have a high access to knowledge via the internet and other knowledge resources. However this hasn’t translated to our ability to interact in a global society, shown through the ongoing social conflicts and wars. Ideally we can build towards social intelligence and allow knowledge resources to improve social interactions.
Theory of mind and social skills and competence allow for better social awareness. Needs of social intelligence were guiding in the development of language, culture, and society. Examples of applications are the facilitation of communication, support for neurodivergent people, and improving global social interactions. Differences in the ability to communicate using language as well as non-verbal queues motivates the investigation of supporting neurodivergency. Conflicts can be caused by an inability to model the other person or group’s mind and thus a misunderstanding of their goals, beliefs, and intentions.
The steps of these research projects are to first evaluate SOTA models for their ability to facilitate social intelligence, and assuming that they are not perfect we can develop models that are better able to address the issues that arise from social intelligence requirements. Conversation management strategies can be used to redirect conversations when one person is not acting with high social intelligence. Social IQa research tests the ability of LLMs and MLLMs in reasoning about emotion and social interactions of humans.
Theory of mind evaluation tests typically involve things like the Sally Anne task, with additional complexities in the order of false beliefs. These are difficult challenges for LLMs as they have a hard time keeping track of inconsistent beliefs, with changing characters coming in and out of the environment and differences in who can see what changes in the environment. 3rd and higher order theory of mind deals with people’s beliefs about other agent’s beliefs about states of the environment where some people may be deceiving each other. Uncertainty is also an aspect of ToM where we need to be able to understand the partial beliefs or uncertainty of other people
Presenter’s work deals with tasks of emotion recognition, conversation management, and theory of mind in the setting of multiple choice question and answering. Performance of models in understanding these theory of mind and other problems is very low and often make up information. Detecting single emotional states and simple theory of mind questions are easy for LLMs and MLLMs but additional levels of complexity make it much more difficult for the models. Examples of this are multiple different emotion or emotional progression within a story.
What are the goals of AI systems? Often people think that the goal is to replicate the behavior of humans to a very high degree. Other ways of building AI is interested in surpassing human performance in some ways like translation while being worse at some other issues like theory of mind. The goals of social intelligence inspired artificial intelligence is to support the interaction of humans with each other rather than a focus on performance on some tasks.
*** Question: Most of the tasks in the talk focused on answering the social intelligence questions of stories and other types of questions. How does this relate to the ideas of supporting the social intelligence interactions of humans? I can see how social intelligence support systems would need to be able to reason about social interactions but how would that be then applied onto making humans better at interacting with each other.
*** Answer: Examples like work settings where people are communicating and there is some miscommunication or misalignment in goals or the understanding of the goals of others. This can be problematic for decisions being made and there could be devices that are supporting people in high stakes negotiations where there needs to be a highly accurate model.
*** Question: I am really interested in the high stakes scenario and with deception, so it may be that the humans are going to try to deceive the model and the other humans potentially in different ways. How could models deceive humans or humans try to deceive models.
Oral Session 3
- Investigating large language models for their competence in extracting grammatically sound sentences from transcribed noisy utterances
Alina Wróblewska Filtering speech specific noise, wonderin what the most challenging speech specific phenomenon is for LLMs to filter. The nmost challenging phenomenon is to restart the speech specific noise. This is because depending on when it happends it can have important meaning for what is being said. Reparandum is also challenging to filter while discourse tokens are rare and thus easier to filter for LLMs.
LLMs that cant filter speech specific phenomenon means they may just have a superficial or defective understanding of language compared t humans who perform this task easily. LLMs are currently being applied onto spoken dialogues with humans and handling these types of errors are important for human communication. Humans often produce spontaneous speech and it can have some meaning to the conversation between the human and LLM.
- The Effect of Word Predictability on Reading Times in Information Seeking and Repeated Reading
Keren Gruteke Klein, Yoav Meiri, Omer Shubi, Yevgeni Berzak
Probability of sentences compared to their difficulty of processing difficulty as measured by reading time. We want to find a function that can predict the difficulty of reading based on the probability of the sentence. There is a logarithmic processing difficulty in relation to the probability, and an increasing linear relationship between surprisal and difficulty.
Does this effect also extend to non-ordinary reading regimes such as information seeking compared to ordinary reading. There are major differences in eye movements between information seeking and normal. Another difference to consider is whether the reader has already read the text, which shows that repeated reading has more skips and less fixation on words.
This raises the question of surprisal and processing difficulty on reading time in regimes with different types of reading. A regression model is used to predict reading time based on surprisal, frequency, and length. Additional linear and non linear models include smothing over the surprisal impact on reading, to compare which model is best fit to the data.
The delta log likelihood compares the log likelihood of the model, baseline and surprisal, subtracted by the baseline without surprisal. This is the metric of predictive power which can be compared to the measure of surprisal. LLMs that have higher log complexity result in better predictive ability for normal regiems. In the non normal regiems there is a non normal relationship in the rereading and the information seeking regieme.
Linear effects in surprisal extends in non ordinary regiems, which is consistent when using different reading measures and language models. But we could change the prompting for the LLM where we are having it perform a similar question answering task of reading the question before they read the paragraph and then the question again. The hypothesis is that prompting the context will lead to better fits on predicting human reading time.
Results from analysis show that this hypothesis is not always true. In information seeking, first reading, there is not a significant improvement. Adding additional prompts to use instructions that tell the model that they will have the question answered in the following paragraph. Still no significant benefit. However in the repeated reading paradigm, the performance of the model goes to zero. This demonstrates there are issues with LLM surprisal measures especially in non-ordinary reading paradigms.
- Multi-Cultural Norm Base: Frame-based Norm Discovery in Multi-Cultural Settings
Viet Thanh Pham, SHILIN QU, Farhad Moghimifar, Suraj Sharma, Yuan-Fang Li, Weiqing Wang, Reza Haf
Examples of sociocultural norms are like the etiquette to eat with your right hand and avoid using your left in India. Multi-cultural norm bases can be useful to help humans better understand diverse cultures and enhance the cultural competency of LLMs. Norm discovery through human annotation can be problematic due to costs and issues with cultural specificity. If the annotator is in one culture they may not be able to compare the social norms of another person.
One task for dialogue based norm discovery is to have a conversation between multiple people and ask the human or LLM what the relevant social norms are in the conversation. This work presents a dataset of conversations and other textual information that allows for the comparison of the understanding of diverse cultural nroms across different large language models. Ideally this could also be used to create more culturally diverse and aware large language models.