Multi-University Research Initiative

26 minute read

Date: July 12, 2024

This blog post contains my unedited notes that I took while attending the Multi-University Research Initiative Program Review this year. This was an interesting meeting where I presented my work in cognitive models applied onto large languge model to improve their usefullness in educational settings, specifically for phishing email education.

Multi-University Research Initiative Program Review Meeting May 29-30th

Paul Opening remarks, Ben Rubenstein and Somesh Jha.

What do we now know that we didn’t at the beginning of the MURI.

How do we make HATs resilient, rather than just focusing on the performance of the AI model alone. Human decision errors are often due to biases and AI models should take those into account. SAGAI conference workshop organized for this project.

Thrust 1: Robust analyst friendly task-driven explanations. Explanations support trust. What is the connection between HATs and explanability and trust?

Thrust 2: Robustness in machine learning, ensuring confidentiality, other security desiderata. Malware examples of classifying.

Thrust 3: Human-machine shared mental models. Mathematical models of human cognition, and one of the AI, how do we combine the models?

Thrust 4: Robust HATs. Connecting adversarial machine learning, sponge attacks make it so the human advisor needs to intervene more often, and makes the ML system useless.

Thrust 5: Cognitive-machine models in HATs.

Robust certificates for malware, adversarial ML classification of malware attacks, trying to evade malware detection by changing the classification. Related to the idea of making imperceptible changes to visual data to fool visual categorization. Automated and at scale adjustment of malware attacks to fool categorization. Also want to improve robustness for malware detection. Can be done by smoothing the classification, to ensure that there aren’t small areas close to the input that will change the classification.

Question from Somesh, Do you still get robustness bounds on the performance? Yes we can still prove a bound on the performance and this can change based on what the attacker can do to the malware.

Comment: There is a gap between the threat model and the reality, in that the attackers can make more significant changes. Could give guarantees on the embedding of the malware, and can certify the distance to that.

Question from Matt: Could we apply de-noising like is done with attacks for vision models? Answer: maybe.

Some attacks do make larger changes to malware that are harder for the model to correctly classify. Which is analogous to say rotating or cropping the visual input in the visual attack domain. But the attacks that can be certified as defended in this case are a non-trivial fraction of realistic attacks that might be seen in the real world.

Concept-based explanations for out of distribution detectors. Detectors choosing not to classify things that are OOD and saying that it looks weird. ML categorization for network flows, but they are hard to use in production because they are not explainable when the network flow looks weird, the classifier cant be helpful to the analyst. OOD attacks are also interesting for speech because of auditory deep fake attacks. Social engineering attacks.

Want to go beyond classifying network flow as OOD but giving some explanation as to why it is difficult to classify, in human-understandable concepts. Here this is applied onto images. Haven’t done a user study of these explanations. Related to concept activation vectors, this is extending those ideas to OOD detectors.

Examples of explanations of visual detectors are things like ‘dolphin-like skin’ for detecting dolphins. Energy based detector for OOD, most popular baseline, tried to connect the concepts that the OOD was using. OOD model comes from logits, model trained on images of animals.

How are concepts created? Activation patterns. In the past this required human labeling of what the concepts was. In this work the concept descriptions is extracted in an unsupervised way. Project features into a subspace… Read paper. Is it an understandable Text categories of concepts is from humans in practice but could use CLIP and a visual-textual model to classify it. This works isn’t explaining OOD yet.

The main bottleneck in actually using classification for industry applications is that OOD detectors cant give explanations and thus can’t be useful to analysts. This could be helpful for preventing hallucinating with mechanistic interpretation, but this goal is missing high level semantic explanations of why the model is hallucinating.

My question: What does an OOD explanation sound like ‘this is OOD because it doesn’t have dolphin like skin’.

One issue with this approach with human in the loop training is that it may give explanations for OOD that are understandable to the human but are not the real explanations that the system thinks the input is OOD. Mitigate this by not using ‘user happiness’ as the utility or Reward in the RL sense. RLHF is the standard method of alignment.

Toward robust and explainable malware detection (Bauer, Lucas, Akgul)

Main issue that is trying to be solved is hard to defend attacks against malware detectors, this can be mitigated with robustness methods. New results are an order of magnitude speed up in the training of robust malware detection models. Metrics of alignment for malware detectors. And extracting explanations from the detectors for malware analysts.

Malware detectors can be fooled in a similar way as image classification. However it can take thousands of epochs to create an adversarial example from repeatedly querying the classifier. Want to make adversarial examples more quickly. Weaker attacks can be easier to make for training against adversarial examples for robustness, but there is a question of how training on weaker attacks impacts performance of robustness on more complex attacks.

Reduced the attack success of 3 different types of attacks to very low, but the decrease in attack success plateaus, and negatively impact the accuracy of unperturbed examples. Is it possible to train on one type of attack and be robust to other types of attacks? Not as much with some combinations of types of attacks.

Attack budget refers to the number of bytes that are changed or added, question of training on low budget attacks and test on higher budget attacks. Training on only low budget or only high budget can lead to robustness to other sizes.

Faster training of robust malware detectors. Training without needing to generate adversarial examples by using integrated gradients to select most important blocks and change those to maximize loss ‘GreedbyBlock’ training. Big computational costs to evaluation, which is required since many of the trained models are not robust. Define a measure of a robustness proxy. These are used by training a model to predict the true robustness based on the values of robustness.

Is this method of throwing away models at the beginning of training sort of like data leakage?

Explanability of classification is a lot harder with binaries compared to images. Is there some area of the binary that is actually responsible for making the binary malware? Highlighting lines of code that may be indicative of code as being malware. What is a YARA rule?

Experimental study, hiring reverse engineers to reverse engineer a binary wile the analysts label the information they are looking at based on how it is useful for the task that they are performing. Train a model to label the binaries in a similar way. Follow up study that uses the trained model to highlight areas of the code to see if that highlighting can improve the analysis of the binary.

(Jerry Zhu) Robustness of RL (Zhu)

HATs with robust RL. Using RL because the earlier decisions made may alter the future decisions that are made in the environment. e.g things that are presented to the analyst may impact their mental state and change their future decision making.

Background on the basics of reinforcement learning.

Security and robustness for reinforcement learning. Test time attack on classification version of this problem involves making a slight adjustment to an input to change classification. Training time attack on classification means we artificially change the decision boundary during training. Decision boundaries on classification can be mitigated by training the model that all adjacent neighbors are a part of that class.

All of these concepts have analogous ideas within reinforcement learning literature. RL has more attack surfaces that the attacker could exploit. Reward exploiting, MIM attacks adjusting the reward that the agent observes. Another version of this same idea is to adjust the state that is observed by the state transition function. Also multiple of these can be combined at one time.

Robustness to backdoor RL attacks. backdoor policy attacks involve another person giving you some policy, but there is a special trigger state that causes the policy to return a specific operation. Policy certification can prove that policies are safe for others to use. Example from breackout game is having a single pixel that, when highlighted, always makes the model move to the right until the game ends.

We can sanitize backdoor policies that were given to us by running the policy in a sandbox environment. State occupancy distribution guarantees where the trigger will not be, running PCA on this state space finds the subspace that is frequented by the policy but in a clean way. The trigger state can be filtered out by mapping the state onto the subspace defined by the genuine state occupancy.

Flipping the reward for a Half Cheetah from positive to negative on 1% of trials can cause the model to run backwards instead of running forward. How easy is it to retrain these models without any fancy tools but by retraining the trained models. How can this be applied to multi-agent reinforcement learning, can an adversarial attacker choose actions that negatively impact the decisions made by other players in the game?

(Patrick McDaniel) Learning Domain Constraints for Adversarial Robustness (McDaniel)

What is the natural level of robustness for tasks and how does it related to domain constraints? Hopefully information about these constraints can reduce he attack surface. An example of a domain constraint within network flow analysis is that we can’t have a negative or zero round trip return time. Constraint theory can reduce the space of all combinations of state features to those that are only consistent with the constraints of that environment.

Many adversarial examples that are generated by common techniques (PGD) are not actually realizable by the domain constraints, which means simply checking for this first breaks the adversarial ability of the approach.

Generalization of learned information. Many existing types of ML attacks. Space of adversarial strategies consists of surfaces that are produced via backpropogation and travelers that manipulate the input. 500+ attack models were compared and described in terms of the components that they are built on.

Question: Why do people use these frankensite attack models that are weird components of different models wen PGD has theoretical guarantees. Answer: PGD doesn’t always perform the best in practice.

The best model for attack is different depending on the structure of the environment and the task that is being performed. The dataset is highly relevant for the dataset.

Generated adversarial samples in a way that makes sure that the semantics are conformed to, the it evades constraints and is not duplicative.

Claim that the constraints are canonicalized but is that actually true? What if things change in network protocols etc. Question: Could we have constraints on extracted representations, how would that work? Answer: Maybe, sounds interesting but hard to get a ground truth for which representations are valid and constraint conforming.

(Matt Fredrikson) Interpretability and Robust Control for Language Models (Fredrikson)

What are safeguards? Asking chatGPT to tell you how to hotwire a car typically replies with a content warning that it cannot do that. This is relatively easy to get around with simple prompt engineering. These safeguards are the result of the GPT model being allied with a small amount of data where people classify prompt responses as being aligned or not based on the prompt.

Attacking alignment, we want to maximize the probability of our desired outcome given a data and a small change to that data prompt. Want to maximize not only the ‘sure’ but also the beginning of the description of what is actually desired to be output by the model.

The way we can maximize this in a continuous way that can get us something closer to a true gradient is by looking at the embedding of the representation we want as the output and take the difference between the output representation we do observe and the desired. Currently this process would be model dependent and require about 20-50 steps of optimization of looking at output representations.

Universal attack: Find a suffix that will make the model go off policy regardless of the input prompt that we want the GPT model to actually answer. This way the universal attack can generate any information with the same attack prompt. Also can get attacks that work across different models by adding multiple models to the attack training function.

Some of the issues that these are related to are conversational LLMs that are connected to functions, and these can potentially be abused to allow prompts to control functions other than the ones it needs to do to complete a specific task.

Some ways to avoid this is to control the way that information and prompts are represented in the model. Linear artificial tomography elicits intermediate representations that allow us to distinguish between aligned and unaligned behavior. Also helps to understand why we can attack these LLMs to generate bad behavior.

(Eidels and Brown) Human-Bot Collaboration and the measurement of trust (Eidels & Brown)

Shared mental models with artificial intelligence and humans. What is a rigorous team science for HATs. define cohesiveness with HATs and make them anti-fragile to active human and ML adversaries. Goal of HATS might be a recommendation system interacting with a human analyst and determining the risk associated with network requests.

Competition and collaboration in the 2 color pong game environment. There are two paddles colored either red or blue, and there are balls that are either red, blue, or purple, signifying which of the balls need to be kept in the air by the related pong paddle agent.

Human-human teams collaborating makes an improvement over competitive, while human-machine does not show this same deviation from optimal.

Complex games allow for extensions of analysis from sports science to get ideas of how HATs and HHTs are similar and different and why one of them may show different results. On average if agents actions are more correlated then it means that their performance is poorer.

Calculating measure of division of labor depends on how much the agents overlap in space, but as this moves into different non spacial tasks the concept needs to change as well.

(Tyler Malloy, Cleotilde Gonzalez) Human-AI Teaming in Cyber Defense (Malloy, Gonzalez)

My presentation.

Talk 7 (Kamalika Chaudhari and Rittler) FairProof: Confidential and Certifiable Fairness for Neural Networks

FairProof: principled ways of auditing ML models with guarantees. Guarantee that the model does what we think it does while maintaining confidentiality. This is an issue due to real world models being private, confidential training data or the model itself.

Question: What do we mean by auditing and fairness, is it in relation to a specific attack? Answer: we mean both in terms of a specific definition of fairness, trust, etc. Example given is a visual model that performs less accurately when processing images of black people vs. white, we want to be able to provably ensure this isn’t happening.

General background on Zero Knowledge Proofs. Zero knowledge proofs to audit fairness get around two of the major concerns with alternative methods of how we can audit these models. Weighted L2 distance with 0 for the weight of the feature that is sensitive: isn’t there an issue of intersectionalism where if we are measuring the weight w of race as 0 then we can claim that it isn’t racist but in reality there are many other factors that might be associated more or less with one racial group, and the model could still be unfair to them. Just using Zip code for approving home loans can massively negatively impact black people. E.g denying everyone within a specific zip code a home loan where the population is 80% black and approving everyone in a zip code where the population is 80% white. This seems like a big issue with the concept of local individual fairness. This seems to be addressed in the Fairness Through Awareness paper.

(Chaudhuri) Beyond Discrepancy: A New theory of Distribution Shift (Rittler - student)

AI doesn’t work because of distribution shift. Weird things happen with LLM outputs and issues with fragility. Sometimes this can be solved with more data but the type of data is highly relevant.

We can improve distribution shift if we have a good representation model that can map both the source and target training set onto similar representations if they are functionally similar. How do we learn the features within the training dataset that are relevant for forming good representations.

(Tim Rogers) Delusional hedge as a model of human trust formation (Rogers)

Models of human trust learning with implications for human AI interaction.

One of the issues with how cognitive science typically represents human learning is that it requires fully accurate supervision or the agent to rationally determine statistical regularities from experience in a rational way.

Instead of modeling learning from an oracle that produces fully accurate predictions, we model the case where two agents share their predictions with each other, and an oracle. Traditionally we think of the two agents as updating their beliefs towards the average of the other agent and the oracle, which results in a collapse to the state where both agents have the same belief as the oracle.

In reality this method is susceptible to attack from someone who wants to give fake information towards a desired belief if the other agent is an ‘honest learner’. Why not ground the averaging of the two beliefs of the other agents with the experience of the agents, you have two different predictions from both agents and then make some behavioral observation in the world which pushes your trust in the two other agents towards one way or the other. Then in the future you differentially weigh the other agents beliefs based on your trust.

Main difference between this experiment as described and the real world is we don’t just get the ground truth once, we get it many times.

Hypothesis is that there are a lot of different ways that people may incorporate this information, does that mean that the model is unfalsifiable?

Shouldn’t the order of the experience matter a lot for how the trust is built in the recommendations, if the -40 guy makes a ridiculous claim immediately then you should be less trustworthy in him.

Followup study that compared different distances for the unreliable source of information to compare how far away the other source is and the amount of weight that is given to that source. Exponentially decreasing curve.

When we begin the initial trial of the experience is observing the less correct agent, then there is a much lower weight given to that less correct agent. However the same is true if the recommendation is too close to the decision boundry, we may just ignore them.

What do we do with trust updating if we don’t observe any ground truth? We can imagine an estimated error where the observation is imagined. Using MLE to find the value of the parameter that best fits the learning rate parameter, and evaluate on held out data.

(Sean Chuang and Vince Frigo) Simulating human opinion dynamics with LLM agents (Sean Chuang) Learning from multiple unreliable sources with real-world controversial beliefs (Vince Frigo)

Changing the number of sources of information to see how it impacts the way people average the beliefs of others when forming their own beliefs. Nefariously adding multiple false beliefs can give the impression of there being a consensus when there is not in reality, astroturfing campaigns.

Adding another belief report doesn’t change how people average their beliefs, but it does change peoples impressions about the trustworthiness of others. Hearing from a midpoint decreases the trustworthiness ratings of the near and far belief people.

Extending this belief analysis work into more concrete real world beliefs, such as factive matters ‘human beings have been abducted by aliens from outer space’. In the study this was done with beliefs in ghosts, angels, and 9/11 truther conspiracy theories. Gave the participants persuasive paragraphs, with ratings of how many people agree or disagree with the statements.

Shows the same shift towards the variable source, that disagrees with your prior beliefs. Also varies more if the belief they had at the beginning is farther from the belief they read as being popular. This is the same as the more abstract task.

We can in the future change the communication type such as using social media or changing who is generating the persuasive arguments like GPT.

Want a more explicit metric of how human-like group dynamics are, specifically in language communication for LLMs. - Did the LLM agents debating about global warming have a state memory like the generative agents models do? Answer: Used LangChain Question: how similar is that to the generative agents paper?

These models when analyzing arguments pro and against things like climate change, often they will change their mind. Curiously will reference the things about their demographics information. In a certain number of time steps all of the models will attract their opinions to the truth, like climate change being real for instance. However convincing humans this well is not as easy.

We can add into the prompt for the GPT agent that they have a confirmation bias and thus are less likely to belief things that go against their prior beliefs. This impacts the distribution of the end state and can result in a bimodal distribution of beliefs. The intermediary changing of beliefs is not very human-like, but the distribution at the end ware closer.

Previous study with humans shown that partisan democrats and republicans actually updated their beliefs towards the ground truth when they were shown information about group predictions of the unemployment rate at the end of Obama’s term. This was also done with GPT role-playing agents and the LLM experiment was similar to this.

(Benjamin Zhao) Adversarial Robustness of Malware Detectors, Facing the challenge of leveraging untrained humans in malware triage, and Confidence Attacks (Benjamin Zhao - Postdoc)

Dual denial of decision attack for HATs where human analysts review malware that the AI model has a low confidence on. The Confidence attack works by sending a large number of malware queries that produce low confidence predictions of the ML models and requires the human analyst to review. This is done by exploiting the geometric behavior of adversarial examples in the feature space. For classifications above 2, we need to find the intersection of multiple classes to find a location where the model will reply with a low confidence rating. Different from normal attacks.

Can also transfer the method of finding confidence attack examples from one domain onto another. This slightly reduces the probability of finding the example but not significantly.

How do we prevent these types of attacks from preventing human analysts to effectively review malware attempts. Maybe unskilled users could flag the relevant aspects of malware queries and make the human analyst’s job easier. This is done by adjusting binaries into face image domains. Make a face image from the information from the binaries in a way that preserves distance in the latent space, meaning more similar faces are also from more similar binaries.

Humans are trying to flag the malware that the original model was mistaken in identifying. Question: If the ML can do it then why do we need the human user? Answer: We can improve performance above the baseline model by using feedback from human subjects. Followup: Could you compare the model accuracy to the accuracy of using the embedding model reconstruction as input to the classification model.

LACMUS: Use adversarial created attacked to train the classifier. The VAE used in this process is somewhat basic and can be improved upon.

(Andrew Cullen) Exploiting Certified Guarantees to attack Certificates (Andrew Cullen - Postdoc)

Adversarial machine learning, systematizing AML defenses and determining the drivers of attack. Sometimes looking at the difference between the original image and the adversarial attack for visual categorization can show us highlights of semantic information within the image, like the outlines of objects.

Certifications in the real world tend to be much smaller than the actual size of the attack surface. Why is this? They are still helpful for preventing the hardest to detect attacks which are very close to categorization boundaries.

Differential privacy mechanisms. Applying the certifications at the same time makes sense because there are different spaces that the certification is applicable.

Share on

Twitter Facebook LinkedIn