Higher Education

Researchers Hack ChatGPT to Study How It Makes Decisions

Researchers at Carnegie Mellon University devised a string of code that could unlock ChatGPT and make it do things it was programmed not to. Now they're working on a "mind reader" tool to study how it makes decisions.

September 15, 2023 •

Evan Robinson-Johnson, Pittsburgh Post-Gazette

A layout of a brain formed by blue lines with one side looking like a computer chip to indicate artificial intelligence. Gradient blue and black background.

(TNS) — When he was still an undergraduate student at the University of California Berkeley, Andy Zhou started to worry the artificial intelligence tools he and his peers were using to speed through homework assignments could be vulnerable to misuse.

So in early 2022, he teamed up with Dan Hendrycks and Oliver Zhang to create the Center for AI Safety, the nonprofit that warned in a letter signed by more than 500 industry leaders and academics that "mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war."

Then last fall he came to Carnegie Mellon University and, together with Ph.D. advisers Zico Kolter and Matt Frederickson, developed a relatively simple way to hack ChatGPT.

Using a simple string of code that starts with a winky face, the CMU team showed how easy it is to get around ChatGPT's existing safety mechanisms. Once unlocked, the chatbot would happily accede to any number of nefarious requests, from bomb recipes to racist jokes.

The code also worked on other chatbots, such as Google's Bard and Claude from Anthropic.

Before publishing its findings, the CMU team shared the vulnerability with those companies. Open AI and Google tweaked their tools in response, but the researchers said an "infinite" list of new code strings could trigger the same result.

"Nobody understands these neural networks," Mr. Zhou said.

That's why for his next act, he is developing a mind reader that will look inside the "black box" behind chatbots. The large language models are built on billions of parameters scraped from the
Internet. Curated responses are not typically traced or attributable. Sometimes, the bots have a tendency to deceive users, Mr. Zhou said.

His mind reader could potentially show that intent.

"He's essentially found a part of the internal state of the model that seems to correspond to 'I recognize that this is unsafe, or this is dangerous,'" Mr. Frederickson said, "And you can control that part even without a jailbreak string."

The researchers' goal is to better understand how chatbots make decisions as generative AI becomes more ubiquitous and more powerful.

So far only 14 percent of American adults have tried ChatGPT, according to the Pew Research Center. But other sites and web services are rapidly implementing its predictive capabilities.

Sam Altman, the CEO of Open AI, has said he's not entirely sure how the tool he built to reshape society actually works. On Wednesday, he and other AI executives including Elon Musk and Mark Zuckerberg converged in Washington for a private meeting hosted by Senate Majority Leader Chuck Schumer. Mr. Altman has previously called for greater regulation from Congress.

Mr. Zhou said his nonprofit is engaged in the conversations happening on Capital Hill. But while they try to sway the minds of lawmakers and CEOs, Mr. Zhou is focused on mind control for the tools themselves.

"These models sometimes give you untruthful information, or under some circumstances might even try to deceive you on purpose. And that's a longer term risk," Mr. Zhou said. "This is a way to monitor their truthfulness almost like a lie detector. If you have a reasonable grasp on their internals, that seems to be helpful."

To be clear, he won't be taking control of ChatGPT. "That would be a federal crime," Mr. Frederickson said.

Instead, he's using large language models that are open source, meaning anyone on the Internet can access them for testing and other purposes.

The researchers will likely share the mind control findings in a similar way: Through arxiv.org, an open-access archive hosted by Cornell University that bypasses the peer review process.

The goal? Get it out there quickly, so that people are aware how risky the new tools can be. And maybe also beat more nefarious hackers at their own game.

Right now, chatbots are a relatively innocuous use for generative AI, the researchers said. People could already use the Internet to say mean things and find bomb recipes.

But as the technology becomes more integrated with correspondence and transactions, the risks increase.

People are already "manually jailbreaking" ChatGPT, Mr. Frederickson said, meaning they challenge the AI with more creative or insistent prompts to circumnavigate its safety mechanisms.

"Can people manipulate the data to wreak havoc and cause problems? That's what we wanted to understand," he said.

OpenAI wants to understand its vulnerabilities too. Last year, it paid a group of experts to try to break GPT-4 before its public release. Google employs a "red team" to protect Bard against jailbreaks and other attacks.

That work is encouraging but not a replacement for independent research, said Mr. Kolter of the CMU team.

"You can't rely on companies to do all their own auditing," he said.

©2023 the Pittsburgh Post-Gazette. Distributed by Tribune Content Agency, LLC.

Tags:

IE 11 Not Supported

Preparing K-12 and higher education IT leaders for the exponential era

Researchers Hack ChatGPT to Study How It Makes Decisions

Researchers at Carnegie Mellon University devised a string of code that could unlock ChatGPT and make it do things it was programmed not to. Now they're working on a "mind reader" tool to study how it makes decisions.

Tags: