This study presents an investigation into the influence and potential risks of using user inputs as part of a prompt, a message used to interact with ChatGPT.We demonstrate the influence of user inputs in a prompt through game story generation and story ending classification. To assess risks, we utilize a technique called adversarial prompting, which involves deliberately manipulating the prompt or parts of the prompt to exploit the safety mechanisms of large language models, leading to undesirable or harmful responses. We assess the influence of positive and negative sentiment words, as proxies for user inputs in a prompt, on the generated story endings. The results suggest that ChatGPT tends to adhere to its guidelines, providing safe and non-harmful outcomes, i.e., positive endings. However, malicious intentions, such as “jailbreaking”, can be achieved through prompting injection. These actions carry significant risks of producing unethical outcomes, as shown in an example. As a result, this study also suggests preliminary ways to mitigate these risks: content filtering, rare token-separators, and enhancing training datasets and alignment processes.
Breaking Bad: Unraveling Influences and Risks of User Inputs to ChatGPT for Game Story Generation / Taveekitworachai P.; Abdullah F.; Gursesli M.C.; Dewantoro M.F.; Chen S.; Antonio Lanata; Guazzini A.; Thawonmas R.. - ELETTRONICO. - 14384 LNCS:(2023), pp. 285-296. (Intervento presentato al convegno 16th International Conference on Interactive Digital Storytelling) [10.1007/978-3-031-47658-7_27].
Breaking Bad: Unraveling Influences and Risks of User Inputs to ChatGPT for Game Story Generation
Gursesli M. C.;Antonio Lanata;Guazzini A.;
2023
Abstract
This study presents an investigation into the influence and potential risks of using user inputs as part of a prompt, a message used to interact with ChatGPT.We demonstrate the influence of user inputs in a prompt through game story generation and story ending classification. To assess risks, we utilize a technique called adversarial prompting, which involves deliberately manipulating the prompt or parts of the prompt to exploit the safety mechanisms of large language models, leading to undesirable or harmful responses. We assess the influence of positive and negative sentiment words, as proxies for user inputs in a prompt, on the generated story endings. The results suggest that ChatGPT tends to adhere to its guidelines, providing safe and non-harmful outcomes, i.e., positive endings. However, malicious intentions, such as “jailbreaking”, can be achieved through prompting injection. These actions carry significant risks of producing unethical outcomes, as shown in an example. As a result, this study also suggests preliminary ways to mitigate these risks: content filtering, rare token-separators, and enhancing training datasets and alignment processes.File | Dimensione | Formato | |
---|---|---|---|
978-3-031-47658-7_27.pdf
accesso aperto
Tipologia:
Pdf editoriale (Version of record)
Licenza:
Open Access
Dimensione
1.45 MB
Formato
Adobe PDF
|
1.45 MB | Adobe PDF |
I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.