This study presents an investigation into the influence and potential risks of using user inputs as part of a prompt, a message used to interact with ChatGPT.We demonstrate the influence of user inputs in a prompt through game story generation and story ending classification. To assess risks, we utilize a technique called adversarial prompting, which involves deliberately manipulating the prompt or parts of the prompt to exploit the safety mechanisms of large language models, leading to undesirable or harmful responses. We assess the influence of positive and negative sentiment words, as proxies for user inputs in a prompt, on the generated story endings. The results suggest that ChatGPT tends to adhere to its guidelines, providing safe and non-harmful outcomes, i.e., positive endings. However, malicious intentions, such as “jailbreaking”, can be achieved through prompting injection. These actions carry significant risks of producing unethical outcomes, as shown in an example. As a result, this study also suggests preliminary ways to mitigate these risks: content filtering, rare token-separators, and enhancing training datasets and alignment processes.

Breaking Bad: Unraveling Influences and Risks of User Inputs to ChatGPT for Game Story Generation / Taveekitworachai P.; Abdullah F.; Gursesli M.C.; Dewantoro M.F.; Chen S.; Antonio Lanata; Guazzini A.; Thawonmas R.. - ELETTRONICO. - 14384 LNCS:(2023), pp. 285-296. (Intervento presentato al convegno 16th International Conference on Interactive Digital Storytelling) [10.1007/978-3-031-47658-7_27].

Breaking Bad: Unraveling Influences and Risks of User Inputs to ChatGPT for Game Story Generation

Gursesli M. C.;Antonio Lanata;Guazzini A.;
2023

Abstract

This study presents an investigation into the influence and potential risks of using user inputs as part of a prompt, a message used to interact with ChatGPT.We demonstrate the influence of user inputs in a prompt through game story generation and story ending classification. To assess risks, we utilize a technique called adversarial prompting, which involves deliberately manipulating the prompt or parts of the prompt to exploit the safety mechanisms of large language models, leading to undesirable or harmful responses. We assess the influence of positive and negative sentiment words, as proxies for user inputs in a prompt, on the generated story endings. The results suggest that ChatGPT tends to adhere to its guidelines, providing safe and non-harmful outcomes, i.e., positive endings. However, malicious intentions, such as “jailbreaking”, can be achieved through prompting injection. These actions carry significant risks of producing unethical outcomes, as shown in an example. As a result, this study also suggests preliminary ways to mitigate these risks: content filtering, rare token-separators, and enhancing training datasets and alignment processes.
2023
16th International Conference on Interactive Digital Storytelling
16th International Conference on Interactive Digital Storytelling
Taveekitworachai P.; Abdullah F.; Gursesli M.C.; Dewantoro M.F.; Chen S.; Antonio Lanata; Guazzini A.; Thawonmas R.
File in questo prodotto:
File Dimensione Formato  
978-3-031-47658-7_27.pdf

accesso aperto

Tipologia: Pdf editoriale (Version of record)
Licenza: Open Access
Dimensione 1.45 MB
Formato Adobe PDF
1.45 MB Adobe PDF

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1346082
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? ND
social impact