The human face plays a central role in communication, conveying not only spoken content but also social signals such as emotions, identity, and individual speaking style. Realistic modeling and animation of facial behavior is therefore critical for applications in computer graphics, virtual and augmented reality, telepresence, human–computer interaction, and the entertainment industry. Speech-driven 3D facial animation, commonly referred to as 3D talking heads, requires precise synchronization between speech and facial movements, while capturing natural expressions, micro-expressions, eye gaze, and head gestures. Despite substantial progress, several key challenges remain. Existing approaches often struggle to generate emotionally expressive faces, handle variations in language and speaking style, generalize across diverse identities, and operate on meshes with varying topologies. Furthermore, conventional evaluation protocols, often based on geometric error, fail to adequately capture temporal consistency, motion dynamics, and perceptual quality, limiting the reliability of comparisons between methods. This dissertation addresses these challenges along four complementary directions. First, it investigates methods for generating realistic and emotionally expressive facial animations, capturing subtle cues beyond lip movements. Second, it explores cross-lingual and style-aware approaches, accounting for the effects of language, prosody, and individual speaking style on facial motion. Third, it studies topological generalization, enabling animation of diverse 3D meshes without requiring fixed correspondences. Finally, it develops rigorous evaluation frameworks that measure temporal coherence, articulation accuracy, and overall perceptual realism, providing principled metrics for comparison. Taken together, the contributions of this dissertation advance the state of speech-driven 3D facial animation by enabling expressive, generalizable, and perceptually convincing talking heads, laying the foundation for robust and versatile applications in communication, interactive media, and virtual environments. Beyond the technological and creative domains, speech-driven facial animation has broader societal and communicative implications. In accessibility, realistic talking heads can serve as visual speech aids for individuals with hearing impairments, improving lip-reading and comprehension in noisy environments. In education and language learning, expressive avatars can provide interactive feedback and cross-cultural communication support. Moreover, with the increasing integration of avatars and digital humans in everyday digital communication, from customer service to telemedicine, ensuring natural, emotionally congruent, and ethically responsible generation of facial behavior has become increasingly critical. This dissertation therefore also reflects on the societal and ethical dimensions of such technology, recognizing that realism and expressiveness must be balanced with transparency and trustworthiness.
3D Talking Heads: Advancing Realism and Generalization / Federico Nocentini. - (2026).
3D Talking Heads: Advancing Realism and Generalization
Federico Nocentini
2026
Abstract
The human face plays a central role in communication, conveying not only spoken content but also social signals such as emotions, identity, and individual speaking style. Realistic modeling and animation of facial behavior is therefore critical for applications in computer graphics, virtual and augmented reality, telepresence, human–computer interaction, and the entertainment industry. Speech-driven 3D facial animation, commonly referred to as 3D talking heads, requires precise synchronization between speech and facial movements, while capturing natural expressions, micro-expressions, eye gaze, and head gestures. Despite substantial progress, several key challenges remain. Existing approaches often struggle to generate emotionally expressive faces, handle variations in language and speaking style, generalize across diverse identities, and operate on meshes with varying topologies. Furthermore, conventional evaluation protocols, often based on geometric error, fail to adequately capture temporal consistency, motion dynamics, and perceptual quality, limiting the reliability of comparisons between methods. This dissertation addresses these challenges along four complementary directions. First, it investigates methods for generating realistic and emotionally expressive facial animations, capturing subtle cues beyond lip movements. Second, it explores cross-lingual and style-aware approaches, accounting for the effects of language, prosody, and individual speaking style on facial motion. Third, it studies topological generalization, enabling animation of diverse 3D meshes without requiring fixed correspondences. Finally, it develops rigorous evaluation frameworks that measure temporal coherence, articulation accuracy, and overall perceptual realism, providing principled metrics for comparison. Taken together, the contributions of this dissertation advance the state of speech-driven 3D facial animation by enabling expressive, generalizable, and perceptually convincing talking heads, laying the foundation for robust and versatile applications in communication, interactive media, and virtual environments. Beyond the technological and creative domains, speech-driven facial animation has broader societal and communicative implications. In accessibility, realistic talking heads can serve as visual speech aids for individuals with hearing impairments, improving lip-reading and comprehension in noisy environments. In education and language learning, expressive avatars can provide interactive feedback and cross-cultural communication support. Moreover, with the increasing integration of avatars and digital humans in everyday digital communication, from customer service to telemedicine, ensuring natural, emotionally congruent, and ethically responsible generation of facial behavior has become increasingly critical. This dissertation therefore also reflects on the societal and ethical dimensions of such technology, recognizing that realism and expressiveness must be balanced with transparency and trustworthiness.| File | Dimensione | Formato | |
|---|---|---|---|
|
Nocentini_PhD_thesis.pdf
accesso aperto
Tipologia:
Pdf editoriale (Version of record)
Licenza:
Open Access
Dimensione
48.31 MB
Formato
Adobe PDF
|
48.31 MB | Adobe PDF |
I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



