Complex sampling mechanisms, understood here as departures from fully specified, unconfounded, probabilistic designs, arise in many forms: when inclusion depends on the variable of interest (confounded/endogenous sampling), when selection rules are only partially known or unrecorded, or when some units lack a well-defined, nonzero inclusion probability and inclusion is not probabilistic. In such settings, the sampling process itself must be taken into consideration to ensure valid inference. This issue is central in both survey statistics and causal inference, where the sampling mechanism determines what data become observable and how missingness is generated; neglecting a confounded or unknown mechanism induces systematic selection bias, because the observed data no longer reflect the descriptive or causal structure of the target population. This thesis investigates the role of complex sampling mechanisms and examines how their characteristics shape both data integration and causal inference, thereby requiring explicit adjustment. It is organized around two connected lines of research: (i) drawing valid inferences from non-probability samples that generalize to the target population, and (ii) causal inference under confounded sampling mechanisms, with emphasis on principal stratification under outcome-dependent sampling (ODS). Chapter 1 introduces notation and bridges the missing-data and survey-sampling perspectives, clarifying when and under which conditions the sampling mechanism can be ignored for inference, and offering a taxonomy of common data-collection schemes by ignorability and known/unknown status of the mechanism. Chapter 2 studies population mean estimation from non-probability samples, comparing estimators in terms of bias, variability, and robustness to misspecification through simulation designs that vary the magnitude and structure of selection bias as well as the degree and form of model misspecification. In this context, several classes of doubly robust estimators are proposed and discussed, and simple diagnostics are provided to guide estimator choice under potential misspecification. Chapter 3 applies these methods to data from web-scraping combined with population registers for Tuscan firms to measure innovation and Industry 4.0 adoption. Chapter 4 addresses causal inference when data are obtained from a known but confounded sampling design (e.g., ODS or case-noncase design, also referred to as case-control in the epidemiological literature). Within this setting, principal stratification is formalized under ODS. The analysis clarifies the role of sampling mechanisms in causal inference, establishes identification conditions for principal causal effects, and develops estimation strategies based on principal scores and likelihood-based methods tailored to the ODS design. Simulation studies confirm that the proposed adjustment reduces bias and variability relative to naive analyses. Chapter 5 implements the framework in the E3N cohort of French women, estimating principal causal effects of menopausal hormone therapy on breast cancer risk in the presence of a post-treatment intermediate variable by leveraging a principal score method. A simulation study anchored to E3N cohort data treats the cohort as a finite population and repeatedly draws case-noncase samples to assess finite-sample performance, providing empirical evidence for the method proposed in Chapter 4.
Addressing Complex Sampling Mechanisms in Survey Statistics and Causal Inference / Braito Lisa. - (2026).
Addressing Complex Sampling Mechanisms in Survey Statistics and Causal Inference
Braito Lisa
2026
Abstract
Complex sampling mechanisms, understood here as departures from fully specified, unconfounded, probabilistic designs, arise in many forms: when inclusion depends on the variable of interest (confounded/endogenous sampling), when selection rules are only partially known or unrecorded, or when some units lack a well-defined, nonzero inclusion probability and inclusion is not probabilistic. In such settings, the sampling process itself must be taken into consideration to ensure valid inference. This issue is central in both survey statistics and causal inference, where the sampling mechanism determines what data become observable and how missingness is generated; neglecting a confounded or unknown mechanism induces systematic selection bias, because the observed data no longer reflect the descriptive or causal structure of the target population. This thesis investigates the role of complex sampling mechanisms and examines how their characteristics shape both data integration and causal inference, thereby requiring explicit adjustment. It is organized around two connected lines of research: (i) drawing valid inferences from non-probability samples that generalize to the target population, and (ii) causal inference under confounded sampling mechanisms, with emphasis on principal stratification under outcome-dependent sampling (ODS). Chapter 1 introduces notation and bridges the missing-data and survey-sampling perspectives, clarifying when and under which conditions the sampling mechanism can be ignored for inference, and offering a taxonomy of common data-collection schemes by ignorability and known/unknown status of the mechanism. Chapter 2 studies population mean estimation from non-probability samples, comparing estimators in terms of bias, variability, and robustness to misspecification through simulation designs that vary the magnitude and structure of selection bias as well as the degree and form of model misspecification. In this context, several classes of doubly robust estimators are proposed and discussed, and simple diagnostics are provided to guide estimator choice under potential misspecification. Chapter 3 applies these methods to data from web-scraping combined with population registers for Tuscan firms to measure innovation and Industry 4.0 adoption. Chapter 4 addresses causal inference when data are obtained from a known but confounded sampling design (e.g., ODS or case-noncase design, also referred to as case-control in the epidemiological literature). Within this setting, principal stratification is formalized under ODS. The analysis clarifies the role of sampling mechanisms in causal inference, establishes identification conditions for principal causal effects, and develops estimation strategies based on principal scores and likelihood-based methods tailored to the ODS design. Simulation studies confirm that the proposed adjustment reduces bias and variability relative to naive analyses. Chapter 5 implements the framework in the E3N cohort of French women, estimating principal causal effects of menopausal hormone therapy on breast cancer risk in the presence of a post-treatment intermediate variable by leveraging a principal score method. A simulation study anchored to E3N cohort data treats the cohort as a finite population and repeatedly draws case-noncase samples to assess finite-sample performance, providing empirical evidence for the method proposed in Chapter 4.| File | Dimensione | Formato | |
|---|---|---|---|
|
Lisa_PhD_Thesis.pdf
embargo fino al 14/04/2027
Descrizione: Addressing Complex Sampling Mechanisms in Survey Statistics and Causal Inference
Tipologia:
Pdf editoriale (Version of record)
Licenza:
Open Access
Dimensione
29.27 MB
Formato
Adobe PDF
|
29.27 MB | Adobe PDF | Richiedi una copia |
I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



