Co-limitation towards lower latitudes shapes global forest diversity gradients

The latitudinal diversity gradient (LDG) is one of the most recognized global patterns of species richness exhibited across a wide range of taxa. Numerous hypotheses have been proposed in the past two centuries to explain LDG, but rigorous tests of the drivers of LDGs have been limited by a lack of high-quality global species richness data. Here we produce a high-resolution (0.025° × 0.025°) map of local tree species richness using a global forest inventory database with individual tree information and local biophysical characteristics from ~1.3 million sample plots. We then quantify drivers of local tree species richness patterns across latitudes. Generally, annual mean temperature was a dominant predictor of tree species richness, which is most consistent with the metabolic theory of biodiversity (MTB). However, MTB underestimated LDG in the tropics, where high species richness was also moderated by topographic, soil and anthropogenic factors operating at local scales. Given that local landscape variables operate synergistically with bioclimatic factors in shaping the global LDG pattern, we suggest that MTB be extended to account for co-limitation by subordinate drivers. Examining drivers of the latitudinal biodiversity gradient in a global database of local tree species richness, the authors show that co-limitation by multiple environmental and anthropogenic factors causes steeper increases in richness with latitude in tropical versus temperate and boreal zones.

I dentifying which mechanisms moderate global biodiversity patterns 1,2 has perplexed the scientific community for more than two centuries 3,4 . The most noticeable pattern, the latitudinal diversity gradient (LDG), is a trend of declining local species richness (alpha diversity) from low to high latitudes. This trend has been observed for many taxonomic groups and across land, freshwater and marine environments 5,6 . More than 30 hypotheses have been proposed 3,4,7,8 to explain LDG 9 , but few can be reconciled with existing observational data for predicting biodiversity decline towards the poles. To test these varied hypotheses, biodiversity data must be assembled that are global in scope with sufficient sample coverage across all ecoregions and biomes.
In addition to biodiversity data, testing these varied hypotheses also requires data on a wide spectrum of potential drivers that may moderate biodiversity at local scales 9,10 , such as climate, soil and land features and anthropogenic factors. For instance, environmental temperature (that is, ambient temperature of the air, represented by annual mean temperature) is largely responsible for the generation and maintenance of biodiversity through the effects of solar radiation on demographic rates (for example, growth and mortality), ecological interactions (for example, predation and competition) and evolutionary rates of change (for example, speciation and extinction) 11,12 . Soil and topographic heterogeneity facilitate niche partitioning via inducing microclimatic variation, contributing to compositional variation 13 and biodiversity maintenance 14,15 . Furthermore, humans have a long history of reshaping biodiversity through the selective use of natural resources and the modification of native species composition 16 . In addition, multiple subordinate factors jointly affecting biodiversity could potentially increase the diversity of niche opportunities, thereby resulting in speciesrich assemblages.
Here we quantified the relative contribution of a wide range of environmental factors across space on local tree species richness in forested areas around the world. To accomplish this, we standardized a global tree species richness (that is, as alpha diversity) database ( Fig. 1) and quantified the relative contribution of 47 explanatory variables including bioclimatic conditions (for example, annual mean temperature), vegetation and survey attributes (for example, sample plot size), topographic covariates (for example, terrain roughness), soil covariates (for example, bulk density) and anthropogenic spatial features (for example, size of roadless areas) in an attempt to test whether local co-limitation exists when multiple subordinate drivers co-dominate (Figs. 2 and 3). We conducted a three-stage analysis ( Fig. 1 and Methods in Supplementary Information for details) based on two independent ground-sourced forest inventory datasets (Phase I and Phase II; Extended Data  Fig. 1), with a total of ~55 million sample trees representing more than 32,000 species.

Results and discussion
Global patterns of local tree species richness and LDG. Our analyses confirmed, with a high level of accuracy, one general spatial trend in local tree species richness worldwide that has led us to three conclusions regarding the mechanisms underlying patterns of tree species richness. We found that LDG for tree species richness was consistent with that of most other groups of organisms, with a decline from the tropics to the poles (Figs. 2 and 4). In the Northern Hemisphere, tree species richness dropped sharply from the equator (98 species per ha) to 10° N with an average rate of decline of 6 species per ha per 1° increase in latitude, after which the decline diminished and stabilized at 4 species per ha at 50° N. In the Southern Hemisphere, tree species richness declined from the equator to 25° S on average by 3 species per ha per 1° increase in latitude, after which tree species richness fluctuated before another steep drop from 25 species per ha (43° S) to 4 species per ha (50° S). We were able to detect and map regional patterns and global peaks of tree species diversity, with a high spatial resolution (0.025° × 0.025°). The Amazonian, Southeast Asian and Melanesian rainforests are the regions with the greatest local tree species richness worldwide, containing >200 tree species per ha above the 5 cm diameter-at-breast-height (DBH) threshold, confirming previous findings 17,18 . Tropical African rainforests generally contain 50% fewer tree species per hectare than Amazonian rainforests. In the temperate forests of the Northern Hemisphere, the Changbai Mountains in Northeast Asia (up to ~28 species per ha) and the

Co-limitation towards lower latitudes shapes global forest diversity gradients
The latitudinal diversity gradient (LDG) is one of the most recognized global patterns of species richness exhibited across a wide range of taxa. Numerous hypotheses have been proposed in the past two centuries to explain LDG, but rigorous tests of the drivers of LDGs have been limited by a lack of high-quality global species richness data. Here we produce a high-resolution (0.025° × 0.025°) map of local tree species richness using a global forest inventory database with individual tree information and local biophysical characteristics from ~1.3 million sample plots. We then quantify drivers of local tree species richness patterns across latitudes. Generally, annual mean temperature was a dominant predictor of tree species richness, which is most consistent with the metabolic theory of biodiversity (MTB). However, MTB underestimated LDG in the tropics, where high species richness was also moderated by topographic, soil and anthropogenic factors operating at local scales. Given that local landscape variables operate synergistically with bioclimatic factors in shaping the global LDG pattern, we suggest that MTB be extended to account for co-limitation by subordinate drivers.
Central Appalachian forests in the Eastern United States (up to ~20 species per ha) display high local species richness. In the Southern Hemisphere, the sclerophyllous and Nothofagus-dominated forests in south-central Chile are among the most species-rich temperate communities (up to 50 species per ha). Boreal forest communities are consistently low in local tree species richness, with typically five or fewer tree species per hectare.
The above LDG pattern of tree species richness was generally consistent with the metabolic theory of biodiversity (MTB) 19,20 , except at low latitudes (Fig. 5). According to MTB, environmental temperature is largely responsible for the generation and maintenance of biodiversity 12,21,22 , and the natural logarithm of species richness is linearly associated with 1,000 T −1 , where T is the absolute environmental temperature in Kelvin (mean annual temperature +273.15 K), with a slope ranging from −7.5 K to −9.0 K. Our global tree species richness gradient was largely consistent with MTB, with a slope of −8.0 K (P < 0.001) and a coefficient of determination of 0.82 (Metabolic Theory of Biodiversity in Methods), indicating that environmental temperature is generally a good predictor of LDG. However, at low latitudes, MTB substantially underestimated LDG.
In fact, near the equator where the actual LDG peaked (98 species per ha), observed tree species richness was almost twice as high as predicted by MTB (56 species per ha) (Fig. 4a). Our results suggest that within this low latitudinal range, other factors are also important to the maintenance of biodiversity.
The under-estimation of local tree species richness by MTB at low latitudes is attributable, in part, to the lack of a definite dominant environmental factor, suggesting a co-limitation of multiple subordinate drivers at low latitudes (Fig. 5). In general, bioclimatic factors predominantly determined species richness in 82.6% of the forested areas, while co-limitation (that is, absence of any dominating factor) occurred in 11.7% of forested areas globally. However, in the low-latitude range between 5° N and 15° S, the percentage area of co-limitation increased to 37.1%, more than three times the global average. Furthermore, forested areas under co-limitation contained on average 81.1 ± 0.1 species per ha, much higher than the average local tree species richness of forested areas predominantly determined by topographic (43.9 ± 0.1), anthropogenic (35.6 ± 0.2), soil (33.9 ± 0.2) and bioclimatic (19.4 ± 0.02) factors (Fig. 5b). This suggests that the pattern of co-limitation is pervasive  1) were standardized into a global tree-level dataframe and aggregated into a global species abundance matrix. On the basis of plot locations, we merged the abundance matrix with 47 explanatory variables (Fig. 3) into a standardized plot-level dataframe. Stage 2: We compared three candidate models (rF, random forests; XGb, XGboost; OLS, ordinary least squares) trained from the Phase I plot-level dataframe, using random and spatial cross validation based on Phase I data and post-sample validation based on Phase II data (extended Data Fig. 2). the final model was then selected and re-calibrated with both Phase I and Phase II data. Stage 3: Using the final model, we standardized and mapped local tree species richness per hectare across the global forest range. On the basis of this globally continuous map, we quantified the associated LDG ( Fig. 4a) and tested for the Mtb (Fig. 2). We further developed the global map of co-limitation ( Fig. 5a) based on model sensitivity analysis and quantified the contribution of key factors to local species richness patterns using variance partitioning (Fig. 6). Dotted boxes represent processes or models and dashed ones represent data or results. Methods provide details.
in species-rich tropical forests. In South America, transitional areas between Amazonia and savanna formations nearby are subject to co-limitation that is partly attributable to a dynamic equilibrium between closed forest and savanna 23 , edaphic conditions and natural fire regimes 24 . In Africa, anthropogenic influences such as selective timber extraction and fuelwood collection, together with large-scale degradation 25 affect local tree species richness ( Fig. 5 and Extended Data Fig. 7). In Central Africa, the evolution of anthropogenic influences from prehistoric to present times has imposed a substantial effect on species diversity 26 and resulted in the development of a complex system of mixes with light-demanding and old-growth tree species.
Bioclimatic dominance and co-limitation. In addition to an overall positive response of local tree species richness to the rise of annual mean temperature (partial dependence plot of C1 in Fig. 3 and Extended Data Fig. 3), the importance of environmental temperature (2.7%) was topped by the total annual precipitation (C12, 7.6%) (Fig. 3). Our findings are consistent with previous discoveries of a joint role of water and temperature/energy-as a proxy for net primary productivity 27 -on plant species richness, with water dominating particularly at warmer, lower latitudes 22,28 . Predicted tree species richness accelerated exponentially with temperature and rainfall, although independently, as shown in the cold-dry quadrant and the convex contours of the 2D partial dependence plot (Extended Data Fig. 3), until each has reached its respective threshold (1,500 mm for total annual precipitation and 10 °C for annual mean temperature). Beyond one of these thresholds, species richness is limited only by the predictor below its threshold (that is, by annual mean temperature in the cold-wet quadrant or by annual precipitation in the hot-dry quadrant). When both predictors have reached their thresholds, that is, in the hot-wet quadrant, co-limitation predominates in most tropical forests. Net primary productivity in the tropics, thus, requires co-limitation of other factors besides only temperature and rainfall 29 . As the response of carbon flux mirror the low-latitude co-limitation pattern for tree species diversity, the matching determinants for both diversity and productivity may explain the similar latitudinal gradient in productivity and the positive diversity-productivity relationship 30,31 .
Our findings also indicate that under climate change, intensified droughts coupled with increased annual mean temperature 32 can potentially trigger declines of tree species richness, although possible increases in water-use efficiency from elevated CO 2 and the dominance of highly contingent co-limiting factors may partially buffer this effect in the tropics 33 .
Here we articulate evidence for co-limitation in LDG. Resource co-limitation is a common concept in ecology (for example, refs. 34,35 ), often used to describe how the synergistic interactions of two or more factors limit ecological productivity 36 . Our use of the term co-limitation emphasizes the reduced effect of bioclimate on tree species richness at low latitudes, although bioclimate is the globally predominant driver of species richness, recognizing that several local subordinate factors synergistically contribute to increased tree species richness in this latitudinal range. We thus argue that the inclusion of co-limitation could substantially improve the explanatory power of biodiversity models in estimating alpha diversity by considering multiple subordinate factors where single-factor dominance is lacking, especially in the tropics. At high latitudes, bioclimatic conditions, particularly environmental temperature, are the major limiting factors and thus the dominant drivers of tree species diversity. As the latitude declines, the influence of bioclimatic conditions dwindles and the maintenance of tree species richness is moderated by many interacting drivers without a clear dominance, which is especially well expressed between 5° N and 15° S (Fig. 5). This prevalence of co-limiting factors is thus not a mere coincidence as to why the observed LDG at low latitudes is almost double that predicted by MTB (Fig. 2). While each of the existing hypotheses underpinning LDG addresses a certain process 10,12 (for example, selection, drift, dispersal or speciation), the evidence of co-limitation highlights synergistic interactions of local processes across the latitudinal gradient.

Concluding remarks.
More research is needed to fully elucidate patterns of LDG driven by climatic and other influences, especially those outlined in competing hypotheses. First, our analyses lack explicit consideration of some evolutionary, ecological and historical factors. These include mid-domain stochastic effects 37 , the legacies of the poleward expansion of tree species after the Last Glacial Maximum 38,39 and recent human land use/management. Alternative hypotheses, such as niche conservatism or climatic history, are more difficult to test due to data limitations. In addition, long-term effects at geological and millennial time scales also play a role, but it is difficult to disentangle these effects due to collinearity 40 . A major source of uncertainty in our results (Fig. 4b) came from an uneven sample coverage between developed and developing countries (Extended Data Fig. 1). To address this gap, we argue that there needs to be a shared responsibility among forestry agencies at various levels of government, scientists, indigenous communities and other biodiversity monitoring groups to improve sample coverage of forest inventories in developing countries. Innovative biodiversity funding mechanisms, for example, forest inventories funded by carbon initiatives such as REDD+, should be incorporated into a comprehensive global forest biodiversity database. Meanwhile, the severe shortage of experts and database management infrastructures, especially in developing countries, poses another major challenge to address this gap 41 . The education and training of new generations of forest scientists, taxonomists and foresters can bring tangible benefits to biodiversity monitoring while improving local economies as well.
Considering co-limitation in addition to MTB enables a refined description of the biogeographic distribution of biodiversity and mechanisms underlying LDG. Our analysis has resulted in the production of a high-resolution map of tree species richness across the global forest range, along with visuals of those factors responsible for the moderation of local tree species richness. Such tools are necessary for conservation management which requires assessments of factors responsible for biodiversity patterns at multiple scales that matter-from local and regional to global scales. Patterns of local tree species richness and associated drivers may provide insights into how and why the diversity of other forest flora, fauna and microbes 42,43 vary across space and time. Furthermore, the high-resolution map of local tree species richness presented here provides a benchmark for evaluating the impact of biodiversity loss on the productivity and functioning of forest ecosystems 31,44 . Finally, aligned with current international calls for spatially explicit monitoring of ecosystem attributes 45 , this study delivers detailed biogeographic information to support international endeavours 46 focused on valuing natural capital and advancing global conservation.

Methods
As illustrated in Fig. 1, we conducted data analyses and modelling in three stages.

Stage 1.
For this study, we compiled individual in situ tree data from all the regional and national GFBi forest inventory datasets (Supplementary Table 2) into a standardized GFBi dataframe, that is, the GFBi tree list. In this standardized GFBi dataframe, each row represents an individual tree, and columns represent nine key tree-and plot-level attributes. These attributes are tree ID (FID), a unique number assigned to each individual tree; plot ID (PLT), a unique string assigned to each plot; plot coordinates (LAT and LON); tree species name (SPCD); DBH or above buttress; year of measurement; and dataset name (DSN), a unique number assigned to each forest inventory dataset (Supplementary Table 2). With a total of 56 million trees surveyed, GFBi individual-based dataframe represents 1/50,000 of the approximately 2.7 trillion trees 47 worldwide. Because all trees in each sample plot were identified and measured, GFBi data make it possible to quantify forest community structure, composition and species distribution.
To ensure consistency and maximum accuracy in species names, we standardized observations from different forest inventory datasets with the following protocol. First, all multi-stem trees were divided so that each stem represents an individual tree. The scientific names were extracted from original datasets, keeping only the genus and species (authority names were removed). Next, all the species names were compiled into five general species lists, one for each continent. We verified individual species names against 23 online taxonomic databases or web application programming interfaces (API) using the gnr_resolve() function from the 'taxize' package 48 of R 49 . We then manually verified and corrected all the names that did not match with the majority of the online taxonomic databases, that is, the names with a matching score lower than 0.9. For individuals denoted by morphospecies, we assigned each a unique name comprising the genus name and a unique species code. The unique species code consisted of the string 'spp' , plus the dataset name followed by a unique number denoting if two individuals belong to the same species. For example, 'Aidia sppCDi1' and 'Aidia sppCDi2' represented two different species under the genus 'Aidia' , and both species have been observed in a forest inventory of the Democratic Republic of   Min., minimum; temp., temperature; Precip., precipitation; t., temperature; Pot., potential; Max., maximum; dbh, diameter at breast height; part. deriv., partial derivative; topo., topographic; HF, human footprint; el., electrical.
the Congo named 'CDi' . To maximize our species coverage, a tree was defined in this study as a perennial plant with an elongated woody stem that supports branches and leaves, including woody angiosperms, gymnosperms and taller palms (Arecaceae). Tree ferns (Cyatheales) and bamboos (Bambusoideae) were excluded from our analysis. From the GFBi individual tree-level dataframe, we derived a global species abundance matrix. The global species abundance matrix consisted of the number of individuals by species (column vectors) within individual sample plots (row vectors). The global species abundance matrix consisted of two complementary datasets: Phase I dataset contained 1,255,444 sample plots and Phase II dataset contained 22,131 sample plots, most of which are located in unsampled and under-sampled regions of Phase I dataset. Phase I sample plots cover 394 ecoregions across the world, and Phase II sample plots cover an additional 30 ecoregions in Africa, South America, Southeast Asia, Mexico, India and Japan. Together, our ground-based forest sample plots cover 424 of 435 (97.5%) forested ecoregions across the world. The global species abundance matrix contains ~1.3 million rows (plots) by 32,608 columns (species). Key plot-level information was added to the matrix, including PLT, DSN, plot coordinates, basal area (B), the total cross-sectional areas (m 2 ) of living trees per ha calculated from DBH and TPH (expansion factor) and the year of measurement. TPH denotes the number of trees per hectare represented by each sampled individual. It ranged from 1 to 5,244 across the GFBi data, with a mean of 48 trees per ha.
We quantified for each sample plot tree species richness (S), which is the total number of tree species in a community. Due to the difference in plot size (standard deviation = 0.09 ha) and threshold DBH values (standard deviation = 2.52 cm) across GFBi sample plots, we developed machine learning models to standardize tree species richness for a common basis of 1 ha in area and 5 cm in threshold DBH. The models incorporated both plot area (A) and threshold DBH (D) as predictors to account for the underlying species-area relationship 50-52 and species-individual size distribution 53 in a rarefaction-based approach 54 . This standardization approach justifies compiling direct tree species diversity estimates from GFBi in situ data of different sources and sampling protocols [55][56][57] , an issue highlighted in earlier large-scale-although less extensive-forest biodiversity studies 57,58 . To evaluate the accuracy of this standardization approach, we tested the machine learning models using cross-sample validation and compared our global maps of estimated tree species diversity against other standardization approaches based on sample completeness (Model Evaluation below).
The machine learning models employed 47 environmental covariates to predict tree species richness. These covariates, derived from satellite-based remote sensing and ground-based survey data, can be summarized into five general categories: bioclimatic (for example, annual mean temperature, total annual precipitation, potential evapotranspiration and indexed annual aridity); soil (bulk density, pH, electrical conductivity, C/N ratio and total nitrogen); topographic, including elevation, slope, aspect and terrain features; vegetation and survey attributes  Fig. 4 | estimated tree species richness per hectare in forested areas worldwide. a, tree species richness per hectare was first derived for the ~1.3 million GFbi plots across the world and then imputed to the global forest extent. top left, curves (solid lines representing the mean and shaded areas representing the standard deviation of the mean) represent the observed LDG (black) of tree species diversity in comparison with LDG (red) predicted by the Mtb based on local mean annual temperatures (Fig. 2a). b, Width of the 95% confidence interval (c.I.) for the estimated tree species richness per hectare. All map layers are displayed at a 0.025° × 0.025° resolution with an equirectangular projection (Plate-carrée) for better illustration of the latitudinal gradients.
(plot size, basal area, threshold diameter and percent forest canopy cover); and anthropogenic variables (human footprint, roadless areas and size of protected areas) (Supplementary Table 1). We extracted all geospatial covariate values from raster datasets to point locations of GFBi plots using ArcMap 10.3 (ref. 59 ) and R 3.4.1 (ref. 49 ), to build a standardized plot-level dataframe.

Stage 2.
We trained random forests (RF) 60 , an ensemble learning method that detects general trends present in the data using a multitude of decision trees, to estimate standardized community-level tree species diversity. The RF algorithm applies the general technique of bootstrap aggregating (bagging) with a modified tree learning algorithm that selects at each candidate split in the learning process a random subset of the features (that is, feature bagging). Because a random subset of variables is chosen for each tree, the RF algorithm based on bagged tree ensembles avoids overfitting 60 and mitigates the multicollinearity issue 61 posed by high correlations between some of the predictors variables (Fig. 3). Using subsamples of GFBi data as the training set (that is, training dataframe) with response S, bagging repeatedly for B times selects a random sample with replacement of the training set and trains a regression tree f b . After training, RF can predict for unseen samples Xʹ with the response variable S being tree species richness per ha: For rigorous model evaluation, we employed three very different cross validation approaches: randomized cross validation (RCV), spatial cross validation (SCV) and post-sample validation (PSV). In RCV, a model was trained for each continent with a random subsample that accounted for 90% of the training data from that continent, and the remaining 10% of the training data were used as the testing set. This process was repeated 20 times with sample replacement to examine the accuracy of estimated tree species diversity values. In SCV, all sample data from an ecoregion 62 were reserved for testing the model that was trained with the remaining samples from the larger continent within which the ecoregion is situated. We decided to use ecoregions as spatial blocks because (1) unlike political units such as countries and provinces, ecoregions are delineated based on ecological and bioclimatic conditions; and (2) with a total of ~700 terrestrial ecoregions across the world, each ecoregion encompasses 1,800 sample plots on average, which is a large enough sample size for training RF models. This process was repeated until all the forested ecoregions across the world had been tested. SCV was more rigorous than RCV because samples from an entire ecoregion rather than random samples were withheld for validation. PSV was the most rigorous among the three validation processes. For PSV, we have collated an independent sample dataset from 22,131 forest sample plots, which we named Phase II sample plots to highlight their independence from the original GFBi dataset (that is, Phase I sample plots). In PSV, we used Phase II data as the testing set to evaluate the accuracy of the predictive models that were trained for each continent with the Phase I data.
Using these three cross validation processes, we also evaluated the performance of the RF model against two other predictive models, including multiple regression with ordinary least squares (OLS) and Extreme Gradient Boosting (XGBoost). For each model, we derived predicted values of tree species richness of the testing sets and compared these predicted values against observed data using mean absolute error, root-mean-squared error (RMSE), and coefficient of determination (R 2 ) (ref. 63 ). The process was repeated 20 times to select the best model for each continent.
The OLS model estimated values of standardized point diversity for non-sampled point location s, based on spatially explicit values of covariates: where Y(s) is tree species richness at location s; X a design matrix for the predictor variables at location s; α is a vector of coefficients; and e is a random vector  Fig. 5 | Dominant drivers of tree species richness in forested areas worldwide. a, Driver dominance was derived for each pixel from four driver categories (that is, bioclimatic, topographic, anthropogenic and soil) and co-limitation, which represents a lack of clear dominance among the four foregoing categories. the pixel-level drivers were then aggregated by 0.5° latitudinal bins to show the percentage prevalence of dominant drivers by latitude (top left). b, the violin charts show the kernel probability density of tree species richness per ha for different drivers. Inside boxes indicate the median (line in the centre) and interquartile range (bounds of boxes). the numbers on top of the violin charts indicate the percentage of forested pixels globally that corresponds to each driver category. the red line represents the mean and 95% confidence interval of tree species richness per ha (81.1 ± 0.1) for all the 0.025° × 0.025° pixels of co-limitation. the vertical axis is on a logarithmic scale for better illustration.
following a Gaussian probability density function, with an expected value of zero and variance of σ 2 . Spatial autocorrelation 64 was not accounted for here due to computational limitations. GFBi data collected from sample plots of various sizes were harmonized to represent local forest community populations per ha using the expansion factor 65 , and we used the standardized species richness per ha values for the response variables. We fit a model (2) for each continent. To mitigate the multicollinearity issue 66 , we selected for the OLS model the best subset of predictor variables for each continent from the predictor variables used in the RF models, using step-wise regression and Akaike information criterion 67 . XGBoost is a scalable machine learning system 68 that implements the gradient boosting decision tree algorithm 69 . With this ensemble technique, an initial model was trained, with new models added sequentially to correct for errors made by each existing model until no further improvements could be made. Then, new and initial models were merged to make a final prediction that minimized errors. With its algorithm engineered for efficiency in computing time and memory resources, XGBoost is widely used by data scientists to achieve state-of-the-art results on a number of machine learning challenges 68 . In this study, the XGBoost model estimated tree species diversity values in three steps. First, an initial model F 0 was defined to predict the target variable Y. This model was associated with a residual (Y -F 0 ). Second, a new model h 1 was fit to the residuals from the previous step, and F 0 and h 1 were combined to form the boosted model F 1 : of which the mean squared error was lower than that from F 0 . Finally, to improve the performance of F 1 , we modelled after the residuals of F 1 to create a new model, F 2 , and repeated it for m iterations until the mean squared error converged: Before training RF and XGBoost models, we fine tuned four key hyper-parameters, two for each model. Using 20 bootstrapping iterations on random training sets consisting of 90% of the samples, we first evaluated the sensitivity of RMSE of the testing sets (consisting of the remaining 10% of the samples) to the number of trees to grow and the number of variables randomly sampled as candidates at each split for the RF model and selected the optimal hyper-parameter values (Extended Data Fig. 5). Similarly, we selected the optimal values of the maximum number of boosting iterations (that is, number of rounds) and the maximum depth of a tree for the XGBoost model (Extended Data Fig. 6). As a result, we obtained a preliminary RF model. Because the RF model emerged as the most accurate model from all three cross validation processes (Extended Data Fig. 2), we selected the RF as the final model, and re-calibrated the final RF model using all the sample data (Phase I and Phase II data).

Stage 3. Global map of local tree species richness.
To map community-level tree species richness over the global forest range, we first derived the global forest range map from version 1.3 of the Global Forest Change database 70 (years 2000-2015). To ensure consistency with the definition of forest 71 by the Food and Agriculture Organization of the United Nations (FAO), the global forest range in this study was defined as forested areas with ≥10% tree crown coverage per unit area. The tiled 'treecover2000' , 'loss' and 'gain' datasets were integrated to obtain current forest cover estimates for the year 2015. To minimize processing artefacts, the ~1 arcsecond spatial resolution tiles were spatially aggregated to an even multiple of their native resolution that approximated the resolution of our covariates. The datasets were then converted to vector point files before being reconverted to raster format with the exact resolution and origin of our covariates. After mosaicking each set of tiles, we computed 'treecover' (scaled) -'loss' + 'gain' to obtain the 2015 global forest cover, represented as percent forest cover per ~30 arcsecond pixel. Artefacts in the original data led to 0.08% of all terrestrial pixels having forest cover estimates greater than 100% and 1.9% of terrestrial pixels having estimates less than 0%. These values were truncated to 100% and inflated to 0%, respectively. Finally, the global forest range consisted of those pixels with a percent forest cover ≥10% in 2015. In total, each map consisted of 9,944,908 pixels of 0.025° × 0.025° (hereafter, the pixel) of forested areas. This range is rather conservative and potentially underestimates many remnant forests in drylands and grasslands 72 . We then estimated tree species richness at a 1 ha scale for all pixels within a continent based on the final RF model trained for that continent, using both Phase I and Phase II data. Spatially explicit local environmental covariate data across the global forest range were used for the imputation, except that plot size and threshold DBH were set as 1 ha and 5 cm, respectively. For ecoregions with extremely low sample coverage, we further fine tuned the RF model using samples of similar environment characteristics from other continents. More specifically, we first identified two ecoregions of extremely low sample coverage, that is, the temperate forests in South America and the tropical forests in Oceania, as there were fewer than 1,000 sample plots for the entire biome on those continents. We then trained a new RF model for each ecoregion, using all the sample data from the same biome across the world and fine tuned the mapping data for that ecoregion using the biome-specific RF model.
We computed and mapped the width of the 95% confidence interval for our local estimates of tree species richness per ha across the global forest range. To this end, we employed a rigorous spatial-block approach, analogous to the spatial cross validation, to derive the 95% confidence interval. More specifically, we computed the width of the 95% confidence interval for each 0.025° × 0.025° mapping pixel by ecoregion. For a pixel p in ecoregion e, we trained 20 RF models using random subsamples that accounted for 90% of the training data from the same continent, which included all samples except those from ecoregion e. We then derived the standard error and the width of the 95% confidence interval for this pixel p in ecoregion e from the predictions of the 20 RF models trained for this ecoregion. This process was repeated until all the forested ecoregions across the world had been assessed and mapped.
Uncertainty in our global diversity estimates was caused by two types of error. The first was measurement error from in situ forest inventories. We mitigated this type of error by implementing stringent species name check and data standardization protocols (Stage 1 Data Standardization). The second arose from the imputation process to map tree species diversity. We minimized this type of error using the three cross validation approaches introduced in Stage 2.
Metabolic theory of biodiversity. Using the global standardized tree species richness values predicted from the final RF models, we quantified the global LDG of tree species richness and tested the effect of environmental temperature based on the MTB 19 : where S represents species richness and T env here represents absolute environmental temperature (mean annual temperature +273.15 K); α and β represent coefficients to be estimated by ordinary least squares. According to both the original and extended MTB 19,20 , the slope α is expected to range between −7.5 K and −9.0 K, under the assumption that tree community abundance per area does not vary with latitude.
Variance partitioning. We used variance partitioning 73 based on the sample data from ~1.3 million plots to quantify the unique and joint fractions of spatial variance in tree species richness explained by environmental factors and latitude. Due to the correlation between species and environment and between the spatially explicit environmental factors, the variance partitioning approach mitigates type I error inflated by spatial autocorrelation 74 where n − n A and n C − n A stand for the degree of freedom for the full model and the difference in the degrees of freedom between the full model and the reduced model II, respectively. The significance of bioclimatic factors, with the effect of latitude being controlled, was tested in an one-tailed F-test by comparing RSS of model (A) and model (B): where n B − n A = 21 stands for the difference in the degrees of freedom between the full model and the reduced model I. We partitioned the spatial variance in observed species richness into four components: a represents the fraction of variance uniquely explained by environmental factors (that is, bioclimatic, topographic, anthropogenic and soil variables), after latitudinal effects have been taken into account; b represents the fraction of variance jointly explained by environmental factors and latitudinal effects; c represents the fraction of variance explained by latitudinal effects after removing environmental effects; and d represents the fraction of variance not explained by the full RF model. Then, the total fraction of variance explained by both environmental factors and latitude was a + b + c, the fraction of variance explained by environmental factors was a + b, and the fraction of variance explained by latitude was b + c. Components a + b + c, a + b and b + c were estimated by the R 2 statistics from the RF models trained for each continent using all factors, environmental factors and latitude, respectively (Stage 2 Model Training and Evaluation). Components a, b and c were computed from the previous components using arithmetic relationships that ensure that Model sensitivity. Based on the final RF models and sample data from ~1.3 million plots, we mapped the dominant drivers of tree species richness with a 0.025° × 0.025° resolution (that is, global map of co-limitation), following a standard procedure for model sensitivity analysis 75 : Step 1: using the full RF model, and the values of environmental factors X(s) specific to a 0.025°-pixel s, we had already estimated local tree species richness S full (s): where f() represents the RF model, and X(s) environmental factors in four categories, namely E1: bioclimatic, E2: topographic, E3: anthropogenic and E4: soil.
Step 2: for the above-mentioned pixel, we estimated a new local tree species richness value S−E1 (s), using a reduced RF model in which all E1 (bioclimatic) variables were removed: where f −E1 () represents the RF model trained with all but 21 bioclimatic variables, and (X − E1) (s) encompassed environmental factors in three categories, namely E2: topographic, E3: anthropogenic and E4: soil.
Step 3: for a given pixel, we calculated the relative sensitivity of predicted species richness to E1: Step 4: we repeated steps 2 and 3 to calculate, for a given pixel, the relative sensitivity of each of the following categories (that is, E2: topographic, E3: anthropogenic and E4: soil), respectively. The dominant driver (that is, limiting factor) for this pixel was then the category with the highest relative sensitivity, provided that this relative sensitivity was greater than or equal to 0.2.
Step 5: if the relative sensitivities were less than 0.2 for all categories, we considered that this was a scenario of joint effects of multiple categories of factors (that is, co-limitation), rather than dominance of a single category. Where clear dominance of a single category was lacking, we denoted the dominant driver of this pixel as 'E5: co-limitation. ' Step 6: we repeated the steps above to calculate, for all the remaining pixels of the global grid, the relative sensitivity of each of the five categories of environmental factors, namely E1: bioclimatic, E2: topographic, E3: anthropogenic, E4: soil and E5: co-limitation. On the basis of these values, we created a wall-to-wall map of dominant drivers of tree species richness across the global forest range by labelling the category with the highest relative sensitivity for each pixel (Fig. 5a).
Step 7: on the basis of the relative sensitivity obtained from the steps 1-6, we computed percent prevalence (0-100%) of bioclimatic, topographic, anthropogenic and soil factors and a lack of dominance (co-limitation) in all the forested pixels along each latitudinal band.
Inclusion and ethics statement. The international research collaboration leading to this research paper was conducted via Science-i.org, a transparent and FAIR (Findable, Accessible, Interoperable and Reusable) web platform for international research collaboration. Through this platform and our partner initiatives including the Global Forest Biodiversity Initiative (GFBI), we pursue excellence and high standards of performance, professionalism and ethical conduct. Science-i strictly prohibits any form of discrimination against individuals on the basis of gender, race, age, religion, sexual orientation, veteran status or disability status. Science-i continuously seeks and encourages underrepresented and underprivileged people and groups and the unique voices in global scientific research collaboration.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
The global map of tree species richness is available under license CC BY 4.0, with the identifier: 10.6084/m9.figshare.17232491. This map can be downloaded in two formats. One is a geoTIFF file (S_mean_raster.tif) containing the fully geo-referenced map of tree species richness worldwide at a 0.025° × 0.025° resolution. The other is a comma-separated file (S_mean_grid.csv) with the following attributes: S is local average tree species richness per ha and x, y are centroid coordinates of all 0.025° × 0.025° pixels. The global map of co-limitation is available under license CC BY 4.0, with the identifier: 10.6084/ m9.figshare.17234339. The metadata of the entire training dataframeincluding the characteristics and references of all the in situ Phase I and Phase II datasets and the definitions, units and summary statistics of the environmental covariates-is available under license CC BY 4.0 with the identifier: 10.6084/ m9.figshare.19733449.v1. The public version of the training dataframe, including the plot-level species richness and all the covariates, which is needed to reproduce the models and results presented here, is available at: https://doi.org/10.6084/ m9.figshare.20055488. The maps and dataframe are also available on the international web research platform: Science-i (https://science-i.org/). Raw forest inventory data are commonly subject to a wide array of confidentiality clauses in regard to open access policies. Despite recent efforts to make some of these data fully open 76,77 , some governments and private data owners, especially those from the developing countries generally have decided to keep their data confidential. This decision is based on well-founded arguments to protect certain trees or forests (because of their large size or protected taxonomic status) from illegal logging or trespassing and to protect landowners' privacy against the misuse of plot information such as the geographic coordinates. The sensitive information in the training dataframe, including the plot coordinates and tree-level information, will be available from the corresponding author (albeca.liang@gmail.com) upon a request via Science-i (https://science-i.org/) or GFBI (https://www.gfbinitiative. org/) and an approval from data contributors.

Code availability
All the models in this study were constructed using command line applications written in the R programming language, which processed and restructured the input data, trained the model and performed cross validation. Due to the massive amount of data, we used Purdue University's Brown supercomputing cluster to accelerate the training process. The development of the GFBi database, tabular data cleaning, creation of species abundance matrices, evaluation of diversity determinants and geostatistical imputation were conducted in R 49 (v.3.4.2) through the use of several Linux-based high-performance computing resources at Purdue University and a custom HPC interface developed using Amazon Web Services, each designed for batch processing, scalable resource distribution, embarrassingly parallel computations and/or large RAM jobs. Compute nodes with up to 1 TB of RAM and clusters of up to 64 nodes were employed in this study. Portions of the covariate preparation, mapping and quality control assessment were conducted on Windows-based operating systems with up to 128 GB of RAM. Final continental-level RF models and the R codes we developed to train the models are available under license MIT with the identifier: 10.6084/ m9.figshare.17234729.

acknowledgements
The team collaboration and manuscript development are supported by the web-based team science platform: science-i.org, with the project number 202205GFB2. We thank the following initiatives, agencies, teams and individuals for data collection and other technical support: the Global Forest Biodiversity Initiative (GFBI) for establishing the data standards and collaborative framework; United States Department of Agriculture, Forest Service, Forest Inventory and Analysis (FIA) Program; University of Alaska Fairbanks; the SODEFOR, Ivory Coast; University Félix Houphouët-Boigny (UFHB, Ivory Coast); the Queensland Herbarium and past Queensland Government Forestry and Natural Resource Management departments and staff for data collection for over seven decades; and the National Forestry Commission of Mexico (CONAFOR). We thank M. Baker (Carbon Tanzania), together with a team of field assistants (Valentine and Lawrence); all persons who made the Third Spanish Forest Inventory possible, especially the main coordinator, J. A

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted

Software and code
Policy information about availability of computer code Data collection No software was used

Data analysis
The development of the GFBi database, tabular data cleaning, creation of species abundance matrices, evaluation of diversity determinants, and geostatistical imputation were conducted in R49 (v.3.4.2) through the use of several Linux-based high-performance computing (HPC) resources at Purdue University, and a custom HPC interface developed using Amazon Web Services, each designed for batch processing, scalable resource distribution, embarrassingly parallel computations, and/or large RAM jobs. Compute nodes with up to 1TB of RAM and clusters of up to 64 nodes were employed in this study. Portions of the covariate preparation, mapping, and quality control assessment were conducted on Windows-based operating systems with up to 128 GB of RAM. Final continental-level RF models and the R codes we developed to train the models are available under license MIT, with the identifier: 10.6084/m9.figshare.17234729.
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors and reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Portfolio guidelines for submitting code & software for further information.

March 2021
Data Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A description of any restrictions on data availability -For clinical datasets or third party data, please ensure that the statement adheres to our policy Data availability • The global map of tree species richness is available under license CC BY 4.0, with the identifier: 10.6084/m9.figshare.17232491. This map can be downloaded in two formats. One is a geoTIFF file (S_mean_raster.tif) containing the fully geo-referenced map of tree species richness worldwide at a 0.025°×0.025° resolution. The other is a comma-separated file (S_mean_grid.csv) with the following attributes: S is local average tree species richness per hectare x, y are centroid coordinates of all 0.025°×0.025° pixels; • The global map of co-limitation is available under license CC BY 4.0, with the identifier: 10.6084/m9.figshare.17234339.
• The metadata of the entire training dataframe -including the characteristics and references of all the in situ Phase-I and Phase-II datasets, as well as the definitions, units, and summary statistics of the environmental covariates -is available under license CC BY 4.0, with the identifier: 10.6084/ m9.figshare.19733449.v1 • The public version of the training dataframe including the plot-level species richness and all the covariates, which is needed to reproduce the models and results presented here, will be deposited to a public data repository Figshare upon the final acceptance of this paper. Meanwhile, we will also publish this dataframe on two international web research platforms: science-i.org, and gfbinitiative.org.
• Raw forest inventory data are commonly subject to a wide array of confidentiality clauses in regards to open access policies. Despite recent efforts to make some of these data fully open76,77, some governments and private data owners, especially those from the developing countries generally have decided to keep their data confidential. This decision is based on well-founded arguments to protect certain trees or forests (because of their large size or protected taxonomic status) from illegal logging or trespassing, and to protect landowners' privacy, against the misuse of plot information such as the geographic coordinates. The sensitive information in the training dataframe, including the plot coordinates and tree-level information, will be available from the corresponding author (albeca.liang@gmail.com) upon a request via Science-I or GFBI, and an approval from data contributors.

Field-specific reporting
Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Ecological, evolutionary & environmental sciences study design
All studies must disclose on these points even when the disclosure is negative.

Study description
We compiled a ground-sourced forest inventory database containing ca. 55 million trees in ~1.3 million forest sample plots, based on which we developed the first global maps of tree species diversity and global map of co-limilation

Research sample
In situ measurement of ~1.3 million forest sample plots spanning across 110 countries and territories

Sampling strategy
Sampling strategy is reported separately by the 96 national, regional, and local forest inventories underlying our database. See Table  S2 for a complete list of all the forest inventories and references to the associated reports.

Data collection
Data collection is reported separately by the 96 national, regional, and local forest inventories underlying our database. See Table S2 for a complete list of all the forest inventories and references to the associated reports.
Timing and spatial scale The overall GFBi database has a global coverage.

Reproducibility
This study uses observational data from natural forest ecosystems.

Randomization
Random forests model was trained to estimate global forest tree species richness and co-limilation Blinding Blinding was not relevant to this study, because this study is based primarily on observational tree data from natural forest ecosystems.
Did the study involve field work?
Yes No