For model building, we used soil profile data from ca. 150,000 unique sites spread over all continents (Fig 3; see acknowledgments for a full list). These have been imported, cleaned and merged into a single global compilation of soil points with unique column names and IDs.
Preparation of the global compilation of standardized soil training points took several months of work. The translation and cleaning up of soil properties and soil classes took a large amount of time. About 15–20% of the original soil profile data was only reported using a national classification system, e.g. the Canadian and Brazilian classification systems. Since some information is better than none, where possible we translated national classification systems to the two international (World Reference Base and USDA) classification systems. For translation we used published correlation tables either reported in Krasilnikov et al. [22 ] or reported on the agency websites; see e.g. correlation of Canadian Soil Taxonomy published (http://sis.agr.gc.ca/cansis/taxa/) and correlation of the Brazilian classification system (http://www.pedologiafacil.com.br/classificacao.php). We also consulted numerous local soil classification experts and requested their feedback and corrections in the (online) correlation tables (distributed via Google spreadsheets). Some national classification systems, such as the Australian soil classification system, are simply too different from the USDA and WRB systems to allow satisfactory correlation. These data were therefore not used. The full list of correlation tables is available from ISRIC’s github account at https://github.com/ISRICWorldSoil.
Another time-consuming operation was merging laboratory measurements and field observations and their harmonization to a standard format. In some cases missing values in the original tables had been coded as "0" values, which can have a serious influence on prediction models; in other cases we implemented and applied functions to locate and correct typos and other gross errors. Some variables, such as soil organic carbon, needed to be converted either from soil organic matter (e.g. divide by 1.724) and/or by removing CaCO3 (Calcium carbonates) from total carbon. Nevertheless, the majority of soil variables from various national soil profile data bases appeared to be compatible and relatively easy to merge—soil scientists across continents do measure similar things, but often express the results using different measurement units, vocabularies and standards.
We imported all original tables as-is, next documented all conversion functions through R scripts (available via ISRIC’s github account), to accommodate reproducible research and facilitate that conversion functions may, in the future, be further modified and improved. The majority of the points (excluding LUCAS points and other data sets with specific restricting terms of use) and legends used for model building and for producing SoilGrids are also available for public use via ISRIC’s WoSIS Web Feature Service (http://www.isric.org/data/wosis) and/or the ISRIC’s institutional github account.
Free full text: Click here