-
Notifications
You must be signed in to change notification settings - Fork 42
/
Copy path09-PSM_future.Rmd
1401 lines (1194 loc) · 76.5 KB
/
09-PSM_future.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# The future of predictive soil mapping
*Edited by: R. A. MacMillan and T. Hengl*
## Introduction
This chapter presents some opinions and speculation about how predictive
soil mapping (PSM) may evolve, develop and improve in the near term
future. These thoughts were originally prepared for a discussion document about
whether national to provincial scale soil inventory programs in Canada
could, or indeed should, be reinvented and reinvigorated and, if so, how
this reinvention might be best designed and accomplished.
The solutions proposed for reinvigorating presently moribund soil
inventory programs in Canada were largely based on adopting new methods
and ideas associated with PSM within a new collaborative, collective and
open operational framework. These earlier thoughts were considered to be
relevant to the more general topic of the future of predictive soil
mapping (PSM). As such, the original discussion document was slightly
modified, extended and included as a chapter in this book.
This chapter addresses the following two main issues:
- What caused past national to state level conventional
soil, and other terrestrial resource, inventory programs to anthropy and disappear globally and can
they now be renewed and resurrected?
- How can the methods and ideas behind PSM be adopted and applied to
accomplish the goal of renewing and reviving conventional soil and
terrestrial resource inventory programs?
## Past conventional terrestrial resource inventories
### Why have most national resource inventories been discontinued?
Historically, almost all past terrestrial resource inventory agencies
were slow, expensive to maintain and failed to produce complete,
consistent, current and correct map coverage (4Cs) for entire areas of
political jurisdiction or interest. Most national agencies were unable
to completely map an entire administrative area affordably at any useful
scale using a single consistent methodology applied over a relatively
short time span to produce a single wall to wall map product. Instead
almost all past inventory efforts have been piecemeal and incomplete.
This resulted in what we consider to be *“the embarrassment of the index
map”*. Virtually every jurisdiction produced an index map to illustrate
which parts of the jurisdiction had been mapped at all, which different
mapping methods were used to map different parts, which eras or years
each bit of mapping represented and what scale of mapping had been
carried out in any part. This index map was often proudly displayed and
circulated to illustrate how much progress had been made towards mapping
an entire jurisdiction of interest. In actual fact, the index map
represented a powerful demonstration of all that was wrong with mapping
and mapping progress in that jurisdiction.
The first thing that index maps clearly demonstrated was that there was
no complete map coverage for the area at any scale. The second thing
highlighted was that there was no consistency in scale, methods or
legend across even areas that had been mapped. Different areas had been
mapped at different scales, over different times, using different
concepts and legends and no effort had been expended to achieve
consistency across the entire area. The third thing that would also
become immediately obvious was that, at current rates, complete mapping
of any jurisdiction would never be achieved in anyone’s lifetime. Not
particularly encouraging information to impart. And yet, every agency
maintained an index map and loved to share it.
Another significant historical misjudgement was the failure to make the
information and services provided by terrestrial inventory service
agencies critically important and absolutely necessary to at least one
essential decision making process, preferably a legally mandated one.
Where inventory agencies still survive, they have linked their products
and services intimately to one or more clearly defined, and legally
mandated, decision making processes that involve the expenditure of
considerable sums of (usually public) money. Soil survey survives in the
USA (at least for now) largely because county scale soil maps are a
critical requirement for calculating eligibility for financial support
payments for many agricultural subsidy and support payment programs. You
cannot apply for, or obtain, a subsidy payment unless you have a soil
survey map to justify your eligibility for the payment and to document
where and how the required supported activity will be implemented.
It can be argued that many terrestrial resource inventory programs
failed (and disappeared) because they viewed themselves as their own
primary customer and designed and delivered products and services meant
to satisfy their own desires and expectations and not those of real,
downstream users. They became convinced of the rightness, value and
importance of their maps and databases the way they wanted to make them
and did not effectively listen to, or respond to, criticism of these
products. Users would criticize conventional soil polygon maps and
reports filled with complicated jargon and impenetrable legends and be
dismissed as simply not being able to understand a soil map and to
appreciate how complicated and complex it was to portray the spatial
variation in soils in a simple way. Rather than trying to design and
make simpler representations of more easily understood spatial patterns,
terrestrial inventory agencies would suggest that an expert in making
the maps was required to assist users in interpretation and use of any
map.
### Is there a future for conventional terrestrial inventory programs?
We have asked ourselves, *“can conventional comprehensive soil and similar terrestrial inventory programs be saved or renewed?”*
The short answer is: probably no, at least not in their present format.
Conventional resource inventory programs have become too expensive, too
slow to deliver needed outputs and too slow to change to produce
innovative and needed products. There is now probably insufficient will,
money, demand or benefit to support continuation, or re-establishment,
of conventional, government-funded, comprehensive inventory programs as
we have known them in the past. However, that does not mean that all
needs and benefits previously provided by comprehensive inventory
programs are being met now or that they do not need to be met. There are
a large number of benefits associated with the existence of
comprehensive inventories and we ask if these may not be important to
continue to service and if they might still be provided under some newly
redesigned framework.
### Can terrestrial inventory programs be renewed and revived?
One of our key hopes (which we especially try to achieve through the OpenGeoHub
Foundation), is to contribute to a discussion of how comprehensive
terrestrial resource inventory programs (or equivalent frameworks) might
be re-imagined, re-designed, re-invented, re-implemented and renewed at
regional to national to global scales, for the next generation, and by
whom.
We consider here that we are now at a nexus where it has become possible
to address and redress many of the past inconsistencies and oversights
in terrestrial resource mapping. It is now completely feasible to aspire
to affordably and expeditiously produce new predictive maps that achieve
the 4 Cs and are:
- **Complete** (e.g. cover entire areas of interest),
- **Consistent** (e.g. are made using a single methodology, applied at a
single scale and over a single short period of time),
- **Current** (e.g. represent conditions as they are today, at a specific
moment in time),
- **Correct** (e.g. are as accurate as is possible to achieve given
available data and methods),
We consider that it is also now possible to redesign any new output maps
so that they are capable of acting directly as inputs to well
established processes and programs for planning and decision making at
national to regional to operational scales. And we consider that we have
a unique opportunity to work collaboratively with numerous actual and
potential users of spatial inventory data to ensure that new output
products directly meet their spatial data needs.
### How can terrestrial inventory programs be renewed and revived and by whom?
In light of developments in science, technology, methods of societal
interaction and new models of funding and cooperative action, we suggest
that looking back at how things were done in the past no longer provides
the most appropriate model for how inventory activities ought to be
designed and conducted in the future. We argue that it is preferable to
re-imagine an entirely new framework for cooperation, which takes
advantage of new scientific and organizational advances and within which
many of the acknowledged benefits of previous, government-funded,
programs can be delivered within a new model of cooperative, collective
action and sharing.
In this age of Facebook and Twitter and Wikipedia and Google Earth, it
is no longer the purview, or responsibility, of any single, government
funded, agency to collect, produce, maintain and distribute
comprehensive information about the spatial distribution of soils,
eco-systems, terrain units, wetlands or any other terrestrial
attributes. We believe that it should instead become a collective
responsibility, for a large variety of overlapping groups and
institutions, to create, maintain and share spatial information of
common interest. It is incumbent on these diverse interest groups to
identify mechanisms by which willing collaborators can join together to
produce, maintain and distribute basic levels of spatially distributed
land resource information jointly and collectively.
## The future of PSM: Embracing scientific and technical advances
### Overview
We consider that any new, future collaborative PSM activity should take
advantage of recent scientific and technical advances in the following
areas:
- Collection of field observations and samples:
- Collating and harmonizing existing legacy soils data,
- New field sampling designs and programs and new data collection
strategies,
- Characterization of soils in the field and in the laboratory:
- New field sensors for characterizing soils in situ,
- New faster, cheaper and more accurate methods of laboratory
analysis,
- Creation, collation and distribution of comprehensive sets of
environmental covariates:
- Introduce new covariate data sets based on new remote, air and
space sensors,
- Include new varieties and resolutions of DEM and other
environmental covariate data,
- Maximize use and relevance of existing data sets of
environmental covariates,
- Automated spatial prediction models:
- Replace previous qualitative and subjective mental models with
new quantitative and objective statistical models,
- Adopt new methods of automated space-time modelling and
prediction,
- New options for hosting, publishing, sharing and using spatial data
via cloud services:
- Develop new platforms for collaborative data sharing and
geo-publishing,
- Develop open services to deliver on-demand, real time online
mapping,
### Collection of field observations and samples
We can improve how we locate and obtain data on field observations and
measurements. These O&M field data provide the evidence that is
essential for developing all spatial prediction models and outputs.
First consider the challenges and opportunities associated with
identifying, obtaining and using existing, or legacy, field observations
and measurements.
Legacy field data refers to any field observations or measurements that
were collected in the past and that remain discoverable and available
for present use. Typically, these legacy field data consist of either
field observations and classifications made at point locations to
support the development of conventional, manually prepared maps or of
laboratory analysed samples, collected to support characterization of
point locations considered to be typical or representative of a
particular soil class or individual. Legacy field data may already be in
digital format and stored in digital databases. More often, legacy data
are found in paper reports, manuals, scientific publications and other
hard copy formats that require the data to first be transformed into
digital format and then harmonized into a standardized format before
they can be used effectively.
Legacy field data typically possess several characteristics which can
make their use for producing new inventory map products problematic.
Some common limitations of legacy field data are:
- They are rarely collected using any kind of rigorous, statistically
valid, sampling design,
- Their locations in space (geolocations) are often not measured or
reported accurately,
- Their locations in time (sampling dates) are often unknown or are
spread over decades,
- The methods used in description or analysis can vary greatly by
source, location or time,
- They can be difficult and costly to find, to obtain, to digitize and
to harmonize,
Despite these limitations, legacy field data have other attributes that
make them valuable and worth assembling, collating, harmonizing and using.
The advantages associated with using legacy field data can be summarized
as follows:
- Legacy point data provide the only source of baseline information
about past time periods:
- We can’t go back in time to collect new samples or make new
observations applicable to past time periods,
- They establish prior probabilities which are essentially
starting points that describe what we know now before we start
making new predictions and new maps using new data,
- Legacy point data are all we have initially to work with until new
field data can be obtained:
- Use of legacy field data can help us to learn and to improve
methods and approaches,
- Working through the full cycle required to produce predictive
maps using legacy data lets us learn a lot about how to do it and, more
importantly, how we might do it better the next time around,
- They give us something to work with to provide real-world,
worked examples, for ourselves and for potential users, of the
kinds of maps and other products that can now be produced
using modern automated prediction methods,
- Legacy point data help us to illustrate problems, weaknesses and
opportunities for improvement:
- Gaps in existing legacy data (missing data in space and time)
help to illustrate the need to have samples that
comprehensively cover all areas of space and time of interest,
- Errors and uncertainties in initial predictive maps based on
legacy field data provide a clear illustration of the need for
more and better field data to improve future mapping,
- The spatial distribution of uncertainties computed for initial
maps created using legacy data can identify locations where
new observations and samples are most needed and will
contribute most to improving subsequent predictions,
Legacy point data can be surprisingly difficult and costly to find,
obtain, harmonize and digitize [@arrouays2017soil]. One can only imagine how many hundreds
of thousands, even millions, of site observations may have been made by
field personnel undertaking many different types of inventories for many
different agencies over the years. Similarly, laboratories have
definitely analyzed millions of soil samples over the years for samples
collected by government agencies, private sector consulting companies,
NGOs, advocacy groups, farmers or landowners. Unfortunately, very few of
these observations or samples have survived to enter the public domain
where they can now be easily located and obtained.
In an ideal world, it would be possible to identify and obtain hundreds
of thousands to perhaps even millions of laboratory analysed results for
point locations globally. These samples surely were taken and analysed
but they no longer remain accessible. Instead, best efforts to date have
resulted in rescuing some 300,000 to 350,000 records globally for which
soil analytical data exist for geolocated point locations. What has
happened to all of the thousands to millions of other analysed samples
that were undeniably collected and analysed? Essentially they may be
considered to be lost in the mists of time, victims of lack of will and
lack of resources to support maintaining a viable archive of observation
and sample results over the years. Unfortunately, no entity or agency
had the mandate to maintain such a comprehensive global archive and no one had the
vision or resources to take on such a challenge.
The world can do a much better job of locating, harmonizing, archiving
and sharing global legacy field and laboratory data than it has done to
date [@arrouays2017soil]. It is incumbent on agencies, companies, organizations and
individuals that hold, or are aware of, collections of legacy field data
to step forward to offer to contribute such data to a comprehensive and
open repository of field observations and laboratory measurements. We
would hope that the evidence of beneficial use of legacy point data by
OpenGeoHub to produce concrete examples of needed and useful spatial
outputs would encourage entities that hold field O&M data that are not
currently publicly available to contribute them for future use by a
community of global mappers. Techniques developed by OpenGeoHub to
collate and harmonize legacy point data could be applied to any new,
previously overlooked, data sets contributed, in the future, by
interested parties.
### Collecting new field O&M data
The Africa Soil Information Service (AfSIS) project
(http://www.africasoils.net) provides a
powerful example of how new field observations and laboratory analysed
field data can be collected in a manner that is reliable, feasible and
affordable. AfSIS is one of the very few global examples of an entity
that has not accepted that collection of new field data is too difficult
and too expensive to contemplate. Instead, AfSIS asked the question *“how
can we make it feasible and affordable to collect new, high quality,
field data?”* And then AfSIS (and several partner countries) went ahead
and collected new field data using modern, robust and affordable methods
of field sampling and laboratory analysis.
Following the example of AfSIS, we can identify the following major
considerations for how the collection of new field O&M data can be made
both more affordable and more effective.
- Select locations for field sampling using a formal, rigorous
sampling design [@brown2015spatially; @stumpf2017uncertainty; @BRUS2019464],
- Design based sampling schemes:
- Random sampling,
- Stratified random sampling,
- Systematic sampling (confluence point or grid sampling),
- Nested, multi-scale hierarchical sampling,
- Spatially-based sampling,
- Model based sampling schemes:
- Conditioned Latin Hypercube (cLHC) sampling [@Malone2019PeerJ],
- Multi-stage sampling at locations of maximum uncertainty,
- Systematize and automate all field sampling and recording procedures
as much as possible,
- Create custom tools and apps to support:
- Locating sample sites and recording observations,
- Assigning unique identifier sample numbers to all locations
and samples,
- Tracking progress of samples from the field through the lab
to the database,
Adopting formal sampling designs to identify where to best collect new
field O&M samples offers several significant advantages.
Firstly, statistically valid sampling schemes ensure that the fewest
number of samples are required to achieve the most correct and
representative values to characterize any area of interest. This
minimizes field data collection costs while maximizing usefulness of the
samples. Secondly, there is rapidly growing interest in, and need for,
measuring and monitoring of changes in environmental conditions through
time (e.g. carbon sequestration or losses, fertility changes).
Quantitative statements can only be made about the accuracy of changes
in values for any given area if there is an ability to replicate those
values with a subsequent comparable sampling effort. The ability to
return to any given area at some later time to collect a second set of
statistically representative field samples is essential to any effort to
quantify and monitor changes through time. Only statistically based
sampling frameworks support repeat sampling.
Design based sampling schemes generally require little to no advance
knowledge about the patterns of spatial variation within an area to be
sampled. They are best used for situations where there is little
existing knowledge about spatial variation and where there is a need to
collect a representative sample with the fewest possible sample points.
Of the design based options available a nested,
multiscale sampling design based on a stratified random sample
framework or spatially-based sampling appears as a suitable option.
In these nested sampling approaches, explicit attention is given to ensuring
that multiple samples are collected at a succession of point locations
with increasingly large interpoint separation distances (e.g. 1 m, 10 m,
100 m, 1 km). These multiple points support construction of
semi-variograms that quantify the amounts of variation in any attribute
that occur across different distances. Knowing how much of the total
observed variation occurs across different distances can be very helpful
for identifying and selecting the most appropriate grid resolution(s) to
use for predictive mapping. If 100% of the observed variation occurs
over distances shorter than the minimum feasible grid resolution, then
there is really no point in trying to map the attribute spatially at
that resolution. Similarly, if most of the observed variation occurs
across longer distances, there is really little point in using a very
fine resolution grid for prediction. Most past purposive sampling
undertaken for conventional inventories was not particularly well suited
to supporting geostatistics and the production of semi-variograms.
Model based sampling frameworks are recommended for situations where
there is some existing (*a-priori*) knowledge about the spatial pattern
of distribution of properties or classes of interest. Conditioned Latin
Hypercube (cLHC) sampling is based on first identifying all significant
combinations of environmental conditions that occur in an area based on
overlay and intersection of grid maps that depict the spatial
distribution of environmental covariates
[@stumpf2016incorporating; @Malone2019PeerJ]. Potential point sample
locations are then identified and selected in such a way that they
represent all significant combinations of environmental conditions in an
area. Point samples are typically selected so that the numbers of
samples taken are more or less proportional to the frequency of
occurrence of each significant combination of environmental covariates.
This ensures that samples cover the full range of combinations of
environmental conditions (e.g. the covariate space) in an area and
sample numbers are proportional to the relative extent of each major
combination of environmental conditions in an area.
Field sampling programs can also be designed to collect new point
samples at locations of maximum uncertainty or error in a current set of
spatial predictions [@stumpf2017uncertainty]. The spatially located measures of uncertainty
computed as one output of a prediction model can be used to provide an
indication of the locations where it may be most beneficial to collect
new samples to reduce uncertainty to the maximum extent possible. This
type of sampling approach can proceed sequentially, with predictions
updated for both estimated values and computed uncertainty at all
locations after any new point sample data have been included in a new
model run. It is often not efficient to collect just one new point
sample prior to rerunning a model and updating all predictions of values
and uncertainties. So, it is often recommended to collect a series of
new point observations at a number of locations that exhibit the largest
estimates of uncertainty and then update all predictions based on this
series of new field point data. Collecting a series of new multistage
samples can be repeated as many times as is deemed necessary to achieve
some specified maximum acceptable level of uncertainty everywhere.
Field sampling can also be made more efficient, and less expensive, by
creating and adopting more systematic and automated procedures to
support field description and sampling. Custom apps can be developed to
help to choose, and then locate, sampling points in the field rapidly
and accurately. These field apps can be extended to automate and
systematize most aspects of making and recording observations in the
field, thereby increasing speed and accuracy and reducing costs. Unique
sample numbers can be generated to automatically assign unique and
persistent identifiers to every site and to every soil sample collected
in the field. This can reduce costs and errors associated with assigning
different sample IDs at different stages in a sampling campaign (e.g.
field, lab, data entry). Persistent and unique machine readable identifiers can help to
support continuous, real-time tracking of the progress of field
descriptions and soil samples from initial collection in the field
through laboratory analysis to final collation in a soil information
system. This consistency and reliability of tracking can also improve
efficiency, decrease errors and reduce costs for field description and
laboratory analysis. Taken all together, improvements that automate and
systematize field descriptions and field sampling can make it much more
affordable and feasible to collect new field data through new field
sampling programs.
@BRUS2019464 provides a systematic overview of sampling techniques and [how
to implement them in R](https://github.com/DickBrus/TutorialSampling4DSM).
The author also recongizes that *“further research is recommended on sampling
designs for mapping with machine learning techniques, designs that are robust
against deviations of modeling assumptions”*.
### Characterization of soils in the field and the laboratory
Characterization of field profiles and samples can be made more
affordable and feasible again by making maximum use of new technologies
that enable field descriptions and laboratory analyses to be completed
more rapidly, more affordably and more accurately.
Field characterizations can be improved by making use of a number of new
technologies. Simply taking geotagged digital photos of soil profiles
and sample sites can provide effective information that is located with
accuracy in both space and time. New sensors based on handheld
spectrophotometers are just beginning to become available. These may
soon support fast, efficient and accurate characterization of many soil
physical and chemical attributes directly in the field. Other field
instruments such as ground penetrating radar [@gerber2010applicability], electrical conductivity
and gamma ray spectroscopy [@rouze2017understanding] are also becoming increasingly available and
useful. Field sensors for monitoring soil moisture and soil temperature
in real time and transmitting these data to a central location are also
becoming increasingly common and affordable to deploy.
Portable MIR scanners achieve almost the same accuracy as laboratories [@s18040993].
Simple field description protocols based on using mobile phones to crowdsource a set
of basic observations and measurements could enable massive public
participation in collecting new field data.
Recent developments in the use of new, rapid and accurate pharmaceutical
grade analytical devices have reduced the costs of typical laboratory
analyses dramatically, while, at the same time, significantly improving
on reproducibility and accuracy [@shepherd2002development; @ShepherdWalsh2007JNIS]. A modern soil laboratory now entails
making use of mid and near infrared spectrophotometers, X-ray
diffraction and X-Ray diffusion and laser based particle size analysis.
Using these new instruments, it has been demonstrated that total costs
for running a complete set of common soil analyses on a full soil
profile can be reduced from a current cost of US\$ 2,000 to as little as
US\$ 2–10 per profile [@ShepherdWalsh2007JNIS; @ViscarraRossel2016198].
This reduction in cost, along with the associated improvement in
reproducibility is a game changer. It makes it, once again, feasible and
affordable to consider taking new field soil samples and analyzing them
in the laboratory.
### Creation, collation and distribution of effective environmental covariates
Any future soil inventory activities will inevitably be largely based on
development and application of automated predictive soil mapping (PSM)
methods. These methods are themselves based on developing statistical
relationships between environmental conditions that have been mapped
extensively, over an entire area of interest (e.g. environmental
covariates), and geolocated point observations that provide objective
evidence about the properties or classes of soils (or any other
environmental attribute of interest) at specific sampled locations.
The quality of outputs generated by predictive mapping models is
therefore highly dependent on the quality of the point evidence and also
on the environmental covariates available for use in any model. For
environmental covariates to be considered effective and useful, they
must capture and describe spatial variation in the most influential
environmental conditions accurately and at the appropriate level of
spatial resolution (detail) and spatial abstraction (generalization).
They must also describe those specific environmental conditions that
exhibit the most direct influence on the development and distribution of
soils or soil properties (or of whatever else one wishes to predict).
The degree to which available environmental covariates can act as
reliable and accurate proxies for the main (scorpan) soil forming
factors has a profound influence on the success of PSM methods. If
available covariates describe the environment comprehensively,
accurately and correctly, it is likely that predictive models will also
achieve high levels of prediction accuracy and effectiveness, if
provided with sufficient suitable point training data.
Fortunately, advances in remote sensing and mapping continue to provide
us with more and better information on the global spatial distribution
of many key (scorpan) environmental conditions. Climate data (c) is
becoming increasingly detailed, accurate and available. Similarly, many
currently available kinds of remotely sensed imagery provide
increasingly useful proxies for describing spatial patterns of
vegetation (o) and land use. Topography, or relief (r), is being
described with increasing detail, precision and accuracy by ever finer
resolution global digital elevation models (DEMs).
Unfortunately, several key environmental conditions are still not as
well represented, by currently available environmental covariates, as
one would wish. Improvements need to be made in acquiring global
covariates that describe parent material (p), age (a) and spatial
context or spatial position (n) better than they currently are. In
addition, the scorpan model recognizes that available information about
some aspect of the soil (s) can itself be used as a covariate in
predicting some other (related) aspect of the soil. Only recently have
we begun to see complete and consistent global maps of soil classes and
soil properties emerge that can be used as covariates to represent the
soil (s) factor in prediction models based on the scorpan concept.
Advances are being made in developing new covariates that provide
improved proxies for describing parent material (p). Perhaps the best
known of these, and the most directly relevant, is airborne gamma ray
spectroscopy [@wilford1997application; @viscarra2007multivariate; @rouze2017understanding].
This sensor can provide very direct and interpretable
information from which inferences can be made about both the mineralogy
and the texture of the top few centimeters of the land surface. A number
of countries (e.g. Australia, Uganda, Ireland) already possess complete,
country-wide coverage of gamma ray spectroscopy surveys. More are likely
to follow. Similarly, advances are being made in interpreting satellite
based measurements of spatio-temporal variations in ground surface
temperature and near surface soil moisture to infer properties of the
parent material such as texture, and to a lesser extent, mineralogy [@liu2012soil].
These act as very indirect proxies but they do help to distinguish
warmer and more rapidly drying sands, for example, from colder and
slower drying wet clays. Identifying and acquiring more detailed and
more accurate covariates from which parent material type and texture can
be inferred is a major ongoing challenge for which progress has been
slow.
Only recently have a number of investigators begun to suggest a variety
of covariates that can be calculated and used as proxies to describe
spatial context or spatial position (n) in the scorpan model [@Behrens2018EJSS]. These
measures of spatial context or position can help to account for the effects
of spatial autocorrelation in prediction models for many soil properties
and attributes. They also help to coax out effects related to spatial
context and spatial scale. The old adage that “what you see depends upon
how closely you look” certainly applies to predictive soil mapping. If
one only looks at the finest detail, one overlooks the broader context
and broader patterns. Similarly, if one only looks at broad patterns
(coarser resolutions) one can easily miss seeing, and predicting,
important shorter range variation. Soils are known to form in response
to a number of different soil forming processes and these processes are
themselves known to operate over quite different ranges of process
scales (or distances). So, if one looks only at the most detailed scales
(e.g. finest spatial resolution) one can easily fail to observe,
describe and account for important influences that operate across longer
distances and larger scales. Increasingly, it is becoming evident that
prediction models generate more accurate results when they incorporate
consideration of a hierarchical pyramid of environmental covariates
computed across a wide range of resolutions to represent a wide range of
process scales and formative influences [@Behrens2018EJSS; @behrens2018multi].
A final, and very significant, consideration, for environmental
covariates is one of degree of availability and ease of use. For
covariates to be effective, they must be relatively easy to identify,
locate and use. Many existing spatial data sets need some form of
preprocessing or transformation in order to become useful inputs as
environmental covariates in predictive mapping. Difficulties and costs
involved in locating, downloading and transforming these source data
sets can severely restrict their effective use. Equally, many of these
same covariates are often located, downloaded and processed multiple
times by multiple entities for use in isolated projects and then archived
(or disposed of) and not made easily available for subsequent use and
reuse. A mentality of “protecting my data” leads to limitations on
sharing and reuse of spatial data with large resulting costs from
redoing the same work over and over for each new project. Significant
improvements could be realized if spatial data sets, once assembled,
corrected and preprocessed, could be widely shared and widely used.
In many PSM projects, as much as 80% of the time and effort expended can
go into preparing and collating the environmental covariates used in the
modelling process. If modelers could work collectively and
collaboratively to share entire collections of relevant covariates at
global to regional to national scales, considerable efficiencies could
be realized. Time and effort now spent in assembling covariates could
instead be devoted to locating and assembling more and better point O&M
data and on discovering and applying improved models. So, one key way in
which future inventory activities could be made much more efficient and
cost-effective would be to develop mechanisms and platforms whereby
comprehensive stacks of environmental covariates, covering entire
regions of interest, could be jointly created, collated and freely shared.
OpenGeoHub aims to provide a fully worked example of such a platform for
sharing geodata.
### Automated spatial prediction models (PSM)
Rapid adoption of new, automated, spatial prediction methods is the most
fundamental change envisaged as being central to all efforts to redesign
land resource inventories such that they can, once again, become
affordable and feasible to conduct. These models are quantitative,
objective, repeatable and updateable. They capture and codify
understanding of how soils are arranged spatially in the landscape, and
why, in ways that are systematic, rigorous and verifiable. Results of
models can be updated regularly and easily, as new O&M point data, new
covariates, or even new modelling algorithms become available. The time
and costs associated with constructing prediction models is minimal in
comparison with traditional manual mapping methods. Even more
dramatically, once constructed, models can be rerun, against improved
data, to update predictions regularly or to track changes in conditions
through time.
Prediction models have changed, and improved, quite substantially, over
the last few years. Most initial PSM models were linear (simple) and
universal (applied equally to entire areas). Newer PSM models are
increasingly non-linear and hierarchical with different mathematical
combinations of predictors operating in different ways under different
regional combinations of environmental conditions. More powerful methods
involving Deep Learning and Artificial Intelligence have recently
demonstrated improved prediction accuracies, compared to earlier, more
simple, linear regression or tree models.
Automated prediction models have several other clear advantages over
conventional manual mapping methods. Consider again, the previously
discussed manual approaches of top-down versus bottom up mapping. Up
until now, almost all previous manual (or indeed automated) mapping
programs have been bottom up approaches applicable to bounded areas of
some defined and limited extent such as individual farm fields, map
sheets, counties, provinces, states or, at a maximum, entire countries.
Any project that applies only to a bounded area of limited extent will,
as a consequence, only collect, analyse and use observations and data
that exist within the boundaries of the defined map extent.
Automated mapping methods have the advantage that they can be truly
global. That is, they can use, and consider, all available point data,
everywhere in the world, as evidence when constructing prediction rules.
This means that all possible point data get used and no data go to
waste. Global models, that use all available global point data are, in
fact, an elegant and simple way of implementing the concept of Homosoil
that has been advanced by @Mallavan2010PSS. The Homosol concept
suggests that, if O&M data are not available for any particular point of
interest in the world, then one should search to identify and locate a
point somewhere else in the world that has the most similar possible
combination of environmental conditions as the current unsampled point
but that has also been sampled. Data for this sampled site are then used
to characterize the current unsampled site. Global models simply reverse
this search process by 180 degrees while at the same time making it much
more efficient and simpler to implement. Global models take all
available point data and then identify all other locations anywhere in
the world that possess similar combinations of environmental conditions.
All these similar locations are then assigned, via application of the
prediction model, values for soil properties or soil classes that are
similar to those observed at the sampled reference location, or multiple
similar locations.
Global models not only make use of all available point data to develop
rules, they also capture and quantify variation in soil classes and soil
properties that operates over longer distances (10s to 100s of km) and
coarser scales. This longer range variation is usually related to soil
forming processes that themselves operate over longer distances, such as
gradual, long distance variation in climate, vegetation or even
topography (at the level of regional physiography). Long range variation
may require consideration of patterns that express themselves over very
large distances that may exist partially, or entirely, outside the
boundaries of some current bounded area of interest. Local, bounded
studies can easily fail to observe and quantify this long range
variation.
```{r landgis-soil, echo=FALSE, fig.cap="General workflow of the spatial prediction system used to produce soil property and class maps via the LandGIS.", out.width="100%"}
knitr::include_graphics("figures/Fig_LandGIS_workflow_soil.png")
```
We can consider global models as providing a kind of elegant
implementation of top down mapping (Fig. \@ref(fig:landgis-soil)). Global models capture, describe and
quantify that portion of the total spatial variation in soil properties
and soil classes that occurs over longer distances in response to longer
range soil forming processes. This longer range variation may only
constitute some rather small percentage of the total range in local spatial
variation in some property (typically some 10–30% of total variation).
But it does represent a component of the total variation that would
likely be missed, and not properly observed or accounted for, by local,
bounded, models that do not consider patterns of spatial variation that
extend outside their maximum boundaries or that occur entirely outside
the boundaries of a contained study area.
In a top down mapping approach based on automated mapping, predictions
made globally, using all globally available point data, can be used to
account for longer range patterns of variation and can provide initial,
*a priori,* estimates of the most likely values for soil properties or
soil classes at a point. These initial, *a priori,* estimates can
subsequently be updated and improved upon by more detailed local studies
that have access to much larger volumes of local O&M point data. The
values computed for soil properties by global models can be merged with
values estimated by local models to create some form of merged weighted
average. Alternately, the global estimates of soil property values can
be used to represent soil type covariates (s) in a scorpan prediction
model. Here, globally estimated property values are used as s-type
covariates in predicting equivalent soil property values at local scales
using local models.
Automated spatial prediction models also permit us to recognize that
otherwise similar soils develop and express different properties under
different types of human management. They don't just permit this
recognition, they force us to recognize differences in soils that arise
from differences in land use. This is because automated prediction
models are driven by the data that are fed to them and field O&M data
collected from managed landscapes will invariably report different
values for key soil properties than would be reported for otherwise
similar soils under natural or unmanaged conditions. Thus, for automated
predictive models to actually work, they have to observe and then
predict differences in soils and soil properties between managed and
natural landscapes. This was never something that was considered
feasible to do with manual soil mapping. Consequently managed soils were
usually named and mapped as if they were identical to their natural
(unmanaged) equivalents. Differences might be described in reports or
tables of laboratory analyses, but the two variations of the same soil
(managed and natural) were rarely, if ever, mapped as separately
described entities.
In a similar way, automated prediction methods force us to recognize and
account for temporal variations that arise from changes in soil
conditions or soil attributes at the same locations over time. The
models will predict values similar to those provided to them as input
from field observations and measurements. If we have point O&M data for
the same point location that is separated in time and that reflects
changes in soil property values through time, we need to be able to
recognize this and adapt to it. We need to recognize that all
predictions apply to a specific time period and that different
predictions (maps) need to be produced for different time periods, if
the available point O&M data reference widely different time periods.
In the context of automated mapping and High Performance Computing,
opportunities for producing high quality soil maps using Open Source software
are becoming more and more attractive. However, not all Open Source Machine
Learning packages are equally applicable for processing large national or
international data sets at resolutions of 250 m or better. LandGIS predictions
are, for example, possible only thanks to the following packages that can be
fully parallelized and are ready for upscaling predictions
(all written in C++ in fact):
- **ranger** (https://github.com/imbs-hl/ranger),
- **xgboost** (https://xgboost.readthedocs.io/en/latest/),
- **liquidSVM** (https://github.com/liquidSVM/liquidSVM),
these can be further efficiently combined with accuracy assessment
and fine-tuning packages (also ready for parallelization):
- **SuperLearner** (https://cran.r-project.org/web/packages/SuperLearner/),
- **caret** (https://topepo.github.io/caret/),
- **mlr** (https://mlr.mlr-org.com/),
Beyond that it is not trivial to use R for production of large rasters where
millions of points with hundreds of covariates are used for model building.
So it is important to realize that Open Source does not have out-of-box
solutions for PSM projects, but requires active involvement and development.
### Hosting, publishing, sharing and using spatial data
Finally, we need to consider how future inventory activities can benefit
from improved approaches for hosting, publishing, sharing and using
spatial data, with special attention paid to predictions of soil
properties or soil classes.
The value of data is in its use. Thus, we only get full value for our
data if we can maximize its distribution and use. Developments in
mechanisms and communities for sharing digital data online provide
promise of greatly improved access to, and use of, new digital data
sets, including predictive soil maps.
Major developments in hosting and delivering spatial data online include
new and increased interest in, and adherence to, principles of FAIR
Data, FAST Data and, most importantly, OPEN Data.
FAIR Data principles aim to make data findable, accessible,
interoperable and reusable [@wilkinson2016fair]. The easier data are
to locate and access, the greater the use is likely to be. Similarly,
data that are interoperable are easier to ingest into end user
applications, and so, will receive greater use. Data that are reusable
also ensure maximum benefit by facilitating regular use and reuse.
FAST data is the application of big data analytics to smaller data sets
in near-real or real-time in order to solve a problem or create business
value. The goal of fast data is to quickly gather and mine structured
and unstructured data so that action can be taken
(https://whatis.techtarget.com/definition/fast-data).
Fast data is fundamentally different from Big Data in many ways. Big
Data is most typically data at rest, hundreds of terabytes or even
petabytes of it, taking up lots of space on disk drives. Fast data is
data in motion
(https://www.voltdb.com/why-voltdb/big-data/).
OpenGeoHub aims to use Big Data analytics to rapidly and affordably turn
static and unstructured data into easily used, and widely used
information. The objective should be to rapidly generate agile, flexible
and user oriented data.
Future soil inventory projects based on application of predictive soil
modelling will also benefit from adopting the following principles of
OPEN Data based on the Sunlight Foundation's *“Ten Principles for Opening up Government Information”*
(https://open.canada.ca/en/open-data-principles#toc95):
**1. Completeness**\
Data sets should be as complete as possible, reflecting the entirety of
what is recorded about a particular subject. All raw information from a
data set should be released to the public, unless there are Access to
Information or Privacy issues. Metadata that defines and explains the
raw data should be included, along with explanations for how the data
was calculated.\
**2. Primacy**\
Data sets should come from a primary source. This includes the original
information collected by the original sources and available details on
how the data was collected. Public dissemination will allow users to
verify that information was collected properly and recorded accurately.\
**3. Timeliness**\
Data sets released should be made available to the public in a timely
fashion. Whenever feasible, information collected by original entities
should be released as quickly as it is gathered and collected. Priority
should be given to data whose utility is time sensitive.\
**4. Ease of Physical and Electronic Access**\
Data sets released by their producers should be as accessible as
possible, with accessibility defined as the ease with which information
can be obtained. Barriers to electronic access include making data
accessible only via submitted forms or systems that require
browser-oriented technologies (e.g., Flash, Javascript, cookies or Java
applets). By contrast, providing an interface for users to make specific
calls for data through an Application Programming Interface (API) make
data much more readily accessible.\
**5. Machine readability**\
Machines can handle certain kinds of inputs much better than others.
Data sets should be released in widely-used file formats that easily lend
themselves to machine processing (e.g. CSV, XML). These files should be
accompanied by documentation related to the format and how to use it in
relation to the data.\
**6. Non-discrimination**\
Non-discrimination refers to who can access data and how they must do
so. Barriers to use of data can include registration or membership
requirements. Released data sets should have as few barriers to use as
possible. Non-discriminatory access to data should enable any person to
access the data at any time without having to identify him/herself or
provide any justification for doing so.\
**7. Use of Commonly Owned Standards**\
Commonly owned standards refer to who owns the format in which data is
stored. For example, if only one company manufactures the program that
can read a file where data is stored, access to that information is
dependent upon use of that company's program. Sometimes that program is
unavailable to the public at any cost, or is available, but for a fee.
Removing this cost makes the data available to a wider pool of potential
users. Released data sets should be in freely available file formats as
often as possible.\
**8. Licencing**\
All data sets should be released under a recognized Open Data Licence.
Such licences are designed to increase openness and minimize
restrictions on the use of the data.\
**9. Permanence**\
The capability of finding information over time is referred to as
permanence. For best use by the public, information made available
online should remain online, with appropriate version-tracking and
archiving over time.\
**10. Usage Costs**\
All open data should be provided free of charge.
A preferred way of achieving FAIR, FAST and OPEN data distribution is to
develop and maintain new, online platforms that support collaborative
compilation, sharing and geopublishing. OpenGeoHub aims to provide a
viable, worked example of how a new, open and collaborative web-based
platform can deliver soil spatial information on-demand and in nearly
real time.
### New visualization and data analysis tools
Terrestrial resource inventories, and indeed spatial inventories of
almost all environmental conditions, will increasingly benefit from
adopting and using new tools and platforms that enhance interactive,
real time data visualization and data analysis.
Spatial data increasingly needs to be presented in ways that support
interactive, real time visualization of 3 dimensions plus time. What is
increasingly being referred to as 4D or 3D+ time. We need to help users
visualize, and appreciate, how soils vary with depth as well as in
horizontal space. And, also increasingly, we need to be able to help
users visualize and understand how soils can vary through time.
OpenGeoHub is attempting to demonstrate newly available facilities for
visualizing, and interacting with, 3D and 3D+ time spatio-temporal data.
Every effort needs to be made to facilitate easy use of terrestrial
resource inventory spatial data. This should entail releasing spatial
data that has both the content and the format required for immediate
ingestion into, and use in, critical end user applications. Users should
be able to link their applications to data supplier platforms and simple
call up needed data.
## The future of PSM: Embracing new organizational and governance models
### Overview
In the same way that new scientific and technological advances can be
embraced to improve future PSM any new, future, PSM activities should
also take advantage of newer organizational models that improve how
collective activities can be organized and managed collaboratively and
cooperatively through innovations such as [@Hengl2018OGH]:
- Open data and platforms and procedures for acquiring and sharing data,
- Open, cloud-based, processing capabilities,