Merge pull request #10 from hotosm/mvum

Add conversion utilities for external highway datasets
hotosm · Aug 2, 2024 · f4a16f0 · f4a16f0
2 parents 2792f71 + 582de35
commit f4a16f0
Show file tree

Hide file tree

Showing 4 changed files with 573 additions and 105 deletions.
diff --git a/docs/utilities.md b/docs/utilities.md
@@ -1,107 +1,46 @@
 # Utility Programs
 
-Conflator includes a few bourne shell scripts used for bulk
-processing of data files. Because of the poor performance when
-processing huge data files, they're split up into manageable
-pieces. There are several assumptions made, namely that the OSM data
-is in postgres database, imported via [these
-instructions](conflation.md). The building footprint file can be
-either in a database, or the downloaded GeoJson formatted data file.
-
-Since all of the output files need to be web accessible, these scripts
-are usually run in the directory containing the data files. Each
-country should be in a separate directory. These scripts take two
-command line arguments, the country to be processed, and optionally a
-single project to process. By default all projects are processed. If a
-consistent naming convention is used, the basename of the current
-directory is to try to guess the proper country name. That same value
-is also used to identify the correct database or data file name.
-
-### For example
-
-	~www/Africa/Kenya
-	~www/Africa/Kenya/kenya-latest.pbf
-	~www/Africa/Kenya/kenya.geojsonl
-	~www/Africa/Nigeria
-	~www/Africa/Nigeria/nigeria-latest.pbf
-	~www/Africa/Nigeria/nigeria.geojsonl
-	~www/Asia/Nepal
-	~www/Asia/Nepal/...
-
-# Tasking Manager Projects
-
-Since the import data is huge, the Tasking Manager is used to
-validation the results of conflation. To further reduce the data size,
-the project boundaries are downlooaded from the Tasking Manager by
-using it's remote API. These are then saved to disk using the naming
-convention *12345-project.geojson*, where **12345** is a project
-ID. The project boundaries can downloaded using the
-[splitter.py](splitter.md) program, which is part of conflator. A
-boundary can be downloaded like this:
-
-> PATH/splitter.py -p 12345
-
-Since a big import requires multiple Tasking Manager projects, to get
-started, download all of the boundaries for this import.
-
-## clipsrc.sh
-
-This script extracts all the buildings in the specified country into a
-data file. This assumes all the data has already been imported into
-postgres. Since there are usually multiple countries imported into
-postgres, this gets just the ones we want for furthur processing. 
-
-This generates two output files from the database, namely the country
-name, postfixed by the data source. For example *kenya-osm.geojson*
-and *kenya-ms.geojson*. These data files are then split into
-smaller files based on a Tasking Manager project boundary. Each of the
-smaller files follows the same naming convention, *12345-osm.geojson*
-or *kenya-ms.geojson*.
-
-> PATH/clipsrc.sh kenya
-
-## update.sh
-
-This script processes the project sized data files for the best
-performance. One again, it looks for any files that follow the naming
-convention, and runs the [conflation script](conflator.md) on each of
-the project boundaries. The generates a single output file, containing
-buildings from the footprint file that are not already in OSM. This
-file is *12345-buildings.geojson*.
-
-> PATH/clipsrc.sh nigeria
-
-## index.sh
-
-This script generates a simple webpage to navigate all the data files,
-so they can be manually downloaded for validation. This script should
-be run in the directory with all the data files. The first section is
-just the project from the Tasking Manager, the rest are all the
-smaller files for each project. Each project has 3 generated data
-files, the two raw data files produced from the database, and the
-conflated building output.
-
-> ./index.sh
-
-## splittasks.sh
-
-This utility splits an existing data file of the results of building
-conflation into smaller pieces. If the project id is specified on the
-command line, only that project is downloaded. Otherwise the current
-directory is scanned for files using the naming convention of
-${projectid}-tasks.geojson. This then uses the X and Y coordinates of
-the task for the default zoom level  This then uses the X and Y
-coordinates of the task for the default zoom level to uniquely name
-the data file so the Tasking Manager can load it.
-
-> PATH/splittasks.sh [project ID]
-
-## getosm
-
-This utility is to download smaller data files than are available
-from GeoFabrik. It requires a boundary polygon from a Tasking Manager
-project. If the project id is specified on the command line, only that
-project is downloaded. Otherwise the current directory is scanned for
-files using the naming convention of ${projectid}-projects.geojson.
-
-> PATH/getosm.sh [project ID]
+To conflate external datasets with OSM, the external data needs to be
+converted to the OSM tagging schema. Otherwise comparing tags gets
+very convoluted. Since every dataset uses a different schema, included
+are a few utility programs for converting external datasets. Currently
+the only datatsets are for highways. These datasets are available from
+the [USDA](https://www.usda.gov/), and have an appropriate license to
+use with OpenStreetMap. Indeed, some of this data has already been
+imported. The files are available from the 
+[FSGeodata Clearinghouse](https://data.fs.usda.gov/geodata/edw/datasets.php?dsetCategory=transportation)
+
+Most of the fields in the dataset aren't needed for OSM, only the
+reference number if it has one, and the name. Most of these highways
+are already in OSM, but it's a bit of a mess, and mostly
+unvalidated. Most of the problems are related to the TIGER import
+in 2007. So the goal of these utilities is to add in the [TIGER
+fixup](https://wiki.openstreetmap.org/wiki/TIGER_fixup) work by
+updating or adding the name and a reference number. These utilities
+prepare the dataset for conflation.
+
+There are other fields in the datasets we might want, like surface
+type, is it 4wd only, etc... but often the OSM data is more up to
+date. And to really get that right, you need to ground truth it.
+
+## mvum.py
+
+This converts the [Motor Vehicle Use Map(MVUM)](https://data.fs.usda.gov/geodata/edw/edw_resources/shp/S_USA.Road_MVUM.zip) dataset that contains
+data on highways more suitable for offroad vehicles. Some require
+specialized offroad vehicles like a UTV or ATV. The data in OSM for
+these roads is really poor. Often the reference number is wrong, or
+lacks the suffix. We assume the USDA data is correct when it comes to
+name and reference number, and this will get handled later by
+conflation.
+
+## roadcore.py
+
+This converts the [Road Core](https://data.fs.usda.gov/geodata/edw/edw_resources/shp/S_USA.RoadCore_FS.zip) vehicle map. This contains data on all
+highways in a national forest. It's similar to the MVUM dataset.
+
+## Trails.py
+
+This converts the [NPSPublish](https://data.fs.usda.gov/geodata/edw/edw_resources/shp/S_USA.TrailNFS_Publish.zip) Trail dataset. These are hiking trails
+not open to motor vehicles. Currently much of this dataset has empty
+fields, but the trail name and reference number is useful. This
+utility is to support the OpenStreetMap US [Trails Initiative](https://openstreetmap.us/our-work/trails/).
diff --git a/utilities/mvum.py b/utilities/mvum.py
@@ -0,0 +1,181 @@
+#!/usr/bin/python3
+
+# Copyright (c) 2021, 2022, 2023, 2024 Humanitarian OpenStreetMap Team
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU Affero General Public License as
+# published by the Free Software Foundation, either version 3 of the
+# License, or (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU Affero General Public License for more details.
+#
+# You should have received a copy of the GNU Affero General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+
+import argparse
+import logging
+import sys
+import os
+from sys import argv
+from osm_fieldwork.osmfile import OsmFile
+from geojson import Point, Feature, FeatureCollection, dump, Polygon, load
+import geojson
+from shapely.geometry import shape, LineString, Polygon, mapping
+import shapely
+from shapely.ops import transform
+import pyproj
+import asyncio
+from codetiming import Timer
+import concurrent.futures
+from cpuinfo import get_cpu_info
+from time import sleep
+from thefuzz import fuzz, process
+from pathlib import Path
+from tqdm import tqdm
+import tqdm.asyncio
+
+# Instantiate logger
+log = logging.getLogger(__name__)
+
+# The number of threads is based on the CPU cores
+info = get_cpu_info()
+cores = info['count']
+
+# shut off warnings from pyproj
+import warnings
+warnings.simplefilter(action='ignore', category=FutureWarning)
+
+class MVUM(object):
+    def __init__(self,
+                 filespec: str = None,
+                 ):
+        self.file = None
+        if filespec is not None:
+            self.file = open(filespec, "r")
+
+    def convert(self,
+                filespec: str = None,
+                ) -> list:
+
+        # FIXME: read in the whole file for now
+        if filespec is not None:
+            file = open(filespec, "r")
+        else:
+            file = self.file
+
+        data = geojson.load(file)
+
+        highways = list()
+        for entry in data["features"]:
+            geom = entry["geometry"]
+            id = 0
+            sym = 0
+            op = None
+            surface = str()
+            name = str()
+            props = dict()
+            # print(entry["properties"])
+            if entry["properties"] is None or entry is None:
+                continue
+            if "ID" in entry["properties"]:
+                props["ref:usfs"] = f"FR {entry['properties']['ID']}"
+            if "NAME" in entry["properties"] and entry["properties"]["NAME"] is not None:
+                title = entry["properties"]["NAME"].title()
+                name = str()
+                # Fix some common abbreviations
+                if " Cr " in title:
+                    name = name.replace(" Cr ", " Creek ")
+                elif " Cg " in title:
+                    name = name.replace(" Cg ", " Campground ")
+                elif " Rd. " in title:
+                    name = name.replace(" Rd. ", " Road")
+                elif " Mtn " in title:
+                    name = name.replace(" Mtn", " Mountain")
+                else:
+                    name = title
+                if name.find("Road") <= 0:
+                    props["name"] = f"{name} Road"
+            if "OPERATIONA" in entry["properties"] and entry["properties"]["OPERATIONA"] is not None:
+                op = int(entry["properties"]["OPERATIONA"][:1])
+                if op == 1:
+                    props["access"] = "no"
+                elif op == 2:
+                    props["smoothness"] = "very bad"
+                elif op == 3:
+                    props["smoothness"] = "good"     
+                elif op == 4:
+                    props["smoothness"] = "bad"
+                elif op == 5:
+                    props["smoothness"] = "excellent"
+
+            # if "SBS_SYMBOL" in entry["properties"] and op is None:
+            #     if "Not Maintained for" in entry["properties"]["SBS_SYMBOL"]:
+            #         props["smoothness"] = "very bad"
+            #     else:
+            #         sym = entry["properties"]
+            if "SURFACETYP" in entry["properties"]:
+                surface = entry["properties"]["SURFACETYP"]
+                if surface is None:
+                    continue
+                if surface[:3] == "NAT":
+                    props["surface"] = "dirt"
+                if surface[:3] == "IMP" or surface[:5] == "CSOIL":
+                    props["surface"] = "gravel"
+                    props["surface"] = "compacted"
+                elif surface[:3] == "AGG":
+                    props["surface"] = "gravel"
+                elif surface[:2] == "AC":
+                    props["surface"] = "gravel"
+                elif surface[:3] == "BST" or surface[:2] == "P ":
+                    props["surface"] = "paved"
+
+            highways.append(Feature(geometry=geom, properties=props))
+            #print(props)
+
+        return FeatureCollection(highways)
+
+
+async def main():
+    """This main function lets this class be run standalone by a bash script"""
+    parser = argparse.ArgumentParser(
+        prog="mvum",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        description="This program converts MVUM highway data into OSM tagging",
+        epilog="""
+This program processes the MVUM data. It will convert the MVUM dataset
+to using OSM tagging schema so it can be conflated. Abbreviations are
+discouraged in OSM, so they are expanded. Most entries in the MVUM
+dataset are ignored. For fixing the TIGER mess, all that is relevant
+are the name and the USFS reference number. The surface and smoothness
+tags are also converted, but should never overide what is in OSM, as the
+OSM values for these may be more recent. And the values change over time,
+so what is in the MVUM dataset may not be accurate. These tags are converted
+primarily as an aid to navigation when ground-truthing, since it's usually
+good to avoid any highway with a smoothness of "very bad" or worse.
+
+    For Example: 
+        mvum.py -v -c -i WY_RoadsMVUM.geojson
+        """,
+    )
+    parser.add_argument("-v", "--verbose", action="store_true", help="verbose output")
+    parser.add_argument("-i", "--infile", required=True, help="Output file from the conflation")
+    parser.add_argument("-c", "--convert", action="store_true", help="Convert MVUM feature to OSM feature")
+    parser.add_argument("-o", "--outfile", default="out.geojson", help="Output file")
+
+    args = parser.parse_args()
+
+    mvum = MVUM()
+    if args.convert:
+        data = mvum.convert(args.infile)
+
+    file = open(args.outfile, "w")
+    geojson.dump(data, file)
+
+if __name__ == "__main__":
+    """This is just a hook so this file can be run standlone during development."""
+    loop = asyncio.new_event_loop()
+    asyncio.set_event_loop(loop)
+    loop.run_until_complete(main())