Skip to content

Commit

Permalink
Merge pull request #10 from hotosm/mvum
Browse files Browse the repository at this point in the history
Add conversion utilities for external highway datasets
  • Loading branch information
spwoodcock authored Aug 2, 2024
2 parents 2792f71 + 582de35 commit f4a16f0
Show file tree
Hide file tree
Showing 4 changed files with 573 additions and 105 deletions.
149 changes: 44 additions & 105 deletions docs/utilities.md
Original file line number Diff line number Diff line change
@@ -1,107 +1,46 @@
# Utility Programs

Conflator includes a few bourne shell scripts used for bulk
processing of data files. Because of the poor performance when
processing huge data files, they're split up into manageable
pieces. There are several assumptions made, namely that the OSM data
is in postgres database, imported via [these
instructions](conflation.md). The building footprint file can be
either in a database, or the downloaded GeoJson formatted data file.

Since all of the output files need to be web accessible, these scripts
are usually run in the directory containing the data files. Each
country should be in a separate directory. These scripts take two
command line arguments, the country to be processed, and optionally a
single project to process. By default all projects are processed. If a
consistent naming convention is used, the basename of the current
directory is to try to guess the proper country name. That same value
is also used to identify the correct database or data file name.

### For example

~www/Africa/Kenya
~www/Africa/Kenya/kenya-latest.pbf
~www/Africa/Kenya/kenya.geojsonl
~www/Africa/Nigeria
~www/Africa/Nigeria/nigeria-latest.pbf
~www/Africa/Nigeria/nigeria.geojsonl
~www/Asia/Nepal
~www/Asia/Nepal/...

# Tasking Manager Projects

Since the import data is huge, the Tasking Manager is used to
validation the results of conflation. To further reduce the data size,
the project boundaries are downlooaded from the Tasking Manager by
using it's remote API. These are then saved to disk using the naming
convention *12345-project.geojson*, where **12345** is a project
ID. The project boundaries can downloaded using the
[splitter.py](splitter.md) program, which is part of conflator. A
boundary can be downloaded like this:

> PATH/splitter.py -p 12345
Since a big import requires multiple Tasking Manager projects, to get
started, download all of the boundaries for this import.

## clipsrc.sh

This script extracts all the buildings in the specified country into a
data file. This assumes all the data has already been imported into
postgres. Since there are usually multiple countries imported into
postgres, this gets just the ones we want for furthur processing.

This generates two output files from the database, namely the country
name, postfixed by the data source. For example *kenya-osm.geojson*
and *kenya-ms.geojson*. These data files are then split into
smaller files based on a Tasking Manager project boundary. Each of the
smaller files follows the same naming convention, *12345-osm.geojson*
or *kenya-ms.geojson*.

> PATH/clipsrc.sh kenya
## update.sh

This script processes the project sized data files for the best
performance. One again, it looks for any files that follow the naming
convention, and runs the [conflation script](conflator.md) on each of
the project boundaries. The generates a single output file, containing
buildings from the footprint file that are not already in OSM. This
file is *12345-buildings.geojson*.

> PATH/clipsrc.sh nigeria
## index.sh

This script generates a simple webpage to navigate all the data files,
so they can be manually downloaded for validation. This script should
be run in the directory with all the data files. The first section is
just the project from the Tasking Manager, the rest are all the
smaller files for each project. Each project has 3 generated data
files, the two raw data files produced from the database, and the
conflated building output.

> ./index.sh
## splittasks.sh

This utility splits an existing data file of the results of building
conflation into smaller pieces. If the project id is specified on the
command line, only that project is downloaded. Otherwise the current
directory is scanned for files using the naming convention of
${projectid}-tasks.geojson. This then uses the X and Y coordinates of
the task for the default zoom level This then uses the X and Y
coordinates of the task for the default zoom level to uniquely name
the data file so the Tasking Manager can load it.

> PATH/splittasks.sh [project ID]
## getosm

This utility is to download smaller data files than are available
from GeoFabrik. It requires a boundary polygon from a Tasking Manager
project. If the project id is specified on the command line, only that
project is downloaded. Otherwise the current directory is scanned for
files using the naming convention of ${projectid}-projects.geojson.

> PATH/getosm.sh [project ID]
To conflate external datasets with OSM, the external data needs to be
converted to the OSM tagging schema. Otherwise comparing tags gets
very convoluted. Since every dataset uses a different schema, included
are a few utility programs for converting external datasets. Currently
the only datatsets are for highways. These datasets are available from
the [USDA](https://www.usda.gov/), and have an appropriate license to
use with OpenStreetMap. Indeed, some of this data has already been
imported. The files are available from the
[FSGeodata Clearinghouse](https://data.fs.usda.gov/geodata/edw/datasets.php?dsetCategory=transportation)

Most of the fields in the dataset aren't needed for OSM, only the
reference number if it has one, and the name. Most of these highways
are already in OSM, but it's a bit of a mess, and mostly
unvalidated. Most of the problems are related to the TIGER import
in 2007. So the goal of these utilities is to add in the [TIGER
fixup](https://wiki.openstreetmap.org/wiki/TIGER_fixup) work by
updating or adding the name and a reference number. These utilities
prepare the dataset for conflation.

There are other fields in the datasets we might want, like surface
type, is it 4wd only, etc... but often the OSM data is more up to
date. And to really get that right, you need to ground truth it.

## mvum.py

This converts the [Motor Vehicle Use Map(MVUM)](https://data.fs.usda.gov/geodata/edw/edw_resources/shp/S_USA.Road_MVUM.zip) dataset that contains
data on highways more suitable for offroad vehicles. Some require
specialized offroad vehicles like a UTV or ATV. The data in OSM for
these roads is really poor. Often the reference number is wrong, or
lacks the suffix. We assume the USDA data is correct when it comes to
name and reference number, and this will get handled later by
conflation.

## roadcore.py

This converts the [Road Core](https://data.fs.usda.gov/geodata/edw/edw_resources/shp/S_USA.RoadCore_FS.zip) vehicle map. This contains data on all
highways in a national forest. It's similar to the MVUM dataset.

## Trails.py

This converts the [NPSPublish](https://data.fs.usda.gov/geodata/edw/edw_resources/shp/S_USA.TrailNFS_Publish.zip) Trail dataset. These are hiking trails
not open to motor vehicles. Currently much of this dataset has empty
fields, but the trail name and reference number is useful. This
utility is to support the OpenStreetMap US [Trails Initiative](https://openstreetmap.us/our-work/trails/).
181 changes: 181 additions & 0 deletions utilities/mvum.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
#!/usr/bin/python3

# Copyright (c) 2021, 2022, 2023, 2024 Humanitarian OpenStreetMap Team
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as
# published by the Free Software Foundation, either version 3 of the
# License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU Affero General Public License for more details.
#
# You should have received a copy of the GNU Affero General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.

import argparse
import logging
import sys
import os
from sys import argv
from osm_fieldwork.osmfile import OsmFile
from geojson import Point, Feature, FeatureCollection, dump, Polygon, load
import geojson
from shapely.geometry import shape, LineString, Polygon, mapping
import shapely
from shapely.ops import transform
import pyproj
import asyncio
from codetiming import Timer
import concurrent.futures
from cpuinfo import get_cpu_info
from time import sleep
from thefuzz import fuzz, process
from pathlib import Path
from tqdm import tqdm
import tqdm.asyncio

# Instantiate logger
log = logging.getLogger(__name__)

# The number of threads is based on the CPU cores
info = get_cpu_info()
cores = info['count']

# shut off warnings from pyproj
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

class MVUM(object):
def __init__(self,
filespec: str = None,
):
self.file = None
if filespec is not None:
self.file = open(filespec, "r")

def convert(self,
filespec: str = None,
) -> list:

# FIXME: read in the whole file for now
if filespec is not None:
file = open(filespec, "r")
else:
file = self.file

data = geojson.load(file)

highways = list()
for entry in data["features"]:
geom = entry["geometry"]
id = 0
sym = 0
op = None
surface = str()
name = str()
props = dict()
# print(entry["properties"])
if entry["properties"] is None or entry is None:
continue
if "ID" in entry["properties"]:
props["ref:usfs"] = f"FR {entry['properties']['ID']}"
if "NAME" in entry["properties"] and entry["properties"]["NAME"] is not None:
title = entry["properties"]["NAME"].title()
name = str()
# Fix some common abbreviations
if " Cr " in title:
name = name.replace(" Cr ", " Creek ")
elif " Cg " in title:
name = name.replace(" Cg ", " Campground ")
elif " Rd. " in title:
name = name.replace(" Rd. ", " Road")
elif " Mtn " in title:
name = name.replace(" Mtn", " Mountain")
else:
name = title
if name.find("Road") <= 0:
props["name"] = f"{name} Road"
if "OPERATIONA" in entry["properties"] and entry["properties"]["OPERATIONA"] is not None:
op = int(entry["properties"]["OPERATIONA"][:1])
if op == 1:
props["access"] = "no"
elif op == 2:
props["smoothness"] = "very bad"
elif op == 3:
props["smoothness"] = "good"
elif op == 4:
props["smoothness"] = "bad"
elif op == 5:
props["smoothness"] = "excellent"

# if "SBS_SYMBOL" in entry["properties"] and op is None:
# if "Not Maintained for" in entry["properties"]["SBS_SYMBOL"]:
# props["smoothness"] = "very bad"
# else:
# sym = entry["properties"]
if "SURFACETYP" in entry["properties"]:
surface = entry["properties"]["SURFACETYP"]
if surface is None:
continue
if surface[:3] == "NAT":
props["surface"] = "dirt"
if surface[:3] == "IMP" or surface[:5] == "CSOIL":
props["surface"] = "gravel"
props["surface"] = "compacted"
elif surface[:3] == "AGG":
props["surface"] = "gravel"
elif surface[:2] == "AC":
props["surface"] = "gravel"
elif surface[:3] == "BST" or surface[:2] == "P ":
props["surface"] = "paved"

highways.append(Feature(geometry=geom, properties=props))
#print(props)

return FeatureCollection(highways)


async def main():
"""This main function lets this class be run standalone by a bash script"""
parser = argparse.ArgumentParser(
prog="mvum",
formatter_class=argparse.RawDescriptionHelpFormatter,
description="This program converts MVUM highway data into OSM tagging",
epilog="""
This program processes the MVUM data. It will convert the MVUM dataset
to using OSM tagging schema so it can be conflated. Abbreviations are
discouraged in OSM, so they are expanded. Most entries in the MVUM
dataset are ignored. For fixing the TIGER mess, all that is relevant
are the name and the USFS reference number. The surface and smoothness
tags are also converted, but should never overide what is in OSM, as the
OSM values for these may be more recent. And the values change over time,
so what is in the MVUM dataset may not be accurate. These tags are converted
primarily as an aid to navigation when ground-truthing, since it's usually
good to avoid any highway with a smoothness of "very bad" or worse.
For Example:
mvum.py -v -c -i WY_RoadsMVUM.geojson
""",
)
parser.add_argument("-v", "--verbose", action="store_true", help="verbose output")
parser.add_argument("-i", "--infile", required=True, help="Output file from the conflation")
parser.add_argument("-c", "--convert", action="store_true", help="Convert MVUM feature to OSM feature")
parser.add_argument("-o", "--outfile", default="out.geojson", help="Output file")

args = parser.parse_args()

mvum = MVUM()
if args.convert:
data = mvum.convert(args.infile)

file = open(args.outfile, "w")
geojson.dump(data, file)

if __name__ == "__main__":
"""This is just a hook so this file can be run standlone during development."""
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
loop.run_until_complete(main())
Loading

0 comments on commit f4a16f0

Please sign in to comment.