From c4cc4a90789efe47dc084a521c802b61c43e8d1b Mon Sep 17 00:00:00 2001 From: Paul Bienkowski Date: Sun, 26 Mar 2023 20:32:12 +0200 Subject: [PATCH] Docs for new pipeline --- README.md | 42 +------ .../{convert_osm.py => transform_osm.py} | 0 docker-compose.yaml | 2 + docs/osm-import.md | 103 ++++++++++++++++++ 4 files changed, 106 insertions(+), 41 deletions(-) rename api/tools/{convert_osm.py => transform_osm.py} (100%) create mode 100644 docs/osm-import.md diff --git a/README.md b/README.md index 0e61270..b23dff6 100644 --- a/README.md +++ b/README.md @@ -164,7 +164,7 @@ You will need to re-run this command after updates, to migrate the database and (re-)create the functions in the SQL database that are used when generating vector tiles. -You should also import OpenStreetMap data now, see below for instructions. +You should also [import OpenStreetMap data](docs/osm-import.md) now. ### Boot the application @@ -190,46 +190,6 @@ docker-compose run --rm api alembic upgrade head ``` -## Import OpenStreetMap data - -You need to import road information from OpenStreetMap for the portal to work. -This information is stored in your PostgreSQL database and used when processing -tracks (instead of querying the Overpass API), as well as for vector tile -generation. The process applies to both development and production setups. For -development, you should choose a small area for testing, such as your local -county or city, to keep the amount of data small. For production use you have -to import the whole region you are serving. - -* Install `osm2pgsql`. -* Download the area(s) you would like to import from [GeoFabrik](https://download.geofabrik.de). -* Import each file like this: - - ```bash - osm2pgsql --create --hstore --style roads_import.lua -O flex \ - -H localhost -d obs -U obs -W \ - path/to/downloaded/myarea-latest.osm.pbf - ``` - -You might need to adjust the host, database and username (`-H`, `-d`, `-U`) to -your setup, and also provide the correct password when queried. For the -development setup the password is `obs`. For production, you might need to -expose the containers port and/or create a TCP tunnel, for example with SSH, -such that you can run the import from your local host and write to the remote -database. - -The import process should take a few seconds to minutes, depending on the area -size. A whole country might even take one or more hours. You should probably -not try to import `planet.osm.pbf`. - -You can run the process multiple times, with the same or different area files, -to import or update the data. However, for this to work, the actual [command -line arguments](https://osm2pgsql.org/doc/manual.html#running-osm2pgsql) are a -bit different each time, including when first importing, and the disk space -required is much higher. - -Refer to the documentation of `osm2pgsql` for assistance. We are using "flex -mode", the provided script `roads_import.lua` describes the transformations -and extractions to perform on the original data. ## Troubleshooting diff --git a/api/tools/convert_osm.py b/api/tools/transform_osm.py similarity index 100% rename from api/tools/convert_osm.py rename to api/tools/transform_osm.py diff --git a/docker-compose.yaml b/docker-compose.yaml index 733a51a..7e2db47 100644 --- a/docker-compose.yaml +++ b/docker-compose.yaml @@ -36,6 +36,8 @@ services: - ./tile-generator/data/:/tiles - ./api/migrations:/opt/obs/api/migrations - ./api/alembic.ini:/opt/obs/api/alembic.ini + - ./local/pbf:/pbf + - ./local/obsdata:/obsdata depends_on: - postgres - keycloak diff --git a/docs/osm-import.md b/docs/osm-import.md new file mode 100644 index 0000000..939f7e1 --- /dev/null +++ b/docs/osm-import.md @@ -0,0 +1,103 @@ +# Importing OpenStreetMap data + +The application requires a lot of data from the OpenStreetMap to work. + +The required information is stored in the PostgreSQL database and used when +processing tracks, as well as for vector tile generation. The process applies +to both development and production setups. For development, you should choose a +small area for testing, such as your local county or city, to keep the amount +of data small. For production use you have to import the whole region you are +serving. + +## General pipeline overview + +1. Download OpenStreetMap data as one or more `.osm.pbf` files. +2. Transform this data to generate geometry data for all roads and regions, so + we don't need to look up nodes separately. This step requires a lot of CPU + and memory, so it can be done "offline" on a high power machine. +3. Import the transformed data into the PostgreSQL/PostGIS database. + +## Community hosted transformed data + +Since the first two steps are the same for everybody, the community will soon +provide a service where relatively up-to-date transformed data can be +downloaded for direct import. Stay tuned. + +## Download data + +[GeoFabrik](https://download.geofabrik.de) kindly hosts extracts of the +OpenStreetMap planet by region. Download all regions you're interested in from +there in `.osm.pbf` format, with the tool of your choice, e. g.: + +```bash +wget -P local/pbf/ https://download.geofabrik.de/europe/germany/baden-wuerttemberg-latest.osm.pbf +``` + +## Transform data + +To transform downloaded data, you can either use the docker image from a +development or production environment, or locally install the API into your +python environment. Then run the `api/tools/transform_osm.py` script on the data. + +```bash +api/tools/transform_osm.py baden-wuerttemberg-latest.osm.pbf baden-wuerttemberg-latest.msgpack +``` + +In dockerized setups, make sure to mount your data somewhere in the container +and also mount a directory where the result can be written. The development +setup takes care of this, so you can use: + +```bash +docker-compose run --rm api tools/transform_osm.py \ + /pbf/baden-wuerttemberg-latest.osm.pbf /obsdata/baden-wuerttemberg-latest.msgpack +``` + +Repeat this command for every file you want to transform. + +## Import transformed data + +The command for importing looks like this: + +```bash +api/tools/import_osm.py baden-wuerttemberg-latest.msgpack +``` + +This tool reads your application config from `config.py`, so set that up first +as if you were setting up your application. + +In dockerized setups, make sure to mount your data somewhere in the container. +Again, the development setup takes care of this, so you can use: + +```bash +docker-compose run --rm api tools/import_osm.py \ + /obsdata/baden-wuerttemberg-latest.msgpack +``` + +The transform process should take a few seconds to minutes, depending on the area +size. You can run the process multiple times, with the same or different area +files, to import or update the data. You can update only one region and leave +the others as they are, or add more filenames to the command line to +bulk-import data. + +## How this works + +* The transformation is done with a python script that uses + [pyosmium](https://osmcode.org/pyosmium/) to read the `.osm.pbf` file. This + script then filters the data for only the required objects (such as road + segments and administrative areas), and extracts the interesting information + from those objects. +* The node geolocations are looked up to generate a geometry for each object. + This requires a lot of memory to run efficiently. +* The geometry is projected to [Web Mercator](https://epsg.io/3857) in this + step to avoid continous transformation when tiles are generated later. Most + operations will work fine in this projection. Projection is done with the + [pyproj](https://pypi.org/project/pyproj/) library. +* The output is written to a binary file in a very simple format using + [msgpack](https://github.com/msgpack/msgpack-python), which is way more + efficient that (Geo-)JSON for example. This format is stremable, so the + generated file is never fully written or read into memory. +* The import script reads the msgpack file and sends it to the database using + [psycopg](https://www.psycopg.org/). This is done because it supports + PostgreSQL's `COPY FROM` statement, which enables much faster writes to the + database that a traditionional `INSERT VALUES`. The file is streamed directly + to the database, so it is never read into memory.