Docs for new pipeline

2023-03-26 20:32:12 +02:00 · 2023-03-26 20:32:12 +02:00 · c4cc4a9078
commit c4cc4a9078
parent ac90d50239
4 changed files with 106 additions and 41 deletions
--- a/README.md
+++ b/README.md
@ -164,7 +164,7 @@ You will need to re-run this command after updates, to migrate the database and
 (re-)create the functions in the SQL database that are used when generating
 vector tiles.
-You should also import OpenStreetMap data now, see below for instructions.
+You should also [import OpenStreetMap data](docs/osm-import.md) now.
 ### Boot the application
@ -190,46 +190,6 @@ docker-compose run --rm api alembic upgrade head
 ```
 ## Import OpenStreetMap data
 You need to import road information from OpenStreetMap for the portal to work.
 This information is stored in your PostgreSQL database and used when processing
 tracks (instead of querying the Overpass API), as well as for vector tile
 generation. The process applies to both development and production setups. For
 development, you should choose a small area for testing, such as your local
 county or city, to keep the amount of data small. For production use you have
 to import the whole region you are serving.
 * Install `osm2pgsql`. 
 * Download the area(s) you would like to import from [GeoFabrik](https://download.geofabrik.de). 
 * Import each file like this:
    ```bash
    osm2pgsql --create --hstore --style roads_import.lua -O flex \
      -H localhost -d obs -U obs -W \
      path/to/downloaded/myarea-latest.osm.pbf 
    ```
 You might need to adjust the host, database and username (`-H`, `-d`, `-U`) to
 your setup, and also provide the correct password when queried. For the
 development setup the password is `obs`. For production, you might need to
 expose the containers port and/or create a TCP tunnel, for example with SSH,
 such that you can run the import from your local host and write to the remote
 database.
 The import process should take a few seconds to minutes, depending on the area
 size. A whole country might even take one or more hours. You should probably
 not try to import `planet.osm.pbf`. 
 You can run the process multiple times, with the same or different area files,
 to import or update the data. However, for this to work, the actual [command
 line arguments](https://osm2pgsql.org/doc/manual.html#running-osm2pgsql) are a
 bit different each time, including when first importing, and the disk space
 required is much higher. 
 Refer to the documentation of `osm2pgsql` for assistance. We are using "flex
 mode", the provided script `roads_import.lua` describes the transformations
 and extractions to perform on the original data.
 ## Troubleshooting
--- a/api/tools/transform_osm.py
+++ b/api/tools/transform_osm.py
--- a/docker-compose.yaml
+++ b/docker-compose.yaml
@ -36,6 +36,8 @@ services:
      - ./tile-generator/data/:/tiles
      - ./api/migrations:/opt/obs/api/migrations
      - ./api/alembic.ini:/opt/obs/api/alembic.ini
      - ./local/pbf:/pbf
      - ./local/obsdata:/obsdata
    depends_on:
      - postgres
      - keycloak
--- a/docs/osm-import.md
+++ b/docs/osm-import.md
@ -0,0 +1,103 @@
 # Importing OpenStreetMap data
 The application requires a lot of data from the OpenStreetMap to work.
 The required information is stored in the PostgreSQL database and used when
 processing tracks, as well as for vector tile generation. The process applies
 to both development and production setups. For development, you should choose a
 small area for testing, such as your local county or city, to keep the amount
 of data small. For production use you have to import the whole region you are
 serving.
 ## General pipeline overview
 1. Download OpenStreetMap data as one or more `.osm.pbf` files.
 2. Transform this data to generate geometry data for all roads and regions, so
   we don't need to look up nodes separately. This step requires a lot of CPU
   and memory, so it can be done "offline" on a high power machine.
 3. Import the transformed data into the PostgreSQL/PostGIS database.
 ## Community hosted transformed data
 Since the first two steps are the same for everybody, the community will soon
 provide a service where relatively up-to-date transformed data can be
 downloaded for direct import. Stay tuned.
 ## Download data
 [GeoFabrik](https://download.geofabrik.de) kindly hosts extracts of the
 OpenStreetMap planet by region. Download all regions you're interested in from
 there in `.osm.pbf` format, with the tool of your choice, e. g.:
 ```bash
 wget -P local/pbf/ https://download.geofabrik.de/europe/germany/baden-wuerttemberg-latest.osm.pbf 
 ```
 ## Transform data
 To transform downloaded data, you can either use the docker image from a
 development or production environment, or locally install the API into your
 python environment. Then run the `api/tools/transform_osm.py` script on the data.
 ```bash
 api/tools/transform_osm.py baden-wuerttemberg-latest.osm.pbf baden-wuerttemberg-latest.msgpack
 ```
 In dockerized setups, make sure to mount your data somewhere in the container
 and also mount a directory where the result can be written. The development
 setup takes care of this, so you can use:
 ```bash
 docker-compose run --rm api tools/transform_osm.py \
  /pbf/baden-wuerttemberg-latest.osm.pbf /obsdata/baden-wuerttemberg-latest.msgpack
 ```
 Repeat this command for every file you want to transform.
 ## Import transformed data
 The command for importing looks like this:
 ```bash
 api/tools/import_osm.py baden-wuerttemberg-latest.msgpack
 ```
 This tool reads your application config from `config.py`, so set that up first
 as if you were setting up your application.
 In dockerized setups, make sure to mount your data somewhere in the container.
 Again, the development setup takes care of this, so you can use:
 ```bash
 docker-compose run --rm api tools/import_osm.py \
  /obsdata/baden-wuerttemberg-latest.msgpack
 ```
 The transform process should take a few seconds to minutes, depending on the area
 size. You can run the process multiple times, with the same or different area
 files, to import or update the data. You can update only one region and leave
 the others as they are, or add more filenames to the command line to
 bulk-import data.
 ## How this works
 * The transformation is done with a python script that uses
  [pyosmium](https://osmcode.org/pyosmium/) to read the `.osm.pbf` file. This
  script then filters the data for only the required objects (such as road
  segments and administrative areas), and extracts the interesting information
  from those objects.
 * The node geolocations are looked up to generate a geometry for each object.
  This requires a lot of memory to run efficiently.
 * The geometry is projected to [Web Mercator](https://epsg.io/3857) in this
  step to avoid continous transformation when tiles are generated later. Most
  operations will work fine in this projection. Projection is done with the
  [pyproj](https://pypi.org/project/pyproj/) library.
 * The output is written to a binary file in a very simple format using
  [msgpack](https://github.com/msgpack/msgpack-python), which is way more
  efficient that (Geo-)JSON for example. This format is stremable, so the
  generated file is never fully written or read into memory.
 * The import script reads the msgpack file and sends it to the database using
  [psycopg](https://www.psycopg.org/). This is done because it supports
  PostgreSQL's `COPY FROM` statement, which enables much faster writes to the
  database that a traditionional `INSERT VALUES`. The file is streamed directly
  to the database, so it is never read into memory.