Docs for new pipeline

This commit is contained in:
Paul Bienkowski 2023-03-26 20:32:12 +02:00
parent ac90d50239
commit c4cc4a9078
4 changed files with 106 additions and 41 deletions

View file

@ -164,7 +164,7 @@ You will need to re-run this command after updates, to migrate the database and
(re-)create the functions in the SQL database that are used when generating (re-)create the functions in the SQL database that are used when generating
vector tiles. vector tiles.
You should also import OpenStreetMap data now, see below for instructions. You should also [import OpenStreetMap data](docs/osm-import.md) now.
### Boot the application ### Boot the application
@ -190,46 +190,6 @@ docker-compose run --rm api alembic upgrade head
``` ```
## Import OpenStreetMap data
You need to import road information from OpenStreetMap for the portal to work.
This information is stored in your PostgreSQL database and used when processing
tracks (instead of querying the Overpass API), as well as for vector tile
generation. The process applies to both development and production setups. For
development, you should choose a small area for testing, such as your local
county or city, to keep the amount of data small. For production use you have
to import the whole region you are serving.
* Install `osm2pgsql`.
* Download the area(s) you would like to import from [GeoFabrik](https://download.geofabrik.de).
* Import each file like this:
```bash
osm2pgsql --create --hstore --style roads_import.lua -O flex \
-H localhost -d obs -U obs -W \
path/to/downloaded/myarea-latest.osm.pbf
```
You might need to adjust the host, database and username (`-H`, `-d`, `-U`) to
your setup, and also provide the correct password when queried. For the
development setup the password is `obs`. For production, you might need to
expose the containers port and/or create a TCP tunnel, for example with SSH,
such that you can run the import from your local host and write to the remote
database.
The import process should take a few seconds to minutes, depending on the area
size. A whole country might even take one or more hours. You should probably
not try to import `planet.osm.pbf`.
You can run the process multiple times, with the same or different area files,
to import or update the data. However, for this to work, the actual [command
line arguments](https://osm2pgsql.org/doc/manual.html#running-osm2pgsql) are a
bit different each time, including when first importing, and the disk space
required is much higher.
Refer to the documentation of `osm2pgsql` for assistance. We are using "flex
mode", the provided script `roads_import.lua` describes the transformations
and extractions to perform on the original data.
## Troubleshooting ## Troubleshooting

View file

@ -36,6 +36,8 @@ services:
- ./tile-generator/data/:/tiles - ./tile-generator/data/:/tiles
- ./api/migrations:/opt/obs/api/migrations - ./api/migrations:/opt/obs/api/migrations
- ./api/alembic.ini:/opt/obs/api/alembic.ini - ./api/alembic.ini:/opt/obs/api/alembic.ini
- ./local/pbf:/pbf
- ./local/obsdata:/obsdata
depends_on: depends_on:
- postgres - postgres
- keycloak - keycloak

103
docs/osm-import.md Normal file
View file

@ -0,0 +1,103 @@
# Importing OpenStreetMap data
The application requires a lot of data from the OpenStreetMap to work.
The required information is stored in the PostgreSQL database and used when
processing tracks, as well as for vector tile generation. The process applies
to both development and production setups. For development, you should choose a
small area for testing, such as your local county or city, to keep the amount
of data small. For production use you have to import the whole region you are
serving.
## General pipeline overview
1. Download OpenStreetMap data as one or more `.osm.pbf` files.
2. Transform this data to generate geometry data for all roads and regions, so
we don't need to look up nodes separately. This step requires a lot of CPU
and memory, so it can be done "offline" on a high power machine.
3. Import the transformed data into the PostgreSQL/PostGIS database.
## Community hosted transformed data
Since the first two steps are the same for everybody, the community will soon
provide a service where relatively up-to-date transformed data can be
downloaded for direct import. Stay tuned.
## Download data
[GeoFabrik](https://download.geofabrik.de) kindly hosts extracts of the
OpenStreetMap planet by region. Download all regions you're interested in from
there in `.osm.pbf` format, with the tool of your choice, e. g.:
```bash
wget -P local/pbf/ https://download.geofabrik.de/europe/germany/baden-wuerttemberg-latest.osm.pbf
```
## Transform data
To transform downloaded data, you can either use the docker image from a
development or production environment, or locally install the API into your
python environment. Then run the `api/tools/transform_osm.py` script on the data.
```bash
api/tools/transform_osm.py baden-wuerttemberg-latest.osm.pbf baden-wuerttemberg-latest.msgpack
```
In dockerized setups, make sure to mount your data somewhere in the container
and also mount a directory where the result can be written. The development
setup takes care of this, so you can use:
```bash
docker-compose run --rm api tools/transform_osm.py \
/pbf/baden-wuerttemberg-latest.osm.pbf /obsdata/baden-wuerttemberg-latest.msgpack
```
Repeat this command for every file you want to transform.
## Import transformed data
The command for importing looks like this:
```bash
api/tools/import_osm.py baden-wuerttemberg-latest.msgpack
```
This tool reads your application config from `config.py`, so set that up first
as if you were setting up your application.
In dockerized setups, make sure to mount your data somewhere in the container.
Again, the development setup takes care of this, so you can use:
```bash
docker-compose run --rm api tools/import_osm.py \
/obsdata/baden-wuerttemberg-latest.msgpack
```
The transform process should take a few seconds to minutes, depending on the area
size. You can run the process multiple times, with the same or different area
files, to import or update the data. You can update only one region and leave
the others as they are, or add more filenames to the command line to
bulk-import data.
## How this works
* The transformation is done with a python script that uses
[pyosmium](https://osmcode.org/pyosmium/) to read the `.osm.pbf` file. This
script then filters the data for only the required objects (such as road
segments and administrative areas), and extracts the interesting information
from those objects.
* The node geolocations are looked up to generate a geometry for each object.
This requires a lot of memory to run efficiently.
* The geometry is projected to [Web Mercator](https://epsg.io/3857) in this
step to avoid continous transformation when tiles are generated later. Most
operations will work fine in this projection. Projection is done with the
[pyproj](https://pypi.org/project/pyproj/) library.
* The output is written to a binary file in a very simple format using
[msgpack](https://github.com/msgpack/msgpack-python), which is way more
efficient that (Geo-)JSON for example. This format is stremable, so the
generated file is never fully written or read into memory.
* The import script reads the msgpack file and sends it to the database using
[psycopg](https://www.psycopg.org/). This is done because it supports
PostgreSQL's `COPY FROM` statement, which enables much faster writes to the
database that a traditionional `INSERT VALUES`. The file is streamed directly
to the database, so it is never read into memory.