Page MenuHomeDevCentral

Import FANTOIR database
Open, NormalPublic

Description

rVIPER contains parsing code for the FANTOIR database, a database with all streets, ways, private lots, and "pseudo-voies" like metro stations in France.

FANTOIR

FANTOIR is the "fichier des voies et lieux-dits" produced by the DGFiP, a French administration with finance as a core mission.

The database uses a text file, where each line is a record, and information is located between characters ... and ..., a format difficult to work with:

 ENEVERS                  2022110120223060000000
010        AIN                                             00000000000000 00000000000000
010001    WL'ABERGEMENT-CLEMENCIAT        N  3      000082500000000000000 00000001987001
010001A008WLOT BELLEVUE                   N  3  0          00000000000000 00000002001351               000592   BELLEVUE
010001A015DLOT LES CHARMILLES             N  3  0          00000000000000 00000001998274               000562   CHARMILL
010001A025PLOT LES COQUELICOTS            N  3  0          00000000000000 00000001999300               000572   COQUELIC
010001A028TLOT LES LILAS                  N  3  0          00000000000000 00000002001025               000582   LILAS
010001A030VLOT MUNETVILLE                 N  3  0          00000000000000 00000001991302               000522   MUNETVIL
010001A035ALOT LES MURIERS                N  3  0          00000000000000 00000002003352               000602   MURIERS
010001A100WLOT LES TROIS CHENES           N  3  0          00000000000000 00000001996150               000552   CHENES

Documentation of the format is available at https://data.economie.gouv.fr/api/datasets/1.0/fichier-fantoir-des-voies-et-lieux-dits/attachments/descriptif_du_fichier_national_fantoir_pdf

We'd like to modernize that dataset to exploit it furthermore.

Plan is currently:

  • Import data from this file in a relational database
  • Build a pipeline (DAG workflow?) to import fresh data when a new file is released
  • See if we can enrich data from Wikidata, OSM, BAN, BANO or other sources

PostgreSQL has been picked as:

  • Good feeling to use PostgreSQL from the Rust ecosystem (first idea was to use Diesel, but as an ORM isn't especially needed here, we use sqlx)
  • Fast search on a name of a way, with trigram indexes

This task focus on dataset building, ie create a PostgreSQL database with all information and tooling to update data in it.

Next steps

From there, several things can be done:

  • Create a front-end to search it
  • Distribute JSON or XML files with rich information split by department
  • Help Wikidata do detect disparities between the values in FANTOIR file values on Wikidata elements.
  • Add PostgreSQL support to rVIPER to use our dataset instead of the text file

Event Timeline

rDS has now a comprehensive fantoir-datasource tool.

Components to deploy

Next steps are the deployment, with the following components:

  • PostgreSQL server or cluster
  • Install fantoir-datasource
  • Pipeline to execute it

PostgreSQL

The server or cluster should be reachable by our import tool but also the rDB front-end.

A specialized Docker container could be fine, or a more generic like acquisitariat.
But a baremetal or VM is also a possible solution.

Installer

We need to install it either:

  • on PaaS docker, with a Docker image + a command wrapper
  • on the PostgreSQL server, if dedicated to datasources, as a FreeBSD package to install

The combo Docker to run it with a connection to a specified PostgreSQL server seems preferable.

Pipeline engine

The pipeline could be driven by Jenkins (at CD level, cd.nasqueron.org). The initial idea was to install a specific tool like Apache Airflow to run pipeline as a DAG.

Can we easily build pipeline on Salt? Reactor with events would be fine, but what about DAG visibility?

If the pipeline runs in a Docker container, how do we run the fantoir-datasource? For Jenkins, we could add it to rust_brown (Rust executor).

For Apache Airflow, a full access to our main Docker engine looks dangerous, Apache Airflow + dedicated Docker/k8s infra is perhaps more relevant.

Pipeline itself

The ideal operation order is documented at https://agora.nasqueron.org/Fantoir-datasource

Two components are missing:

  • send a notification / we can use T771
  • configure rDB to use the new table $FANTOIR_TABLE, depends it we'll use Consul or etcd as the ideal client/API, but in both case a HTTP API exists

To configure Airflow, we can need a PostgreSQL DSN.

If so, see https://devcentral.nasqueron.org/T1791#25765 to how to build it.

Airflow deployed, so we can test D2754.

Still need to be configured: