This repository contains the implementation and materials of the following paper:
Data driven discovery of human mobility models
Hao Guo†, Weiyu Zhang†, Junjie Yang, Yuanqiao Hou, Lei Dong∗, Yu Liu∗Abstract: Human mobility is a fundamental aspect of social behavior, with broad applications in transportation, urban planning, and epidemic modeling. However, for decades new mathematical formulas to model mobility phenomena have been scarce and usually discovered by analogy to physical processes, such as the gravity model and the radiation model. These sporadic discoveries are often thought to rely on intuition and luck in fitting empirical data. Here, we propose a systematic approach that leverages symbolic regression to automatically discover interpretable models from human mobility data. Our approach finds several well-known formulas, such as the distance decay effect and classical gravity models, as well as previously unknown ones, such as an exponential-power-law decay that can be explained by the maximum entropy principle. By relaxing the constraints on the complexity of model expressions, we further show how key variables of human mobility are progressively incorporated into the model, making this framework a powerful tool for revealing the underlying mathematical structures of complex social phenomena directly from observational data.
FlowSR/
│
├── .gitignore
├── LICENSE
├── README.md
│
├── Data/
│
├── Existing_models_evaluation/
│
├── FlowSR_Julia/
│ ├── geographical_heterogeneity_analysis/
│ ├── symbolic_regression_on_real_data/
│ └── symbolic_regression_on_synthetic_data/
Data: This folder contains the download link of publicly available datasets (the US and England datasets) in this study.Existing_models_evaluation: This directory contains evaluations of existing models, implemented in Python.FlowSR_Julia: This directory contains our proposed framework, implemented using Julia and the modified SymbolicRegression.jl package.symbolic_regression_on_real_data: This directory contains the code to replicate our main results on real datasets.geographical_heterogeneity_analysis: This directory contains the code for theGeographical Heterogeneity of Mobility Modelssection.symbolic_regression_on_synthetic_data: This directory contains the code for experiments on synthetic data in theDiscussion.
The project is implemented in Julia. Please install Julia from the official website.
Julia's robust dependency management system simplifies the process of setting up the project environment. First, clone the repository to your local machine. Then, activate the project environment by running the following commands in terminal:
cd path/to/FlowSR_Julia
julia # start julia REPL
using Pkg
Pkg.activate(".")
Pkg.instantiate()Our project is based on SymbolicRegression.jl, and we modified it to search for mobility flow allocation model in a friendly forked repo .
The modified package requires manual installation. Please clone modified SymbolicRegression.jl to your local machine. Then run the following command in julia REPL to activate the environment of FlowSR_julia and install the modified package:
cd path/to/FlowSR_Julia
julia
using Pkg
Pkg.activate(".")
Pkg.develop(path="path/to/SymbolicRegression.jl")After preparation of the environment and dependencies, you can run the following command to perform the symbolic regression easily:
cd path/to/FlowSR_Julia
julia --project="." --threads=4 srflow_us.jl # Take the use for example The evaluation of existing models is conducted using Python. Execute the benchmark_allocation.py script directly:
python benchmark_allocation.py
The download links of England and US in this study are as follows:
The downloaded zip files can be extracted to the corresponding folder under Data/, and codes can run without any additional modifications.
This dataset contains the location of usual residence and place of work of employed residents in England, aggregated to MLAD level, as well as the residential and workplace population of each MLAD. Both the commuting flow data and the population data are collected from the 2011 UK Census.
- England_mlad_census11_attr.xlsx
- respop: residential population of each county (in 10^4 persons), collected from UK Data Service
- workpop: workplace population of each county (in 10^4 persons), collected from UK Data Service
- centx/centy: projected coordinates of polygon centroid for each county (in meters, EPSG: 27700), calculated from the official boundary shapefile in 2011
- England_mlad_census11_supp3.pkl: number of commuters from each origin (residence) MLAD to each destination (work) MLAD
- England_mlad_dist.pkl: spherical distances between each pair of MLADs, calculated based on the projected coordinates
- England_mlad_iores.pkl: intervening opportunites from each origin MLAD to each destination MLAD, calculated based on the residential population
- England_mlad_iowork.pkl: intervening opportunites from each origin MLAD to each destination MLAD, calculated based on the workplace population
This dataset contains the location of usual residence and place of work of employed residents in the Contiguous US, aggregated to the county level, as well as the residential and workplace population of each county. Both the commuting flow data and the population data are collected from the American Community Survey (2011-2015 ACS 5-year estimate).
- us_acs15_county_attr.xlsx
- respop: residential population of each county (in 10^4 persons), collected from Census.gov
- workpop: workplace population of each county (in 10^4 persons), calculated from the commuting flow data (workers from Contiguous US only)
- centx/centy: longitude/latitude of polygon centroid for each county (in degrees), calculated from the official boundary shapefile in 2015
- us_acs15_county_flow.pkl: number of commuters from each origin (residence) county to each destination (work) county
- us_county_dist.pkl: spherical distances between each pair of counties, calculated based on the geographical coordinates
- us_county_iores.pkl: intervening opportunites from each origin county to each destination county, calculated based on the residential population
- us_county_iowork.pkl: intervening opportunites from each origin county to each destination county, calculated based on the workplace population
This dataset contains aggregated inter-county human mobility flows from November 4 to November 10, 2019, provided by China Unicom, as well as population for each county in 2019, collected from official statistical yearbooks.
- BTH_county_attr.xlsx
- pop: household registered population in 2019 (in 10^4 persons), collected from 2020 offical statistical yearbooks
- lon/lat: longitude/latitude of each county, retrieved from the Geocoder API provided by amap.com
- BTH_county_flow.pkl: number of movements from each origin county to each destination county, provided by China Unicom
- BTH_county_dist.pkl: spherical distances between each pair of counties, calculated based on the geographical coordinates
- BTH_county_io.pkl: intervening opportunites from each origin county to each destination county, calculated based on the household population
You can read the .pkl files above using the pickle package in Python (or the Pickle package in Julia). An example in Python:
flow_file = open("../Data/US/us_acs15_county_flow.pkl", 'rb')
flow_dict = pickle.load(flow_file)The file structure is a nested dictionary. flow_dict[ID_of_A][ID_of_B] is the flow volume from county A to county B. The usage for distance or intervening opportunities is similar. The spatial unit IDs are defined as
- England:
MLAD_CODEin England_mlad_census11_attr.xlsx, removing the prefixE41. - US:
GEOIDin us_acs1_county_attr.xlsx - Beijing-Tianjin-Hebei: 6-digit
codein BTH_county_attr.xlsx
-
Options : Various attributes are added to this class, including
allocation,eval_probability,ori_sep,num_places,optimize_hof. -
If
allocation==true,ori_sepis required as n-dim vector, where n is the number of places; dataset entryori_sep[i-1]+1:ori_sep[i]corresponds to flows with origini. Alternatively, you may input n*nadjmatrix, which is transformed intoori_sep.num_placeswill be calculated automatically.
- eval_loss: Generate partition if
allocation==true. - _eval_loss: Perform probability normalization if
allocation==true. Ifeval_probability==true, do not multiply total outflow. - batch_sample: Sample from
1:num_placesinstead of1:dataset.nifallocation==true.
- _equation_search: if
optimize_hof==true, Hall-of-Fame equations will be optimized with entire dataset (even ifbatching==true) after the lasts_r_cycle.