Source code and trained models for the paper "Comparative Analysis of Structure-Property Machine Learning models for Predicting Electrolyte Thermodynamic Windows".
General overview of the modeling framework
Table of Contents
Clone this repository and then use setup.sh to setup a virtual environment binfo with the required dependencies in requirements.txt.
chmod +x setup.sh
git clone https://github.com/EnthusiasticTeslim/BatteryInformatics.git
cd BatteryInformatics
sh setup.sh
source binfo/bin/activateImportant
All scripts for training models are available in Docker mode in folder docker.
python src/descriptor/trainer.py -h
usage: trainer.py [-h] [--parent_directory PARENT_DIRECTORY] [--data_directory DATA_DIRECTORY] [--result_directory RESULT_DIRECTORY] [--src SRC] [--train_data TRAIN_DATA] [--test_data TEST_DATA] [--scale] [--hyperparameter HYPERPARAMETER] [--iterations ITERATIONS] [--cv CV] [--model MODEL] [--seed SEED]
options:
--parent_directory PARENT_DIRECTORY
Path to main directory
--data_directory DATA_DIRECTORY
where the data is stored in parent directory
--result_directory RESULT_DIRECTORY
Path to result directory
--src SRC function source directory
--train_data TRAIN_DATA
Path to train data
--test_data TEST_DATA
Path to test data
--scale Scale data
--hyperparameter HYPERPARAMETER
Hyperparameter space
--iterations ITERATIONS
Number of iterations for hyperparameter optimization
--cv CV Number of cross-validation folds
--model MODEL Model to train
--seed SEED Random seed
--morgan_fingerprint Use Morgan Fingerprint (MFF) instead of RDKit descriptors
--nbits NBITS Number of bits for MFF
--radius RADIUS Radius for MFF
The model and its predictions will be saved in results/<MODEL>. For example, to train a SVR model using RDKIT descriptor, you can use:
python -m src/descriptor/trainer.py --parent_directory YOUR_MAIN_FOLDER --result_directory results --data_directory data --train_data "train_data_cleaned.csv" --test_data "test_data_cleaned.csv" --scale --model SVR --seed 42 --iterations 100 --hyperparameter "hp_descriptor.yaml" --cv 5To train the whole model (SVR, RandomForest, AdaBoostRegressor, GradientBoostingRegressor),
chmod a+x regenerate/descriptor.sh
./descriptor.shpython src/graph/trainer.py -h
trainer.py [-h] [--parent_directory PARENT_DIRECTORY] [--result_directory RESULT_DIRECTORY] [--data_directory DATA_DIRECTORY] [--train_data TRAIN_DATA] [--test_data TEST_DATA] [--add_features]
[--skip_cv] [--epochs EPOCHS] [--start-epoch START_EPOCH] [--batch_size BATCH_SIZE] [--lr LR] [--gpu GPU] [--cv CV] [--dim_input DIM_INPUT] [--unit_per_layer UNIT_PER_LAYER] [--seed SEED]
[--num_feat NUM_FEAT] [--train]
options:
--parent_directory PARENT_DIRECTORY
Path to main directory
--result_directory RESULT_DIRECTORY
Path to result directory
--data_directory DATA_DIRECTORY
where the data is stored in parent directory
--train_data TRAIN_DATA
name of train data
--test_data TEST_DATA
name of test data
--add_features if add features
--skip_cv if skip cross validation
--epochs EPOCHS number of total epochs to run
--start-epoch START_EPOCH
manual epoch number (useful on restarts)
--batch_size BATCH_SIZE
mini-batch size (default: 256)
--lr LR initial learning rate
--gpu GPU GPU ID to use.
--cv CV k-fold cross validation
--dim_input DIM_INPUT
dimension of input
--unit_per_layer UNIT_PER_LAYER
unit per layer
--seed SEED seed number
--num_feat NUM_FEAT number of additional features
--train if train
--print_result if print result
To train the GNN model,
chmod a+x regenerate/graph.sh
./graph.shand its checkpoints and predictions will be saved in results/GNN.
For example, to train a GNN model you can use:
python -m src/graph/trainer.py --parent_directory YOUR_MAIN_FOLDER --result_directory results --data_directory data --train_data "train_data_cleaned.csv" --test_data "test_data_cleaned.csv" --seed 42 --iterations 100 --train --cv 5to test and an already train model, you can use:
python -m src/graph/trainer.py --parent_directory YOUR_MAIN_FOLDER --result_directory results --data_directory data --train_data "train_data_cleaned.csv" --test_data "test_data_cleaned.csv" --seed 42 --iterations 100 --cv 5under construction
- Complete data cleaning
- Scripts for QSPR with RDKIT descriptors.
- Scripts for QSPR with Graph.
- Train ML with RDKIT and Graph.
- [] Set up ML model with transformer.
- [] Evaluate performances.
- [] Deploy models as a GUI.
@article{doi,
author = {Teslim Olayiwola, Jose Romagnoli},
title = {Comparative Analysis of Structure-Property Machine Learning models for Predicting Electrolyte Thermodynamic Windows},
journal = {n/a},
year = {n/a},
volume = {n/a},
number = {n/a},
doi = {https://doi.org/},
preprint = {Manuscript in Preparation}
}
BatteryInformatics is under MIT license. For use of specific models, please refer to the model licenses found in the original packages.