Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 14 additions & 14 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ default_language_version:
python: python3
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.6.0
rev: v6.0.0
hooks:
- id: check-added-large-files
- id: check-byte-order-marker
Expand All @@ -21,69 +21,69 @@ repos:
- id: trailing-whitespace
exclude: .gitignore
- repo: https://github.com/psf/black-pre-commit-mirror
rev: 24.4.2
rev: 26.1.0
hooks:
- id: black-jupyter
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.5.1
rev: v0.14.14
hooks:
- id: ruff
args: [--fix, --exit-non-zero-on-fix]
- repo: https://github.com/rbubley/mirrors-prettier
rev: v3.3.2
rev: v3.8.1
hooks:
- id: prettier
- repo: https://github.com/google/yamlfmt
rev: v0.13.0
rev: v0.21.0
hooks:
- id: yamlfmt
- repo: https://github.com/Yelp/detect-secrets
rev: v1.5.0
hooks:
- id: detect-secrets
- repo: https://github.com/pappasam/toml-sort
rev: v0.23.1
rev: v0.24.3
hooks:
- id: toml-sort-fix
exclude: poetry.lock
- repo: https://github.com/codespell-project/codespell
rev: v2.3.0
rev: v2.4.1
hooks:
- id: codespell
additional_dependencies: [".[toml]"]
- repo: https://github.com/sqlfluff/sqlfluff
rev: 3.1.0
rev: 4.0.0
hooks:
- id: sqlfluff-fix
- repo: https://github.com/hadolint/hadolint
rev: v2.13.0-beta
rev: v2.14.0
hooks:
- id: hadolint-docker
- repo: https://github.com/jsh9/markdown-toc-creator
rev: 0.0.6
rev: 0.1.3
hooks:
- id: markdown-toc-creator
- repo: https://github.com/jumanjihouse/pre-commit-hooks
rev: 3.0.0
hooks:
- id: check-mailmap
- repo: https://github.com/python-poetry/poetry
rev: 1.8.0
rev: 2.3.1
hooks:
- id: poetry-check
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.10.1
rev: v1.19.1
hooks:
- id: mypy
additional_dependencies:
- types-requests
- repo: https://github.com/srstevenson/nb-clean
rev: 3.3.0
rev: 4.0.1
hooks:
- id: nb-clean
args: [--preserve-cell-outputs, --remove-empty-cells]
- repo: https://github.com/abravalheri/validate-pyproject
rev: v0.18
rev: v0.24.1
hooks:
- id: validate-pyproject
additional_dependencies:
Expand Down
22 changes: 22 additions & 0 deletions code_of_conduct.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,27 @@
# Contributor Covenant Code of Conduct

<!--TOC-->

______________________________________________________________________

**Table of Contents**

- [Our Pledge](#our-pledge)
- [Our Standards](#our-standards)
- [Enforcement Responsibilities](#enforcement-responsibilities)
- [Scope](#scope)
- [Enforcement](#enforcement)
- [Enforcement Guidelines](#enforcement-guidelines)
- [1. Correction](#1-correction)
- [2. Warning](#2-warning)
- [3. Temporary Ban](#3-temporary-ban)
- [4. Permanent Ban](#4-permanent-ban)
- [Attribution](#attribution)

______________________________________________________________________

<!--TOC-->

## Our Pledge

We as members, contributors, and leaders pledge to make participation in our
Expand Down
1 change: 1 addition & 0 deletions data/tabular/chem_caption_smarts/preprocess.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# flake8: noqa
"""Preprocess the raw outputs from the text-output of chem-caption to a tabular dataset."""

from glob import glob

import pandas as pd
Expand Down
271 changes: 135 additions & 136 deletions data/tabular/ld50_catmos/meta.yaml

Large diffs are not rendered by default.

3 changes: 1 addition & 2 deletions data/tabular/mona/example_processing_and_templates.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@
"from tqdm import tqdm\n",
"\n",
"# import datasets\n",
"import rdkit\n",
"import rdkit.Chem as Chem\n",
"import rdkit.RDLogger as RDLogger"
]
Expand Down Expand Up @@ -1444,7 +1443,7 @@
" k = md[\"name\"]\n",
" v = md.get(\"value\", np.nan)\n",
" df_row[\"md_\" + transform_key(k)] = v\n",
" if not (v is np.nan):\n",
" if v is not np.nan:\n",
" md_keys.append(k)\n",
" md_key_counter.update(md_keys)\n",
" compounds = entry.get(\"compound\", [])\n",
Expand Down
4 changes: 2 additions & 2 deletions data/tabular/ocp/transform.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@ def uniCode2Latex(text: str) -> str:
text = text.replace(chr(code), f"$_{code-8320}$")

text = text.replace("\u0305", "$^-$")
text = text.replace("\u207A", "$^+$")
text = text.replace("\u207B", "$^-$")
text = text.replace("\u207a", "$^+$")
text = text.replace("\u207b", "$^-$")
text = text.replace("\u2074", "$^4$")
text = text.replace("\u2070", "$^0$")
text = text.replace("\u2078", "$^1$")
Expand Down
5 changes: 0 additions & 5 deletions data/tabular/orbnet_denali/develop_transform.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,7 @@
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
"from rdkit import Chem\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import os\n",
"import pandas as pd\n",
"from glob import glob"
]
Expand Down Expand Up @@ -474,7 +470,6 @@
"metadata": {},
"outputs": [],
"source": [
"from rdkit.Chem import rdDetermineBonds\n",
"from chemnlp.utils import xyz_to_mol"
]
},
Expand Down
19 changes: 18 additions & 1 deletion docs/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,23 @@
# Contributing to ChemNLP

<!--TOC-->

______________________________________________________________________

**Table of Contents**

- [Getting Started](#getting-started)
- [Implementing a Dataset](#implementing-a-dataset)
- [meta.yaml Structure](#metayaml-structure)
- [transform.py Guidelines](#transformpy-guidelines)
- [Text Templates](#text-templates)
- [Testing Your Contribution](#testing-your-contribution)
- [Submitting Your Contribution](#submitting-your-contribution)

______________________________________________________________________

<!--TOC-->

Thank you for your interest in contributing to ChemNLP! There are many ways to contribute, including implementing datasets, improving code, and enhancing documentation.

## Getting Started
Expand All @@ -17,7 +35,6 @@ One of the most valuable contributions is implementing a dataset. Here's how to
1. Choose a dataset from our [awesome list](https://github.com/kjappelbaum/awesome-chemistry-datasets) or add a new one there.
2. Create an issue in this repository stating your intention to add the dataset.
3. Make a Pull Request (PR) that adds a new folder in `data` with the following files:

- `meta.yaml`: Describes the dataset (see structure below).
- `transform.py`: Python code to transform the original dataset into a usable form.

Expand Down
20 changes: 20 additions & 0 deletions docs/api/meta_yaml_augmentor.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,25 @@
# Meta YAML Augmenter

<!--TOC-->

______________________________________________________________________

**Table of Contents**

- [Overview](#overview)
- [generate_augmented_meta_yaml](#generate_augmented_meta_yaml)
- [CLI Interface](#cli-interface)
- [Usage](#usage)
- [Arguments](#arguments)
- [Example](#example)
- [Augmentation Process](#augmentation-process)
- [Notes](#notes)
- [Example Usage in Python](#example-usage-in-python)

______________________________________________________________________

<!--TOC-->

## Overview

The Meta YAML Augmenter is a tool designed to enhance existing `meta.yaml` files for chemical datasets. It uses Large Language Models (LLMs) to generate additional templates and improve the metadata structure, particularly focusing on advanced sampling methods and template formats.
Expand Down
14 changes: 14 additions & 0 deletions docs/api/meta_yaml_generator.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,19 @@
# Meta YAML Generator

<!--TOC-->

______________________________________________________________________

**Table of Contents**

- [Overview](#overview)
- [`generate_meta_yaml`](#generate_meta_yaml)
- [Usage Example](#usage-example)

______________________________________________________________________

<!--TOC-->

## Overview

The Meta YAML Generator is a tool designed to automatically create a `meta.yaml` file for chemical datasets using Large Language Models (LLMs). It analyzes the structure of a given DataFrame and generates a comprehensive metadata file, including advanced sampling methods and template formats.
Expand Down
23 changes: 23 additions & 0 deletions docs/api/sampler.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,28 @@
# Sampler Module

<!--TOC-->

______________________________________________________________________

**Table of Contents**

- [Overview](#overview)
- [TemplateSampler](#templatesampler)
- [Class: TemplateSampler](#class-templatesampler)
- [Initialization](#initialization)
- [Configuration Options](#configuration-options)
- [Main Methods](#main-methods)
- [`sample`](#sample)
- [`enable_class_balancing`](#enable_class_balancing)
- [`disable_class_balancing`](#disable_class_balancing)
- [Identifier Wrapping](#identifier-wrapping)
- [Usage Examples](#usage-examples)
- [Notes](#notes)

______________________________________________________________________

<!--TOC-->

## Overview

The `sampler` module provides functionality for generating text samples based on templates and data. It is primarily used for creating datasets for natural language processing tasks in chemistry and related fields. The main class in this module is `TemplateSampler`, which allows for flexible text generation with support for multiple choice questions, class balancing, and identifier wrapping.
Expand Down
26 changes: 26 additions & 0 deletions docs/api/sampler_cli.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,31 @@
# Sampler CLI

<!--TOC-->

______________________________________________________________________

**Table of Contents**

- [Overview](#overview)
- [Usage](#usage)
- [Arguments](#arguments)
- [Options](#options)
- [Detailed Option Descriptions](#detailed-option-descriptions)
- [`chunksize`](#chunksize)
- [`class_balanced`](#class_balanced)
- [`benchmarking`](#benchmarking)
- [`multiple_choice`](#multiple_choice)
- [`additional_templates`](#additional_templates)
- [`use_standard_templates`](#use_standard_templates)
- [`wrap_identifiers`](#wrap_identifiers)
- [Examples](#examples)
- [Notes](#notes)
- [Troubleshooting](#troubleshooting)

______________________________________________________________________

<!--TOC-->

## Overview

The Sampler CLI is a command-line interface tool designed to process chemical datasets using the `TemplateSampler`. It allows for flexible text generation based on templates, with support for various sampling scenarios including class balancing, benchmarking, and multiple-choice questions.
Expand Down
19 changes: 13 additions & 6 deletions experiments/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,18 @@
# Working with the Stability cluster

<!--TOC-->

______________________________________________________________________

**Table of Contents**

- [GPT-Neox](#gpt-neox)
- [Hugging Face](#hugging-face)

______________________________________________________________________

<!--TOC-->

We currently run our large scale experiments on the Stability AI HPC cluster.
This subdirectory features a few helpful scripts that can help you get up and
running on the cluster.
Expand All @@ -11,7 +24,6 @@ running on the cluster.

1. [Create Environment](scripts/env_creation_neox.sh) -
creates a basic conda environment for experiments.

- Creates a conda environment at the prefix `CONDA_ENV_PATH` path.
> Using the positional argument passed into the script
- Clones `chemnlp` into your personal cluster `USER` directory.
Expand All @@ -28,7 +40,6 @@ running on the cluster.

2. [Training Models](scripts/sbatch_train_neox.sh) -
runs a GPT-NeoX training pipeline

- creates a conda environment using the `env_creation_neox.sh` script.
- runs the GPT-NeoX `train.py` script using the user configuration
> as GPT-NeoX configurations can be combined, the PEFT configurations are held
Expand All @@ -48,7 +59,6 @@ running on the cluster.

1. [Create Environment](scripts/env_creation_hf.sh) -
creates a basic conda environment for experiments.

- Creates a conda environment at the prefix `CONDA_ENV_PATH` path.
> Using the positional argument passed into the script
- Clones `chemnlp` into your personal cluster `USER` directory.
Expand All @@ -65,7 +75,6 @@ running on the cluster.

2. [Single Node Models](scripts/sbatch_train_hf.sh) -
runs a Hugging Face training pipeline across devices

- creates a conda environment using the `env_creation_hf.sh` script.
- runs the Hugging Face `run_tune.py` script with the user configuration

Expand All @@ -81,7 +90,6 @@ running on the cluster.

3. [Multi Node Models](scripts/sbatch_train_hf_multinode.sh) -
runs a Hugging Face training pipeline across nodes

- creates a conda environment using the `env_creation_hf.sh` script.
- runs the Hugging Face `run_tune.py` script with the user configuration

Expand All @@ -97,7 +105,6 @@ running on the cluster.

4. [Grid Search](scripts/run_grid_search.py) -
runs a grid search across training pipeline configuration options

- Update the upper-case parameters at the top of the script
- The script runs an exhaustive set of experiments across all permutations

Expand Down
Loading
Loading