The BioCreative VI ChemProt dataset identifies chemical and protein entities and the relations between them. Compounds are generally agonists (activators) or antagonists (inhibitors) of proteins.
The project utilizes three specific entity labels:
- GENE-N
- GENE-Y
- CHEMICAL
There are 11 distinct relation classes used for classification:
| ID | Label |
|---|---|
| 0 | UNDEFINED |
| 1 | PART_OF |
| 2 | REGULATOR_OR_DIRECT_REGULATOR_OR_INDIRECT_REGULATOR_OF |
| 3 | ACTIVATOR_OR_UPREGULATOR_OR_INDIRECT_UPREGULATOR_OF |
| 4 | INHIBITOR_OR_DOWNREGULATOR_OR_INDIRECT_DOWNREGULATOR_OF |
| 5 | AGONIST_OR_AGONIST_ACTIVATOR_OR_AGONIST_INHIBITOR_OF |
| 6 | ANTAGONIST_OF |
| 7 | MODULATOR_OR_MODULATOR_ACTIVATOR_OR_MODULATOR_INHIBITOR_OF |
| 8 | COFACTOR_OF |
| 9 | SUBSTRATE_OR_PRODUCT_OR_SUBSTRATE_PRODUCT_OF |
| 10 | NO_RELATION |
Objective: Construct a knowledge graph from ChemProt abstracts by performing two distinct sub-tasks:
-
Named Entity Recognition (NER):
- Framing: Token classification.
- Labels:
{B-GENE-N, I-GENE-N, B-GENE-Y, I-GENE-Y, B-CHEMICAL, I-CHEMICAL, O}. - Overlap Handling: The implementation supports double-encoding to handle overlapping entities (tokens with multiple labels).
-
Relation Extraction (RE):
- Framing: Classify pairs of extracted entities into one of the 11 relation classes.
Evaluation Metrics: Average Precision, Recall, and F1 score.
- Tokenizer:
bert-base-uncased(uncapitalized BERT tokenizer). - Embeddings: Retrieved from
TFBertModelpretrained onbert-base-uncased. - Segmentation: Abstracts are segmented into 128 tokens with striding and padding to ensure correct handling across boundaries.
- For NER Training: Tokens are labeled using the full tag set. Only valid label combinations present in the training data are encoded.
- For RE Training: Segments are indexed, and pairs of entity token spans are marked and stored in
re_mask. Rows inre_maskare padded with-1where no more pairs exist.
The project implements a Text→Graph pipeline using TensorFlow/Keras (TF 2.x idiom). The architecture consists of three main components.
A custom implementation of Conditional Random Fields for sequence tagging.
Parameters (Trainable):
- Start Transitions: Shape
(num_tags,). Initializer: GlorotUniform. - Transition Matrix: Shape
(num_tags, num_tags). Initializer: GlorotUniform. - End Transitions: Shape
(num_tags,). Initializer: GlorotUniform.
Key Logic:
- Forward Algorithm: Computes the log-partition function
log_likelihoodusingtf.while_loopandtf.reduce_logsumexp. - Score Sequence: Computes
gold_scoreof text sequence with correct labels. - CRF Loss: Returns the mean of
-(gold_score - log_likelihood). - Viterbi Decoding:
- Initializes with start transitions + emissions.
- Iteratively computes scores by expanding dimensions and adding the transition matrix.
- Uses backtracking (via TensorArray) to produce the final best path as an integer tensor of shape
(batch_size, seq_len).
A hybrid BiLSTM-CRF architecture.
Architecture Layers:
- BERT Backbone:
TFBertModel(bert-base-uncased). Outputslast_hidden_statewith shape(batch_size, seq_len, bert_hidden_size). - L2 Normalization: Applied to the BERT output before the LSTM.
- Bi-LSTM: Bidirectional LSTM wrapper. Output shape is
(batch_size, seq_len, 2 * hidden_dim). - Dense Layer: Projects LSTM output to
num_tags. Output shape:(batch_size, seq_len, num_tags). - CRF: The Custom CRF layer described above.
Training Logic (train_step):
- Forward: Computes emissions (logits) through BERT → L2 → BiLSTM → Dense.
- Loss: Calculates CRF loss (
log_likelihood - gold_score). - Optimization: Gradients are computed via
tf.GradientTapeand applied via the optimizer.
Inference Logic (test_step):
- If validating: Computes CRF loss.
- If testing: Runs Viterbi decoding to get prediction sequences. Updates metrics (TP/TN/FP/FN) by comparing predictions to true labels using boolean logic and reduction.
A transfer/independent/joint learning based architecture using Entity Pooling.
Architecture Layers:
- BERT Backbone:
TFBertModel(bert-base-uncased). - Layer Normalization (Input): Applied to BERT embeddings.
- Bi-LSTM: Bidirectional LSTM.
Transfer Learning Note: Weights are explicitly copied from the NER model's LSTM in transfer learning whereas in joint learning single Bi-LSTM layer utilized for both tasks without layer/L2 normalizations.
- Layer Normalization (Output): Applied to the Bi-LSTM output.
- Dense Layer: Final classification layer with Softmax activation. Input size is
4 * hidden_dim(concatenation of two pooled entities).
Input Structure (re_mask):
Input is a 2D array where each row encodes a candidate pair: [e1_start, e1_end, relation_label, e2_start, e2_end].
Forward Pass Logic:
- Encoding: Input IDs flow through BERT → LayerNorm → BiLSTM → LayerNorm.
- Entity Pooling (
extract_entity_pairs):- Iterates over valid pairs (where the first column of
re_maskis not -1). - Pool Entity: Uses
tf.while_loopto collect token vectors from the start index to the end index of an entity. - Computes the mean vector of the span.
- Iterates over valid pairs (where the first column of
- Combination: Concatenates the pooled embedding of Entity 1 and Entity 2. Shape:
(4 * hidden_dim,). - Classification: Stacked combined embeddings are passed to the Dense layer to produce logits.
Training Logic:
- Extracts valid relation labels from
re_mask. - Computes
sparse_categorical_crossentropyloss between filtered labels and logits.
Performance is evaluated on the test split using Average Precision, Recall, and F1-score, inferred entities and relations are migrated to a graph database for visualization.
- Database Sanitization: Each run begins with a
DETACH DELETEquery to ensure the graph reflects the most recent test inference without stale data. - Text Reconstruction: Uses BERT's
offset_mappingto resolve token indices back into the original raw abstract text. - Filtering: Only relations with functional biological significance are stored; other no relations are discarded.
- Entity Normalization: Node names are stripped of whitespace and capitalized to ensure proper entity resolution.
- Upsert Strategy (Cypher MERGE):
- Nodes: Entities (
:CHEMICALor:GENE-Yor:GENE-N). - Relationships: Established as directed edges using the predicted relation type (e.g.,
-[:ANTAGONIST_OF]->).
- Nodes: Entities (
The project is organized as following directories, /data contains .keras models that achieved higher F1 scores.
.
├── src/
│ ├── train/
│ │ ├── NER_RE_transfer_and_independent_learning.ipynb
│ │ └── NER+RE_joint_learning.ipynb
│ └── test/
│ └── Test_NER_RE.ipynb
├── data/
├── ner_model.keras
└── re_model.keras