Text-to-Knowledge Graph Construction (ChemProt)

Dataset Overview

The BioCreative VI ChemProt dataset identifies chemical and protein entities and the relations between them. Compounds are generally agonists (activators) or antagonists (inhibitors) of proteins.

Entities

The project utilizes three specific entity labels:

GENE-N
GENE-Y
CHEMICAL

Relations

There are 11 distinct relation classes used for classification:

ID	Label
0	UNDEFINED
1	PART_OF
2	REGULATOR_OR_DIRECT_REGULATOR_OR_INDIRECT_REGULATOR_OF
3	ACTIVATOR_OR_UPREGULATOR_OR_INDIRECT_UPREGULATOR_OF
4	INHIBITOR_OR_DOWNREGULATOR_OR_INDIRECT_DOWNREGULATOR_OF
5	AGONIST_OR_AGONIST_ACTIVATOR_OR_AGONIST_INHIBITOR_OF
6	ANTAGONIST_OF
7	MODULATOR_OR_MODULATOR_ACTIVATOR_OR_MODULATOR_INHIBITOR_OF
8	COFACTOR_OF
9	SUBSTRATE_OR_PRODUCT_OR_SUBSTRATE_PRODUCT_OF
10	NO_RELATION

2. Task Objectives

Objective: Construct a knowledge graph from ChemProt abstracts by performing two distinct sub-tasks:

Named Entity Recognition (NER):
- Framing: Token classification.
- Labels: {B-GENE-N, I-GENE-N, B-GENE-Y, I-GENE-Y, B-CHEMICAL, I-CHEMICAL, O}.
- Overlap Handling: The implementation supports double-encoding to handle overlapping entities (tokens with multiple labels).
Relation Extraction (RE):
- Framing: Classify pairs of extracted entities into one of the 11 relation classes.

Evaluation Metrics: Average Precision, Recall, and F1 score.

3. Data Preprocessing

Tokenizer: bert-base-uncased (uncapitalized BERT tokenizer).
Embeddings: Retrieved from TFBertModel pretrained on bert-base-uncased.
Segmentation: Abstracts are segmented into 128 tokens with striding and padding to ensure correct handling across boundaries.

Encoding Details

For NER Training: Tokens are labeled using the full tag set. Only valid label combinations present in the training data are encoded.
For RE Training: Segments are indexed, and pairs of entity token spans are marked and stored in re_mask. Rows in re_mask are padded with -1 where no more pairs exist.

4. Modeling Approach

The project implements a Text→Graph pipeline using TensorFlow/Keras (TF 2.x idiom). The architecture consists of three main components.

Component A: Custom CRF Layer

A custom implementation of Conditional Random Fields for sequence tagging.

Parameters (Trainable):

Start Transitions: Shape (num_tags,). Initializer: GlorotUniform.
Transition Matrix: Shape (num_tags, num_tags). Initializer: GlorotUniform.
End Transitions: Shape (num_tags,). Initializer: GlorotUniform.

Key Logic:

Forward Algorithm: Computes the log-partition function log_likelihood using tf.while_loop and tf.reduce_logsumexp.
Score Sequence: Computes gold_score of text sequence with correct labels.
CRF Loss: Returns the mean of -(gold_score - log_likelihood).
Viterbi Decoding:
- Initializes with start transitions + emissions.
- Iteratively computes scores by expanding dimensions and adding the transition matrix.
- Uses backtracking (via TensorArray) to produce the final best path as an integer tensor of shape (batch_size, seq_len).

Component B: NER Model

A hybrid BiLSTM-CRF architecture.

Architecture Layers:

BERT Backbone: TFBertModel (bert-base-uncased). Outputs last_hidden_state with shape (batch_size, seq_len, bert_hidden_size).
L2 Normalization: Applied to the BERT output before the LSTM.
Bi-LSTM: Bidirectional LSTM wrapper. Output shape is (batch_size, seq_len, 2 * hidden_dim).
Dense Layer: Projects LSTM output to num_tags. Output shape: (batch_size, seq_len, num_tags).
CRF: The Custom CRF layer described above.

Training Logic (train_step):

Forward: Computes emissions (logits) through BERT → L2 → BiLSTM → Dense.
Loss: Calculates CRF loss (log_likelihood - gold_score).
Optimization: Gradients are computed via tf.GradientTape and applied via the optimizer.

Inference Logic (test_step):

If validating: Computes CRF loss.
If testing: Runs Viterbi decoding to get prediction sequences. Updates metrics (TP/TN/FP/FN) by comparing predictions to true labels using boolean logic and reduction.

Component C: Relation Extraction Model

A transfer/independent/joint learning based architecture using Entity Pooling.

Architecture Layers:

BERT Backbone: TFBertModel (bert-base-uncased).
Layer Normalization (Input): Applied to BERT embeddings.
Bi-LSTM: Bidirectional LSTM.

Transfer Learning Note: Weights are explicitly copied from the NER model's LSTM in transfer learning whereas in joint learning single Bi-LSTM layer utilized for both tasks without layer/L2 normalizations.
Layer Normalization (Output): Applied to the Bi-LSTM output.
Dense Layer: Final classification layer with Softmax activation. Input size is 4 * hidden_dim (concatenation of two pooled entities).

Input Structure (re_mask): Input is a 2D array where each row encodes a candidate pair: [e1_start, e1_end, relation_label, e2_start, e2_end].

Forward Pass Logic:

Encoding: Input IDs flow through BERT → LayerNorm → BiLSTM → LayerNorm.
Entity Pooling (extract_entity_pairs):
- Iterates over valid pairs (where the first column of re_mask is not -1).
- Pool Entity: Uses tf.while_loop to collect token vectors from the start index to the end index of an entity.
- Computes the mean vector of the span.
Combination: Concatenates the pooled embedding of Entity 1 and Entity 2. Shape: (4 * hidden_dim,).
Classification: Stacked combined embeddings are passed to the Dense layer to produce logits.

Training Logic:

Extracts valid relation labels from re_mask.
Computes sparse_categorical_crossentropy loss between filtered labels and logits.

5. Evaluation and Test Result Storage

Performance is evaluated on the test split using Average Precision, Recall, and F1-score, inferred entities and relations are migrated to a graph database for visualization.

Neo4j Graph Integration Logic

Database Sanitization: Each run begins with a DETACH DELETE query to ensure the graph reflects the most recent test inference without stale data.
Text Reconstruction: Uses BERT's offset_mapping to resolve token indices back into the original raw abstract text.
Filtering: Only relations with functional biological significance are stored; other no relations are discarded.
Entity Normalization: Node names are stripped of whitespace and capitalized to ensure proper entity resolution.
Upsert Strategy (Cypher MERGE):
- Nodes: Entities (:CHEMICAL or :GENE-Y or :GENE-N).
- Relationships: Established as directed edges using the predicted relation type (e.g., -[:ANTAGONIST_OF]->).

6. Project Directory Structure

The project is organized as following directories, /data contains .keras models that achieved higher F1 scores.

File Tree

.
├── src/
│   ├── train/
│   │   ├── NER_RE_transfer_and_independent_learning.ipynb
│   │   └── NER+RE_joint_learning.ipynb
│   └── test/
│       └── Test_NER_RE.ipynb
├── data/
    ├── ner_model.keras
    └── re_model.keras

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
src		src
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text-to-Knowledge Graph Construction (ChemProt)

Dataset Overview

Entities

Relations

2. Task Objectives

3. Data Preprocessing

Encoding Details

4. Modeling Approach

Component A: Custom CRF Layer

Component B: NER Model

Component C: Relation Extraction Model

5. Evaluation and Test Result Storage

Neo4j Graph Integration Logic

6. Project Directory Structure

File Tree

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Vignesh3538/Text2Graph

Folders and files

Latest commit

History

Repository files navigation

Text-to-Knowledge Graph Construction (ChemProt)

Dataset Overview

Entities

Relations

2. Task Objectives

3. Data Preprocessing

Encoding Details

4. Modeling Approach

Component A: Custom CRF Layer

Component B: NER Model

Component C: Relation Extraction Model

5. Evaluation and Test Result Storage

Neo4j Graph Integration Logic

6. Project Directory Structure

File Tree

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages