Skip to content

This project builds end-to-end framework for conversion of text to knowledge-graph with empirical analysis on ChemProt dataset.

License

Notifications You must be signed in to change notification settings

Vignesh3538/Text2Graph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text-to-Knowledge Graph Construction (ChemProt)

Dataset Overview

The BioCreative VI ChemProt dataset identifies chemical and protein entities and the relations between them. Compounds are generally agonists (activators) or antagonists (inhibitors) of proteins.

Entities

The project utilizes three specific entity labels:

  • GENE-N
  • GENE-Y
  • CHEMICAL

Relations

There are 11 distinct relation classes used for classification:

ID Label
0 UNDEFINED
1 PART_OF
2 REGULATOR_OR_DIRECT_REGULATOR_OR_INDIRECT_REGULATOR_OF
3 ACTIVATOR_OR_UPREGULATOR_OR_INDIRECT_UPREGULATOR_OF
4 INHIBITOR_OR_DOWNREGULATOR_OR_INDIRECT_DOWNREGULATOR_OF
5 AGONIST_OR_AGONIST_ACTIVATOR_OR_AGONIST_INHIBITOR_OF
6 ANTAGONIST_OF
7 MODULATOR_OR_MODULATOR_ACTIVATOR_OR_MODULATOR_INHIBITOR_OF
8 COFACTOR_OF
9 SUBSTRATE_OR_PRODUCT_OR_SUBSTRATE_PRODUCT_OF
10 NO_RELATION

2. Task Objectives

Objective: Construct a knowledge graph from ChemProt abstracts by performing two distinct sub-tasks:

  1. Named Entity Recognition (NER):

    • Framing: Token classification.
    • Labels: {B-GENE-N, I-GENE-N, B-GENE-Y, I-GENE-Y, B-CHEMICAL, I-CHEMICAL, O}.
    • Overlap Handling: The implementation supports double-encoding to handle overlapping entities (tokens with multiple labels).
  2. Relation Extraction (RE):

    • Framing: Classify pairs of extracted entities into one of the 11 relation classes.

Evaluation Metrics: Average Precision, Recall, and F1 score.


3. Data Preprocessing

  • Tokenizer: bert-base-uncased (uncapitalized BERT tokenizer).
  • Embeddings: Retrieved from TFBertModel pretrained on bert-base-uncased.
  • Segmentation: Abstracts are segmented into 128 tokens with striding and padding to ensure correct handling across boundaries.

Encoding Details

  • For NER Training: Tokens are labeled using the full tag set. Only valid label combinations present in the training data are encoded.
  • For RE Training: Segments are indexed, and pairs of entity token spans are marked and stored in re_mask. Rows in re_mask are padded with -1 where no more pairs exist.

4. Modeling Approach

The project implements a Text→Graph pipeline using TensorFlow/Keras (TF 2.x idiom). The architecture consists of three main components.

Component A: Custom CRF Layer

A custom implementation of Conditional Random Fields for sequence tagging.

Parameters (Trainable):

  • Start Transitions: Shape (num_tags,). Initializer: GlorotUniform.
  • Transition Matrix: Shape (num_tags, num_tags). Initializer: GlorotUniform.
  • End Transitions: Shape (num_tags,). Initializer: GlorotUniform.

Key Logic:

  • Forward Algorithm: Computes the log-partition function log_likelihood using tf.while_loop and tf.reduce_logsumexp.
  • Score Sequence: Computes gold_score of text sequence with correct labels.
  • CRF Loss: Returns the mean of -(gold_score - log_likelihood).
  • Viterbi Decoding:
    • Initializes with start transitions + emissions.
    • Iteratively computes scores by expanding dimensions and adding the transition matrix.
    • Uses backtracking (via TensorArray) to produce the final best path as an integer tensor of shape (batch_size, seq_len).

Component B: NER Model

A hybrid BiLSTM-CRF architecture.

Architecture Layers:

  1. BERT Backbone: TFBertModel (bert-base-uncased). Outputs last_hidden_state with shape (batch_size, seq_len, bert_hidden_size).
  2. L2 Normalization: Applied to the BERT output before the LSTM.
  3. Bi-LSTM: Bidirectional LSTM wrapper. Output shape is (batch_size, seq_len, 2 * hidden_dim).
  4. Dense Layer: Projects LSTM output to num_tags. Output shape: (batch_size, seq_len, num_tags).
  5. CRF: The Custom CRF layer described above.

Training Logic (train_step):

  1. Forward: Computes emissions (logits) through BERT → L2 → BiLSTM → Dense.
  2. Loss: Calculates CRF loss (log_likelihood - gold_score).
  3. Optimization: Gradients are computed via tf.GradientTape and applied via the optimizer.

Inference Logic (test_step):

  • If validating: Computes CRF loss.
  • If testing: Runs Viterbi decoding to get prediction sequences. Updates metrics (TP/TN/FP/FN) by comparing predictions to true labels using boolean logic and reduction.

Component C: Relation Extraction Model

A transfer/independent/joint learning based architecture using Entity Pooling.

Architecture Layers:

  1. BERT Backbone: TFBertModel (bert-base-uncased).
  2. Layer Normalization (Input): Applied to BERT embeddings.
  3. Bi-LSTM: Bidirectional LSTM.

    Transfer Learning Note: Weights are explicitly copied from the NER model's LSTM in transfer learning whereas in joint learning single Bi-LSTM layer utilized for both tasks without layer/L2 normalizations.

  4. Layer Normalization (Output): Applied to the Bi-LSTM output.
  5. Dense Layer: Final classification layer with Softmax activation. Input size is 4 * hidden_dim (concatenation of two pooled entities).

Input Structure (re_mask): Input is a 2D array where each row encodes a candidate pair: [e1_start, e1_end, relation_label, e2_start, e2_end].

Forward Pass Logic:

  1. Encoding: Input IDs flow through BERT → LayerNorm → BiLSTM → LayerNorm.
  2. Entity Pooling (extract_entity_pairs):
    • Iterates over valid pairs (where the first column of re_mask is not -1).
    • Pool Entity: Uses tf.while_loop to collect token vectors from the start index to the end index of an entity.
    • Computes the mean vector of the span.
  3. Combination: Concatenates the pooled embedding of Entity 1 and Entity 2. Shape: (4 * hidden_dim,).
  4. Classification: Stacked combined embeddings are passed to the Dense layer to produce logits.

Training Logic:

  • Extracts valid relation labels from re_mask.
  • Computes sparse_categorical_crossentropy loss between filtered labels and logits.

5. Evaluation and Test Result Storage

Performance is evaluated on the test split using Average Precision, Recall, and F1-score, inferred entities and relations are migrated to a graph database for visualization.

Neo4j Graph Integration Logic

  • Database Sanitization: Each run begins with a DETACH DELETE query to ensure the graph reflects the most recent test inference without stale data.
  • Text Reconstruction: Uses BERT's offset_mapping to resolve token indices back into the original raw abstract text.
  • Filtering: Only relations with functional biological significance are stored; other no relations are discarded.
  • Entity Normalization: Node names are stripped of whitespace and capitalized to ensure proper entity resolution.
  • Upsert Strategy (Cypher MERGE):
    • Nodes: Entities (:CHEMICAL or :GENE-Y or :GENE-N).
    • Relationships: Established as directed edges using the predicted relation type (e.g., -[:ANTAGONIST_OF]->).

6. Project Directory Structure

The project is organized as following directories, /data contains .keras models that achieved higher F1 scores.

File Tree

.
├── src/
│   ├── train/
│   │   ├── NER_RE_transfer_and_independent_learning.ipynb
│   │   └── NER+RE_joint_learning.ipynb
│   └── test/
│       └── Test_NER_RE.ipynb
├── data/
    ├── ner_model.keras
    └── re_model.keras

About

This project builds end-to-end framework for conversion of text to knowledge-graph with empirical analysis on ChemProt dataset.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published