Skip to content

Conversation

@galvana
Copy link
Contributor

@galvana galvana commented Jan 17, 2026

Description Of Changes

This PR significantly improves the performance of graph traversal operations by replacing O(N²) algorithms with O(N+E) alternatives.

Performance Improvements

Optimization Before After Improvement
traverse() in traversal.py O(N²) O(N+E) ~100x faster at 3000 nodes
compute_all_descendants() O(N²) O(N+E) ~100x faster at 3000 nodes
Dataset-level after deps O(N²) O(N) Linear lookup

Key Changes:

  1. traverse() in traversal.py - O(N²) → O(N+E)

    • Legacy used MatchingQueue.pop_first_match() which scans the queue on each iteration
    • New version uses Kahn's algorithm with in-degree tracking for optimal complexity
  2. compute_all_descendants() in create_request_tasks.py - O(N²) → O(N+E)

    • Was calling networkx.descendants() for each node in a loop
    • Now pre-computes all descendants in a single reverse topological pass
  3. Dataset-level after dependencies - O(N²) → O(N)

    • Was scanning all nodes to find collections in a dataset
    • Now uses pre-built _collections_by_dataset index for O(1) lookups

Performance comparison

Collections Legacy O(N²) Optimized O(N+E) Speedup
100 ~12 ms ~1 ms 12x
1,000 ~1.1 s ~7 ms 161x
5,000 ~27 s ~30 ms 900x
10,000 ~2.2 min ~80 ms 1,680x
25,000 ~14 min* ~500 ms 1,688x
50,000 ~57 min* ~1 s 3,415x

* Extrapolated from O(N²) growth curve

Code Changes

  • Added optimized traverse() method using Kahn's algorithm with in-degree tracking
  • Added compute_all_descendants() function for O(N+E) descendant computation
  • Pre-indexed edges by node address (edges_by_node) for O(1) edge lookups
  • Pre-built dataset-to-collections index (_collections_by_dataset) for O(1) lookups
  • Added skip_verification parameter to BaseTraversal to avoid redundant traversal during reachability checks
  • Retained _traverse_legacy() method for test comparison purposes

Tests

New test file test_traversal_optimization_comparison.py validates that the optimized algorithm produces identical results to the legacy implementation:

Test Class Purpose
TestTraversalComparison Validates equivalence on deterministic graph topologies (linear chains, star graphs, diamond graphs, graphs with after dependencies)
TestRandomGraphEquivalence Validates equivalence on randomly generated graphs with varying densities and sizes
TestTraversalErrorEquivalence Ensures both algorithms raise the same errors for invalid graphs (disconnected nodes, unreachable collections)

Steps to Confirm

Existing tests should pass.

Pre-Merge Checklist

  • Issue requirements met
  • All CI pipelines succeeded
  • CHANGELOG.md updated
    • Add a db-migration This indicates that a change includes a database migration label to the entry if your change includes a DB migration
    • Add a high-risk This issue suggests changes that have a high-probability of breaking existing code label to the entry if your change includes a high-risk change (i.e. potential for performance impact or unexpected regression) that should be flagged
    • Updates unreleased work already in Changelog, no new entry necessary
  • UX feedback:
    • All UX related changes have been reviewed by a designer
    • No UX review needed
  • Followup issues:
    • Followup issues created
    • No followup issues
  • Database migrations:
    • Ensure that your downrev is up to date with the latest revision on main
    • Ensure that your downgrade() migration is correct and works
      • If a downgrade migration is not possible for this change, please call this out in the PR description!
    • No migrations
  • Documentation:
    • Documentation complete, PR opened in fidesdocs
    • Documentation issue created in fidesdocs
    • If there are any new client scopes created as part of the pull request, remember to update public-facing documentation that references our scope registry
    • No documentation updates required

@vercel
Copy link
Contributor

vercel bot commented Jan 17, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
fides-plus-nightly Ready Ready Preview, Comment Jan 29, 2026 1:01am
1 Skipped Deployment
Project Deployment Review Updated (UTC)
fides-privacy-center Ignored Ignored Jan 29, 2026 1:01am

Request Review

@galvana galvana marked this pull request as ready for review January 24, 2026 07:03
@galvana galvana requested a review from a team as a code owner January 24, 2026 07:03
@galvana galvana requested review from adamsachs and removed request for a team January 24, 2026 07:03
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 24, 2026

Greptile Overview

Greptile Summary

This PR implements significant performance optimizations for graph traversal operations, achieving ~100x speedup at 3000 nodes by replacing O(N²) algorithms with O(N+E) alternatives.

Key optimizations:

  • traverse() method: Replaced MatchingQueue.pop_first_match() scanning with Kahn's algorithm variant using in-degree tracking for O(N+E) complexity
  • compute_all_descendants(): Added function to pre-compute all descendants in single reverse topological pass instead of calling networkx.descendants() per node
  • Dataset indexing: Pre-built _collections_by_dataset index for O(1) dataset-to-collections lookups
  • Edge indexing: Pre-indexed edges by node address (edges_by_node) for O(1) edge lookups
  • Verification optimization: Added skip_verification parameter to avoid redundant traversal during reachability checks

Code quality:

  • Comprehensive test suite validates equivalence between optimized and legacy implementations across multiple graph topologies
  • Legacy _traverse_legacy() method retained for reference and comparison
  • Clear documentation of algorithmic approach and complexity analysis

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk - well-tested performance optimization
  • Score reflects thorough testing approach and algorithmic correctness. The optimized implementation is validated against the legacy implementation across multiple test scenarios. The only minor style issue is a single-character variable name in the legacy code (which is intentionally kept for reference). The changes are well-documented with clear complexity analysis.
  • No files require special attention - the implementation is solid and well-tested

Important Files Changed

Filename Overview
src/fides/api/graph/traversal.py Replaced O(N²) traversal algorithm with O(N+E) Kahn's algorithm variant using in-degree tracking; added pre-indexing optimizations for edges and dataset collections; retained legacy implementation for comparison
src/fides/api/task/create_request_tasks.py Added compute_all_descendants() function for O(N+E) descendant computation; updated task creation functions to pre-compute descendants once instead of calling networkx.descendants per node
src/fides/api/graph/node_filters.py Updated OptionalIdentityFilter to use skip_verification parameter and explicit traverse() call to avoid redundant verification overhead
tests/ops/graph/test_traversal_optimization_comparison.py Comprehensive test suite comparing optimized and legacy traversal algorithms across deterministic topologies, random graphs, and error cases

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 24, 2026

Additional Comments (1)

src/fides/api/graph/traversal.py
single-character variable name n - use descriptive name like current_node

Context Used: Rule from dashboard - Use full names for variables, not 1 to 2 characters (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@galvana galvana added the run unsafe ci checks Runs fides-related CI checks that require sensitive credentials label Jan 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run unsafe ci checks Runs fides-related CI checks that require sensitive credentials

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants