Basic Graph Analysis

This example demonstrates the basic workflow of using NEExT to analyze graph data, including: loading data, computing node features, creating graph embeddings, and analyzing feature importance.

Loading Graph Data

First, we’ll load some graph data from CSV files. We’re using the NCI1 dataset, which is a collection of chemical compounds represented as graphs, where each graph is labeled as either active or inactive against non-small cell lung cancer.

from NEExT import NEExT
import numpy as np

# Initialize NEExT
nxt = NEExT()
nxt.set_log_level("INFO")

# Define paths to data files
edge_file = "https://raw.githubusercontent.com/AnomalyPoint/NEExT_datasets/refs/heads/main/real_world_networks/csv_format/NCI1/edges.csv"
node_graph_mapping_file = "https://raw.githubusercontent.com/AnomalyPoint/NEExT_datasets/refs/heads/main/real_world_networks/csv_format/NCI1/node_graph_mapping.csv"
graph_label_file = "https://raw.githubusercontent.com/AnomalyPoint/NEExT_datasets/refs/heads/main/real_world_networks/csv_format/NCI1/graph_labels.csv"

# Load data with node reindexing and largest component filtering
graph_collection = nxt.read_from_csv(
    edges_path=edge_file,
    node_graph_mapping_path=node_graph_mapping_file,
    graph_label_path=graph_label_file,
    reindex_nodes=True,
    filter_largest_component=True,
    graph_type="networkx"
)

Computing Node Features

Next, we’ll compute various node-level features for each graph. These features capture both local and global structural properties of the nodes.

# Compute node features
features = nxt.compute_node_features(
    graph_collection=graph_collection,
    feature_list=["all"],  # Compute all available features
    feature_vector_length=3,  # Number of hops for neighborhood aggregation
    show_progress=True
)

# Normalize features for better model performance
features.normalize(type="StandardScaler")

Creating Graph Embeddings

Now we’ll create graph-level embeddings using the computed node features. These embeddings will represent each graph as a fixed-size vector, making them suitable for machine learning.

# Compute graph embeddings
embeddings = nxt.compute_graph_embeddings(
    graph_collection=graph_collection,
    features=features,
    embedding_algorithm="approx_wasserstein",
    embedding_dimension=3,
    random_state=42
)

Training and Evaluating Models

With our graph embeddings, we can now train a machine learning model to classify the graphs.

# Train a classification model
model_results = nxt.train_ml_model(
    graph_collection=graph_collection,
    embeddings=embeddings,
    model_type="classifier",
    sample_size=50,  # Number of train/test splits
    balance_dataset=False
)

# Print model results
print(f"Average Accuracy: {np.mean(model_results['accuracy']):.4f}")
print(f"Average F1 Score: {np.mean(model_results['f1_score']):.4f}")

Analyzing Feature Importance

Finally, we’ll analyze which node features are most important for the classification task. We’ll use the fast supervised method which is more efficient than the greedy approach.

# Compute feature importance
importance_df = nxt.compute_feature_importance(
    graph_collection=graph_collection,
    features=features,
    feature_importance_algorithm="supervised_fast",
    embedding_algorithm="approx_wasserstein",
    n_iterations=5
)

# Print feature importance results
print("\nFeature Importance Results:")
print(importance_df)

The feature importance results show which node features contribute most to the model’s performance, ranked from most to least important. This can help in feature selection and understanding which structural properties are most relevant for the task.