Basic Graph Analysis
==================

This example demonstrates the basic workflow of using NEExT to analyze graph data, including:
loading data, computing node features, creating graph embeddings, and analyzing feature importance.

Loading Graph Data
----------------

First, we'll load some graph data from CSV files. We're using the NCI1 dataset, which is a collection
of chemical compounds represented as graphs, where each graph is labeled as either active or inactive
against non-small cell lung cancer.

.. code-block:: python

    from NEExT import NEExT
    import numpy as np

    # Initialize NEExT
    nxt = NEExT()
    nxt.set_log_level("INFO")

    # Define paths to data files
    edge_file = "https://raw.githubusercontent.com/AnomalyPoint/NEExT_datasets/refs/heads/main/real_world_networks/csv_format/NCI1/edges.csv"
    node_graph_mapping_file = "https://raw.githubusercontent.com/AnomalyPoint/NEExT_datasets/refs/heads/main/real_world_networks/csv_format/NCI1/node_graph_mapping.csv"
    graph_label_file = "https://raw.githubusercontent.com/AnomalyPoint/NEExT_datasets/refs/heads/main/real_world_networks/csv_format/NCI1/graph_labels.csv"

    # Load data with node reindexing and largest component filtering
    graph_collection = nxt.read_from_csv(
        edges_path=edge_file,
        node_graph_mapping_path=node_graph_mapping_file,
        graph_label_path=graph_label_file,
        reindex_nodes=True,
        filter_largest_component=True,
        graph_type="networkx"
    )

Computing Node Features
---------------------

Next, we'll compute various node-level features for each graph. These features capture both local
and global structural properties of the nodes.

.. code-block:: python

    # Compute node features
    features = nxt.compute_node_features(
        graph_collection=graph_collection,
        feature_list=["all"],  # Compute all available features
        feature_vector_length=3,  # Number of hops for neighborhood aggregation
        show_progress=True
    )

    # Normalize features for better model performance
    features.normalize(type="StandardScaler")

Creating Graph Embeddings
-----------------------

Now we'll create graph-level embeddings using the computed node features. These embeddings
will represent each graph as a fixed-size vector, making them suitable for machine learning.

.. code-block:: python

    # Compute graph embeddings
    embeddings = nxt.compute_graph_embeddings(
        graph_collection=graph_collection,
        features=features,
        embedding_algorithm="approx_wasserstein",
        embedding_dimension=3,
        random_state=42
    )

Training and Evaluating Models
----------------------------

With our graph embeddings, we can now train a machine learning model to classify the graphs.

.. code-block:: python

    # Train a classification model
    model_results = nxt.train_ml_model(
        graph_collection=graph_collection,
        embeddings=embeddings,
        model_type="classifier",
        sample_size=50,  # Number of train/test splits
        balance_dataset=False
    )

    # Print model results
    print(f"Average Accuracy: {np.mean(model_results['accuracy']):.4f}")
    print(f"Average F1 Score: {np.mean(model_results['f1_score']):.4f}")

Analyzing Feature Importance
-------------------------

Finally, we'll analyze which node features are most important for the classification task.
We'll use the fast supervised method which is more efficient than the greedy approach.

.. code-block:: python

    # Compute feature importance
    importance_df = nxt.compute_feature_importance(
        graph_collection=graph_collection,
        features=features,
        feature_importance_algorithm="supervised_fast",
        embedding_algorithm="approx_wasserstein",
        n_iterations=5
    )

    # Print feature importance results
    print("\nFeature Importance Results:")
    print(importance_df)

The feature importance results show which node features contribute most to the model's
performance, ranked from most to least important. This can help in feature selection
and understanding which structural properties are most relevant for the task.