Query Engine#

Contents:

1. What is the Query Engine?#

The GraphRecord Query Engine enables users to find node and edges’ indices stored in the graph structure efficiently. Thanks to an intuitive interface, complex queries can be performed, allowing you to filter nodes and edges by their properties and relationships. This section introduces the basic concepts of querying GraphRecords and explores advanced techniques for working with complex datasets.

2. Example dataset#

An example dataset for the following demonstrations is created manually with users, products, and their relationships.

users = pd.DataFrame(
    [
        ["pat_0", 20, "M"],
        ["pat_1", 30, "F"],
        ["pat_2", 40, "M"],
        ["pat_3", 50, "F"],
        ["pat_4", 60, "M"],
    ],
    columns=["index", "age", "gender"],
)

products = pd.DataFrame(
    [
        ["drug_0", "fentanyl injection"],
        ["drug_1", "aspirin tablet"],
        ["drug_2", "insulin pen"],
    ],
    columns=["index", "description"],
)

user_product = pd.DataFrame(
    [
        ["pat_0", "drug_0", 100, 1, "2020-01-01"],
        ["pat_1", "drug_0", 150, 2, "2020-02-15"],
        ["pat_1", "drug_1", 50, 1, "2020-03-10"],
        ["pat_2", "drug_1", 75, 12, "2020-04-20"],
        ["pat_2", "drug_2", 200, 1, "2020-05-05"],
        ["pat_3", "drug_2", 180, 12, "2020-06-30"],
        ["pat_4", "drug_0", 120, 1, "2020-07-15"],
        ["pat_4", "drug_1", 60, 2, "2020-08-01"],
    ],
    columns=["source", "target", "cost", "quantity", "time"],
)

graphrecord = (
    gr.GraphRecord.builder()
    .add_nodes((users, "index"), group="user")
    .add_nodes((products, "index"), group="product")
    .add_edges((user_product, "source", "target"), group="user_product")
    .build()
)

This example dataset includes a set of users and products. For this section, we will use the users, products and the edges that connect these two groups.

users_df.head(10)

      gender age
pat_0      M  20
pat_1      F  30
pat_2      M  40
pat_3      F  50
pat_4      M  60

products_df.head(10)

               description
drug_0  fentanyl injection
drug_1      aspirin tablet
drug_2         insulin pen

user_product_edges.head(10)

  source  target cost        time quantity
pat_0  drug_0  100  2020-01-01        1
pat_1  drug_0  150  2020-02-15        2
pat_1  drug_1   50  2020-03-10        1
pat_2  drug_1   75  2020-04-20       12
pat_2  drug_2  200  2020-05-05        1
pat_3  drug_2  180  2020-06-30       12
pat_4  drug_0  120  2020-07-15        1
pat_4  drug_1   60  2020-08-01        2

3. Node Queries#

The NodeOperand querying class allow you to define specific criteria for selecting nodes within a GraphRecord. These operands enable flexible and complex queries by combining multiple conditions, such as group membership, attributes’ selection and querying, attribute values, and relationships to other nodes or edges. This section introduces the basic usage of node operands to create a powerful foundation for your data queries.

The function query_nodes() and its counterpart query_edges() are the main ways to use these queries. They can retrieve different types of data from the GraphRecord, such as the indices of some nodes that fulfill some criteria (using index()), or even the mean age of those nodes (mean()).

# Basic node query
def query_node_in_user(node: NodeOperand) -> NodeIndicesOperand:
    node.in_group("user")

    return node.index()


graphrecord.query_nodes(query_node_in_user)

['pat_1', 'pat_0', 'pat_2', 'pat_3', 'pat_4']

You can get to the same result via different approaches. That makes the query engine very versatile and adaptive to your specific needs. Let’s complicate it a bit more involving more than one operand.

# Intermediate node query
def query_node_user_older_than_30(node: NodeOperand) -> NodeIndicesOperand:
    node.in_group("user")
    node.index().contains("pat")

    node.has_attribute("age")
    node.attribute("age").greater_than(30)

    return node.index()


graphrecord.query_nodes(query_node_user_older_than_30)

['pat_2', 'pat_3', 'pat_4']

Note

The has_attribute() method is not needed in this example, since the attribute() one already checks whether the nodes have the attribute. It is placed there merely for educational purposes. This will happen in different examples in this user guide to ensure the maximum amount of methods are portrayed.

3.1. Reusing Node Queries#

As you can see, the query engine can prove to be highly useful for finding nodes that fulfill different criteria, these criteria being as specific and narrowing as we like. A key feature of the query engine is that it allows for re-using previous queries in new ones. For instance, the previous query can be written as follows:

# Reusing node query
def query_node_reused(node: NodeOperand) -> NodeIndicesOperand:
    query_node_in_user(node)
    node.index().contains("pat")

    node.has_attribute("age")
    node.attribute("age").greater_than(30)

    return node.index()


graphrecord.query_nodes(query_node_reused)

['pat_2', 'pat_3', 'pat_4']

3.2. Neighbors#

Another very useful method is neighbors(), which can be used to query through the nodes that are neighbors to those nodes (they have edges connecting them).

In this following example we are selecting the nodes that fulfill the following criteria:

Are in group user.
Their node index contains the string “pat”
Their attribute age is greater than 30, and their attribute gender is equal to “M”.
They are connected to nodes which attribute description contains the word “fentanyl” in either upper or lowercase.

# Node query with neighbors function
def query_node_neighbors(node: NodeOperand) -> NodeIndicesOperand:
    query_node_user_older_than_30(node)

    description_neighbors = node.neighbors().attribute("description")
    description_neighbors.lowercase()
    description_neighbors.contains("fentanyl")

    return node.index()


graphrecord.query_nodes(query_node_neighbors)

['pat_4']

4. Edge Queries#

The querying class EdgeOperand provides a way to query through the edges contained in a GraphRecord. Edge operands show the same functionalities as Node operands, creating a very powerful tandem to query throughout your data. In this section, we will portray different ways the edge operands can be employed.

# Basic edge query
def query_edge_user_product(edge: EdgeOperand) -> EdgeIndicesOperand:
    edge.in_group("user_product")
    return edge.index()


edges = graphrecord.query_edges(query_edge_user_product)
edges[0:5]

[6, 7, 0, 3, 5]

The edge operand follows the same principles as the node operand, with some extra queries applicable only to edges like source_node() or target_node() (instead of neighbors()).

# Advanced edge query
def query_edge_old_user_cheap_item(edge: EdgeOperand) -> EdgeIndicesOperand:
    edge.in_group("user_product")
    edge.attribute("cost").less_than(200)

    edge.source_node().attribute("age").is_max()
    edge.target_node().attribute("description").contains("insulin")
    return edge.index()


graphrecord.query_edges(query_edge_old_user_cheap_item)

[]

5. Combining Node & Edge Queries#

The full power of the query engine appears once you combine both operands inside the queries. In the following query, we are able to query for nodes that:

Are in group user
Their attribute age is greater than 30, and their attribute gender is equal to “M”.
They have at least an edge that is in in the user_product group, which attribute cost is less than 200 and its attribute quantity is equal to 1.

# Combined node and edge query
def query_edge_combined(edge: EdgeOperand) -> EdgeIndicesOperand:
    edge.in_group("user_product")
    edge.attribute("cost").less_than(200)
    edge.attribute("quantity").equal_to(1)

    return edge.index()


def query_node_combined(node: NodeOperand) -> NodeIndicesOperand:
    node.in_group("user")
    node.attribute("age").is_int()
    node.attribute("age").greater_than(30)
    node.attribute("gender").equal_to("M")

    query_edge_combined(node.edges())

    return node.index()


graphrecord.query_nodes(query_node_combined)

['pat_4']

6. Clones#

Since the statements in the query engine are additive, every operation modifies the state of the query. That means that it is not possible to revert to a previous state unless the entire query is rewritten from scratch for that intermediate step. This can become inefficient and redundant, particularly when multiple branches of a query or comparisons with intermediate results are required.

To address this limitation, the clone() method was introduced. This method allows users to create independent copies - or clones - of operands or computed values at any point in the query chain. Clones are completely decoupled from the original object, meaning that modifications of the clone do not affect the original, and vice versa. This functionality applies to all types of operands.

# Clone query
def query_node_clone(node: NodeOperand) -> NodeIndicesOperand:
    node.in_group("user")
    node.index().contains("pat")

    mean_age_original = node.attribute("age").mean()
    mean_age_clone = mean_age_original.clone()  # Clone the mean age

    # Subtract 5 fom the cloned mean age (original remains unchanged)
    mean_age_clone.subtract(5)

    node.attribute("age").less_than(mean_age_original)  # Mean age
    node.attribute("age").greater_than(mean_age_clone)  # Mean age minus 5

    return node.index()


graphrecord.query_nodes(query_node_clone)

[]

7. Full example Code#

The full code examples for this chapter can be found here:

import pandas as pd

import graphrecords as gr
from graphrecords.querying import (
    EdgeIndicesOperand,
    EdgeOperand,
    NodeIndicesOperand,
    NodeOperand,
)

# Create example dataset manually
users = pd.DataFrame(
    [
        ["pat_0", 20, "M"],
        ["pat_1", 30, "F"],
        ["pat_2", 40, "M"],
        ["pat_3", 50, "F"],
        ["pat_4", 60, "M"],
    ],
    columns=["index", "age", "gender"],
)

products = pd.DataFrame(
    [
        ["drug_0", "fentanyl injection"],
        ["drug_1", "aspirin tablet"],
        ["drug_2", "insulin pen"],
    ],
    columns=["index", "description"],
)

user_product = pd.DataFrame(
    [
        ["pat_0", "drug_0", 100, 1, "2020-01-01"],
        ["pat_1", "drug_0", 150, 2, "2020-02-15"],
        ["pat_1", "drug_1", 50, 1, "2020-03-10"],
        ["pat_2", "drug_1", 75, 12, "2020-04-20"],
        ["pat_2", "drug_2", 200, 1, "2020-05-05"],
        ["pat_3", "drug_2", 180, 12, "2020-06-30"],
        ["pat_4", "drug_0", 120, 1, "2020-07-15"],
        ["pat_4", "drug_1", 60, 2, "2020-08-01"],
    ],
    columns=["source", "target", "cost", "quantity", "time"],
)

graphrecord = (
    gr.GraphRecord.builder()
    .add_nodes((users, "index"), group="user")
    .add_nodes((products, "index"), group="product")
    .add_edges((user_product, "source", "target"), group="user_product")
    .build()
)


# Basic node query
def query_node_in_user(node: NodeOperand) -> NodeIndicesOperand:
    node.in_group("user")

    return node.index()


graphrecord.query_nodes(query_node_in_user)


# Intermediate node query
def query_node_user_older_than_30(node: NodeOperand) -> NodeIndicesOperand:
    node.in_group("user")
    node.index().contains("pat")

    node.has_attribute("age")
    node.attribute("age").greater_than(30)

    return node.index()


graphrecord.query_nodes(query_node_user_older_than_30)


# Reusing node query
def query_node_reused(node: NodeOperand) -> NodeIndicesOperand:
    query_node_in_user(node)
    node.index().contains("pat")

    node.has_attribute("age")
    node.attribute("age").greater_than(30)

    return node.index()


graphrecord.query_nodes(query_node_reused)


# Node query with neighbors function
def query_node_neighbors(node: NodeOperand) -> NodeIndicesOperand:
    query_node_user_older_than_30(node)

    description_neighbors = node.neighbors().attribute("description")
    description_neighbors.lowercase()
    description_neighbors.contains("fentanyl")

    return node.index()


graphrecord.query_nodes(query_node_neighbors)


# Basic edge query
def query_edge_user_product(edge: EdgeOperand) -> EdgeIndicesOperand:
    edge.in_group("user_product")
    return edge.index()


edges = graphrecord.query_edges(query_edge_user_product)
edges[0:5]


# Advanced edge query
def query_edge_old_user_cheap_item(edge: EdgeOperand) -> EdgeIndicesOperand:
    edge.in_group("user_product")
    edge.attribute("cost").less_than(200)

    edge.source_node().attribute("age").is_max()
    edge.target_node().attribute("description").contains("insulin")
    return edge.index()


graphrecord.query_edges(query_edge_old_user_cheap_item)


# Combined node and edge query
def query_edge_combined(edge: EdgeOperand) -> EdgeIndicesOperand:
    edge.in_group("user_product")
    edge.attribute("cost").less_than(200)
    edge.attribute("quantity").equal_to(1)

    return edge.index()


def query_node_combined(node: NodeOperand) -> NodeIndicesOperand:
    node.in_group("user")
    node.attribute("age").is_int()
    node.attribute("age").greater_than(30)
    node.attribute("gender").equal_to("M")

    query_edge_combined(node.edges())

    return node.index()


graphrecord.query_nodes(query_node_combined)


# Either/or query
def query_edge_either(edge: EdgeOperand) -> None:
    edge.in_group("user_product")
    edge.attribute("cost").less_than(200)
    edge.attribute("quantity").equal_to(1)


def query_edge_or(edge: EdgeOperand) -> None:
    edge.in_group("user_product")
    edge.attribute("cost").less_than(200)
    edge.attribute("quantity").equal_to(12)


def query_node_either_or(node: NodeOperand) -> NodeIndicesOperand:
    node.in_group("user")
    node.attribute("age").greater_than(30)

    node.edges().either_or(query_edge_either, query_edge_or)

    return node.index()


graphrecord.query_nodes(query_node_either_or)


def query_node_either_or_component(node: NodeOperand) -> None:
    node.in_group("user")
    node.attribute("age").greater_than(30)

    node.edges().either_or(query_edge_either, query_edge_or)


# Exclude query
def query_node_exclude(node: NodeOperand) -> NodeIndicesOperand:
    node.in_group("user")
    node.exclude(query_node_either_or_component)

    return node.index()


graphrecord.query_nodes(query_node_exclude)


# Clone query
def query_node_clone(node: NodeOperand) -> NodeIndicesOperand:
    node.in_group("user")
    node.index().contains("pat")

    mean_age_original = node.attribute("age").mean()
    mean_age_clone = mean_age_original.clone()  # Clone the mean age

    # Subtract 5 fom the cloned mean age (original remains unchanged)
    mean_age_clone.subtract(5)

    node.attribute("age").less_than(mean_age_original)  # Mean age
    node.attribute("age").greater_than(mean_age_clone)  # Mean age minus 5

    return node.index()


graphrecord.query_nodes(query_node_clone)

# Node queries as function arguments
graphrecord.unfreeze_schema()
graphrecord.add_group("old_male_user", nodes=query_node_user_older_than_30)
graphrecord.groups

graphrecord.node[query_node_either_or]
graphrecord.groups_of_node(query_node_user_older_than_30)
graphrecord.edge_endpoints(query_edge_old_user_cheap_item)