GraphRecord#
1. Preface#
Every major library has a central object that constitutes its core. For PyTorch, it is the torch.Tensor, whereas for Numpy, it is the np.array. In our case, GraphRecords centres around the GraphRecord as its foundational structure.
GraphRecords delivers advanced data analytics methods out-of-the-box by utilizing a structured approach to data storage. This is enabled by the GraphRecord class, which organizes data of any complexity within a graph structure. With its Rust backend implementation, GraphRecord guarantees high performance, even when working with extremely large datasets.
import graphrecords as gr
2. Adding Nodes to a GraphRecord#
Let’s begin by introducing some sample data:
ID |
Age |
Type |
Region |
|---|---|---|---|
User 01 |
72 |
M |
USA |
User 02 |
74 |
M |
USA |
User 03 |
64 |
F |
GER |
This data, stored for example in a Pandas DataFrame, looks like this:
users = pd.DataFrame(
[
["User 01", 72, "M", "USA"],
["User 02", 74, "M", "USA"],
["User 03", 64, "F", "GER"],
],
columns=["ID", "Age", "Type", "Region"],
)
In the example below, we create a new GraphRecord using the builder pattern. We instantiate a GraphRecordBuilder and instruct it to add the Pandas DataFrame as nodes, using the ‘ID’ column for indexing. Additionally, we assign these nodes to the group ‘Users’.
The Builder Pattern simplifies creating complex objects by constructing them step by step. It improves flexibility, readability, and consistency, making it easier to manage and configure objects in a controlled way.
record = gr.GraphRecord.builder().add_nodes((users, "ID"), group="Users").build()
Methods used in the snippet
builder(): Creates a newGraphRecordBuilderinstance to build aGraphRecord.add_nodes(): Adds nodes to the GraphRecord from different data formats and optionally assigns them to a group.build(): Constructs a GraphRecord instance from the builder’s configuration.
The GraphRecords GraphRecord object, record, now contains three users. Each user is identified by a unique index and has specific attributes, such as age, type, and region. These users serve as the initial nodes in the graph structure of our GraphRecord.
We can now proceed by adding additional data, such as the following products.
products = pd.DataFrame(
[["Product 01", "Item A"], ["Product 02", "Item B"]], columns=["ID", "Name"]
)
Using the builder pattern to construct the GraphRecord allows us to pass as many nodes and edges as needed. If nodes are not added during the initial graph construction, they can easily be added later to an existing GraphRecord by calling add_nodes(), where you provide the DataFrame and specify the column containing the node indices.
record.add_nodes((products, "ID"), group="Products")
Methods used in the snippet
add_nodes(): Adds nodes to the GraphRecord from different data formats and optionally assigns them to a group.
This will expand the GraphRecord, adding several new nodes to the graph. However, these nodes are not yet connected, so let’s establish relationships between them!
Note
Nodes can be added to the GraphRecord in a lot of different formats, such as a Pandas DataFrame (as previously shown), but also from a Polars DataFrame:
user_tuples = [
("User 04", {"Age": 45, "Type": "F", "Region": "CHI"}),
("User 05", {"Age": 26, "Type": "M", "Region": "SPA"}),
]
record.add_nodes(user_tuples, group="Users")
Or from a NodeTuple:
user_polars = pl.DataFrame(
[
["User 06", 55, "F", "GER"],
["User 07", 61, "F", "USA"],
["User 08", 73, "M", "CHI"],
],
schema=["ID", "Age", "Type", "Region"],
orient="row",
)
record.add_nodes((user_polars, "ID"), group="Users")
3. Adding Edges to a GraphRecord#
To capture meaningful relationships between nodes, such as linking users to purchased products, we add edges to the GraphRecord. These edges must be specified in a relation table, such as the one shown below:
User_ID |
Product_ID |
time |
|---|---|---|
User 02 |
Product 01 |
2020/06/07 |
User 02 |
Product 02 |
2018/02/02 |
User 03 |
Product 02 |
2019/03/02 |
We can add these edges then to our GraphRecord:
record.add_edges((user_product, "User_ID", "Product_ID"))
Methods used in the snippet
add_edges(): Adds edges to the GraphRecord from different data formats and optionally assigns them to a group.
This results in an enlarged Graph with more information.
4. Adding Groups to a GraphRecord#
For certain analyses, we may want to define specific subcohorts within our GraphRecord for easier access. We can do this by defining named groups withing our GraphRecord.
record.add_group("US-Users", nodes=["User 01", "User 02"])
Methods used in the snippet
add_group(): Adds a group to the GraphRecord instance with an optional list of node and/or edge indices.
This group will include all the defined nodes, allowing for easier access during complex analyses. Both nodes and edges can be added to a group, with no limitations on group size. Additionally, nodes and edges can belong to multiple groups without restriction.
5. Saving and Loading GraphRecords#
When building a GraphRecord, you may want to save it to create a persistent version. This can be done by storing it as a RON (Rusty Object Notation) file. The GraphRecord can then be reloaded, allowing you to create a new instance from the saved RON file.
record.to_ron("record.ron")
new_record = gr.GraphRecord.from_ron("record.ron")
Methods used in the snippet
to_ron(): Writes the GraphRecord instance to a RON file.from_ron(): Creates a GraphRecord instance from a RON file.
6. Overview Tables#
The GraphRecord class is designed to efficiently handle large datasets while maintaining a standardized data structure that supports complex analysis methods. As a result, the structure within the GraphRecord can become intricate and difficult to manage. To address this, GraphRecords offers tools to help keep track of the graph-based data. One such tool is the overview() method, which prints an overview over all nodes and edges in the GraphRecord.
record.overview()
┌────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Node Overview │
├───────────┬────────────┬───────────┬────────────────┬────────────────┬─────────────────────────┤
│ Group │ Node Count │ Attribute │ Attribute Type │ Data Type │ Details │
├───────────┼────────────┼───────────┼────────────────┼────────────────┼─────────────────────────┤
│ │ │ │ │ │ Min: 65 │
│ │ │ Age │ Continuous │ Option[Int] │ Mean: 65 │
│ │ │ │ │ │ Max: 65 │
│ Ungrouped │ 1 ├───────────┼────────────────┼────────────────┼─────────────────────────┤
│ │ │ Type │ Unstructured │ Option[String] │ Distinct value count: 1 │
│ │ ├───────────┼────────────────┼────────────────┼─────────────────────────┤
│ │ │ Region │ Unstructured │ Option[String] │ Distinct value count: 1 │
├───────────┼────────────┼───────────┼────────────────┼────────────────┼─────────────────────────┤
│ │ │ │ │ │ Min: 72 │
│ │ │ Age │ Continuous │ Int │ Mean: 73 │
│ │ │ │ │ │ Max: 74 │
│ US-Users │ 2 ├───────────┼────────────────┼────────────────┼─────────────────────────┤
│ │ │ Type │ Unstructured │ String │ Distinct value count: 1 │
│ │ ├───────────┼────────────────┼────────────────┼─────────────────────────┤
│ │ │ Region │ Unstructured │ String │ Distinct value count: 1 │
├───────────┼────────────┼───────────┼────────────────┼────────────────┼─────────────────────────┤
│ │ │ │ │ │ Min: 26 │
│ │ │ Age │ Continuous │ Int │ Mean: 58.75 │
│ │ │ │ │ │ Max: 74 │
│ Users │ 8 ├───────────┼────────────────┼────────────────┼─────────────────────────┤
│ │ │ Region │ Unstructured │ String │ Distinct value count: 4 │
│ │ ├───────────┼────────────────┼────────────────┼─────────────────────────┤
│ │ │ Type │ Unstructured │ String │ Distinct value count: 2 │
├───────────┼────────────┼───────────┼────────────────┼────────────────┼─────────────────────────┤
│ Products │ 2 │ Name │ Unstructured │ String │ Distinct value count: 2 │
└───────────┴────────────┴───────────┴────────────────┴────────────────┴─────────────────────────┘
┌───────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Edge Overview │
├───────────┬────────────┬───────────┬────────────────┬──────────────────┬──────────────────────────┤
│ Group │ Edge Count │ Attribute │ Attribute Type │ Data Type │ Details │
├───────────┼────────────┼───────────┼────────────────┼──────────────────┼──────────────────────────┤
│ Ungrouped │ 3 │ Date │ Temporal │ Option[DateTime] │ Min: 2018-02-02 00:00:00 │
│ │ │ │ │ │ Max: 2020-06-07 00:00:00 │
└───────────┴────────────┴───────────┴────────────────┴──────────────────┴──────────────────────────┘
Methods used in the snippet
overview(): Gets a summary for all nodes and edges in groups and their attributes.
7. Accessing Elements in a GraphRecord#
Now that we have stored some structured data in our GraphRecord, we might want to access certain elements of it. The main way to do this is by either selecting the data with their indices or via groups that they are in.
We can, for example, get all available nodes:
record.nodes
['User 08', 'User 01', 'User 05', 'User 02', 'User 06', 'User 09', 'Product 01', 'User 03', 'Product 02', 'User 04', 'User 07']
Or access the attributes of a specific node:
record.node["User 01"]
{'Age': 72, 'Region': 'USA', 'Type': 'M'}
Or a specific edge:
record.edge[0]
{'Date': datetime.datetime(2020, 6, 7, 0, 0)}
Or get all available groups:
record.groups
['US-Users', 'Users', 'User-Product', 'Products']
Or get all that nodes belong to a certain group:
record.nodes_in_group("Products")
['Product 02', 'Product 01']
Methods used in the snippet
nodes: Lists the node indices in the GraphRecord instance.node[]: Provides access to node information within the GraphRecord instance via an indexer, returning a dictionary with node indices as keys and node attributes as values.edge[]: Provides access to edge attributes within the GraphRecord via an indexer, returning a dictionary with edge indices and edge attributes as values.groups(): Lists the groups in the GraphRecord instance.nodes_in_group(): Retrieves the node indices associated with the specified group(s) in the GraphRecord.
The GraphRecord can be queried in very advanced ways in order to find very specific nodes based on time, relations, neighbors or other. These advanced querying methods are covered in one of the next sections of the user guide, Query Engine.
8. Full example Code#
The full code examples for this chapter can be found here:
import pandas as pd
import polars as pl
import graphrecords as gr
# Users DataFrame (Nodes)
users = pd.DataFrame(
[
["User 01", 72, "M", "USA"],
["User 02", 74, "M", "USA"],
["User 03", 64, "F", "GER"],
],
columns=["ID", "Age", "Type", "Region"],
)
# Products DataFrame (Nodes)
products = pd.DataFrame(
[["Product 01", "Item A"], ["Product 02", "Item B"]], columns=["ID", "Name"]
)
# User-Product Relation (Edges)
user_product = pd.DataFrame(
[
["User 02", "Product 01", pd.Timestamp("20200607")],
["User 02", "Product 02", pd.Timestamp("20180202")],
["User 03", "Product 02", pd.Timestamp("20190302")],
],
columns=["User_ID", "Product_ID", "Date"],
)
record = gr.GraphRecord.builder().add_nodes((users, "ID"), group="Users").build()
record.add_nodes((products, "ID"), group="Products")
user_tuples = [
("User 04", {"Age": 45, "Type": "F", "Region": "CHI"}),
("User 05", {"Age": 26, "Type": "M", "Region": "SPA"}),
]
record.add_nodes(user_tuples, group="Users")
user_polars = pl.DataFrame(
[
["User 06", 55, "F", "GER"],
["User 07", 61, "F", "USA"],
["User 08", 73, "M", "CHI"],
],
schema=["ID", "Age", "Type", "Region"],
orient="row",
)
record.add_nodes((user_polars, "ID"), group="Users")
record.add_edges((user_product, "User_ID", "Product_ID"))
record.add_group("US-Users", nodes=["User 01", "User 02"])
record.add_nodes(
(
pd.DataFrame(
[["User 09", 65, "M", "USA"]], columns=["ID", "Age", "Type", "Region"]
),
"ID",
),
)
record.overview()
# Adding edges to a certain group
record.add_group("User-Product", edges=record.edges)
# Getting all available nodes
record.nodes
# Accessing a certain node
record.node["User 01"]
# Accessing a certain edge
record.edge[0]
# Getting all available groups
record.groups
# Getting the nodes that are within a certain group
record.nodes_in_group("Products")
record.to_ron("record.ron")
new_record = gr.GraphRecord.from_ron("record.ron")