Creating an AWS Setup for Testing GraphDB and Generative AI Applications

Creating an AWS Setup for Testing GraphDB and Generative AI Applications

AWS Neptune + AWS Bedrock and Generative AI

Introduction

This article will discuss configuring an environment to work with Generative AI on a Graph Database.

My use case is not yet here. I want to develop a closed system to analyze and advise on project dependencies. Long story short - "Talk to your dependencies". Currently, the main blocker is the ability to import maven dependecy:tree graphml to AWS Neptune. I already submitted the bug to Issues with Importing GraphML File in AWS Neptune Engine 1.2.1.0 · Issue #335 · awslabs/amazon-neptune-tools (github.com)

The good news is that using Gremlin query language to create your graph is working as a charm.

Result

For your dear reader, I want to show the result that I got so you can decide if this article is for you.
After building a graph about "English Premier League" from graph-notebook/src/graph_notebook/notebooks/02-Visualization/EPL-Gremlin.ipynb at main · aws/graph-notebook (github.com)
I asked what AWS Bedrock (Claude 2) may say about this Graph and got an answer:

Based on the information provided, I have the following details about the graph:

The graph contains nodes representing English Premier League teams, their home stadiums, and the cities where they are located.

There are 20 teams: Arsenal, Chelsea, Tottenham Hotspur, West Ham United, Liverpool, Everton, Manchester United, Manchester City, Wolverhampton Wanderers, Burnley, Sheffield United, Crystal Palace, Southampton, Newcastle United, Brighton and Hove Albion, Watford, Bournemouth, Aston Villa, Leicester City, Norwich City

Each team node has properties like founded year, full name, nickname etc.

There are stadium nodes for each team's home stadium with properties like opening year, capacity, name.

There are city nodes for the city where each team is located.

All nodes have labels like 'Team', 'Stadium', 'City' to categorize them.

The nodes and properties provide information about EPL teams, their home stadiums and cities. The graph models real-world entities and relationships between them.

⚠️☢️☣️!!! WARNING !!!⚠️☢️☣️

Note 1. UNSECURE. This setup shows you how to overcome AWS Neptune connection to instances inside the same VPC. Limiting your Security Group to your IP and enabling IAM permissions is the way to mitigate this issue.

Note 2. MONEY. Because AI needs access to graph schema, thus requires access to statistics, which requires RAM, cheap/free tier instances t3 and t4 are NOT an option, please check Managing statistics for the Neptune DFE to use - Amazon Neptune

⚠️☢️☣️!!! WARNING !!!⚠️☢️☣️

Tooling

Setting up an efficient workflow is crucial in today's fast-paced tech world. Below are essential tools that can significantly enhance your workflow, especially when dealing with cloud services, databases, and AI.

0. AWS Bedrock - Hosted LLMs Services from AWS

AWS Bedrock stands out for its robust Large Language Models (LLMs) services. This platform offers unparalleled scalability and reliability, making it an ideal choice for businesses seeking powerful cloud-based AI solutions.

1. Graph Database - AWS Neptune

AWS Neptune is a fully managed graph database service. It excels over local graph database solutions. Its managed nature means less time spent on setup and maintenance, allowing teams to focus more on development and innovation.

2. TinkerPop Gremlin Query Language

The TinkerPop Gremlin query language is a versatile tool for querying graph data in a technology-agnostic manner. Its ability to work across various graph databases makes it an invaluable asset in a developer's toolkit, ensuring flexibility and adaptability in your data handling.

3. TinkerPop Gremlin Driver for Java

For Java developers, the TinkerPop Gremlin driver is a game-changer. It simplifies the process of creating and manipulating graph databases within Java applications, making it a must-have for any Java-based project involving graph data.

4. AWS Graph Notebook Project

The AWS Graph Notebook project (https://github.com/aws/graph-notebook) offers a powerful way to interact with graph databases. These notebooks facilitate not just querying but also visualizing graph data, enhancing understanding and presentation of complex relationships.

5. Jupyter Lab for Notebooks

Jupyter Lab is an advanced version of Jupyter Notebooks. It provides a flexible and interactive environment for running code, visualizing data, and writing narrative content. Its integration with other tools like AWS Graph Notebook makes it an essential part of any data scientist’s arsenal.

6. HAProxy for Proxy Connections

HAProxy serves as an effective solution for proxying connections from AWS Virtual Private Cloud (VPC) to external machines. It ensures secure and efficient handling of traffic, making it a key component in managing cloud-based resources.

7. Langchain for AI-Powered Querying

Finally, Langchain is a cutting-edge tool for querying graph databases using Generative AI. It opens up new possibilities in data analysis and interaction, leveraging the power of AI to streamline complex queries and data processing tasks.

Each of these tools contributes uniquely to the overall efficiency and capability of a modern tech stack, particularly in environments where cloud services, AI, and data handling play a central role. Integrating them into your workflow can significantly enhance your team's productivity and the quality of your output.

Bumps on the road

AWS Neptune graph database access

While creating Neptune instance, please keep in mind two aspects:

  • It's important to use engine 1.2 to get graph schema

  • It's important to use bigger instances like db.r5.large, NOT t3 or t4, because by default graph statistics is disabled on this instances

HAProxy role

According to AWS, your Neptune instance is reachable from VPC only, all AWS examples are based on running EC2 instances in the same VPC that your Neptune instance.

If you want to use your external machine with Neptune, according to Connecting to Amazon Neptune from Clients Outside the Neptune VPC | aws-dbs-refarch-graph (aws-samples.github.io) proxying queries is option, UNSECURE by default option.

So to configure this option, you need an AWS EC2 instance, a free tire t2 instance is OK.

Install HAproxy and add these lines to your config:

frontend gremlin_proxy
    bind *:8182
    default_backend neptune_gremlin

backend neptune_gremlin
    mode http
    server neptune_server db-endpoint.cluster-xxxxx.region.neptune.amazonaws.com:8182 ssl verify none

Maybe you want to add Elastic IP to your EC2 instance so you can reuse same config after EC2 restarts.

Please follow me to configure your app with HAProxy

Gremlin Console and Java app

The connection configuration is similar here.

Please use the same minor versions between your server and client. You can use a URL for your HAProxy like ec2-xxxxx.region.compute.amazonaws.com:8182/status to get a version of the server gremlin. It's important because of serializers.

For example, in version 3.7 gremlin-driver changes the package name of serializers in comparison to version 3.6.

To connect to your AWS Neptune Gremlin Server you need two files:

remote-graph.properties for java

gremlin.remote.remoteConnectionClass=org.apache.tinkerpop.gremlin.driver.remote.DriverRemoteConnection
gremlin.remote.driver.clusterFile=conf/remote-objects.yaml
gremlin.remote.driver.sourceName=g

remote-objects.yaml

hosts: [ "haproxy-ec2-instance.compute.amazonaws.com" ]
port: 8182
connectionPool: { enableSsl: false }
serializer: {
  className: org.apache.tinkerpop.gremlin.util.ser.GraphBinaryMessageSerializerV1,
  config: { ioRegistries: [ org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry ] } }

Please notice that I'm using raw HTTP and className is a serializer class path that is still outdated for the latest gremlin 3.7.0 (don't use it yet), use 3.6.

Graph Notebook with Jupyter Lab

Here is is simple:

  1. Don't mess with your Python environment - use pyenv/pyenv: Simple Python version management (github.com)

  2. Install all dev packages so your Python has all features enabled

  3. Install python version 3.10, NOT NOT NOT 3.11

  4. Follow aws/graph-notebook: Library extending Jupyter notebooks to integrate with Apache TinkerPop, openCypher, and RDF SPARQL. (github.com) Readme to pip install "jupyterlab>=3,<4", version 3 is stable, NOT 4

Finally something interesting

After you manage to install and configure all moving parts here sample notebook to configure the connection to AWS Neptune:

%graph_notebook_config
{
  "host": "ec2-haproxy.region.compute.amazonaws.com",
  "port": 8182,
  "ssl": False,
  "gremlin": {
    "traversal_source": "g",
    "message_serializer": "graphsonv3"
  }
}
%status
%%gremlin
g.V().count()

Then please use graph-notebook/src/graph_notebook/notebooks/02-Visualization/EPL-Gremlin.ipynb at main · aws/graph-notebook (github.com) to create your first graph.

Here comes AI

After you build your foundation you can "copy/paste" example from Neptune Open Cypher QA Chain | 🦜️🔗 Langchain

from langchain.graphs import NeptuneGraph

host = "ec2-haproxy.region.compute.amazonaws.com"
port = 8182
use_https = False
graph = NeptuneGraph(host=host, port=port, use_https=use_https)

import boto3
boto3_bedrock  = boto3.client("bedrock-runtime", 'us-east-1')
from langchain.llms.bedrock import Bedrock
llm = Bedrock(
    model_id="anthropic.claude-v2",
    model_kwargs={
        "temperature": 0,
        "top_k": 250,
        "top_p": 1,  
    },
    client=boto3_bedrock,
)

chain = NeptuneOpenCypherQAChain.from_llm(llm=llm, graph=graph)

graph = NeptuneGraph(host=host, port=port, use_https=use_https)
chain.run("What information do you have about graph? ")

Summary

My article discusses setting up an environment for working with Generative AI on a Graph Database using AWS Neptune and AWS Bedrock. I cover essential tools such as the TinkerPop Gremlin Query Language, AWS Graph Notebook, and Jupyter Lab. I also address challenges like connecting to AWS Neptune instances and utilizing HAProxy for proxy connections. I provide a detailed step-by-step guide on configuring and using these tools to create a graph and leverage AI for querying and analyzing data.