21 July 2021
4 min read
We are building a Relational Knowledge Graph Management System. We have a big vision for this product, and you can learn more about that vision in other videos on our website, relational.ai.
Today we are going to focus on Data Applications built on Knowledge Graphs.
We’re going to walk through some highlights of a demo that uses a Knowledge Graph to solve a business problem.
This demo is called the “Knowledge Graph to Learn Knowledge Graphs”, or KGLKG. It parallels my own journey from programming in imperative languages like Java, C++, and Python to RAI's declarative language, Rel. For many years, I used those legacy languages plus SQL in an application-centric database-oriented way of solving business problems. But today I am using Rel and a data-centric business modeling approach to creating business solutions.
If you have seen the Overview video, or had a presentation from our Sales team, you will recognize this reference architecture diagram for Data Apps built on the RAI platform.
Here is how the KGLKG Data App fits into this architecture.
A full demo would walk through these components, but we’re going to jump ahead to the demo itself.
A Jupyter notebook will serve as the front-end for today's demo.
The business problem revolves around choosing the most relevant learning materials to bring a new Sales or Sales Engineering hire up to speed. Documents are suggested based on their conceptual content, quality, degree of difficulty, and appropriateness for a sequenced learning plan.
We model the business problem with this Knowledge Graph. Let’s build it up piece by piece.
We begin with the core nodes and relationships (or edges) that center around the learning materials and the concepts or topics they contain. These learning materials are PDFs and HTML pages like blog posts, e-zine articles, videos, and so on. We’re going to focus on PDFs and HTMLs.
We ingest a lemma CSV containing document names, concept names, and concept weights (the number of occurrences of that concept in the document).
We define relations in Rel for the document and concept nodes. These are based on the lemma CSV. Here are the documents…
The “About” edges link the Document and Concept nodes. A document is about some set of concepts. And each concept has some number of documents that mention it. Each such “About" relationship has a weight attribute, the number of times that concept appears in that document.
We can list all the About edges or about_weight attributes, but here let’s query the about_weight attribute and use a regular expression to find the documents and about_weights for concept words that contain the substring, “graph”.
Here’s the Knowledge Graph fragment we’ve created so far annotated with example data.
In addition to querying attributes from the loaded data, we can compute knowledge from the nodes and relationships.
These computations can be executed as runtime queries.
Or, they can be generalized and declaratively defined as relations so that this computed knowledge becomes part of our Knowledge Graph, available for simple query.
Our diagram highlights the new computed attributes added to our Knowledge Graph. These are automatically recomputed as the underlying data changes, say when a new document and its concepts are added to the graph. Contrast this with other graph databases which require you to explicitly run queries to recompute attributes that may be affected by data updates.
To suggest documents based on their content, we compute the “focus” of each document on each of the concepts it mentions. The about_focus attribute becomes a queryable part of our knowledge graph. But for convenience, we create an additional relation, called “suggested” to get us a Top-N list of documents for a specified concept.
The definition shown here is unbounded, too generalized to be pre-computed. So we mark it with “@inline” to defer evaluation of the relation until it is used in a query for a specific concept. You might even think of the suggested relation as similar to a procedural language function.
Here’s a query that joins the top-N list for “graph” with the top-N list for “ai” to get a top-N list of documents about BOTH topics.
So far we have based our suggestions purely on statistical analysis of the documents. But now we want to include knowledge from Reviewers who rank and rate these documents.
We add a Person node with a dynamic role attribute that can track their status as learner, employee, or curator of the learning materials.
The full demo shows how we dynamically add data, perform ELT, and use integrity constraints to guarantee our data conforms to our business rules. Right now, we’re going to skip ahead to the Reviewer role.
Any employee can review documents, and they are not confined to the curated library of materials. In fact, reviewing documents outside the curated library is how new material is found for the library.
In this example, we have two reviewers, Ben and Steve. Ben leads a sales team and reviews documents with an eye towards onboarding new sales hires. He tracks his reading in a Google Sheet called Sales Onboarding Plan, or SOP for short.
Steve leads a sales engineering team and reviews documents with an eye towards onboarding new sales engineering hires. He tracks his reading in a Google Sheet called Hersker’s Learning Curve, or HLC for short.
Reflecting the real world, there was no standard format for these tracking sheets, so Ben and Steve each created different formats that must be reconciled for our analysis. In his sheet, Ben created document titles that were linked to the URL. Steve used separate columns for the title and the URL. Both used some Google Drive URLs which convey no human-readable information, but which can be resolved to the actual on-drive file names.
We define brief and full-detailed definitions for the input CSVs for each reviewer’s Google Sheet. Here’s the brief data for each.
The detailed views include extracted URLs and Names generated during data preparation. But the data ingest process left some entries we don’t want.
For example, there are many empty strings in the extractedName relation which we want to eliminate as they violate 6NF.
So, we perform some ELT to clean up our reviewer data.
Now we have a clean extractedName relation.
To reconcile these reviewer lists we want to match names and find the intersections between the two lists and with the curated library. But we will explore these next steps in other videos.
Semantic Optimization makes your complex data workloads more efficient, which in turn improves overall system performance and scalability.Read More
Dovetail Join is a WCOJ (Worst Case Optimal Join) algorithm, meaning we can mathematically prove that the more complicated the problem is, the faster we will go.Read More
Our distillers provide a quick shot of information covering key concepts in our RKGMSRead More