In early April this year, the Pinterest Engineering Team shared how they internally implemented a Text-to-SQL solution to enable data users to retrieve data without writing SQL; they also shared an astonishing ROI on the solution provided to their users, quoted below.
In our real world data (which importantly does not control for differences in tasks), we find a 35% improvement in task completion speed for writing SQL queries using AI assistance.
The team at Wren AI has been working on the Text-to-SQL challenge for several months and found we gained a lot of new insights and learnings from the post; we would like to share what we had learned from the post and our learnings while developing Wren AI with the community to accelerate the innovation of bringing the LLM comprehension capabilities to data.
Before we start, if you haven’t checked out Wren AI, check us out on GitHub! If you love what we are doing, don’t forget to give us a star ⭐️ !
First, let’s investigate the architectural similarity between Pinterest Text-to-SQL and Wren AI Architecture.
Here’s the architecture Pinterest shared in their post.
Let’s break down the architecture phase by phase; when working on Text-to-SQL implementation, there are three critical phases: indexing, retrieval, and augmentation & generation.
Mapping these three phases into the architecture below and delve deeper into each.
In the architecture, they applied an offline job to generate indexes for the vector store based on tables’ summaries and historical queries against them. Currently, they rely on two types of document embeddings in the vector indexing process: Table summarization and Query summarization.
The indexing phase is usually closely related to how Text-to-SQL searches for the most relevant tables in your database; the more context you can provide into the vector store, the better the final result will be.
During the retrieval phase, if the user does not specify tables, their question is converted into embeddings. A few methods are applied to conduct the retrieval phase, including NLP Table Search and Table Re-selection. The table re-selection results will be returned to the user for validation before being generated into SQL.
We usually use post-retrieval methods to filter out low-related contexts based on additional information or apply different weights to the results.
In the post, they didn’t mention much about this part of the augmentation and generation phase.
Here’s what the Wren AI Text-to-SQL architecture looks like.
Wren AI consists of three core services:
Let’s see how Wren AI designs and implements those three phases in our text-to-SQL architecture so you can compare it with the Pinterest text-to-SQL architecture we just mentioned.
In Wren AI, we have implemented a semantic engine architecture to provide the LLM context of your business; you can easily establish a logical presentation layer on your data schema that helps LLM learn more about your business context.
What is included in our indexing:
revenue
, profit
, you want LLM to understand how your company defines those terms; this is where calculation
comes in, you can easily define calculations in the semantic modeling, and LLM could generate the right SQL based on your definition without hallucination.In contrast, Pinterest currently relies on only table metadata and query logs, but they have implemented table summarization and query summarization methods to improve context.
Lesson Learned: Our team plans to implement “Table summarization” and “Query summarization” in the Wren AI project.
In Wren AI, during the retrieval phase, the user question is transformed into embeddings, and a similarity search is conducted against the vector index to infer the top K suitable tables, as below.
from typing import Any
from haystack_integrations.components.retrievers.qdrant import QdrantEmbeddingRetriever
def init_retriever(document_store: Any, top_k: int = 10):
return QdrantEmbeddingRetriever(document_store=document_store, top_k=top_k)
In Pinterest’s case, the Text-to-SQL implementation implemented the table search method above and Table Re-selection.
Table Re-selection:
Upon retrieving the top N tables from the vector index, we engage an LLM to choose the most relevant K tables by evaluating the question alongside the table summaries.
Lesson Learned: Table Re-selection after the table selection from the vector store could help improve performance.
In Wren AI, we have unique designs and techniques to help LLMs understand more of the business context to augment information in prompts. We also provide SQL validation and auto-correction while in the generation phase.
Augmentation
One of Wren AI's key designs is the semantic engine, the Wren Engine. This engine is responsible for the intricate processing that defines and maps metadata, schema, terminology, data relationships, and the logic behind calculations and aggregations through an analytics-as-code design approach.
With Wren Engine, you can define semantics through the “Modeling Definition Language” (MDL). This definition encapsulates calculation formulas, metrics definitions, and semantic relationships, reducing duplicate coding and handling of data joins.
Here’s an example; below is a definition of a “customers” data model.
{
"name": "customers",
"refSql": "select * from main.customers",
"columns": [
{
"name": "City",
"type": "VARCHAR",
"isCalculated": false,
"notNull": false,
"properties": {
"description": "The Customer City, where the customer company is located. Also called 'customer segment'."
}
},
{
"name": "Id",
"type": "VARCHAR",
"isCalculated": false,
"notNull": false,
"properties": {
"description": "A unique identifier for each customer in the data model."
}
},
{
"name": "State",
"type": "VARCHAR",
"isCalculated": false,
"notNull": false,
"properties": {
"description": "A field indicating the state where the customer is located."
}
},
{
"name": "orders",
"type": "orders",
"relationship": "CustomersOrders",
"isCalculated": false,
"notNull": false,
"properties": {}
},
{
"name": "LatestRecord",
"type": "DATE",
"isCalculated": true,
"expression": "max(orders.PurchaseTimestamp)",
"notNull": false,
"properties": {}
},
{
"name": "FirstRecord",
"type": "DATE",
"isCalculated": true,
"expression": "min(orders.PurchaseTimestamp)",
"notNull": false,
"properties": {}
},
{
"name": "VIP",
"type": "BOOLEAN",
"isCalculated": true,
"expression": "sum(orders.Size) > 2",
"notNull": false,
"properties": {}
},
{
"name": "OrderCount",
"type": "BIGINT",
"isCalculated": true,
"expression": "count(orders.OrderId)",
"notNull": false,
"properties": {}
},
{
"name": "Debit",
"type": "DOUBLE",
"isCalculated": true,
"expression": "sum(orders.OrderBalance)",
"notNull": false,
"properties": {}
},
{
"name": "ReviewRate",
"type": "DOUBLE",
"isCalculated": true,
"expression": "count(orders.IsReviewed = TRUE) / count(DISTINCT orders.OrderId)",
"notNull": false,
"properties": {}
}
],
"primaryKey": "Id",
"cached": false,
"refreshTime": "30.00m",
"properties": {
"schema": "main",
"catalog": "memory",
"description": "A table of customers who have made purchases, including their city"
}
}
While generating the prompt, we will augment the Data Definition Language(DDL) with the semantics defined in the file, as below.
/* {"schema": "main", "catalog": "memory", "description": "A table of customers who have made purchases, including their city"} */
CREATE TABLE customers (
-- {"description": "The Customer City, where the customer company is located. Also called \'customer segment\'."}
City VARCHAR,
-- {"description": "A unique identifier for each customer in the data model."}
Id VARCHAR PRIMARY KEY,
-- {"description": "A field indicating the state where the customer is located."}
State VARCHAR,
-- This column is a Calculated Field
-- column expression: max(orders.PurchaseTimestamp)
LatestRecord DATE,
-- This column is a Calculated Field
-- column expression: min(orders.PurchaseTimestamp)
FirstRecord DATE,
-- This column is a Calculated Field
-- column expression: sum(orders.Size) > 2
VIP BOOLEAN,
-- This column is a Calculated Field
-- column expression: count(orders.OrderId)
OrderCount BIGINT,
-- This column is a Calculated Field
-- column expression: sum(orders.OrderBalance)
Debit DOUBLE,
-- This column is a Calculated Field
-- column expression: count(orders.IsReviewed = TRUE) / count(DISTINCT orders.OrderId)
ReviewRate DOUBLE
)
Generation
Here’s how Wren AI deals with the generation phase, which we separate into two distinct scenarios:
In Pinterest’s case, the final stage of SQL generation is relatively simple; they currently didn’t mention much in this part in their blog post.
In the post, they mentioned a few areas they are researching and working on in their roadmaps; some of the topics are already implemented in Wren AI, and some are working in progress and planning. Let’s check below:
In metadata management, we’ve implemented into Wren AI, which is called “Modeling Definition Language” (MDL) for LLMs to include semantic relationships, domain context, tags, description, etc; this significantly improves the filtering accuracy during the retrieval phase.
In Wren AI, the index is updated automatically while you are modeling through our user interface; click on the “Deploy” button, and the system will automatically synchronize between the vector store and the semantic engine.
The current Wren AI scoring strategy is also pretty basic; in the future, we plan to fine-tune it to improve the relevance of retrieved results.
The query validation process is implemented in Wren AI; we will dry run the query in Wren Engine to ensure that the query is executable, validating the SQL and asking for LLMs to correct if the SQL is in an invalid form.
This is one of the most important development areas we are designing in Wren AI; we will soon release a new version with user feedback! We’ll share more details once we release it, so make sure you subscribe to our Medium to stay posted!
This is also one of the most important development areas we design in Wren AI. We will soon release an end-to-end evaluation process into Wren AI! We’ll share more details once we release them; make sure you subscribe to our Medium to stay posted!
This is pretty much it! Finally, we thank the Pinterest Engineering team for sharing such an amazing and informative post. We hope more engineering teams can share the knowledge they have implemented in their company! Onwards!
If you haven’t checked out Wren AI, Check us out!
👉 GitHub: https://github.com/Canner/WrenAI
👉 X: https://twitter.com/getwrenai
Supercharge Your Data with AI Today?!