Retrieval-augmented generation (RAG) has opened up a new opportunity for LLMs to leverage their capability to comprehend users’ intentions to search internal database knowledge for SQL generation.
LLM frameworks such as LangChain and LlamaIndex provide tutorials to help developers implement text-to-SQL on their data sources. However, when these frameworks are deployed in production, users quickly find it challenging to implement the necessary features.
In this post, I’ll give a brief walkthrough using LangChain as an example of how you could use it to implement text-to-SQL. We will break down the challenges you will encounter and provide a solution to solve them.
This is based on the tutorial from LangChain's official documentation
Here’s a high-level concept of building a text-to-SQL solution in LangChain; check the full tutorial here.
Here’s how it works
First, when users ask a business question, LLM will comprehend the question and generate SQL based on DDL that comes along with the prompt with the business question; usually, if you want to enhance the semantics understanding, you will also attach semantics with the prompt.
First, install LangChain-related libraries
%pip install --upgrade --quiet langchain langchain-community langchain-openai
Next import SQLDatabase
from langchain-community
, SQLDatabase
is an SQLAlchemy
wrapper around a database, which provides a SQL Toolkit on top of databases.
from langchain_community.utilities import SQLDatabase
from langchain.chains import create_sql_query_chain
from langchain_openai import ChatOpenAI
db = SQLDatabase.from_uri("sqlite:///Chinook.db")
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
Use create_sql_query_chain
to generate different dialects of SQL languages.
chain = create_sql_query_chain(llm, db)
response = chain.invoke({"question": "How many employees are there"})
Here’s an example of how the prompt looks like:
from langchain_core.prompts import PromptTemplate
template = '''Given an input question, first create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer.
Use the following format:
Question: "Question here"
SQLQuery: "SQL Query to run"
SQLResult: "Result of the SQLQuery"
Answer: "Final answer here"
Only use the following tables:
{table_info}.
Question: {input}'''
prompt = PromptTemplate.from_template(template)
Here, we see we are asking LLMs to generate SQL based on a string template {dialect}
; the underlying mechanism is to insert the current dialect into the prompt and rely on LLMs to generate SQL dialects.
For {table_info}
, you will need to insert all the table DDLs you want LLM to understand. What if you have many tables? We can’t dump the full information about our database in every prompt. LangChain's official tutorial suggests the following.
Simplify our model’s job by grouping the tables together.
system = """Return the names of the SQL tables that are relevant to the user question. \
The tables are:
Music
Business"""
category_chain = create_extraction_chain_pydantic(Table, llm, system_message=system)
category_chain.invoke({"input": "What are all the genres of Alanis Morisette songs"})
You can group tables by categories:
from typing import List
def get_tables(categories: List[Table]) -> List[str]:
tables = []
for category in categories:
if category.name == "Music":
tables.extend(
[
"Album",
"Artist",
"Genre",
"MediaType",
"Playlist",
"PlaylistTrack",
"Track",
]
)
elif category.name == "Business":
tables.extend(["Customer", "Employee", "Invoice", "InvoiceLine"])
return tables
table_chain = category_chain | get_tables # noqa
table_chain.invoke({"input": "What are all the genres of Alanis Morisette songs"})
Another common way is to store table schema in a vector database and perform a semantic search to retrieve relevant DDLs from certain business inquiries. However, many challenges remain when moving to production.
Building a text-to-SQL tool using LangChain appears simple, but there are common challenges that arise when integrating it with production use cases. As listed below.
The problem with using text-to-SQL solely based on data schema is that when users ask business questions through the chat interface, they usually speak in business languages, not data structure definitions such as table names, column names, etc.
When asking through AI, you might use your business terminologies and definitions, as well as relationships that are defined within your company, so you need to consider not only data structure but also semantics.
In the tutorial in LangChain, demonstrate putting all your table schema into the prompt. When you connect to production databases, the tables easily scale to thousands and tens of thousands of tables in a database.
You can’t fit all the tables into a prompt, so you need to embed the table and metadata from the metadata store into a vector database. When users ask a question, you can use semantic search in a vector database to retrieve the most relevant vectors from the vector database.
The process described above mostly operates offline, doing vector index creation. Pinterest has shared how they deal with the problem in their recent post about how they internally build text-to-SQL.
Different databases need to speak in different dialects; LangChain’s Text-to-SQL tutorial relies on a popular Python library called SQLAlchemy
, under the hood SQLAlchemy
provides a standard toolkit and ORM for users to talk to different databases, but LLMs still need to generate certain dialects for different databases.
At first glance, using this pattern sounds reasonable and could easily provide LLM capability to many databases through SQLAlchemy
.
When moving to production, using the same SQL syntax with predefined aggregations and calculations is important for better and consistent retrieval performance across data sources.
I list above some obvious challenges to building a production-ready text-to-SQL solution. This is why our team builds Wren AI, the open-source AI data assistant for your databases. You can set up an AI agent internally for your text-to-SQL tasks within a few minutes.
Using Wren AI, we automate all the metadata and semantics and help LLMs learn how semantics work in their businesses without you writing any code. with your user-friendly interface, you can model your data schema and add business semantics to the modeling layer. We will automatically complete all the offline Vector Index creation for you.
Mapping table schema with semantics and ensuring you can get the right information through prompt and semantics requires a lot of fine-tuning; with Wren AI, we handle all the optimization and ensure it can search for the most relevant result when users ask business questions.
Underlying Wren AI, we developed a semantic engine called Wren Engine, which is also open-sourced. The engine can transpile from Standard ANSI SQL into different SQL dialects and provides the semantic encapsulate ability to define aggregations and calculations in the semantic modeling layer.
The feedback loop is one of the most important designs for AI agents. We want our agent to learn from our history and also teach the agent to perform better in future tasks; this is where the feedback loop comes in.
We built in a Wren AI feedback loop in the user interface, so when you ask a question and get the answer from Wren AI, you can provide adjustments to the agent, will learn from your inputs, regenerate the result, and store the learning in the semantic modeling definition, so it will generate the right outcome when users ask the next time.
Supercharge Your Data with AI Today?!