Today, DuckDB and Hugging Face co-authored an announcement about the new release using hf://
prefix in DuckDB to access datasets in Hugging Face repositories, this spawns a new wave of opportunities to make data more accessible and lightweight for the AI and ML sectors.
You can check out the full announcement here “Access 150k+ Datasets from Hugging Face with DuckDB”.
In the past, most data in a data warehouse came from within the organization, such as transactional systems, enterprise resource planning (ERP) applications, customer relationship management (CRM) applications, and similar sources.
The structure, volume, and rate of this data were fairly predictable and well-known. However, with the rise of cloud technology, an increasing amount of data now comes from external sources that are less controllable, such as application logs, web applications, mobile devices, social media, and sensor data from the Internet of Things. This data often arrives in schema-less, semi-structured formats. Traditional data warehousing solutions are struggling to handle this new type of data because they rely on deep ETL (extract, transform, load) pipelines and physical tuning, which assume predictable, slow-moving, easily categorized data from mostly internal sources.
In today’s tutorial, we’ll use DuckDB to load data directly from Hugging Face without downloading it to your computer and use Wren AI as an interface connected to GPT-4o, and users can ask business questions to the datasets and get answers; with this, you can access Hugging Face datasets with hf://
path, define semantic meanings through the semantic modeling, and ask any questions about the datasets, the LLM GPT-4o will comprehend your inquiries and query to retrieve and answers.
DuckDB is a fast in-process analytical database, and it has gained a lot of wide adoption in the data and AI community, such as Hugging Face providing DuckDB integration of their Hugging Face Datasets.
Today, DuckDB is one of the most popular databases on GitHub and also had great momentum in DB-Engine Ranking.
With DuckDB, you can easily point to a remote location such as CSV, Excel, JSON, parquet, etc., without moving files from remote locations such as Amazon S3, Azure blob storage, Google Cloud Storage, etc., and analyze them.
The Hugging Face Datasets offers a wide range of datasets from different sources, including academic research, popular benchmark tasks, and real-world applications with more than 150,000 datasets for artificial intelligence. These datasets are curated, processed, and standardized to ensure consistency and ease of use to democratize the access, manipulation, and exploration of datasets used to train and evaluate AI models.
In this tutorial we uploaded an interesting Billionaires CSV File from CORGIS project, CORGIS project collects some interesting datasets such as COVID-19, billionaires, airlines, etc.
Check out the dataset we use in this tutorial on the Hugging Face — Billionaires dataset.
Let’s get started! 🚀
First, check the Hugging Face Billionaires dataset here.
When working with data, you often need to read files in various formats (such as CSV, JSONL, and Parquet).
Now, it is possible to query them using the hf://
paths as below:
hf://datasets/⟨my_username⟩/⟨my_dataset⟩/⟨path_to_file⟩
As of this example, you can use URL below to get the dataset
hf://datasets/chilijung/Billionaires/billionaires.csv
Wren AI is an open-source text-to-SQL solution for data teams to get results and insights faster by asking business questions without writing SQL.
Next, let’s start installing Wren AI; before we start, you need to install Docker.
1. Install Docker Desktop on your local computer.
Please ensure the version of Docker Desktop is at least >= 4.17.
2. Prepare an OpenAI API key
Please ensure that your Open API key has Full Permission(All).
Visit the OpenAI developer platform.
Generate a new API key for Wren AI with full permission.
3. Install Wren AI Launcher
If you are on Mac(using Windows or Linux check here) enter the below command to install the latest Wren AI Launcher.
curl -L https://github.com/Canner/WrenAI/releases/latest/download/wren-launcher-darwin.tar.gz | tar -xz && ./wren-launcher-darwin
The launcher will then ask for your OpenAI API key as below, paste your key into the command and hit enter.
gpt-4o
Now you can select gpt-4o
, gpt-4-turbo
, gpt-3.5-turbo
of OpenAI’s generation model in Wren AI.
Now you’ll see we are running docker-compose
on your computer; after the installation, the tool will automatically open up your browser to access Wren AI.
While the terminal is successfully installed, it will launch the browser
In the UI, select DuckDB
and it will ask you for the connection details.
Here, you can enter a display name of the dataset, such as Billionaire
as an example, and in the Initial SQL statements
enter the script below.
The URL is where we previously showed hf://
path:
CREATE TABLE billionaires AS
SELECT * FROM 'hf://datasets/chilijung/Billionaires/billionaires.csv';
And hit Next
. In the next step, select the tables you want to use in Wren AI.
Select the table and click Next!
In this example, we only have one table, so you can skip or just click Finish
; but if you have multiple tables, you can define semantic relationships here to help LLMs understand and generate more accurate SQL joins.
Now you’re all set!
In this example, we only have one model (table), so when you click Modeling
page at the top, you will see the screen below:
Now click the billionaires
model, and there will be a drawer expand from the right.
The CORGIS dataset comprehensively describes each column, which we could add into the semantic model.
Adding the semantic context to the model
If you have multiple models, you can model via the interface below.
With Wren AI UI, you can model your data within a semantic context. This includes adding descriptions, defining relationships, incorporating calculations, and more. By providing this context, you can help LLMs understand your business terminologies and KPI definitions, reducing errors when combining multiple tables. LLMs can comprehend the data structure hierarchy by learning through relationships, such as whether it is a many-to-one
, one-to-many
, or many-to-many
relationships between tables.
You can define your business KPIs and formulas via calculations in Wren AI.
Adding semantic relationships between tables.
Now let’s switch to the Home
page, where you initiate a new thread and start asking questions to Wren AI.
It will then generate the best three options based on your question, as below
Select one of the options, and it will generate the result below, with a step-by-step breakdown and an explanation.
Ask follow-up questions based on the results
That is about it!
Now, with hf://
path, you can connect to up to 150,000+ datasets on Hugging Face directly through Wren AI without fussing around with the files! Pretty awesome, right?
If you love our work, please support and star us on GitHub!
🚀 GitHub: https://github.com/canner/wrenai
Don’t forget to give ⭐ Wren AI a star on Github ⭐ if you’ve enjoyed this article, and as always, thank you for reading.
Supercharge Your Data with AI Today?!