How to directly access 150k+ Hugging Face Datasets with DuckDB and query using GPT-4o

Today, DuckDB and Hugging Face co-authored an announcement about the new release using hf:// prefix in DuckDB to access datasets …

Howard Chi
Co-founder of Wren AI
May 29, 2024
September 8, 2024
5 min read

Today, DuckDB and Hugging Face co-authored an announcement about the new release using hf:// prefix in DuckDB to access datasets in Hugging Face repositories, this spawns a new wave of opportunities to make data more accessible and lightweight for the AI and ML sectors.

You can check out the full announcement here “Access 150k+ Datasets from Hugging Face with DuckDB”.

DuckDB announcement of the collaboration with Hugging Face.

In the past, most data in a data warehouse came from within the organization, such as transactional systems, enterprise resource planning (ERP) applications, customer relationship management (CRM) applications, and similar sources.

The structure, volume, and rate of this data were fairly predictable and well-known. However, with the rise of cloud technology, an increasing amount of data now comes from external sources that are less controllable, such as application logs, web applications, mobile devices, social media, and sensor data from the Internet of Things. This data often arrives in schema-less, semi-structured formats. Traditional data warehousing solutions are struggling to handle this new type of data because they rely on deep ETL (extract, transform, load) pipelines and physical tuning, which assume predictable, slow-moving, easily categorized data from mostly internal sources.

Access 150k+ Datasets from Hugging Face with DuckDB and query the data with GPT-4o

In today’s tutorial, we’ll use DuckDB to load data directly from Hugging Face without downloading it to your computer and use Wren AI as an interface connected to GPT-4o, and users can ask business questions to the datasets and get answers; with this, you can access Hugging Face datasets with hf:// path, define semantic meanings through the semantic modeling, and ask any questions about the datasets, the LLM GPT-4o will comprehend your inquiries and query to retrieve and answers.

What is DuckDB

DuckDB is a fast in-process analytical database, and it has gained a lot of wide adoption in the data and AI community, such as Hugging Face providing DuckDB integration of their Hugging Face Datasets.

Today, DuckDB is one of the most popular databases on GitHub and also had great momentum in DB-Engine Ranking.

DuckDB surpassed PostgreSQL on GitHub star amount in 2024 Jan.
DB Engine ranks of DuckDB

With DuckDB, you can easily point to a remote location such as CSV, Excel, JSON, parquet, etc., without moving files from remote locations such as Amazon S3, Azure blob storage, Google Cloud Storage, etc., and analyze them.

Hugging Face Datasets

The Hugging Face Datasets offers a wide range of datasets from different sources, including academic research, popular benchmark tasks, and real-world applications with more than 150,000 datasets for artificial intelligence. These datasets are curated, processed, and standardized to ensure consistency and ease of use to democratize the access, manipulation, and exploration of datasets used to train and evaluate AI models.

Hugging Face Datasets

In this tutorial we uploaded an interesting Billionaires CSV File from CORGIS project, CORGIS project collects some interesting datasets such as COVID-19, billionaires, airlines, etc.

Check out the dataset we use in this tutorial on the Hugging Face — Billionaires dataset.

Let’s get started! 🚀

Using GPT-4o to query Hugging Face Datasets with DuckDB

Get the dataset URL from Hugging Face

First, check the Hugging Face Billionaires dataset here.

Hugging Face Dataset Page

Read using hf:// paths

When working with data, you often need to read files in various formats (such as CSV, JSONL, and Parquet).

Now, it is possible to query them using the hf:// paths as below:

hf://datasets/⟨my_username⟩/⟨my_dataset⟩/⟨path_to_file⟩

As of this example, you can use URL below to get the dataset

hf://datasets/chilijung/Billionaires/billionaires.csv

Installing Wren AI

Wren AI is an open-source text-to-SQL solution for data teams to get results and insights faster by asking business questions without writing SQL.

Next, let’s start installing Wren AI; before we start, you need to install Docker.

1. Install Docker Desktop on your local computer.

Please ensure the version of Docker Desktop is at least >= 4.17.

2. Prepare an OpenAI API key

Please ensure that your Open API key has Full Permission(All).

Visit the OpenAI developer platform.

Enter the OpenAI API key page

Generate a new API key for Wren AI with full permission.

Generate your OpenAI API key with full permission

3. Install Wren AI Launcher

If you are on Mac(using Windows or Linux check here) enter the below command to install the latest Wren AI Launcher.

curl -L https://github.com/Canner/WrenAI/releases/latest/download/wren-launcher-darwin.tar.gz | tar -xz && ./wren-launcher-darwin

The launcher will then ask for your OpenAI API key as below, paste your key into the command and hit enter.

Select gpt-4o

Now you can select gpt-4o , gpt-4-turbo , gpt-3.5-turbo of OpenAI’s generation model in Wren AI.

Install Wren AI with CLI

Now you’ll see we are running docker-compose on your computer; after the installation, the tool will automatically open up your browser to access Wren AI.

Running Docker Compose on your computer

Setup DuckDB connection

While the terminal is successfully installed, it will launch the browser

First-time launching Wren AI

In the UI, select DuckDB and it will ask you for the connection details.

Connect to DuckDB with hf:// path

Here, you can enter a display name of the dataset, such as Billionaire as an example, and in the Initial SQL statements enter the script below.

The URL is where we previously showed hf:// path:

CREATE TABLE billionaires AS 
	SELECT * FROM 'hf://datasets/chilijung/Billionaires/billionaires.csv';

And hit Next . In the next step, select the tables you want to use in Wren AI.

Select the `Billionaires` table

Select the table and click Next!

Define relationship

In this example, we only have one table, so you can skip or just click Finish ; but if you have multiple tables, you can define semantic relationships here to help LLMs understand and generate more accurate SQL joins.

Now you’re all set!

Home: The ask question page

Semantic Modeling on Wren AI UI

In this example, we only have one model (table), so when you click Modeling page at the top, you will see the screen below:

The modeling page

Now click the billionaires model, and there will be a drawer expand from the right.

The CORGIS dataset comprehensively describes each column, which we could add into the semantic model.

Descriptions of each column

Adding the semantic context to the model

Adding the descriptions to the modeling page

Data Modeling with Complex Schema

If you have multiple models, you can model via the interface below.

Semantic modeling through Wren AI UI

With Wren AI UI, you can model your data within a semantic context. This includes adding descriptions, defining relationships, incorporating calculations, and more. By providing this context, you can help LLMs understand your business terminologies and KPI definitions, reducing errors when combining multiple tables. LLMs can comprehend the data structure hierarchy by learning through relationships, such as whether it is a many-to-one , one-to-many , or many-to-many relationships between tables.

You can define your business KPIs and formulas via calculations in Wren AI.

Add calculations in the data model

Adding semantic relationships between tables.

Add relationships in the data model.

Start asking questions

Now let’s switch to the Home page, where you initiate a new thread and start asking questions to Wren AI.

It will then generate the best three options based on your question, as below

You can start asking any questions to Wren AI

Select one of the options, and it will generate the result below, with a step-by-step breakdown and an explanation.

Ask follow-up questions based on the results

Follow-up questions

That is about it!

Now, with hf:// path, you can connect to up to 150,000+ datasets on Hugging Face directly through Wren AI without fussing around with the files! Pretty awesome, right?

If you love our work, please support and star us on GitHub!

🚀 GitHub: https://github.com/canner/wrenai

Don’t forget to give ⭐ Wren AI a star on Github ⭐ if you’ve enjoyed this article, and as always, thank you for reading.

Supercharge Your Data with AI Today?!

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Join our newsletter to stay up to date on features and releases.
© 2024 Canner. All right reserved.