Comparing Ask-a-Metric and Vanna.ai

by: Poornima Ramesh, Greg Mitchell

TL;DR: We are comparing Ask-a-Metric (AAM) and vanna.ai performance, on metrics that we find ourselves regularly testing for AAM use-cases. We find that Ask-a-Metric performs on-par with vanna.ai for straightforward queries, but struggles with more complex queries. Vanna.ai also struggles with complex queries and lacks guardrails, but has a greater range of features than AAM.¹

Introduction

In this blog post, we provide a detailed comparison of two text-to-SQL tools: Ask-a-Metric (AAM) and vanna.ai. This is a follow up note expanding a similar analysis shared by Project Tech4Dev in a previous note. Code to reproduce this analysis, as well as the full set of results are available at Ask-a-Metric’s open source repository (under experiments/)

Ask-a-Metric

Ask-a-Metric is a WhatsApp-based LLM tool, that answers a user’s questions by querying an SQL database to extract the relevant information, and returning it using natural language in a vernacular of user’s choice. It is intended to facilitate data access for decision-making in the development sector. It is optimized for usage in resource-constrained settings, by non-technical users, who nevertheless require timely access to data for critical decisions. For more details, please see our blog post here, and our open-source repository here.

Vanna.ai

Vanna.ai is an AI tool designed for technical teams with coding knowledge, but focusing on generating SQL queries for an array of different SQL databases. It leverages vector stores of documentation, tables, and example SQL queries for high accuracy. Users can implement the interface through various platforms like Jupyter Notebooks, Slackbot, and web apps. The tool is open source, with free and premium plans. Vanna.ai reports about 80% accuracy, improving over time by learning from successful queries.

Our approach to comparing Vanna.ai and Ask-a-Metric

We note that Vanna.ai and Ask-a-Metric appear to be intended for different end users: Vanna.ai is aimed at technical users and webapp deployments, while Ask-a-Metric is aimed at non-technical users accessing the tool exclusively over messaging services (for the time being). We have nevertheless attempted as fair a comparison as possible between both tools. We also note that our comparison metrics for the two tools may be somewhat biased: we are evaluating based on requirements that we often encounter when we deploy Ask-a-Metric for our partners. However, these might not be the specifications / metrics that Vanna.ai was optimized for.

Comparing features²

Feature	Vanna.ai	Ask-a-Metric
Methodology	RAG, and active training on input questions	Few-shot prompting, pseudo-agentic pipeline (i.e. is able to use “tools”) and no feedback loop
Input requirements for LLM model	System prompts, example queries, DB schema	System prompts, table description, column description
LLM model support	Support OpenAI, Ollama, Mistral, Gemini, and also self-hosted LLM models	Any LLM model / architecture supported by `litellm`
SQL database support	Supports standard SQL databases (Postgres, SQLite, MySQL, etc.) and can be extended to other databases	In principle, supports all databases available via SQLalchemy. However, has only been tested with Postgres and SQLite databases.
Maturity	Has been around for more than a year (first release in May 2023)	Released in July 2024 (after 4 months of development)
Data interactions	Text-to-SQL, visualizations, supports any language supported by LLM model, contextual memory	Text-to-SQL and SQL-to-text, supports any language supported by LLM model, basic guardrails, orchestrate chatflow
Deployment	Slack, built-in webapp, Streamlit, etc.	WhatsApp

Comparison metrics for performance

We compare the two on the following metrics:

Query accuracy: whether the responses are accurate and relevant to the user query.
Language: whether appropriate SQL queries can be generated when the language is not English, and whether responses are returned in the same language and script. For many applications of AAM, we expect end-users to use a mix of English and other regional languages (Hindi, Filipino, Swahili, etc.) – we thus want to ensure that the bot responds reliably in the same language and script as the input question.
Guardrails: whether there are guardrails for unsafe queries. For AAM applications, we find AAM users often ask queries that may not be safe (e.g. deleting data, or access to data that is not in the database). It is thus important to ensure that there are guardrails against these kinds of queries, so that there are no adversarial effects due to misinformation, or running deleterious SQL queries on the DB.
Ease of updating context: whether it is easy to modify the bot’s responses when the context switches.

The previous note from Tech4Dev also highlights issues with JSON columns in the DB, which AAM struggles with. We investigate this issue separately in the Appendix.

Results

We performed all tests on the SQLite TN COVID-19 demo database in AAM’s open source GitHub repository – we chose this database since it has a relatively simple architecture and is publicly available data. This allows for fair and reproducible analysis. We note however, that the database is too limited to encapsulate all potential real-world scenarios, and we would need to do additional testing to effectively compare both tools.

Example validation questions

Example validation questions

We compared AAM to Vanna.ai trained with the DB schema, and trained without the DB schema. Basically, AAM is the middle ground between the two Vanna.ai modes: we don’t give AAM the schema up-front, but it can query a subset of the tables / columns in the query processing pipeline that is relevant to the input question. Thus for a given query, AAM may or may not have access to the full DB schema, depending on the complexity of the input question.

Query accuracy

Vanna without the DB schema generates SQL queries, but hallucinates table and column names, and thus does not produce an accurate response for any of the input queries
Vanna with DB schema is accurate for most questions, but struggles with the most complex query requiring multiple JOIN / UNION clauses (it only answers the question partially)
AAM performance is on par with the Vanna with DB schema version, when we inspect the generated SQL queries and responses. It also fails on the same type of query, requiring multiple JOIN / UNION clauses (it raises an error for this query).

We note that we wrote code to use an LLM-based evaluation to test query accuracy and relevancy (table below). However it was difficult to trust the evaluation, or compare Vanna and AAM fairly with this method, since Vanna.ai’s response is simply the output of running the SQL query on the database, but AAM produces a natural language response composed on the basis of the SQL output.

Metric			Vanna (no schema)			Vanna (schema)			AAM
Accuracy and Relevancy (0-1)			0.000000			0.526316			0.947368
Language and Script Identification (0-1)			0.000000			0.789474			1.000000

Vanna.ai and AAM on responses

Vanna.ai and AAM responses

Vanna.ai and AAM SQL queries

Vanna.ai and AAM SQL queries

For further discussion in this post, we will compare only Vanna.ai with the DB schema with AAM (although we ran all analysis for both versions of the Vanna.ai models).

Language

Vanna.ai interprets questions in different languages correctly, but only dumps the output of the SQL query – it doesn’t translate them.
AAM is able to interpret and respond to queries in different languages, and additionally responds in the same script

Vanna.ai and AAM responses on queries in different languages

Vanna.ai and AAM responses for queries in different languages

Guardrails

Vanna.ai identifies irrelevant queries, but allows prompt and SQL injection.
AAM does not allow SQL injection, prompt injection, or irrelevant queries

Vanna.ai and AAM guardrails

Vanna.ai and AAM guardrails

Cost and response time

Vanna is on average faster than AAM at responding to queries
Vanna is also cheaper, but we cannot count token costs for training or in the RAG steps, so it may not be an accurate comparison.

Metric	Vanna (no schema)	Vanna (schema)	AAM
Cost per query (USD)³	0.000836	0.000836	0.008870
Time per response (s)	2.536245	2.536245	7.866015

Ease of switching context / fixing issues.

We tested this by updating the system prompt, to change context, with the following line: REMEMBER: If the user query does not specify a district, assume it is about Chennai.”

We then re-trained the Vanna.ai models and simply input this updated system message to AAM. We then asked: "How many ICU beds?" with both tools.

We found that both Vanna.ai and AAM are able to accommodate changes in context, and update their responses accordingly:

Vanna.ai and AAM responses when context changes

Vanna.ai and AAM responses when context changes

Conclusions

Vanna.ai is an advanced AI tool designed for tech-savvy users, focusing on converting text to SQL without providing SQL-to-text functionality. It leverages vector stores for high accuracy but lacks safeguards against prompt or SQL injection. Primarily deployed via web applications, it offers visualization and feedback loop training features.

In contrast, AAM targets non-technical users, exclusively deploying over WhatsApp with basic guardrails and a transparent, straightforward pipeline. However it lacks additional features like visualization.

In the meantime, if you want to know more about Ask-a-Metric, reach out to us at DSEM Team. We would love to talk to you and see how we can help and collaborate with you.

Appendix

Investigating errors with JSON columns

We reproduced the error from the previous column, by modifying one table in the TN COVID database to contain a JSON column. AAM error with JSON column

We can resolve this by updating the system prompt to include the following information: The database contains JSON columns. Use json_extract to extract the required information. After this, AAM provides the appropriate query:

SELECT district_name, json_extract(available_beds, '$.vacant_beds') AS vacant_beds FROM bed_vacancies_clinics_11_may ORDER BY CAST(json_extract(available_beds, '$.vacant_beds') AS INTEGER) DESC LIMIT 1;

However, we also need to update the DB table description of the appropriate table, to include information about the JSON column: Each row identifies a district and a JSON object including the beds earmarked, occupied and available for COVID cases in the district clinics.

Otherwise, the LLM simply hallucinates appropriate column names and the values inside it. When we try to generate values from within these columns in the part of the pipeline that looks for the top-k values in each column, it results in the following errors:

AAM tool error with JSON column

While this is an intermediate fix, we are working on making the pipeline more robust to JSON columns.

Cover Photo by Ana Municio on Unsplash. ↩︎
This section is a modification of a table provided by Project Tech4Dev in their note comparing AAM and Vanna.ai ↩︎
This might not be a fair comparison, since we only compute input and output token costs for vanna.ai (and not training or for intermediate RAG queries). Costs for AAM, however, are computed for every OpenAI API call in the pipeline.// ↩︎

« Enhancing Maternal Healthcare: Training Language Models to Identify Urgent Messages in Real-Time Using Agents to Not Use Agents: How we built our Text-to-SQL Q&A system »