Comparing Ask-a-Metric and Vanna.ai
TL;DR: We are comparing Ask-a-Metric (AAM) and vanna.ai performance, on metrics that we find ourselves regularly testing for AAM use-cases. We find that Ask-a-Metric performs on-par with vanna.ai for straightforward queries, but struggles with more complex queries. Vanna.ai also struggles with complex queries and lacks guardrails, but has a greater range of features than AAM.1
Introduction
In this blog post, we provide a detailed comparison of two text-to-SQL tools: Ask-a-Metric (AAM) and vanna.ai. This is a follow up note expanding a similar analysis shared by Project Tech4Dev in a previous note.
Code to reproduce this analysis, as well as the full set of results are available at Ask-a-Metric’s open source repository (under experiments/
)
Ask-a-Metric
Ask-a-Metric is a WhatsApp-based LLM tool, that answers a user’s questions by querying an SQL database to extract the relevant information, and returning it using natural language in a vernacular of user’s choice. It is intended to facilitate data access for decision-making in the development sector. It is optimized for usage in resource-constrained settings, by non-technical users, who nevertheless require timely access to data for critical decisions. For more details, please see our blog post here, and our open-source repository here.
Vanna.ai
Vanna.ai is an AI tool designed for technical teams with coding knowledge, but focusing on generating SQL queries for an array of different SQL databases. It leverages vector stores of documentation, tables, and example SQL queries for high accuracy. Users can implement the interface through various platforms like Jupyter Notebooks, Slackbot, and web apps. The tool is open source, with free and premium plans. Vanna.ai reports about 80% accuracy, improving over time by learning from successful queries.
Our approach to comparing Vanna.ai and Ask-a-Metric
We note that Vanna.ai and Ask-a-Metric appear to be intended for different end users: Vanna.ai is aimed at technical users and webapp deployments, while Ask-a-Metric is aimed at non-technical users accessing the tool exclusively over messaging services (for the time being). We have nevertheless attempted as fair a comparison as possible between both tools. We also note that our comparison metrics for the two tools may be somewhat biased: we are evaluating based on requirements that we often encounter when we deploy Ask-a-Metric for our partners. However, these might not be the specifications / metrics that Vanna.ai was optimized for.
Comparing features2
Feature | Vanna.ai | Ask-a-Metric |
---|---|---|
Methodology | RAG, and active training on input questions | Few-shot prompting, pseudo-agentic pipeline (i.e. is able to use “tools”) and no feedback loop |
Input requirements for LLM model | System prompts, example queries, DB schema | System prompts, table description, column description |
LLM model support | Support OpenAI, Ollama, Mistral, Gemini, and also self-hosted LLM models | Any LLM model / architecture supported by litellm |
SQL database support | Supports standard SQL databases (Postgres, SQLite, MySQL, etc.) and can be extended to other databases | In principle, supports all databases available via SQLalchemy. However, has only been tested with Postgres and SQLite databases. |
Maturity | Has been around for more than a year (first release in May 2023) | Released in July 2024 (after 4 months of development) |
Data interactions | Text-to-SQL, visualizations, supports any language supported by LLM model, contextual memory | Text-to-SQL and SQL-to-text, supports any language supported by LLM model, basic guardrails, orchestrate chatflow |
Deployment | Slack, built-in webapp, Streamlit, etc. |
Comparison metrics for performance
We compare the two on the following metrics:
- Query accuracy: whether the responses are accurate and relevant to the user query.
- Language: whether appropriate SQL queries can be generated when the language is not English, and whether responses are returned in the same language and script. For many applications of AAM, we expect end-users to use a mix of English and other regional languages (Hindi, Filipino, Swahili, etc.) – we thus want to ensure that the bot responds reliably in the same language and script as the input question.
- Guardrails: whether there are guardrails for unsafe queries. For AAM applications, we find AAM users often ask queries that may not be safe (e.g. deleting data, or access to data that is not in the database). It is thus important to ensure that there are guardrails against these kinds of queries, so that there are no adversarial effects due to misinformation, or running deleterious SQL queries on the DB.
- Ease of updating context: whether it is easy to modify the bot’s responses when the context switches.
The previous note from Tech4Dev also highlights issues with JSON columns in the DB, which AAM struggles with. We investigate this issue separately in the Appendix.
Results
We performed all tests on the SQLite TN COVID-19 demo database in AAM’s open source GitHub repository – we chose this database since it has a relatively simple architecture and is publicly available data. This allows for fair and reproducible analysis. We note however, that the database is too limited to encapsulate all potential real-world scenarios, and we would need to do additional testing to effectively compare both tools.
We compared AAM to Vanna.ai trained with the DB schema, and trained without the DB schema. Basically, AAM is the middle ground between the two Vanna.ai modes: we don’t give AAM the schema up-front, but it can query a subset of the tables / columns in the query processing pipeline that is relevant to the input question. Thus for a given query, AAM may or may not have access to the full DB schema, depending on the complexity of the input question.
Query accuracy
- Vanna without the DB schema generates SQL queries, but hallucinates table and column names, and thus does not produce an accurate response for any of the input queries
- Vanna with DB schema is accurate for most questions, but struggles with the most complex query requiring multiple JOIN / UNION clauses (it only answers the question partially)
- AAM performance is on par with the Vanna with DB schema version, when we inspect the generated SQL queries and responses. It also fails on the same type of query, requiring multiple JOIN / UNION clauses (it raises an error for this query).
We note that we wrote code to use an LLM-based evaluation to test query accuracy and relevancy (table below). However it was difficult to trust the evaluation, or compare Vanna and AAM fairly with this method, since Vanna.ai’s response is simply the output of running the SQL query on the database, but AAM produces a natural language response composed on the basis of the SQL output.
Metric | Vanna (no schema) | Vanna (schema) | AAM | ||||||
---|---|---|---|---|---|---|---|---|---|
Accuracy and Relevancy (0-1) | 0.000000 | 0.526316 | 0.947368 | ||||||
Language and Script Identification (0-1) | 0.000000 | 0.789474 | 1.000000 |
For further discussion in this post, we will compare only Vanna.ai with the DB schema with AAM (although we ran all analysis for both versions of the Vanna.ai models).
Language
- Vanna.ai interprets questions in different languages correctly, but only dumps the output of the SQL query – it doesn’t translate them.
- AAM is able to interpret and respond to queries in different languages, and additionally responds in the same script
Guardrails
- Vanna.ai identifies irrelevant queries, but allows prompt and SQL injection.
- AAM does not allow SQL injection, prompt injection, or irrelevant queries
Cost and response time
- Vanna is on average faster than AAM at responding to queries
- Vanna is also cheaper, but we cannot count token costs for training or in the RAG steps, so it may not be an accurate comparison.
Metric | Vanna (no schema) | Vanna (schema) | AAM |
---|---|---|---|
Cost per query (USD)3 | 0.000836 | 0.000836 | 0.008870 |
Time per response (s) | 2.536245 | 2.536245 | 7.866015 |
Ease of switching context / fixing issues.
We tested this by updating the system prompt, to change context, with the following line: REMEMBER: If the user query does not specify a district, assume it is about Chennai.”
We then re-trained the Vanna.ai models and simply input this updated system message to AAM. We then asked: "How many ICU beds?"
with both tools.
We found that both Vanna.ai and AAM are able to accommodate changes in context, and update their responses accordingly:
Conclusions
Vanna.ai is an advanced AI tool designed for tech-savvy users, focusing on converting text to SQL without providing SQL-to-text functionality. It leverages vector stores for high accuracy but lacks safeguards against prompt or SQL injection. Primarily deployed via web applications, it offers visualization and feedback loop training features.
In contrast, AAM targets non-technical users, exclusively deploying over WhatsApp with basic guardrails and a transparent, straightforward pipeline. However it lacks additional features like visualization.
In the meantime, if you want to know more about Ask-a-Metric, reach out to us at DSEM Team. We would love to talk to you and see how we can help and collaborate with you.
Appendix
Investigating errors with JSON columns
We reproduced the error from the previous column, by modifying one table in the TN COVID database to contain a JSON column.
We can resolve this by updating the system prompt to include the following information: The database contains JSON columns. Use
json_extract to extract the required information
. After this, AAM provides the appropriate query:
SELECT district_name, json_extract(available_beds, '$.vacant_beds') AS vacant_beds FROM bed_vacancies_clinics_11_may ORDER BY CAST(json_extract(available_beds, '$.vacant_beds') AS INTEGER) DESC LIMIT 1;
However, we also need to update the DB table description of the appropriate table, to include information about the JSON column: Each row identifies a district and a JSON object including the beds earmarked, occupied and available for COVID cases in the district clinics
.
Otherwise, the LLM simply hallucinates appropriate column names and the values inside it. When we try to generate values from within these columns in the part of the pipeline that looks for the top-k values in each column, it results in the following errors:
While this is an intermediate fix, we are working on making the pipeline more robust to JSON columns.
Cover Photo by Ana Municio on Unsplash. ↩︎
This section is a modification of a table provided by Project Tech4Dev in their note comparing AAM and Vanna.ai ↩︎
This might not be a fair comparison, since we only compute input and output token costs for vanna.ai (and not training or for intermediate RAG queries). Costs for AAM, however, are computed for every OpenAI API call in the pipeline.// ↩︎