Building Survey Accelerator: How we made finding quality survey questions simple

Mark Botterill
Data Scientist at IDinsight's Nairobi office.
Aug 26, 2024 9 min read
thumbnail for this post

Creating good surveys is hard. Ask any economist or researcher who’s spent weeks crafting the perfect questionnaire, and they’ll tell you about the countless hours spent hunting through academic papers, existing surveys, and their own memory banks, searching for question formats, flow structures, and methodological approaches that can help them build a cohesive survey. Even when they need to develop original questions, they’re often looking for proven frameworks and industry-standard approaches that can serve as reliable benchmarks for their own work.

At IDinsight, our data science team sat down with our economists in mid-2024 to tackle this problem head-on. What we discovered was that while everyone dreamed of an AI system that could magically generate a perfect SurveyCTO survey from a simple project description, the reality was far more complex—and far more interesting.

The problem with the “magic button” approach

Initially, we were drawn to the pipe dream solution: build an agentic AI system that takes your project context and theory of change, then outputs a fully formatted survey. It sounds brilliant in theory, but there’s a catch—it would require dozens, if not hundreds, of chained LLM calls working under the hood to break down inputs, search for existing questions along the right lines, generate novel questions where needed, and format everything properly at the very end.

The fundamental issue with this approach is one that applies to all AI systems: The more LLM calls you chain together, the higher the probability that an AI system will misinterpret your original intent. While such a system could potentially complete in minutes what takes human researchers weeks of careful work, you might start with what you believe are crystal-clear requirements but end up with something miles away from what you wanted. Even the most sophisticated agentic systems powered by the best LLMs still fail at complex, end-to-end real-world tasks like this.

Furthermore, one of the largest frustrations people have with AI tools is the inability to understand what’s happening under the hood. You might input your carefully crafted project documents, but after all those LLM calls, get something completely different from what you wanted.

Complex cycle

Figure 1: Common frustrations that arise with agentic pipelines

There are ways to counteract this—perhaps building a UI that lets users see into the pipeline and check intermediate outputs—but the engineering required grows exponentially with each subsequent pipeline step, and our goal for Survey Accelerator was to create an MVP people could start using.

Over engineered pipeline

Figure 2: Transparent pipelines can involved overwhelming amounts of engineering

So we took a step back and asked: what’s the most painful part of the survey creation process that we could actually solve well?

Our solution: A smarter search engine for survey questions

The answer became clear after talking with our economists: finding high-quality examples of survey questions on specific topics.

Industry-standard surveys like MICS, DHS, and IHDS are published in the public domain and readily accessible, but they’re scattered across various websites with no unified search capability, let alone an intelligent search capability. Researchers waste countless hours manually digging through PDFs, hoping to stumble upon relevant question formats that can inform their own survey design and methodology. Even when they find promising surveys, scanning through hundreds of pages to locate specific question types is tedious and time-consuming.

We decided to build something different—a search engine that uses LLMs in a targeted, controlled way to take someone from “I want to see examples of surveys that cover child health” to viewing the exact page containing the most relevant questions almost instantly, complete with concise justification summaries for search results and AI-powered highlighting. Rather than relying on unconstrained AI generation that could drift from user intent, we use LLMs for specific, well-defined tasks like understanding document context and ranking relevance.

Here’s where things got interesting. Standard practice for document ingestion (or “chunkingˮ) typically involves extracting text from PDFs and embedding the raw, extracted text of each page independently using hybrid search—combining keyword matching with semantic search to capture meaning. But survey pages themselves aren’t independent entities; they’re part of a larger, structured questionnaire where context is everything.

Consider this example: imagine you’re researching ways to evaluate children’s math abilities. You might encounter a page in a survey that contains only two numbers: “58” and “49.” In isolation, these numbers tell you nothing about mathematical assessment. But several pages earlier, the survey states: “Show the person in question the two numbers and ask what comes next in the sequence.” It becomes clear that these two numbers are actually part of a section designed to test children’s numerical literacy.

This is exactly what we found when examining the MICS Base Questionnaire for Children and Adolescents Age 5–17. A page showing just “58 49” would never show up in basic hybrid search when someone searches for children’s mathematical ability, because those numbers in isolation have zero semantic connection to math assessment.

MICS numeric example

Figure 3: Here we’re looking at one page midway through a MICS survey that comprises part of a section evaluating education; the numbers 58 and 49 - in isolation - have absolutely no connection to the notion of a children’s numerical literacy evaluation

Implementing contextual chunking

To solve this, we used a technique called contextual chunking, developed by Anthropic. During ingestion, for each page in a survey, we make an LLM API call asking for a brief summary of the page, how it fits into its section, and how it relates to the survey overall.

So instead of embedding just “58 49” into our database, we embed something like: “Context: This page covers mathematical assessment as part of a numeracy test section designed to evaluate children’s arithmetic skills, following questions that ask children to identify missing numbers in sequences. Extracted Text: 58 49.”

This dramatically improved our ability to find relevant sections, even when the raw text seemed completely unrelated to the search query. We found this technique made our search substantially more effective, even while maintaining our page-by-page processing pipeline.

Smart ranking: Beyond simple matches

We discovered that traditional re-rankers weren’t sophisticated enough for this use case. Users needed to find not just contextually relevant pages, but pages that contained actual, usable questions they could adapt for their own surveys.

Our solution uses parallelized LLM calls to quickly score each search result on two distinct dimensions:

  1. Contextual match: How well does this page fit within the broader topic you’re researching?
  2. Direct question score: Does this page contain an actual question you could adapt for your survey?

This dual scoring system handles the nuanced nature of survey research beautifully. For example, our “58 49” page gets a high contextual score (it’s testing math ability) but a low direct score (no actual question visible). Meanwhile, a page from IHDS2 asking “Can this child read and write?” gets high scores on both dimensions—it’s clearly about educational assessment and contains a direct, actionable question.

Scoring system

Figure 4: Each page result for a survey gets its own scoring, quantifying how well it matches the input query both contextually and directly

The beauty of this approach is that it gives users flexibility. Sometimes you want to see the exact question format others have used. Other times, you want to understand how a particular topic has been approached conceptually, even if the specific questions aren’t directly applicable.

The user experience: Simple but powerful

Our goal for the UI was to dedicate as much possible space to the surveys returned by a search query; the entire purpose of the tool is to get users from their query to seeing the raw documents as smoothly as possible. The search panel is tucked away in the top right, with filtering options that let users narrow results by organization (like WHO, World Bank, or specific research institutions) or survey type (household surveys, health assessments, educational evaluations, etc.).

When you search, results appear as cards with AI-generated explanations of why each result matched your query. For instance, if you search for children’s educational assessment, a card might show: “Mentions children’s mathematical abilities and assessments in the educational context” for an IHDS2 result. This immediate context helps users quickly evaluate whether a result is worth exploring further.

But here’s where the real magic happens: when you click a card, you’re taken directly to the relevant page with keywords highlighted automatically. No more scanning walls of text to find the needle in the haystack. If a page contains dense questionnaire formatting or multiple questions, the highlighting immediately draws your eye to the content that triggered the match.

Keyword highlighting example

Figure 5: Here we use a dummy query “citrus fruits” to illustrate how relevant, related words like “Fruits and vegetables" and the topic header “Farming and Gardening” are custom highlighted so that it’s easier to dig your match out of the original PDF page

The real strength of Survey Accelerator is that it combines a curated collection of high-quality surveys with an intuitive and rapid search process. If one result isn’t quite right, you can click to the next best match in seconds, seeing the source page exactly as it was created in its full context. This speed and context preservation is crucial for researchers who need to understand not just what questions were asked, but how they fit into the broader survey structure.

What’s next: Building toward automated survey creation

We’ve laid the groundwork for something much bigger. This search engine forms a crucial component of that original, ambitious vision—the agentic survey creation pipeline we initially set out to build. While fully automated survey generation remains a challenging goal, we now have a more pragmatic, stepped approach that builds on what we’ve learned.

The roadmap we’d like to pursue looks like this:

  1. Take starting documents (project context, theory of change etc.)
  2. Search for high-quality examples (using Survey Accelerator as an external API)
  3. Generate tailored questions (adapting from examples or generating original questions from scratch)
  4. Convert to SurveyCTO format (final formatting)

By solving the search problem first, we’ve built a tool that’s immediately useful to researchers while laying the groundwork for more sophisticated automation. The key insight is that even the most advanced AI systems work better when they can reference high-quality examples rather than generating questions from scratch.

Our curated database of contextually-chunked surveys can now serve as the foundation for future AI-assisted survey creation tools. When we eventually tackle that agentic system, it will be able to search through our database to find relevant question formats, understand their context, and adapt them appropriately—rather than trying to invent questions from thin air.

Try it yourself and stay tuned for next steps

Survey Accelerator is fast, accurate and completely free to use, so please give it a try and leave us some feedback—and if you have high quality surveys to contribute, it only takes a few clicks to add them to our growing collection. Weʼve also made our entire codebase open source, so researchers and developers can build upon our work or adapt it for their own specialized needs.

Weʼre excited that weʼve tackled one of the key challenges in the survey creation pipeline and are keen to keep building from here so stay tuned for more updates!