It’s the GenAI age. Every person and their grandpa is creating AI chatbots based on RAG. For farmers, for mothers, for teachers, for bureaucrats. Hey, we’re doing it too!
But here’s a hot take: you don’t need a RAG AI chatbot. Definitely not at the start. Probably not ever.
TL;DR: In this blog post, we will describe a custom clustering algorithm we designed to efficiently cluster grids into enumeration areas for grid-based sampling
The DSEM team at IDinsight is the technical workhorse for project teams, and nearly every piece of technical work we do involves grouping things by some measure of similarity. Let me explain.
In our previous post, we examined how satellite imagery can be used in the social sector and how the MOSAIKS algorithm enables us to draw out “features” from these images without needing complex image-processing models. But the story doesn’t end with the algorithm.
Satellite imagery has become a valuable tool in global development: from environmental monitoring and disaster response to urban planning and agriculture. With more and more high-resolution satellite imagery available as open-source datasets, information about land usage and populations have become widely accessible. But this data also needs advanced analytical techniques to make sense of it.
In Karl Popper’s The Open Society and its Enemies (1945) he introduces “piecemeal social engineering,” his framework for building up social institutions incrementally informed by experimentation and evidence. This is in contrast to the more prevalent “utopian social engineering” of his time which he criticized for overly lofty / abstract ideals that largely ignored practicality; indeed today we might regard such methods as colonial and paternalistic. For Popper, the “piecemeal engineer knows, like Socrates, how little he knows. He knows that we can learn only from our mistakes.”1
In that spirit, I want to begin with lessons we have learned in trying to apply engineering principles and methods to help our partners increase their social impact. I hope that our learnings can be helpful for others in the sector.
As data practitioners, we are separated by vast distances from the ground truth. There is, in one sense, the literal physical distance between our laptop screens and the places and sites of data collection which can cause fidelity losses in context and empathy. There is also a representative distance – in some cases, an asymmetry of power – between the reality of researching and practicing machine learning, of publishing papers, of open-source repositories, of commercial applications – and the labor that goes into each row of data; the families represented by vectors; each interaction is distilled into a potential flag for data quality. In this blog, I hope to illustrate those minutiae and bring together these two worlds.