The reality behind a machine learning dataset

by: Victor Zhenyi Wang
thumbnail for this post

As data practitioners, we are separated by vast distances from the ground truth. There is, in one sense, the literal physical distance between our laptop screens and the places and sites of data collection which can cause fidelity losses in context and empathy. There is also a representative distance – in some cases, an asymmetry of power – between the reality of researching and practicing machine learning, of publishing papers, of open-source repositories, of commercial applications – and the labor that goes into each row of data; the families represented by vectors; each interaction is distilled into a potential flag for data quality. In this blog, I hope to illustrate those minutiae and bring together these two worlds.

In Katihar, makhana is a relatively new crop. Makhana is similar to the water lily; the plant grows in shallow ponds and its stems burst through the muddy waters erupting into purple and white flowers. Its floating leaves however, unlike the lotus or lily, have sharp thorns that discourage harvest. When mature, the flower produces the foxnut, which is a profitable cash crop frequently viewed as a health food. In order to harvest the nut, one must first knock the nut into the paddies of water, then use long polearms to move the thorny leaves and stems out of the way, and, finally, wade in to collect the foxnut. It’s a laborious process and one that happens usually in late summer, when the mercury climbs until the afternoon, when a sudden flush of rain brings brief respite from stifling haze.

image

About 90 percent of the world’s makhana comes from eastern Bihar. Makhana is a crop uniquely suited to Bihar with hot summers and fields prone to flooding. The conditions are much better for the crop than its native sites in East Asia. In the early 90s, the crop was introduced to Bihar and hundreds of thousands of hectares dedicated to its propagation. Today, driving through the eastern Bihar countryside, paddies of makhana are ubiquitous – largely discrete individual paddies that pockmark the landscape with muddy water – signs of the last storm and flooding ever-present in the fields surrounding them.

image

It was on an eight hour car ride from Patna to Katihar that I saw these fields for the first time. I was traveling with an IDinsight team to observe data collection activities and farmer interviews for a project for a non-profit foundation. The foundation’s mission is to provide high-quality datasets and machine learning resources as public goods. The current project provides a satellite imagery dataset labeled with plot boundaries. Eventually, the foundation hopes that this could be used to train a machine learning classifier to distinguish plots. Crop predictions could be used by governments to make economic forecasts and policy decisions.

The data collection activity in the district of Katihar involves 16 enumerators, three auditors, and two coordinators / managers. Each enumerator is assigned to one of Katihar’s 16 blocks approximately 45km in diameter and designated to visit ~45 households. A household survey takes anywhere from 30 minutes to over two hours. The household survey opens with what sounds like “gupshup” - questions about family context, agricultural practices, recent harvests. The families we visit uphold the region’s generous standards of hospitality with chai invariably served, electric fans brought out from the interior, and chairs splayed out in a circle, prepared for conversation. Then, we set out to visit plots held by the families.

image

During this portion of data collection, the enumerator takes out a Garmin GPS device and walks the boundaries of the fields pointed out by the farmers. After delineating the boundary, they walk into the field itself and record a center point. If there are many plots spread out far in an area, this exercise is repeated until all the plots are accounted for. For the days we were with our enumerators and the farmers, the temperatures were never below 35 degrees Celsius (95 degrees Farenheit) and humidity over 90 percent. Nonetheless, they walk on.

image

Here is one account from a family involved in the survey. On our second day, we visited a home about an hour northwest of the main city. There was an intensely sweet smell of drying chilies that hung in the air like perfume. We were in a small mandir nestled in a grove of bamboo. There was a field immediately adjacent to one of the sides of the main edifice and a couple of bamboo-woven houses further away. When we arrived, a surveyor had already started a conversation with the man that lived there. His wife started to bring us these massive plastic garden chairs and my colleague rushed forward to help her out.

Soon it had dawned on the field coordinator that this family was not actually eligible to be part of the dataset. Although the man did own two fields, he did not cultivate them directly. He told us this land was given to him as part of his military pension but he could no longer farm because of a persistent knee injury that had limited his mobility severely. Instead, they rented the land out to another family.

image

The recent months were not kind to them. The fields had been set aside to grow watermelons, but the storms had smothered the nascent seedlings, leading to catastrophic loss of income. As we started to leave, outside the mandir he came to speak with us. Perhaps thinking that we had connections to the state and some iota of political capital, he told us that he was deeply in debt and begged for us to ask the government to provide assistance. Although we tried to explain that we had no such powers he said “but even still your pen is mightier than my words.”

Two months later, 7000 plots measured, and 3000 families visited, the output is a “flat file” ready for consumption as a public good dataset for machine learning. Lifting the veil behind each machine learning project reveals an entire world of complexity –– of the teams involved in data collection and analysis, and beyond that, the thousands of families who gave up precious time and energy to participate with scant expectations of ever benefiting from the outputs that may be years away. Often these contributions are easily overlooked – even when social sector achievements or new innovations are celebrated widely. The story behind each data point that meaningfully contributes to a change in policies or programs should be something we hold in our minds and actively acknowledge. Many IDinsighters and others across the sector agree there is more to be done in this regard.

This is the reality behind many machine learning datasets, particularly for those involved in the development sector. As do many other IDinsighters, we often ask ourselves how we can better acknowledge and support survey participants. How can we effectively share results and findings that have implications for their lives? Where are our other opportunities address power asymmetries that may exist in research? We are working toward this in our dignity initiative and in how we apply our ethics policy, as well as elsewhere, but hope to learn from all of you as well.