Optimizing data loading using Airflow and Redshift

by: Rochdi Nouaiti, Jeenu Thomas, Eric Dodge

SurveyStream is a software product developed by IDinsight to support and streamline primary data collection operations. SurveyStream helps data collection teams manage survey operations more efficiently, freeing up their time to focus on other crucial activities both before and during a survey.

SurveyStream functions as a web application backed by a robust data management system that survey teams can use to manage enumerator hiring, assign enumerators to respondents, send periodic emails to enumerators with their assignments and get timely insights into productivity and data quality.

In this blog, we share how we scaled a key functionality of the SurveyStream system to support increasing volumes of data coming in from large-scale surveys. In the social sector, the budget for software costs is limited, so our approach had to prioritize using existing cloud resources more efficiently rather than simply adding more compute or memory resources to our existing setup.

The Challenge

During surveys, enumerators record responses in a data collection tool like SurveyCTO. To enable features such as survey progress tracking and data quality monitoring, this data needs to be regularly transferred or “loaded” into the SurveyStream platform. We run this extraction and loading process every 30 minutes, ensuring teams have timely access to the data.

However, managing survey data at scale presents unique challenges. Surveys at IDinsight often have more than 100,000 submissions and include more than a thousand questions, which can result in over 5000 columns in the final dataset after including repeat groups and “calculate fields” from the SurveyCTO form. Since most database engines impose limits on the number of columns per table, this data is stored in a long format - specifically, using an entity-attribute-value (EAV) model - requiring additional processing before it can be loaded into SurveyStream.

Fig.1: Conversion of wide survey data to long

In this blog, we will discuss how we optimized data processing and loading of survey data within the limits of our existing infrastructure.

Status Quo

The core components of SurveyStream’s data management system include Apache Airflow[1], AWS S3 for storing raw data and Amazon Redshift as the data warehouse. Data was extracted by Airflow and first stored in S3 in csv format. We then used Pandas with chunk loading to read 5000 rows at a time, format and pivot the chunk data and append it to a parquet directory in S3. Once all chunks were processed, the entire Parquet dataset was loaded into Redshift using the COPY command.

Our main bottlenecks with this approach were speed and memory. Processing larger chunks could improve speed but increase memory usage. Processing smaller chunks could lower memory usage but deteriorate speed.

It is important to know that at this point we had already incorporated some tooling changes to perform well at scale:

For storage we shifted from Postgres to Redshift. Redshift is optimized for bulk loads and fast analytical query performance on massive datasets. This shift was especially crucial for downstream data analysis and reporting.
We adopted Parquet format instead of CSV or JSON for loading data into Redshift. Parquet is a columnar storage file format with better compression and encoding schemes, which reduces storage overhead and I/O operations.

However, while this design worked well for small surveys, as SurveyStream was used by larger and larger surveys (and multiple surveys at once) the pipelines began to slow down and survey teams were not able to get the information they needed at critical times, like during daily enumerator debriefs. It was clear we needed to make further improvements to the scalability of the platform.

Parallelism as a solution

To speed up loading for larger datasets, we needed to process the data in parallel. But to do this without scaling up the Virtual Machine on which Airflow is running or creating risks of running into memory issues, we needed to shift this data wrangling outside of Airflow. Using AWS Lambda may seem like a natural next step. But to avoid the additional cost we decided to explore how to achieve this using existing components.

The basic idea is that instead of Airflow managing a single, large, and resource-intensive task, it oversees multiple smaller tasks running outside of Airflow, thus reducing the load on Airflow. We redesigned the data flow to slice the survey data horizontally and then load these slices to temporary tables on Redshift and process them as parallel tasks, allowing for concurrent execution. And because we’re running a non-serverless Redshift cluster, we could do this with only minor impacts on our infrastructure costs.

We start by extracting the data and storing it in parquet format in S3. Airflow then reads the column names from the parquet file’s metadata, reads the data in chunks of 200 columns each and stores each chunk as a separate parquet file in S3. This read and write to S3 happens in sequence and is quite fast.

Then we use the Dynamic Task Mapping feature on Airflow to create as many parallel tasks as the number of chunked parquet files and each task:

Loads the file as a temporary table on Redshift using COPY command
Process (pivot to long format) the data by running an UNPIVOT query

The last step combines the data in each unpivoted long table and populates the final table. The heavy lift of converting data from wide to long is thus delegated to Redshift.

Fig.4: Parallel execution of load and processing

Results

Following the new design, we saw significant improvements in run time for large surveys. For a survey with data of ~1 Gb size, 5500+, columns and 110000+ rows:

	Time taken	Impact on Airflow	Impact on Redshift
Old	~50 minutes	Very high load causing memory saturation as the whole unpivoting process multiplies memory usage	Low - Only query running is the COPY command loading the whole data into the database
New	<10 minutes	Low memory load but more tasks are running in parallel	High load as multiple queries are running in parallel

For smaller surveys with data less than 100 Mb, pandas chunk loading is better so we adopted a hybrid approach based on the size of data to be processed.

Learnings

Through this work, we surfaced and identified the following learnings and areas for further improvement:

Store in Parquet from the start: Writing data in Parquet format from the outset turned out to be more efficient than converting it later on during loading tasks. This change though minor improved performance and simplified data management.
Better data typing can reduce storage: We observed that loading the data with all columns cast as VARCHAR consumes a lot of storage in Redshift. We drop these wide tables after loading data into the final tables so the spike is temporary but to address this, we are considering more appropriate data typing.
Need to implement proper roll-back and retry strategies: Our approach of processing data in chunks led to a series of dependent queries. To ensure atomicity, we designed the pipeline so that all chunked operations must succeed before committing any changes to the database. If any step fails, the entire operation is rolled back and retried - ensuring data consistency and preventing partial writes.
Redshift memory limits: We encountered memory limit errors during query execution in Redshift. To mitigate this, we wrapped the query execution in try-except blocks, allowing us to handle these errors gracefully.

Conclusion

Our primary objective was to improve the efficiency and speed of data processing without incurring additional costs. By shifting the heavy data transformation tasks to Redshift and executing them in parallel on chunks of columns, we significantly reduced processing time for large surveys. This redesign allowed us to meet our pipeline SLAs while maintaining a lean architecture - leveraging existing infrastructure instead of scaling up or adding new components.

References

1: Apache Airflow is an open-source tool used to author, schedule and manage workflows used commonly in the field of data engineering to sequence the flow of data.

« Make sense of YouTube videos using Gemini Two New Principles for Humanising AI »