Creating transformational impact: Lacuna Fund enables local AI solutions by filling data gaps in Africa
Imagine a world where cutting-edge machine learning and AI technologies are not only accessible in low- and middle-income contexts (LMICs) but are also designed by and for those communities. This is the vision of Lacuna Fund, a ground-breaking initiative that empowers data scientists, researchers and social entrepreneurs to harness the power of artificial intelligence (AI) and create solutions tailored to their unique needs.
In a time where AI has the potential to transform every facet of life, Lacuna Fund is working to ensure that the benefits reach everyone, especially in LMIC contexts. The journey began in 2020 when Canada’s International Development Research Centre (IDRC), The Rockefeller Foundation, Google.org and GIZ on behalf of Germany’s Federal Ministry for Economic Cooperation and Development (BMZ) joined forces to create the world's first collaborative effort to fill critical AI data gaps.
Data forms the building blocks of AI applications. The better the data, the better the AI solution for a real-world problem. However, sometimes the required data fall short; they either do not exist, are outdated or are unrepresentative of underserved populations. This can lead to a biased, inaccurate AI tool. Robust, representative machine learning datasets are particularly absent in LMIC contexts around the world.
That’s where Lacuna Fund comes in. Its thriving, multi-stakeholder engagement involving technical experts, thought leaders and end users works to identify, fund and support high-quality projects in the domains of agriculture, language, health and climate.
Research highlights
- The utilization of Lacuna-funded datasets is creating transformational impact in projects aimed improving access to language technologies and making existing agricultural yield datasets more accurate and useful.
- The KenCorpus team developed rich textual and speech data resources for Kenyan languages Kiswahili, Luhya (including dialects Lumarachi, Logooli and Lubukusu) and Dholuo. The dataset has since been downloaded an incredible 250,000 times.
- The High-Accuracy Maize Plot Location and Yield Dataset in East Africa project improved the usability of the most expansive East African crop cut yield estimation datasets by correcting the geolocations of the fields.
Published datasets by and for local communities
Lacuna Fund addressed gaps in agriculture and language in its first two calls for proposals. Ultimately, 16 projects received funding: 10 for language and six for agriculture. The final funding amount disbursed across all awardees was USD2.1 million. Today, these awardees are concluding their projects, with 13 datasets available and openly accessible so far.
Agriculture awardees produced training datasets across sub-Saharan Africa to support various agricultural needs. Language awardees produced text and speech datasets for natural language processing technologies in East, West and Southern Africa.
Since the first two rounds of funding, Lacuna Fund has selected an additional 42 projects to receive USD7.7 million in agriculture, language, health and climate. These projects focus on everything from expanding clean energy access to linking health impacts with environmental and socioeconomic data.
Filling data gaps and creating transformational impact
The utilization of Lacuna-funded datasets is creating transformational impact in project teams’ communities around the globe. Two examples below highlight grantees that are improving access to language technologies and making existing agricultural yield datasets more accurate and useful.
The KenCorpus team developed rich textual and speech data resources for Kenyan languages Kiswahili, Luhya (including dialects Lumarachi, Logooli and Lubukusu) and Dholuo. The dataset has since been downloaded an incredible 250,000 times.
Researchers collected data from Indigenous stories and narratives from student compositions, native language media stations and publishers, with 4,442 texts collected in total. In addition, they collected approximately 176 hours of spontaneous speech data. The team also translated Dholuo and Luhyia texts into Kiswahili for machine translation (12,400 sentences translated), developed a Kiswahili question-and-answer (QA) dataset for machine comprehension (7,526 pairs developed) and annotated Dholuo and Luhyia texts with part-of-speech (POS) tags (143,000 words tagged). The team also created a lexicon-phone dictionary of 31,759 words.
In 2021, the team held a capacity-building workshop, training 22 research assistants on POS annotation, translation, transcription and QA annotation. The research assistants then recruited and trained others. The KenCorpus team was also able to offer mentorship for 18 undergraduate and six post-graduate students studying linguistics and natural language processing.
The team’s collection methods and annotation schemes are pace setters and can be emulated by other data collectors and annotators in Africa. Their annotation tag set was designed and agreed upon by linguists, offering a great reference point for researchers. The fruits of their labor can be expanded and replicated, opening the doors for more representative datasets for not only Kenyan languages, but also other indigenous African languages.
The High-Accuracy Maize Plot Location and Yield Dataset in East Africa project improved the usability of the most expansive East African crop-cut yield estimation datasets by correcting the geolocations of the fields. This work was conducted by a team from Zindi and the Big Data Platform of the Consultative Group on International Agricultural Research. This initial dataset was collected by the non-profit One Acre Fund from 2015–2019. It covers major crop producing regions in Kenya, Rwanda and Tanzania, and contains approximately 18,000 crop-cut yield data points for maize.
The utility of existing crop-cut yield datasets is often compromised due to location inaccuracies. In fact, the team estimated that out of the 18,481 data points they started with, only 20 percent had accurate field center points. Oftentimes, reported locations can be found on adjacent roads, nearby homes and other locations outside of the crop field. To address this, the team developed a tool to correct the geolocations. They used this tool to annotate 1,700 data points. For each point, they downloaded four satellite images and 12 multi-spectral images at varying time slices, totaling 27,000 satellite images.
The team also invited the public’s input into their project, offering a challenge: Can you design a method to accurately find field locations? Six hundred fifty-six people participated in the challenge, which offered a prize dispersion of USD10,000 across five winners. They received 3,377 submissions, with representation from 62 unique countries, and 13 percent of participants were women. The challenge provided a rich diversity of perspectives and the leaderboard helped to vet solutions for accuracy.
Participants also expressed that the competition led to new relationships across the continent. Ultimately, the team created a model that combined the competition’s first and second-place solutions, retrained the model on the full dataset and applied that model to the remaining fields in the dataset.
A collaborative process for sustainable impact
Identifying data gaps in underserved regions and mobilizing funding for solutions to urgent problems — in a sustainable, inclusive and high-quality manner — is no small task. Lacuna Fund’s process and governance structure is collaborative in nature and rooted in a set of guiding principles, allowing it to draw upon the wisdom of a diverse, multi-disciplinary and multi-sectoral group of experts across the globe.
To select high-quality, relevant projects like those of the Zindi and KenCorpus teams, technical advisory panels customized for each call for proposals offer expertise and technical guidance. Lacuna Fund’s steering committee contains a balance of perspectives that provides strategic direction and oversight to ensure its focus, impact and growth. Its funders include a range of development, philanthropic and research institutions.
As the world’s first collaborative effort to directly address the problem of biased and unrepresentative datasets in LMICs, Lacuna Fund supports its grantees and tracks their work and impact throughout the project lifecycle.
It is the goal of each new call for proposals to equip local researchers in LMICs with the resources they need to create machine learning solutions that are scalable, replicable and address urgent problems in their communities. The benefits of Lacuna-funded available datasets to date are already being felt, not least because of the investment from its stakeholders — and trailblazing grantees — in AI’s potential for good.
Check out Lacuna Fund’s Apply page for current requests for proposals. Sign up for their newsletter to stay up to date on awarded projects and newly available datasets. For updates and other news, visit their Twitter and LinkedIn.
Contributors: Selaam Dollisso, Communications Associate, Lacuna Fund and Meridian Institute
With Emma Heth, Project Associate, Lacuna Fund and Meridian Institute; Jennifer Pratt Miles, Project Director, Lacuna Fund and Partner and Practice Director, Meridian Institute; Amy Bray, Zindi Africa; Lilian Wanzare, Maseno University.