IBM Research - Africa

facilityJohannesburg, South Africa

Research output, citation impact, and the most-cited recent papers from IBM Research - Africa (South Africa). Aggregated across the NobleBlocks index of 300M+ scholarly works.

Total works

147

Citations

2.1K

h-index

i10-index

Also known as

IBM Research - Africa

Top-cited papers from IBM Research - Africa

Narratives and Counternarratives on Data Sharing in Africa

Rediet Abebe, Kehinde Aruleba, Abeba Birhane, Sara Kingsley +3 more

202190doi:10.1145/3442188.3445897

As machine learning and data science applications grow ever more prevalent, there is an increased focus on data sharing and open data initiatives, particularly in the context of the African continent. Many argue that data sharing can support research and policy design to alleviate poverty, inequality, and derivative effects in Africa. Despite the fact that the datasets in question are often extracted from African communities, conversations around the challenges of accessing and sharing African data are too often driven by non-African stakeholders. These perspectives frequently employ a deficit narratives, often focusing on lack of education, training, and technological resources in the continent as the leading causes of friction in the data ecosystem.

<scp>DeepSource</scp>: point source detection using deep learning

A Vafaei Sadr, Etienne Vos, Bruce A. Bassett, Zafiirah Hosenie +2 more

2019· Monthly Notices of the Royal Astronomical Society50doi:10.1093/mnras/stz131

Point source detection at low signal-to-noise ratio (SNR) is challenging for astronomical surveys, particularly in radio interferometry images where the noise is correlated. Machine learning is a promising solution, allowing the development of algorithms tailored to specific telescope arrays and science cases. We present DeepSource – a deep learning solution – that uses convolutional neural networks to achieve these goals. DeepSource enhances the SNR of the sources in the original map and then uses dynamic blob detection to detect sources. Trained and tested on two sets of 500 simulated 1° × 1° MeerKAT images with a total of 300 000 sources, DeepSource is essentially perfect in both purity and completeness down to SNR = 4 and outperforms PyBDSF in all metrics. For uniformly weighted images, it achieves a Purity × Completeness (PC) score at SNR = 3 of 0.73, compared to 0.31 for the best PyBDSF model. For natural weighting, we find a smaller improvement of |${\sim } 40{{\ \rm per\ cent}}$| in the PC score at SNR = 3. If instead we ask where either of the purity or completeness first drops to |$90{{\ \rm per\ cent}}$|⁠, we find that DeepSource reaches this value at SNR = 3.6 compared to the 4.3 of PyBDSF (natural weighting). A key advantage of DeepSource is that it can learn to optimally trade off purity and completeness for any science case under consideration. Our results show that deep learning is a promising approach to point source detection in astronomical images.

Hackathons as a means of accelerating scientific discoveries and knowledge transfer

Amel Ghouila, Geoffrey Siwo, Jean-Baka Domelevo Entfellner, Sumir Panji +4 more

2018· Genome Research48doi:10.1101/gr.228460.117

Scientific research plays a key role in the advancement of human knowledge and pursuit of solutions to important societal challenges. Typically, research occurs within specific institutions where data are generated and subsequently analyzed. Although collaborative science bringing together multiple institutions is now common, in such collaborations the analytical processing of the data is often performed by individual researchers within the team, with only limited internal oversight and critical analysis of the workflow prior to publication. Here, we show how hackathons can be a means of enhancing collaborative science by enabling peer review before results of analyses are published by cross-validating the design of studies or underlying data sets and by driving reproducibility of scientific analyses. Traditionally, in data analysis processes, data generators and bioinformaticians are divided and do not collaborate on analyzing the data. Hackathons are a good strategy to build bridges over the traditional divide and are potentially a great agile extension to the more structured collaborations between multiple investigators and institutions.

Image Captioning as an Assistive Technology: Lessons Learned from VizWiz 2020 Challenge

Pierre Dognin, Igor Melnyk, Youssef Mroueh, Inkit Padhi +4 more

2022· Journal of Artificial Intelligence Research39doi:10.1613/jair.1.13113

Image captioning has recently demonstrated impressive progress largely owing to the introduction of neural network algorithms trained on curated dataset like MS-COCO. Often work in this field is motivated by the promise of deployment of captioning systems in practical applications. However, the scarcity of data and contexts in many competition datasets renders the utility of systems trained on these datasets limited as an assistive technology in real-world settings, such as helping visually impaired people navigate and accomplish everyday tasks. This gap motivated the introduction of the novel VizWiz dataset, which consists of images taken by the visually impaired and captions that have useful, task-oriented information. In an attempt to help the machine learning computer vision field realize its promise of producing technologies that have positive social impact, the curators of the VizWiz dataset host several competitions, including one for image captioning. This work details the theory and engineering from our winning submission to the 2020 captioning competition. Our work provides a step towards improved assistive image captioning systems. This article appears in the special track on AI & Society.

Big Data Analytics and Its Role to Support Groundwater Management in the Southern African Development Community

Zaheed Gaffoor, Kevin Pietersen, Nebo Jovanovic, Antoine Bagula +1 more

2020· Water37doi:10.3390/w12102796

Big data analytics (BDA) is a novel concept focusing on leveraging large volumes of heterogeneous data through advanced analytics to drive information discovery. This paper aims to highlight the potential role BDA can play to improve groundwater management in the Southern African Development Community (SADC) region in Africa. Through a review of the literature, this paper defines the concepts of big data, big data sources in groundwater, big data analytics, big data platforms and framework and how they can be used to support groundwater management in the SADC region. BDA may support groundwater management in SADC region by filling in data gaps and transforming these data into useful information. In recent times, machine learning and artificial intelligence have stood out as a novel tool for data-driven modeling. Managing big data from collection to information delivery requires critical application of selected tools, techniques and methods. Hence, in this paper we present a conceptual framework that can be used to manage the implementation of BDA in a groundwater management context. Then, we highlight challenges limiting the application of BDA which included technological constraints and institutional barriers. In conclusion, the paper shows that sufficient big data exist in groundwater domain and that BDA exists to be used in groundwater sciences thereby providing the basis to further explore data-driven sciences in groundwater management.

Classification of COVID-19 and Other Pathogenic Sequences: A Dinucleotide Frequency and Machine Learning Approach

Gciniwe Dlamini, Stéphanie Muller, Rebone L. Meraba, Richard A. Young +3 more

2020· IEEE Access37doi:10.1109/access.2020.3031387

The world is grappling with the COVID-19 pandemic caused by the 2019 novel SARS-CoV-2. To better understand this novel virus and its relationship with other pathogens, new methods for analyzing the genome are required. In this study, intrinsic dinucleotide genomic signatures were analyzed for whole genome sequence data of eight pathogenic species, including SARS-CoV-2. The genome sequences were transformed into dinucleotide relative frequencies and classified using the extreme gradient boosting (XGBoost) model. The classification models were trained to a) distinguish between the sequences of all eight species and b) distinguish between sequences of SARS-CoV-2 that originate from different geographic regions. Our method attained 100% in all performance metrics and for all tasks in the eight-species classification problem. Moreover, the models achieved 67% balanced accuracy for the task of classifying the SARS-CoV-2 sequences into the six continental regions and achieved 86% balanced accuracy for the task of classifying SARS-CoV-2 samples as either originating from Asia or not. Analysis of the dinucleotide genomic profiles of the eight species revealed a similarity between the SARS-CoV-2 and MERS-CoV viral sequences. Further analysis of SARS-CoV-2 viral sequences from the six continents revealed that samples from Oceania had the highest frequency of TT dinucleotides as well as the lowest CG frequency compared to the other continents. The dinucleotide signatures of AC, AG,CA, CT, GA, GT, TC, and TG were well conserved across most genomes, while the frequencies of other dinucleotide signatures varied considerably. Altogether, the results from this study demonstrate the utility of dinucleotide relative frequencies for discriminating and identifying similar species.

A Comparison of Ensemble and Deep Learning Algorithms to Model Groundwater Levels in a Data-Scarce Aquifer of Southern Africa

Zaheed Gaffoor, Kevin Pietersen, Nebo Jovanovic, Antoine Bagula +3 more

2022· Hydrology36doi:10.3390/hydrology9070125

Machine learning and deep learning have demonstrated usefulness in modelling various groundwater phenomena. However, these techniques require large amounts of data to develop reliable models. In the Southern African Development Community, groundwater datasets are generally poorly developed. Hence, the question arises as to whether machine learning can be a reliable tool to support groundwater management in the data-scarce environments of Southern Africa. This study tests two machine learning algorithms, a gradient-boosted decision tree (GBDT) and a long short-term memory neural network (LSTM-NN), to model groundwater level (GWL) changes in the Shire Valley Alluvial Aquifer. Using data from two boreholes, Ngabu (sample size = 96) and Nsanje (sample size = 45), we model two predictive scenarios: (I) predicting the change in the current month’s groundwater level, and (II) predicting the change in the following month’s groundwater level. For the Ngabu borehole, GBDT achieved R2 scores of 0.19 and 0.14, while LSTM achieved R2 scores of 0.30 and 0.30, in experiments I and II, respectively. For the Nsanje borehole, GBDT achieved R2 of −0.04 and −0.21, while LSTM achieved R2 scores of 0.03 and −0.15, in experiments I and II, respectively. The results illustrate that LSTM performs better than the GBDT model, especially regarding slightly greater time series and extreme GWL changes. However, closer inspection reveals that where datasets are relatively small (e.g., Nsanje), the GBDT model may be more efficient, considering the cost required to tune, train, and test the LSTM model. Assessing the full spectrum of results, we concluded that these small sample sizes might not be sufficient to develop generalised and reliable machine learning models.

Machine learning based estimation of Ozone using spatio-temporal data from air quality monitoring stations

Tapiwa M. Chiwewe, Ditsela Jeofrey

201634doi:10.1109/indin.2016.7819134

In this paper, models are created to predict the levels of ground level Ozone at particular locations based on the cross-correlation and spatial-correlation of different air pollutants whose readings are obtained from several different air quality monitoring stations in Gauteng province, South Africa, including the City of Johannesburg which is on the cusp of being one of the world's megacities and is currently the most polluted city in the country. Datasets spanning several years collected from the monitoring stations and transmitted through the Internet-of-Things are used. Big data analytics and cognitive computing is used to get insights on the data and create models that can estimate levels of Ozone without requiring massive computational power or intense numerical analysis.

Fast Convergence Cooperative Dynamic Spectrum Access for Cognitive Radio Networks

Tapiwa M. Chiwewe, Gerhard P. Hancke

2017· IEEE Transactions on Industrial Informatics30doi:10.1109/tii.2017.2783973

Cognitive radio and dynamic spectrum access can reform the way that radiofrequency spectrum is accessed. Problems of spectrum scarcity, coexistence, and unreliable wireless communication that affect industrial wireless networks can be addressed. In this paper, a game theoretic dynamic spectrum access algorithm that improves upon on a hedonic coalition formation algorithm for spectrum sensing and access is presented. The modified algorithm is tailored for faster convergence and scalability and makes use of a novel simultaneous multichannel sensing and access technique. Results to demonstrate the performance improvements of the adapted algorithm are presented and the use of different decision rules are investigated revealing that a conservative decision rule for exploiting spectrum opportunities performs better than an aggressive decision rule in most scenarios. The algorithm that was developed could be a key enabler for future cognitive radio networks.

Skin Tone Analysis for Representation in Educational Materials (STAR-ED) using machine learning

Girmaw Abebe Tadesse, Celia Cintas, Kush R. Varshney, Peter Staar +4 more

2023· npj Digital Medicine25doi:10.1038/s41746-023-00881-0

Abstract Images depicting dark skin tones are significantly underrepresented in the educational materials used to teach primary care physicians and dermatologists to recognize skin diseases. This could contribute to disparities in skin disease diagnosis across different racial groups. Previously, domain experts have manually assessed textbooks to estimate the diversity in skin images. Manual assessment does not scale to many educational materials and introduces human errors. To automate this process, we present the Skin Tone Analysis for Representation in EDucational materials (STAR-ED) framework, which assesses skin tone representation in medical education materials using machine learning. Given a document (e.g., a textbook in .pdf), STAR-ED applies content parsing to extract text, images, and table entities in a structured format. Next, it identifies images containing skin, segments the skin-containing portions of those images, and estimates the skin tone using machine learning. STAR-ED was developed using the Fitzpatrick17k dataset. We then externally tested STAR-ED on four commonly used medical textbooks. Results show strong performance in detecting skin images (0.96 ± 0.02 AUROC and 0.90 ± 0.06 F 1 score) and classifying skin tones (0.87 ± 0.01 AUROC and 0.91 ± 0.00 F 1 score). STAR-ED quantifies the imbalanced representation of skin tones in four medical textbooks: brown and black skin tones (Fitzpatrick V-VI) images constitute only 10.5% of all skin images. We envision this technology as a tool for medical educators, publishers, and practitioners to assess skin tone diversity in their educational materials.

Designing Digital Peer Assessment for Second Language Learning in Low Resource Learning Settings

Maletšabisa Molapo, Chané Moodley, Ismail Yunus Akhalwaya, Toby Kurien +2 more

201923doi:10.1145/3330430.3333626

In low-resource, over-burdened schools and learning centres, peer assessment systems promise significant practical and pedagogical benefits. Many of the these benefits have been realised in contexts like massive open online courses (MOOCs) and university classrooms which share a specific trait with low-resource schools: high learner-teacher ratios. However, the constraints and considerations for designing and deploying peer assessment systems in low-resource classrooms have not been well-researched and understood, especially for high school. In this paper, we present the design of a peer assessment system for second language learning (English as a Second Language) for high school learners in South Africa. We report findings from multiple studies investigating qualitative and quantitative aspects of peer review, as well as the contextual factors that influence the viability of peer assessment systems in these contexts.

Detecting Adversarial Attacks via Subset Scanning of Autoencoder Activations and Reconstruction Error

Celia Cintas, Skyler Speakman, Victor Akinwande, William Ogallo +3 more

202022doi:10.24963/ijcai.2020/122

Reliably detecting attacks in a given set of inputs is of high practical relevance because of the vulnerability of neural networks to adversarial examples. These altered inputs create a security risk in applications with real-world consequences, such as self-driving cars, robotics and financial services. We propose an unsupervised method for detecting adversarial attacks in inner layers of autoencoder (AE) networks by maximizing a non-parametric measure of anomalous node activations. Previous work in this space has shown AE networks can detect anomalous images by thresholding the reconstruction error produced by the final layer. Furthermore, other detection methods rely on data augmentation or specialized training techniques which must be asserted before training time. In contrast, we use subset scanning methods from the anomalous pattern detection domain to enhance detection power without labeled examples of the noise, retraining or data augmentation methods. In addition to an anomalous “score” our proposed method also returns the subset of nodes within the AE network that contributed to that score. This will allow future work to pivot from detection to visualisation and explainability. Our scanning approach shows consistently higher detection power than existing detection methods across several adversarial noise models and a wide range of perturbation strengths.

Toward Quantified Small-Scale Farms in Africa

Kala Fleming, Peninah Waweru, Muuo Wambua, Elizabeth Ondula +1 more

2016· IEEE Internet Computing22doi:10.1109/mic.2016.58

Developing and implementing a frugal Internet of Things can improve the monitoring and management of small-scale farms. The ultimate goal of such work is to bring farmers closer to the market and improve the prospects for food security across Africa.

An integrative analysis of small molecule transcriptional responses in the human malaria parasite Plasmodium falciparum

Geoffrey Siwo, Roger S. Smith, Asako Tan, Katrina A. Button-Simons +2 more

2015· BMC Genomics21doi:10.1186/s12864-015-2165-1

Transcriptional responses to small molecules can provide insights into drug mode of action (MOA). The capacity of the human malaria parasite, Plasmodium falciparum, to respond specifically to transcriptional perturbations has been unclear based on past approaches. Here, we present the most extensive profiling to date of the parasite’s transcriptional responsiveness to thirty-one chemically and functionally diverse small molecules. We exposed two laboratory strains of the human malaria parasite P. falciparum to brief treatments of thirty-one chemically and functionally diverse small molecules associated with biological effects across multiple pathways based on various levels of evidence. We investigated the impact of chemical composition and MOA on gene expression similarities that arise between perturbations by various compounds. To determine the target biological pathways for each small molecule, we developed a novel framework for encoding small molecule effects on a spectra of biological processes or GO functions that are enriched in the differentially expressed genes of a given small molecule perturbation. We find that small molecules associated with similar transcriptional responses contain similar chemical features, and/ or have a shared MOA. The approach also revealed complex relationships between drugs and biological pathways that are missed by most exisiting approaches. For example, the approach was able to partition small molecule responses into drug-specific effects versus non-specific effects. Our work provides a new framework for linking transcriptional responses to drug MOA in P. falciparum and can be generalized for the same purpose in other organisms.

A Generative Machine Learning Approach to RFI Mitigation for Radio Astronomy

Etienne Vos, P. S. Francois Luus, Chris Finlay, Bruce A. Bassett

201919doi:10.1109/mlsp.2019.8918820

Radio astronomy is a vital tool for astronomers to study the Universe and has seen a wave of renewed interest and advancement over recent years. Next-generation radio telescope arrays like the SKA, ALMA and VLA are developed to be significantly more sensitive compared to older telescopes, which as a result also make them more susceptible to radio frequency interference (RFI). This highlights the need for effective RFI mitigation techniques in radio astronomy. We present a machine learning-based RFI mitigation approach that aims to separate RFI-corrupted spectrogram observations into signal of interest and RFI components in an unsupervised manner using a modified generative adversarial network (GAN) framework. We show that this unsupervised source separation approach is able to achieve performance comparable to a fully supervised approach.

Modelling Representative Population Mobility for COVID-19 Spatial Transmission in South Africa

Arminn Potgieter, Inger Fabris‐Rotelli, Zaid Kimmie, Nontembeko Dudeni-Tlhone +4 more

2021· Frontiers in Big Data19doi:10.3389/fdata.2021.718351

The COVID-19 pandemic starting in the first half of 2020 has changed the lives of everyone across the world. Reduced mobility was essential due to it being the largest impact possible against the spread of the little understood SARS-CoV-2 virus. To understand the spread, a comprehension of human mobility patterns is needed. The use of mobility data in modelling is thus essential to capture the intrinsic spread through the population. It is necessary to determine to what extent mobility data sources convey the same message of mobility within a region. This paper compares different mobility data sources by constructing spatial weight matrices at a variety of spatial resolutions and further compares the results through hierarchical clustering. We consider four methods for constructing spatial weight matrices representing mobility between spatial units, taking into account distance between spatial units as well as spatial covariates. This provides insight for the user into which data provides what type of information and in what situations a particular data source is most useful.

An Empirical Evaluation of Web-Based Fingerprinting

Amin Khademi, Mohammad Zulkernine, Komminist Weldemariam

2015· IEEE Software19doi:10.1109/ms.2015.77

Adversaries employ sophisticated fingerprinting techniques to identify Web users and record their browsing history and Web interactions. Fingerprinting leaves no footprint on the browser and is invisible to general Web users, who often lack basic knowledge of it. An analysis of fingerprinting techniques and tools revealed the fingerprinting workflow. This helped define fine-grained properties that precisely model the workflow, allowing development of a client-side fingerprinting-detection tool. This article is part of a special issue on Security and Privacy on the Web.

Cancer–malaria: hidden connections

Akpéli Nordor, Dominique Bellet, Geoffrey Siwo

2018· Open Biology17doi:10.1098/rsob.180127

Cancer and malaria exemplify two maladies historically assigned to separated research spaces. Cancer, on the one hand, ranks among the top priorities in the research agenda of developed countries. Its rise is mostly explained by the ageing of these populations and linked to environment and lifestyle. Malaria, on the other hand, represents a major health burden for developing countries in the Southern Hemisphere. These two diseases also belong to separate fields of medicine: non-communicable diseases for cancer and communicable diseases for malaria.

Body shape: Implications in the study of obesity and related traits

Pablo Navarro, Virgínia Ramallo, Celia Cintas, Anahí Ruderman +4 more

2019· American Journal of Human Biology15doi:10.1002/ajhb.23323

OBJECTIVES: The diagnosis and treatment of obesity are usually based on traditional anthropometric variables including weight, height, and several body perimeters. Here we present a three-dimensional (3D) image-based computational approach aimed to capture the distribution of abdominal adipose tissue as an aspect of shape rather than a relationship among classical anthropometric measures. METHODS: A morphometric approach based on landmarks and semilandmarks placed upon the 3D torso surface was performed in order to quantify abdominal adiposity shape variation and its relation to classical indices. Specifically, we analyzed sets of body cross-sectional circumferences, collectively defining each, along with anthropometric data taken on 112 volunteers. Principal Component Analysis (PCA) was performed on 250 circumferences located along the abdominal region of each volunteer. An analysis of covariance model was used to compare shape variables (PCs) against anthropometric data (weight, height, and waist and hip circumferences). RESULTS: The observed shape patterns were mainly related to nutritional status, followed by sexual dimorphism. PC1 (12.5%) and PC2 (7.5%) represented 20% of the total variation. In PCAs calculated independently by sex, linear regression analyses provide statistically significant associations between PC1 and the three classical indexes: body mass index, waist-to-height ratio, and waist-hip ratio. CONCLUSION: Shape indicators predict well the behavior of classical markers, but also evaluate 3D and geometric features with more accuracy as related to the body shape under study. This approach also facilitates diagnosis and follow-up of therapies by using accessible 3D technology.

Application of Machine Learning Techniques In Forecasting Groundwater Levels in the Grootfontein Aquifer

Yolanda Kanyama, Ritesh Ajoodha, Helen Seyler, Ndivhuwo Makondo +1 more

202015doi:10.1109/imitec50163.2020.9334142

In this paper, we attempt to provide a data driven solution to model groundwater levels in the Grootfontein Aquifer in the North West Province of South Africa by testing several predictive models. Groundwater plays a crucial role in supplying water to a significant part of the population for agricultural, industrial, environmental and/or domestic use. Recent advancements in data analytics, and the analysis of large data sets has allowed the production of powerful predictive models. Five different data driven techniques namely, support vector regression, gradient boosting trees, decision trees, random forest regression and multilayer feed-forward neural network techniques were applied to predict groundwater levels. Modelling was carried out for four boreholes located in the Grootfontein dolomite aquifer considering discharge, rainfall and temperature as model inputs. Five site specific models were developed for each borehole. Model performance was evaluated using coefficient of determination and root mean squared error. Comparison of goodness of fit revealed that data driven methods can indeed capture the trend of water level fluctuations in the aquifer sufficiently with the GB algorithm performing better than other algorithms in both the training and verification stages. Whilst the models performed adequately when predicting groundwater level on a monthly basis for 36 months, further investigation is needed towards determining their efficacy in longer term projections to assist in the decision making process of sustainable groundwater use. This paper provides the following contributions: (a) a ranking of the attributes according to their mutual information (MI); (b) a reference for model selection; and (c) a predictive model to forecast groundwater levels in the Grootfontein aquifer.

Search all NobleBlocks papers mentioning “IBM Research - Africa” →