Center for Scalable Data Analytics and Artificial Intelligence
UniversityDresden, Saxony, Germany
Research output, citation impact, and the most-cited recent papers from Center for Scalable Data Analytics and Artificial Intelligence (Germany). Aggregated across the NobleBlocks index of 300M+ scholarly works.
Top-cited papers from Center for Scalable Data Analytics and Artificial Intelligence
In recent years, artificial intelligence (AI) has deeply impacted various fields, including Earth system sciences, by improving weather forecasting, model emulation, parameter estimation, and the prediction of extreme events. The latter comes with specific challenges, such as developing accurate predictors from noisy, heterogeneous, small sample sizes and data with limited annotations. This paper reviews how AI is being used to analyze extreme climate events (like floods, droughts, wildfires, and heatwaves), highlighting the importance of creating accurate, transparent, and reliable AI models. We discuss the hurdles of dealing with limited data, integrating real-time information, and deploying understandable models, all crucial steps for gaining stakeholder trust and meeting regulatory needs. We provide an overview of how AI can help identify and explain extreme events more effectively, improving disaster response and communication. We emphasize the need for collaboration across different fields to create AI solutions that are practical, understandable, and trustworthy to enhance disaster readiness and risk reduction. Artificial Intelligence is transforming the study of extreme climate events like floods, droughts, and wildfires, helping to overcome challenges such as limited data and real-time integration. This review article highlights the need for transparent, reliable AI models to improve disaster response, risk communication and stakeholder trust.
Cell size and cell count are adaptively regulated and intimately linked to growth and function. Yet, despite their widespread relevance, the relation between cell size and count has never been formally examined over the whole human body. Here, we compile a comprehensive dataset of cell size and count over all major cell types, with data drawn from >1,500 published sources. We consider the body of a representative male (70 kg), which allows further estimates of a female (60 kg) and 10-y-old child (32 kg). We build a hierarchical interface for the cellular organization of the body, giving easy access to data, methods, and sources (https://humancelltreemap.mis.mpg.de/). In total, we estimate total body counts of ≈36 trillion cells in the male, ≈28 trillion in the female, and ≈17 trillion in the child. These data reveal a surprising inverse relation between cell size and count, implying a trade-off between these variables, such that all cells within a given logarithmic size class contribute an equal fraction to the body's total cellular biomass. We also find that the coefficient of variation is approximately independent of mean cell size, implying the existence of cell-size regulation across cell types. Our data serve to establish a holistic quantitative framework for the cells of the human body, and highlight large-scale patterns in cell biology.
Abstract Interpretable Machine Learning (IML) has rapidly advanced in recent years, offering new opportunities to improve our understanding of the complex Earth system. IML goes beyond conventional machine learning by not only making predictions but also seeking to elucidate the reasoning behind those predictions. The combination of predictive power and enhanced transparency makes IML a promising approach for uncovering relationships in data that may be overlooked by traditional analysis. Despite its potential, the broader implications for the field have yet to be fully appreciated. Meanwhile, the rapid proliferation of IML, still in its early stages, has been accompanied by instances of careless application. In response to these challenges, this paper focuses on how IML can effectively and appropriately aid geoscientists in advancing process understanding—areas that are often underexplored in more technical discussions of IML. Specifically, we identify pragmatic application scenarios for IML in typical geoscientific studies, such as quantifying relationships in specific contexts, generating hypotheses about potential mechanisms, and evaluating process‐based models. Moreover, we present a general and practical workflow for using IML to address specific research questions. In particular, we identify several critical and common pitfalls in the use of IML that can lead to misleading conclusions, and propose corresponding good practices. Our goal is to facilitate a broader, yet more careful and thoughtful integration of IML into Earth science research, positioning it as a valuable data science tool capable of enhancing our current understanding of the Earth system.
Streamflow forecasting is crucial in water planning and management. Physically-based hydrological models have been used for a long time in these fields, but improving forecast quality is still an active area of research. Recently, some artificial neural networks have been found to be effective in simulating and predicting short-term streamflow. In this study, we examine the reliability of Long Short-Term Memory (LSTM) deep learning model in predicting streamflow for lead times of up to ten days over a Canadian catchment. The performance of the LSTM model is compared to that of a process-based distributed hydrological model, with both models using the same weather ensemble forecasts. Furthermore, the LSTM’s ability to integrate observed streamflow on the forecast issue date is compared to the data assimilation process required for the hydrological model to reduce initial state biases. Results indicate that the LSTM model forecasted streamflows are more reliable and accurate for lead-times up to 7 and 9 days, respectively. Additionally, it is shown that the LSTM model using recent observed flows as a predictor can forecast flows with smaller errors in the first forecasting days without requiring an explicit data assimilation step, with the LSTM model generating a median value of mean absolute error (MAE) for the first day of lead-time across all forecast issue dates of 25 m3/s compared to 115 m3/s for the assimilated hydrological model.
Abstract. Drought and heat events in Europe are becoming increasingly frequent due to human-induced climate change, impacting both human well-being and ecosystem functioning. The intensity and effects of these events vary across the continent, making it crucial for decision-makers to understand spatial variability in drought impacts. Data on drought-related damage are currently dispersed across scientific publications, government reports, and media outlets. This study consolidates data on drought and heat damage in European forests from 2018 to 2022, using Europe-wide datasets including those related to crown defoliation, insect damage, burnt forest areas, and tree cover loss. The data, covering 16 European countries, were analysed across four regions, northern, central, Alpine, and southern, and compared with a reference period from 2010 to 2014. Findings reveal that forests in all zones experienced reduced vitality due to drought and elevated temperatures, with varying severity. Central Europe showed the highest vulnerability, impacting both coniferous and deciduous trees. The southern zone, while affected by tree cover loss, demonstrated greater resilience, likely due to historical drought exposure. The northern zone is experiencing emerging impacts less severely, possibly due to site-adapted boreal species, while the Alpine zone showed minimal impact, suggesting a protective effect of altitude. Key trends include (1) significant tree cover loss in the northern, central, and southern zones; (2) high damage levels despite 2021 being an average year, indicating lasting effects from previous years; (3) notable challenges in the central zone and in Sweden due to bark beetle infestations; and (4) no increase in wildfire severity in southern Europe despite ongoing challenges. Based on this assessment, we conclude that (i) European forests are highly vulnerable to drought and heat, with even resilient ecosystems at risk of severe damage; (ii) tailored strategies are essential to mitigate climate change impacts on European forests, incorporating regional differences in forest damage and resilience; and (iii) effective management requires harmonised data collection and enhanced monitoring to address future challenges comprehensively.
Until recently the application of artificial intelligence (AI) in precision oncology was confined to activities in drug development and had limited impact on the personalisation of therapy. Now, a number of approaches have been proposed for the personalisation of drug and cell therapies with AI applied to therapy design, planning and delivery at the patient's bedside. Some drug and cell-based therapies are already tuneable to the individual to optimise efficacy, to reduce toxicity, to adapt the dosing regime, to design combination therapy approaches and, preclinically, even to personalise the receptor design of cell therapies. Developments in AI-based healthcare are accelerating through the adoption of foundation models, and generalist medical AI models have been proposed. The application of these approaches in therapy design is already being explored and realistic short-term advances include the application to the personalised design and delivery of drugs and cell therapies. With this pace of development, the limiting step to adoption will likely be the capacity and appropriateness of regulatory frameworks. This article explores emerging concepts and new ideas for the regulation of AI-enabled personalised cancer therapies in the context of existing and in development governance frameworks.
Model calibration is the procedure of finding model settings such that simulated model outputs best match the observed data. Model calibration is necessary when the model parameters cannot directly be measured as is the case with a wide range of environmental models where parameters are conceptually describing upscaled and effective physical processes. Model calibration is therefore an important step of environmental modeling as the model might otherwise provide random outputs if never compared to a ground truth. Model calibration itself is often referred to be an art due to its plenitude of intertwined steps and necessary decisions along the way before a calibration can be carried out or can be regarded successful. This work provides a general guide specifying which steps a modeler needs to undertake, how to diagnose the success of each step, and how to identify the right action to revise steps that were not successful. The procedure is formalized into ten iterative steps generally appearing in calibration experiments. Each step of this “calibration life cycle” is either illustrated with an exemplary calibration experiment or providing an explicit checklist the modeler can follow. These ten strategies are: (1) using sensitivity information to guide the calibration, (2) handling of parameters with constraints, (3) handling of data ranging orders of magnitude, (4) choosing the data to base the calibration on, (5) presenting various methods to sample model parameters, (6) finding appropriate parameter ranges, (7) choosing objective functions, (8) selecting a calibration algorithm, (9) determining the success and quality of a multi-objective calibration, and (10) providing a checklist to diagnose calibration performance using ideas introduced in the previous steps. The formal definition of strategies through the calibration process is providing an overview while shedding a light on connections between these main ingredients to calibrate an environmental model and will therefore enable especially novice modelers to succeed.
Estimating river flood risks under climate change is challenging, largely due to the interacting and combined influences of various flood-generating drivers. However, a more detailed quantitative analysis of such compounding effects and the implications of their interplay remains underexplored on a large scale. Here, we use explainable machine learning to disentangle compounding effects between drivers and quantify their importance for different flood magnitudes across thousands of catchments worldwide. Our findings demonstrate the ubiquity of compounding effects in many floods. Their importance often increases with flood magnitude, but the strength of this increase varies on the basis of catchment conditions. Traditional flood analysis might underestimate extreme flood hazards in catchments where the contribution of compounding effects strongly varies with flood magnitude. Overall, our study highlights the need to carefully incorporate compounding effects in flood risk assessment to improve estimates of extreme floods.
Digital twins represent a key technology for precision health. Medical digital twins consist of computational models that represent the health state of individual patients over time, enabling optimal therapeutics and forecasting patient prognosis. Many health conditions involve the immune system, so it is crucial to include its key features when designing medical digital twins. The immune response is complex and varies across diseases and patients, and its modelling requires the collective expertise of the clinical, immunology, and computational modelling communities. This review outlines the initial progress on immune digital twins and the various initiatives to facilitate communication between interdisciplinary communities. We also outline the crucial aspects of an immune digital twin design and the prerequisites for its implementation in the clinic. We propose some initial use cases that could serve as "proof of concept" regarding the utility of immune digital technology, focusing on diseases with a very different immune response across spatial and temporal scales (minutes, days, months, years). Lastly, we discuss the use of digital twins in drug discovery and point out emerging challenges that the scientific community needs to collectively overcome to make immune digital twins a reality.
Nearly five billion people use and receive news through social media and there is widespread concern about the negative consequences of misinformation on social media (e.g., election interference, vaccine hesitancy). Despite a burgeoning body of research on misinformation, it remains largely unclear who is susceptible to misinformation and why. To address this, we conducted a systematic individual participant data meta-analysis covering 256,337 unique choices made by 11,561 US-based participants across 31 experiments. Our meta-analysis reveals the impact of key demographic and psychological factors on online misinformation veracity judgments. We also disentangle the ability to discern between true and false news (discrimination ability) from response bias, that is, the tendency to label news as either true (true-news bias) or false (false-news bias). Across all studies, participants were well above-chance accurate for both true (68.51%) and false (67.24%) news headlines. We find that older age, higher analytical thinking skills, and identifying as a Democrat are associated with higher discrimination ability. Additionally, older age and higher analytical thinking skills are associated with a false-news bias (caution). In contrast, ideological congruency (alignment of participants' ideology with news), motivated reflection (higher analytical thinking skills being associated with a greater congruency effect), and self-reported familiarity with news are associated with a true-news bias (naïvety). We also find that experiments on MTurk show higher discrimination ability than those on Lucid. Displaying sources alongside news headlines is associated with improved discrimination ability, with Republicans benefiting more from source display. Our results provide critical insights that can help inform the design of targeted interventions.
With Knowledge Graphs (KGs) at the center of numerous applications such as recommender systems and question-answering, the need for generalized pipelines to construct and continuously update such KGs is increasing. While the individual steps that are necessary to create KGs from unstructured sources (e.g., text) and structured data sources (e.g., databases) are mostly well researched for their one-shot execution, their adoption for incremental KG updates and the interplay of the individual steps have hardly been investigated in a systematic manner so far. In this work, we first discuss the main graph models for KGs and introduce the major requirements for future KG construction pipelines. Next, we provide an overview of the necessary steps to build high-quality KGs, including cross-cutting topics such as metadata management, ontology development, and quality assurance. We then evaluate the state of the art of KG construction with respect to the introduced requirements for specific popular KGs, as well as some recent tools and strategies for KG construction. Finally, we identify areas in need of further research and improvement.
Clinical research relies on high-quality patient data, however, obtaining big data sets is costly and access to existing data is often hindered by privacy and regulatory concerns. Synthetic data generation holds the promise of effectively bypassing these boundaries allowing for simplified data accessibility and the prospect of synthetic control cohorts. We employed two different methodologies of generative artificial intelligence - CTAB-GAN+ and normalizing flows (NFlow) - to synthesize patient data derived from 1606 patients with acute myeloid leukemia, a heterogeneous hematological malignancy, that were treated within four multicenter clinical trials. Both generative models accurately captured distributions of demographic, laboratory, molecular and cytogenetic variables, as well as patient outcomes yielding high performance scores regarding fidelity and usability of both synthetic cohorts (n = 1606 each). Survival analysis demonstrated close resemblance of survival curves between original and synthetic cohorts. Inter-variable relationships were preserved in univariable outcome analysis enabling explorative analysis in our synthetic data. Additionally, training sample privacy is safeguarded mitigating possible patient re-identification, which we quantified using Hamming distances. We provide not only a proof-of-concept for synthetic data generation in multimodal clinical data for rare diseases, but also full public access to synthetic data sets to foster further research.
Models play a pivotal role in advancing our understanding of Earth's physical nature and environmental systems, aiding in their efficient planning and management. The accuracy and reliability of these models heavily rely on data, which are generally partitioned into subsets for model development and evaluation. Surprisingly, how this partitioning is done is often not justified, even though it determines what model we end up with, how we assess its performance and what decisions we make based on these model outputs. In this study, we shed light on the paramount importance of meticulously considering data partitioning in the model development and evaluation process, and its significant impact on model generalization. We identify flaws in existing data-splitting approaches and propose a forward-looking strategy to effectively confront the “elephant in the room”, leading to improved model generalization capabilities.
In this paper, we present a novel methodology for automatic adaptive weighting of Bayesian Physics-Informed Neural Networks (BPINNs), and we demonstrate that this makes it possible to robustly address multi-objective and multiscale problems. BPINNs are a popular framework for data assimilation, combining the constraints of Uncertainty Quantification (UQ) and Partial Differential Equation (PDE). The relative weights of the BPINN target distribution terms are directly related to the inherent uncertainty in the respective learning tasks. Yet, they are usually manually set a-priori, that can lead to pathological behavior, stability concerns, and to conflicts between tasks which are obstacles that have deterred the use of BPINNs for inverse problems with multiscale dynamics. The present weighting strategy automatically tunes the weights by considering the multitask nature of target posterior distribution. We show that this remedies the failure modes of BPINNs and provides efficient exploration of the optimal Pareto front. This leads to better convergence and stability of BPINN training while reducing sampling bias. The determined weights moreover carry information about task uncertainties, reflecting noise levels in the data and adequacy of the PDE model. We demonstrate this in numerical experiments in Sobolev training, and compare them to analytically ε-optimal baseline, and in a multiscale Lokta-Volterra inverse problem. We eventually apply this framework to an inpainting task and an inverse problem, involving latent field recovery for incompressible flow in complex geometries.
Graphs are an intuitive way to model complex relationships between real-world data objects. Thus, graph analytics plays an important role in research and industry. As graphs often reflect heterogeneous domain data, their representation requires an expressive data model including the abstraction of graph collections, for example, to analyze communities inside a social network. Further on, answering complex analytical questions about such graphs entails combining multiple analytical operations. To satisfy these requirements, we propose the Extended Property Graph Model, which is semantically rich, schema-free and supports multiple distinct graphs. Based on this representation, it provides declarative and combinable operators to analyze both single graphs and graph collections. Our current implementation is based on the distributed dataflow framework Apache Flink. We present the results of a first experimental study showing the scalability of our implementation on social network data with up to 11 billion edges.
Will protein structure search tools like AlphaFold replace protein sequence search with BLAST? We discuss the promises, using structure search for remote homology detection, and why protein BLAST, as the leading sequence search tool, should strive to incorporate structural information
T cell responses against the transgene product for different administration routes and against the capsid following intramuscular administration. Moreover, humoral responses against the capsid were mitigated as indicated by delayed IgG2a antibody formation and an increased NAb50. To conclude, insertion of the MyD88-derived peptide into the AAV2 capsid improved early steps of host-vector interaction and reduced innate and adaptive immune responses.
One of the biggest challenges in the development of learning-driven automated driving technologies remains the handling of uncommon, rare events that may have not been encountered in training. Especially when training a model with real driving data, unusual situations, such as emergency brakings, may be underrepresented, resulting in a model that lacks robustness in rare events. This study focuses on car-following based on reinforcement learning and demonstrates that existing approaches, trained with real driving data, fail to handle safety–critical situations. Since collecting data representing all kinds of possible car-following events, including safety–critical situations, is challenging, we propose a training environment that harnesses stochastic processes to generate diverse and challenging scenarios. Our experiments show that training with real data can lead to models that collide in safety–critical situations, whereas the proposed model exhibits excellent performance and remains accident-free, comfortable, and string-stable even in extreme scenarios, such as full-braking by the leading vehicle. Its robustness is demonstrated by simulating car-following scenarios for various reward function parametrizations and a diverse range of artificial and real leader data that were not included in training and were qualitatively different from the learning data. We further show that conventional reward designs can encourage aggressive behavior when approaching other vehicles. Additionally, we compared the proposed model with classical car-following models and found it to achieve equal or superior results.
Abstract. Drought and heat events are becoming more frequent in Europe due to human-induced climate change, affecting many aspects of human well-being and ecosystem functioning. However, the intensity of these drought and heat events is not spatially and temporally uniform. Understanding the spatial variability of drought impacts is important information for decision makers, supporting both planning and preparations to cope with the changing climatic conditions. Currently, data relating to the damage caused by extended drought episodes is scattered across languages and sources such as scientific publications, governmental reports and the media. In this review paper, we compiled data of damages caused by the drought and heat of 2018 until 2022 in forest ecosystems and relate it to large European data sets, providing support for decision making both on the regional and European levels. We partitioned data from 16 European countries to the following regions: Northern, Central, Alpine, and South. We focused on drought and heat damage to forests, and categorized them as (1) physiological (2) pest, and (3) fire damage. We were able to identify the following key trends: (1) Relative defoliation rates of broadleaves is higher than of conifers in every country with the exception of Czech Republic (2) the incidence of wood destroyed by insects is extremely high in Central Europe and Sweden (3) Although forest fires can be related to heat and drought, they are superimposed by other anthropogenic influences (4) In this period (2018–2022), forests in central Europe are particularly affected, while forests in the Northern and Alpine zones are less affected, and adaptations to heat and drought can still be observed in the Southern zone. (5) Although in several regions 2021 was an average year still high levels of damages were observed indicating strong legacy effects of 2018–2020. We note that the inventory should be continuously updated as new data appear.
Conditional generative models such as DALL-E and Stable Diffusion generate images based on a user-defined text, the prompt. Finding and refining prompts that produce a desired image has become the art of prompt engineering. Generative models do not provide a built-in retrieval model for a user’s information need expressed through prompts. In light of an extensive literature review, we reframe prompt engineering for generative models as interactive text-based retrieval on a novel kind of “infinite index”. We apply these insights for the first time in a case study on image generation for game design with an expert. Finally, we envision how active learning may help to guide the retrieval of generated images.