Laboratoire Interdisciplinaire des Sciences du Numérique
facilityGif-sur-Yvette, Île-de-France, France
Research output, citation impact, and the most-cited recent papers from Laboratoire Interdisciplinaire des Sciences du Numérique (France). Aggregated across the NobleBlocks index of 300M+ scholarly works.
Top-cited papers from Laboratoire Interdisciplinaire des Sciences du Numérique
International audience
In the past ten years, artificial intelligence has encountered such dramatic progress that it is now seen as a tool of choice to solve environmental issues and, in the first place, greenhouse gas emissions (GHG). At the same time, the deep learning community began to realize that training models with more and more parameters require a lot of energy and, as a consequence, GHG emissions. To our knowledge, questioning the complete net environmental impacts of AI solutions for the environment (AI for Green) and not only GHG, has never been addressed directly. In this article, we propose to study the possible negative impacts of AI for Green. First, we review the different types of AI impacts; then, we present the different methodologies used to assess those impacts and show how to apply life cycle assessment to AI services. Finally, we discuss how to assess the environmental usefulness of a general AI service and point out the limitations of existing work in AI for Green.
BACKGROUND: Prompt engineering, focusing on crafting effective prompts to large language models (LLMs), has garnered attention for its capabilities at harnessing the potential of LLMs. This is even more crucial in the medical domain due to its specialized terminology and language technicity. Clinical natural language processing applications must navigate complex language and ensure privacy compliance. Prompt engineering offers a novel approach by designing tailored prompts to guide models in exploiting clinically relevant information from complex medical texts. Despite its promise, the efficacy of prompt engineering in the medical domain remains to be fully explored. OBJECTIVE: The aim of the study is to review research efforts and technical approaches in prompt engineering for medical applications as well as provide an overview of opportunities and challenges for clinical practice. METHODS: Databases indexing the fields of medicine, computer science, and medical informatics were queried in order to identify relevant published papers. Since prompt engineering is an emerging field, preprint databases were also considered. Multiple data were extracted, such as the prompt paradigm, the involved LLMs, the languages of the study, the domain of the topic, the baselines, and several learning, design, and architecture strategies specific to prompt engineering. We include studies that apply prompt engineering-based methods to the medical domain, published between 2022 and 2024, and covering multiple prompt paradigms such as prompt learning (PL), prompt tuning (PT), and prompt design (PD). RESULTS: We included 114 recent prompt engineering studies. Among the 3 prompt paradigms, we have observed that PD is the most prevalent (78 papers). In 12 papers, PD, PL, and PT terms were used interchangeably. While ChatGPT is the most commonly used LLM, we have identified 7 studies using this LLM on a sensitive clinical data set. Chain-of-thought, present in 17 studies, emerges as the most frequent PD technique. While PL and PT papers typically provide a baseline for evaluating prompt-based approaches, 61% (48/78) of the PD studies do not report any nonprompt-related baseline. Finally, we individually examine each of the key prompt engineering-specific information reported across papers and find that many studies neglect to explicitly mention them, posing a challenge for advancing prompt engineering research. CONCLUSIONS: In addition to reporting on trends and the scientific landscape of prompt engineering, we provide reporting guidelines for future studies to help advance research in the medical field. We also disclose tables and figures summarizing medical prompt engineering papers available and hope that future contributions will leverage these existing works to better advance the field.
CodaLab Competitions is an open source web platform designed to help data scientists and research teams to crowd-source the resolution of machine learning problems through the organization of competitions, also called challenges or contests. CodaLab Competitions provides useful features such as multiple phases, results and code submissions, multi-score leaderboards, and jobs running inside Docker containers. The platform is very flexible and can handle large scale experiments, by allowing organizers to upload large datasets and provide their own CPU or GPU compute workers.
Abstract Facing the abundance and diversity of phages, bacteria have developed multiple anti-phage mechanisms. In the past three years, the number of known anti-phage mechanisms has been expanded by at least 5-fold rendering our view of prokaryotic immunity obsolete. Most anti-phage systems have been studied as standalone mechanisms, however many examples demonstrate strains encode not one but several anti-viral mechanisms. How these different systems integrate into an anti-viral arsenal at the strain level remains to be elucidated. Much could be learned from establishing fundamental description of features such as the number and diversity of anti-phage systems encoded in a given genome. To address this question, we developed DefenseFinder, a tool that automatically detects known anti-phage systems in prokaryotic genomes. We applied DefenseFinder to >20 000 fully sequenced genomes, generating a systematic and quantitative view of the anti-viral arsenal of prokaryotes. We show prokaryotic genomes encode on average five anti-phage systems from three different families of systems. This number varies drastically from one strain to another and is influenced by the genome size and the number of prophages encoded. Distributions of different systems are also very heterogenous with some systems being enriched in prophages and in specific clades. Finally, we provide a detailed comparison of the anti-viral arsenal of 15 common bacterial species, revealing drastic differences in anti-viral strategies. Overall, our work provides a free and open-source software, available as a command line tool or, on a webserver. It allows the rapid detection of anti-phage systems, enables a comprehensive description of the anti-viral arsenal of prokaryotes and paves the way for large scale genomics study in the field of anti-phage defense.
Abstract The scrutiny over the carbon footprint of research and higher education has increased rapidly in the last few years. This has resulted in a series of publications providing various estimates of the carbon footprint of one or several research activities, principally at the scale of a university or a research center or, more recently, a field of research. The variety of tools or methodologies on which these estimates rely unfortunately prevents any aggregation or direct comparison. This is because carbon footprint assessments are very sensitive to key parameters (e.g., emission factors) or hypotheses (e.g., scopes). Hence, it is impossible to address fundamental questions such as: is the carbon footprint of research structurally different between disciplines? Are plane trips a major source of carbon emissions in academic research? Massive collection and curation of carbon footprint data, across a large array of research situations and disciplines, is hence an important, timely and necessary challenge to answer these questions. This paper presents a framework to collect and analyse large amounts of homogeneous research carbon emission data in a network of research entities at the national scale. It relies on an open-source web application, GES 1point5 , designed to estimate the carbon footprint of a department, research lab or team in any country of the world. Importantly, GES 1point5 is also designed to aggregate all input data and corresponding GHG emissions estimates into a comprehensive database. GES 1point5 therefore enables (i) the identification of robust local or national determinants of the carbon footprint of research and (ii) the estimation of the carbon footprint of the entire research sector at national scale. A preliminary analysis of the carbon footprint of more than one hundred laboratories in France is presented to illustrate the potential of the framework. It shows that the average emissions are 479 t CO 2 e for a research lab and 3.6 t CO 2 e for an average lab member (respectively 404 and 3.1 t CO 2 e without accounting for the indirect radiative effects of aviation), with the current scope of GES 1point5 . Availability and implementation: GES 1point5 is available online at http://labos1point5.org/ges- 1point5 and its source code can be downloaded from the GitLab platform at https://framagit.org/ labos1point5/l1p5-vuejs .
Malware detection has become a major concern due to the increasing number and complexity of malware. Traditional detection methods based on signatures and heuristics are used for malware detection, but unfortunately, they suffer from poor generalization to unknown attacks and can be easily circumvented using obfuscation techniques. In recent years, Machine Learning (ML) and notably Deep Learning (DL) achieved impressive results in malware detection by learning useful representations from data and have become a solution preferred over traditional methods. Recently, the application of Graph Representation Learning (GRL) techniques on graph-structured data has demonstrated impressive capabilities in malware detection. This success benefits notably from the robust structure of graphs, which are challenging for attackers to alter, and their intrinsic explainability capabilities. In this survey, we provide an in-depth literature review to summarize and unify existing works under the common approaches and architectures. We notably demonstrate that Graph Neural Networks (GNNs) reach competitive results in learning robust embeddings from malware represented as expressive graph structures such as Function Call Graphs (FCGs) and Control Flow Graphs (CFGs). This study also discusses the robustness of GRL-based methods to adversarial attacks, contrasts their effectiveness with other ML/DL approaches, and outlines future research for practical deployment.
Zeerak Talat, Aurélie Névéol, Stella Biderman, Miruna Clinciu, Manan Dey, Shayne Longpre, Sasha Luccioni, Maraim Masoud, Margaret Mitchell, Dragomir Radev, Shanya Sharma, Arjun Subramonian, Jaesung Tae, Samson Tan, Deepak Tunuguntla, Oskar Van Der Wal. Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models. 2022.
Simulation is a key tool in population genetics for both methods development and empirical research, but producing simulations that recapitulate the main features of genomic datasets remains a major obstacle. Today, more realistic simulations are possible thanks to large increases in the quantity and quality of available genetic data, and the sophistication of inference and simulation software. However, implementing these simulations still requires substantial time and specialized knowledge. These challenges are especially pronounced for simulating genomes for species that are not well-studied, since it is not always clear what information is required to produce simulations with a level of realism sufficient to confidently answer a given question. The community-developed framework stdpopsim seeks to lower this barrier by facilitating the simulation of complex population genetic models using up-to-date information. The initial version of stdpopsim focused on establishing this framework using six well-characterized model species (Adrion et al., 2020). Here, we report on major improvements made in the new release of stdpopsim (version 0.2), which includes a significant expansion of the species catalog and substantial additions to simulation capabilities. Features added to improve the realism of the simulated genomes include non-crossover recombination and provision of species-specific genomic annotations. Through community-driven efforts, we expanded the number of species in the catalog more than threefold and broadened coverage across the tree of life. During the process of expanding the catalog, we have identified common sticking points and developed the best practices for setting up genome-scale simulations. We describe the input data required for generating a realistic simulation, suggest good practices for obtaining the relevant information from the literature, and discuss common pitfalls and major considerations. These improvements to stdpopsim aim to further promote the use of realistic whole-genome population genetic simulations, especially in non-model organisms, making them available, transparent, and accessible to everyone.
Abstract
Abstract Integrons are mobile genetic elements that contain multiple cassettes encoding accessory genes whose order is shuffled by a specific integrase. Integrons within mobile genetic elements often contain multiple antibiotic resistance genes that they spread among nosocomial pathogens and contribute to the current antibiotic resistance crisis. However, most integrons are presumably sedentary and encode a much broader diversity of functions. IntegronFinder is a widely used software to identify novel integrons in bacterial genomes, but has aged and lacks some useful functionalities to handle very large datasets of draft genomes or metagenomes. Here, we present IntegronFinder version 2. We have updated the code, improved its efficiency and usability, adapted the output to incomplete genome data, and added a few novel functions. We describe these changes and illustrate the relevance of the program by analyzing the distribution of integrons across more than 20,000 fully sequenced genomes. We also take full advantage of its novel capabilities to analyze close to 4 thousand Klebsiella pneumoniae genomes for the presence of integrons and antibiotic resistance genes within them. Our data shows that K. pneumoniae has a large diversity of integrons and the largest mobile integron in our database of plasmids. The pangenome of these integrons contains a total of 165 different gene families with most of the largest families being related with resistance to numerous types of antibiotics. IntegronFinder is a free and open-source software available at https://github.com/gem-pasteur/Integron_Finder .
Genomic offset statistics predict the maladaptation of populations to rapid habitat alteration based on association of genotypes with environmental variation. Despite substantial evidence for empirical validity, genomic offset statistics have well-identified limitations, and lack a theory that would facilitate interpretations of predicted values. Here, we clarified the theoretical relationships between genomic offset statistics and unobserved fitness traits controlled by environmentally selected loci and proposed a geometric measure to predict fitness after rapid change in local environment. The predictions of our theory were verified in computer simulations and in empirical data on African pearl millet (Cenchrus americanus) obtained from a common garden experiment. Our results proposed a unified perspective on genomic offset statistics and provided a theoretical foundation necessary when considering their potential application in conservation management in the face of environmental change.
New language technologies are coming, thanks to the huge and competing private investment fuelling rapid progress; we can either understand and foresee their effects, or be taken by surprise and spend our time trying to catch up. This report scketches out some transformative new technologies that are likely to fundamentally change our use of language. Some of these may feel unrealistically futuristic or far-fetched, but a central purpose of this report - and the wider LITHME network - is to illustrate that these are mostly just the logical development and maturation of technologies currently in prototype. But will everyone benefit from all these shiny new gadgets? Throughout this report we emphasise a range of groups who will be disadvantaged and issues of inequality. Important issues of security and privacy will accompany new language technologies. A further caution is to re-emphasise the current limitations of AI. Looking ahead, we see many intriguing opportunities and new capabilities, but a range of other uncertainties and inequalities. New devices will enable new ways to talk, to translate, to remember, and to learn. But advances in technology will reproduce existing inequalities among those who cannot afford these devices, among the world’s smaller languages, and especially for sign language. Debates over privacy and security will flare and crackle with every new immersive gadget. We will move together into this curious new world with a mix of excitement and apprehension - reacting, debating, sharing and disagreeing as we always do. Plug in, as the human-machine era dawns.
As liquids approach the glass transition temperature, dynamical heterogeneity emerges as a crucial universal feature of their behavior. Dynamic facilitation, where local motion triggers further motion nearby, plays a major role in this phenomenon. Here we show that long-ranged, elastically mediated facilitation appears below the mode coupling temperature, adding to the short-range component present at all temperatures. Our results suggest deep connections between the supercooled liquid and glass states, and pave the way for a deeper understanding of dynamical heterogeneity in glassy systems.
This article extends the economic-sociological concept of embeddedness to encompass not only social networks of, for example, friendship or kinship ties, but also economic networks of ownership and control relationships. Applying these ideas to the case of digital platform labour pinpoints two possible scenarios. When platforms take the role of market intermediaries, economic ties are thin and workers are left to their own devices, in a form of ‘disembeddedness’. When platforms partake in intricate inter-firm outsourcing structures, economic ties envelop workers in a ‘deep embeddedness’ which involves both stronger constraints and higher rewards. With this added dimension, the notion of embeddedness becomes a compelling tool to describe the social structures that frame economic action, including the power imbalances that characterize digital labour in the global economy.
International audience
Electrovortex flows occur whenever thin electrodes are put in contact with wider liquid metal regions. An investigation shows that in liquid metal batteries, electrovortex flows can become so intense they can compromise the layered structure of the battery and cause a short circuit.
Recognizing a speaker’s emotion from their speech can be a key element in emergency call centers. End-to-end deep learning systems for speech emotion recognition now achieve equivalent or even better results than conventional machine learning approaches. In this paper, in order to validate the performance of our neural network architecture for emotion recognition from speech, we first trained and tested it on the widely used corpus accessible by the community, IEMOCAP. We then used the same architecture with the real life corpus, CEMO, comprised of 440 dialogs (2h16m) from 485 speakers. The most frequent emotions expressed by callers in these real-life emergency dialogues are fear, anger and positive emotions such as relief. In the IEMOCAP general topic conversations, the most frequent emotions are sadness, anger and happiness. Using the same end-to-end deep learning architecture, an Unweighted Accuracy Recall (UA) of 63% is obtained on IEMOCAP and a UA of 45.6% on CEMO, each with 4 classes. Using only 2 classes (Anger, Neutral), the results for CEMO are 76.9% UA compared to 81.1% UA for IEMOCAP. We expect that these encouraging results with CEMO can be improved by combining the audio channel with the linguistic channel. Real-life emotions are clearly more complex than acted ones, mainly due to the large diversity of emotional expressions of speakers.
Existing studies in black-box optimization for machine learning suffer from\nlow generalizability, caused by a typically selective choice of problem\ninstances used for training and testing different optimization algorithms.\nAmong other issues, this practice promotes overfitting and poor-performing user\nguidelines. To address this shortcoming, we propose in this work a benchmark\nsuite, OptimSuite, which covers a broad range of black-box optimization\nproblems, ranging from academic benchmarks to real-world applications, from\ndiscrete over numerical to mixed-integer problems, from small to very\nlarge-scale problems, from noisy over dynamic to static problems, etc. We\ndemonstrate the advantages of such a broad collection by deriving from it\nAutomated Black Box Optimizer (ABBO), a general-purpose algorithm selection\nwizard. Using three different types of algorithm selection techniques, ABBO\nachieves competitive performance on all benchmark suites. It significantly\noutperforms previous state of the art on some of them, including YABBOB and\nLSGO. ABBO relies on many high-quality base components. Its excellent\nperformance is obtained without any task-specific parametrization.\n The OptimSuite benchmark collection, the ABBO wizard and its base solvers\nhave all been merged into the open-source Nevergrad platform, where they are\navailable for reproducible research.\n
We present a performance-led inquiry that involved a live coder programming movement-based interactive sound and two dance improvisers. During two years of collaboration, we developed a joint improvisation practice where the interactions between the dancers’ movement and the sound feedback are programmed on the fly through live coding and movement sensing. To that end, we designed a new live coding environment called CO/DA that facilitates the real-time manipulation of continuous streams of the dancers’ motion data for interactive sound synthesis. Through an autoethnographic inquiry, we describe our practice of sound and movement improvisation where live coding dynamically changes how the dancers’ movements generate sound, which in turn influences the dancers’ improvisation. We then discuss the value, potential and challenges of our dance/code improvisation practice, along with its implications as a design method.