NobleBlocks

Laboratoire Interdisciplinaire des Sciences du Numérique

facilityGif-sur-Yvette, Île-de-France, France

Research output, citation impact, and the most-cited recent papers from Laboratoire Interdisciplinaire des Sciences du Numérique (France). Aggregated across the NobleBlocks index of 300M+ scholarly works.

Total works
1.3K
Citations
8.1K
h-index
37
i10-index
177
Also known as
Laboratoire Interdisciplinaire des Sciences du NumériqueUMR 9015UMR9015

Top-cited papers from Laboratoire Interdisciplinaire des Sciences du Numérique

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
Joseph Mariani
2018· HAL (Le Centre pour la Communication Scientifique Directe)1.2K

International audience

Unraveling the Hidden Environmental Impacts of AI Solutions for Environment Life Cycle Assessment of AI Solutions
Anne‐Laure Ligozat, Julien Lefèvre, Aurélie Bugeau, Jacques Combaz
2022· Sustainability154doi:10.3390/su14095172

In the past ten years, artificial intelligence has encountered such dramatic progress that it is now seen as a tool of choice to solve environmental issues and, in the first place, greenhouse gas emissions (GHG). At the same time, the deep learning community began to realize that training models with more and more parameters require a lot of energy and, as a consequence, GHG emissions. To our knowledge, questioning the complete net environmental impacts of AI solutions for the environment (AI for Green) and not only GHG, has never been addressed directly. In this article, we propose to study the possible negative impacts of AI for Green. First, we review the different types of AI impacts; then, we present the different methodologies used to assess those impacts and show how to apply life cycle assessment to AI services. Finally, we discuss how to assess the environmental usefulness of a general AI service and point out the limitations of existing work in AI for Green.

Prompt Engineering Paradigms for Medical Applications: Scoping Review
Jamil Zaghir, Marco Naguib, Mina Bjelogrlic, Aurélie Névéol +2 more
2024· Journal of Medical Internet Research96doi:10.2196/60501

BACKGROUND: Prompt engineering, focusing on crafting effective prompts to large language models (LLMs), has garnered attention for its capabilities at harnessing the potential of LLMs. This is even more crucial in the medical domain due to its specialized terminology and language technicity. Clinical natural language processing applications must navigate complex language and ensure privacy compliance. Prompt engineering offers a novel approach by designing tailored prompts to guide models in exploiting clinically relevant information from complex medical texts. Despite its promise, the efficacy of prompt engineering in the medical domain remains to be fully explored. OBJECTIVE: The aim of the study is to review research efforts and technical approaches in prompt engineering for medical applications as well as provide an overview of opportunities and challenges for clinical practice. METHODS: Databases indexing the fields of medicine, computer science, and medical informatics were queried in order to identify relevant published papers. Since prompt engineering is an emerging field, preprint databases were also considered. Multiple data were extracted, such as the prompt paradigm, the involved LLMs, the languages of the study, the domain of the topic, the baselines, and several learning, design, and architecture strategies specific to prompt engineering. We include studies that apply prompt engineering-based methods to the medical domain, published between 2022 and 2024, and covering multiple prompt paradigms such as prompt learning (PL), prompt tuning (PT), and prompt design (PD). RESULTS: We included 114 recent prompt engineering studies. Among the 3 prompt paradigms, we have observed that PD is the most prevalent (78 papers). In 12 papers, PD, PL, and PT terms were used interchangeably. While ChatGPT is the most commonly used LLM, we have identified 7 studies using this LLM on a sensitive clinical data set. Chain-of-thought, present in 17 studies, emerges as the most frequent PD technique. While PL and PT papers typically provide a baseline for evaluating prompt-based approaches, 61% (48/78) of the PD studies do not report any nonprompt-related baseline. Finally, we individually examine each of the key prompt engineering-specific information reported across papers and find that many studies neglect to explicitly mention them, posing a challenge for advancing prompt engineering research. CONCLUSIONS: In addition to reporting on trends and the scientific landscape of prompt engineering, we provide reporting guidelines for future studies to help advance research in the medical field. We also disclose tables and figures summarizing medical prompt engineering papers available and hope that future contributions will leverage these existing works to better advance the field.

CodaLab Competitions An Open Source Platform to Organize Scientific Challenges
Adrien Pavão, Isabelle Guyon, Anne-Catherine Letournel, Tran, Dinh Tuan +4 more
2022· HAL (Le Centre pour la Communication Scientifique Directe)79

CodaLab Competitions is an open source web platform designed to help data scientists and research teams to crowd-source the resolution of machine learning problems through the organization of competitions, also called challenges or contests. CodaLab Competitions provides useful features such as multiple phases, results and code submissions, multi-score leaderboards, and jobs running inside Docker containers. The platform is very flexible and can handle large scale experiments, by allowing organizers to upload large datasets and provide their own CPU or GPU compute workers.

Systematic and quantitative view of the antiviral arsenal of prokaryotes
Florian Tesson, Hervé Alexandre, Marie Touchon, Camille d’Humières +2 more
2021· bioRxiv (Cold Spring Harbor Laboratory)69doi:10.1101/2021.09.02.458658

Abstract Facing the abundance and diversity of phages, bacteria have developed multiple anti-phage mechanisms. In the past three years, the number of known anti-phage mechanisms has been expanded by at least 5-fold rendering our view of prokaryotic immunity obsolete. Most anti-phage systems have been studied as standalone mechanisms, however many examples demonstrate strains encode not one but several anti-viral mechanisms. How these different systems integrate into an anti-viral arsenal at the strain level remains to be elucidated. Much could be learned from establishing fundamental description of features such as the number and diversity of anti-phage systems encoded in a given genome. To address this question, we developed DefenseFinder, a tool that automatically detects known anti-phage systems in prokaryotic genomes. We applied DefenseFinder to >20 000 fully sequenced genomes, generating a systematic and quantitative view of the anti-viral arsenal of prokaryotes. We show prokaryotic genomes encode on average five anti-phage systems from three different families of systems. This number varies drastically from one strain to another and is influenced by the genome size and the number of prophages encoded. Distributions of different systems are also very heterogenous with some systems being enriched in prophages and in specific clades. Finally, we provide a detailed comparison of the anti-viral arsenal of 15 common bacterial species, revealing drastic differences in anti-viral strategies. Overall, our work provides a free and open-source software, available as a command line tool or, on a webserver. It allows the rapid detection of anti-phage systems, enables a comprehensive description of the anti-viral arsenal of prokaryotes and paves the way for large scale genomics study in the field of anti-phage defense.

An open-source tool to assess the carbon footprint of research
Jérôme Mariette, Odile Blanchard, Olivier Berné, Olivier Aumont +4 more
2022· Environmental Research Infrastructure and Sustainability69doi:10.1088/2634-4505/ac84a4

Abstract The scrutiny over the carbon footprint of research and higher education has increased rapidly in the last few years. This has resulted in a series of publications providing various estimates of the carbon footprint of one or several research activities, principally at the scale of a university or a research center or, more recently, a field of research. The variety of tools or methodologies on which these estimates rely unfortunately prevents any aggregation or direct comparison. This is because carbon footprint assessments are very sensitive to key parameters (e.g., emission factors) or hypotheses (e.g., scopes). Hence, it is impossible to address fundamental questions such as: is the carbon footprint of research structurally different between disciplines? Are plane trips a major source of carbon emissions in academic research? Massive collection and curation of carbon footprint data, across a large array of research situations and disciplines, is hence an important, timely and necessary challenge to answer these questions. This paper presents a framework to collect and analyse large amounts of homogeneous research carbon emission data in a network of research entities at the national scale. It relies on an open-source web application, GES 1point5 , designed to estimate the carbon footprint of a department, research lab or team in any country of the world. Importantly, GES 1point5 is also designed to aggregate all input data and corresponding GHG emissions estimates into a comprehensive database. GES 1point5 therefore enables (i) the identification of robust local or national determinants of the carbon footprint of research and (ii) the estimation of the carbon footprint of the entire research sector at national scale. A preliminary analysis of the carbon footprint of more than one hundred laboratories in France is presented to illustrate the potential of the framework. It shows that the average emissions are 479 t CO 2 e for a research lab and 3.6 t CO 2 e for an average lab member (respectively 404 and 3.1 t CO 2 e without accounting for the indirect radiative effects of aviation), with the current scope of GES 1point5 . Availability and implementation: GES 1point5 is available online at http://labos1point5.org/ges- 1point5 and its source code can be downloaded from the GitLab platform at https://framagit.org/ labos1point5/l1p5-vuejs .

A Survey on Malware Detection with Graph Representation Learning
Tristan Bilot, Nour El Madhoun, Khaldoun Al Agha, Anis Zouaoui
2024· ACM Computing Surveys65doi:10.1145/3664649

Malware detection has become a major concern due to the increasing number and complexity of malware. Traditional detection methods based on signatures and heuristics are used for malware detection, but unfortunately, they suffer from poor generalization to unknown attacks and can be easily circumvented using obfuscation techniques. In recent years, Machine Learning (ML) and notably Deep Learning (DL) achieved impressive results in malware detection by learning useful representations from data and have become a solution preferred over traditional methods. Recently, the application of Graph Representation Learning (GRL) techniques on graph-structured data has demonstrated impressive capabilities in malware detection. This success benefits notably from the robust structure of graphs, which are challenging for attackers to alter, and their intrinsic explainability capabilities. In this survey, we provide an in-depth literature review to summarize and unify existing works under the common approaches and architectures. We notably demonstrate that Graph Neural Networks (GNNs) reach competitive results in learning robust embeddings from malware represented as expressive graph structures such as Function Call Graphs (FCGs) and Control Flow Graphs (CFGs). This study also discusses the robustness of GRL-based methods to adversarial attacks, contrasts their effectiveness with other ML/DL approaches, and outlines future research for practical deployment.

You reap what you sow: On the Challenges of Bias Evaluation Under Multilingual Settings
Zeerak Talat, Aurélie Névéol, Stella Biderman, Miruna Clinciu +4 more
202262doi:10.18653/v1/2022.bigscience-1.3

Zeerak Talat, Aurélie Névéol, Stella Biderman, Miruna Clinciu, Manan Dey, Shayne Longpre, Sasha Luccioni, Maraim Masoud, Margaret Mitchell, Dragomir Radev, Shanya Sharma, Arjun Subramonian, Jaesung Tae, Samson Tan, Deepak Tunuguntla, Oskar Van Der Wal. Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models. 2022.

Expanding the stdpopsim species catalog, and lessons learned for realistic genome simulations
M. Elise Lauterbur, Maria Izabel A. Cavassim, Ariella Gladstein, Graham Gower +4 more
2023· eLife53doi:10.7554/elife.84874

Simulation is a key tool in population genetics for both methods development and empirical research, but producing simulations that recapitulate the main features of genomic datasets remains a major obstacle. Today, more realistic simulations are possible thanks to large increases in the quantity and quality of available genetic data, and the sophistication of inference and simulation software. However, implementing these simulations still requires substantial time and specialized knowledge. These challenges are especially pronounced for simulating genomes for species that are not well-studied, since it is not always clear what information is required to produce simulations with a level of realism sufficient to confidently answer a given question. The community-developed framework stdpopsim seeks to lower this barrier by facilitating the simulation of complex population genetic models using up-to-date information. The initial version of stdpopsim focused on establishing this framework using six well-characterized model species (Adrion et al., 2020). Here, we report on major improvements made in the new release of stdpopsim (version 0.2), which includes a significant expansion of the species catalog and substantial additions to simulation capabilities. Features added to improve the realism of the simulated genomes include non-crossover recombination and provision of species-specific genomic annotations. Through community-driven efforts, we expanded the number of species in the catalog more than threefold and broadened coverage across the tree of life. During the process of expanding the catalog, we have identified common sticking points and developed the best practices for setting up genome-scale simulations. We describe the input data required for generating a realistic simulation, suggest good practices for obtaining the relevant information from the literature, and discuss common pitfalls and major considerations. These improvements to stdpopsim aim to further promote the use of realistic whole-genome population genetic simulations, especially in non-model organisms, making them available, transparent, and accessible to everyone.

Dynamics of a surfactant-laden bubble bursting through an interface
C. Ricardo Constante-Amores, Lyes Kahouadji, Assen Batchvarov, Seungwon Shin +3 more
2021· Journal of Fluid Mechanics51doi:10.1017/jfm.2020.1099

Abstract

IntegronFinder 2.0: identification and analysis of integrons across Bacteria, with a focus on antibiotic resistance in Klebsiella
Bertrand Néron, Eloi Littner, Matthieu Haudiquet, Amandine Perrin +2 more
2022· bioRxiv (Cold Spring Harbor Laboratory)51doi:10.1101/2022.02.28.482270

Abstract Integrons are mobile genetic elements that contain multiple cassettes encoding accessory genes whose order is shuffled by a specific integrase. Integrons within mobile genetic elements often contain multiple antibiotic resistance genes that they spread among nosocomial pathogens and contribute to the current antibiotic resistance crisis. However, most integrons are presumably sedentary and encode a much broader diversity of functions. IntegronFinder is a widely used software to identify novel integrons in bacterial genomes, but has aged and lacks some useful functionalities to handle very large datasets of draft genomes or metagenomes. Here, we present IntegronFinder version 2. We have updated the code, improved its efficiency and usability, adapted the output to incomplete genome data, and added a few novel functions. We describe these changes and illustrate the relevance of the program by analyzing the distribution of integrons across more than 20,000 fully sequenced genomes. We also take full advantage of its novel capabilities to analyze close to 4 thousand Klebsiella pneumoniae genomes for the presence of integrons and antibiotic resistance genes within them. Our data shows that K. pneumoniae has a large diversity of integrons and the largest mobile integron in our database of plasmids. The pangenome of these integrons contains a total of 165 different gene families with most of the largest families being related with resistance to numerous types of antibiotics. IntegronFinder is a free and open-source software available at https://github.com/gem-pasteur/Integron_Finder .

A Quantitative Theory for Genomic Offset Statistics
Clément Gain, Bénédicte Rhoné, Philippe Cubry, Israfel Salazar +4 more
2023· Molecular Biology and Evolution51doi:10.1093/molbev/msad140

Genomic offset statistics predict the maladaptation of populations to rapid habitat alteration based on association of genotypes with environmental variation. Despite substantial evidence for empirical validity, genomic offset statistics have well-identified limitations, and lack a theory that would facilitate interpretations of predicted values. Here, we clarified the theoretical relationships between genomic offset statistics and unobserved fitness traits controlled by environmentally selected loci and proposed a geometric measure to predict fitness after rapid change in local environment. The predictions of our theory were verified in computer simulations and in empirical data on African pearl millet (Cenchrus americanus) obtained from a common garden experiment. Our results proposed a unified perspective on genomic offset statistics and provided a theoretical foundation necessary when considering their potential application in conservation management in the face of environmental change.

The Dawn of the Human-Machine Era: A forecast of new and emerging language technologies
Dave Sayers, Rui Sousa‐Silva, Sviatlana Höhn, Lule Ahmedi +4 more
202149doi:10.17011/jyx/reports/20210518/1

New language technologies are coming, thanks to the huge and competing private investment fuelling rapid progress; we can either understand and foresee their effects, or be taken by surprise and spend our time trying to catch up. This report scketches out some transformative new technologies that are likely to fundamentally change our use of language. Some of these may feel unrealistically futuristic or far-fetched, but a central purpose of this report - and the wider LITHME network - is to illustrate that these are mostly just the logical development and maturation of technologies currently in prototype. But will everyone benefit from all these shiny new gadgets? Throughout this report we emphasise a range of groups who will be disadvantaged and issues of inequality. Important issues of security and privacy will accompany new language technologies. A further caution is to re-emphasise the current limitations of AI. Looking ahead, we see many intriguing opportunities and new capabilities, but a range of other uncertainties and inequalities. New devices will enable new ways to talk, to translate, to remember, and to learn. But advances in technology will reproduce existing inequalities among those who cannot afford these devices, among the world’s smaller languages, and especially for sign language. Debates over privacy and security will flare and crackle with every new immersive gadget. We will move together into this curious new world with a mix of excitement and apprehension - reacting, debating, sharing and disagreeing as we always do. Plug in, as the human-machine era dawns.

Elastoplasticity Mediates Dynamical Heterogeneity Below the Mode Coupling Temperature
Rahul Chacko, François P. Landes, Giulio Biroli, Olivier Dauchot +2 more
2021· Physical Review Letters48doi:10.1103/physrevlett.127.048002

As liquids approach the glass transition temperature, dynamical heterogeneity emerges as a crucial universal feature of their behavior. Dynamic facilitation, where local motion triggers further motion nearby, plays a major role in this phenomenon. Here we show that long-ranged, elastically mediated facilitation appears below the mode coupling temperature, adding to the short-range component present at all temperatures. Our results suggest deep connections between the supercooled liquid and glass states, and pave the way for a deeper understanding of dynamical heterogeneity in glassy systems.

Disembedded or Deeply Embedded? A Multi-Level Network Analysis of Online Labour Platforms
Paola Tubaro
2021· Sociology48doi:10.1177/0038038520986082

This article extends the economic-sociological concept of embeddedness to encompass not only social networks of, for example, friendship or kinship ties, but also economic networks of ownership and control relationships. Applying these ideas to the case of digital platform labour pinpoints two possible scenarios. When platforms take the role of market intermediaries, economic ties are thin and workers are left to their own devices, in a form of ‘disembeddedness’. When platforms partake in intricate inter-firm outsourcing structures, economic ties envelop workers in a ‘deep embeddedness’ which involves both stronger constraints and higher rewards. With this added dimension, the notion of embeddedness becomes a compelling tool to describe the social structures that frame economic action, including the power imbalances that characterize digital labour in the global economy.

Stabilization of the fluidic pinball with gradient-enriched machine learning control
Guy Y. Cornejo Maceda, Yiqing Li, François Lusseyran, Marek Morzyński +1 more
2020· HAL (Le Centre pour la Communication Scientifique Directe)47

International audience

Numerical simulation of electrovortex flows in cylindrical fluid layers and liquid metal batteries
Wietze Herreman, Caroline Nore, P. Ziebell Ramos, Loïc Cappanera +2 more
2019· Physical Review Fluids43doi:10.1103/physrevfluids.4.113702

Electrovortex flows occur whenever thin electrodes are put in contact with wider liquid metal regions. An investigation shows that in liquid metal batteries, electrovortex flows can become so intense they can compromise the layered structure of the battery and cause a short circuit.

End-to-End Speech Emotion Recognition: Challenges of Real-Life Emergency Call Centers Data Recordings
Théo Deschamps-Berger, Lori Lamel, Laurence Devillers
202142doi:10.1109/acii52823.2021.9597419

Recognizing a speaker’s emotion from their speech can be a key element in emergency call centers. End-to-end deep learning systems for speech emotion recognition now achieve equivalent or even better results than conventional machine learning approaches. In this paper, in order to validate the performance of our neural network architecture for emotion recognition from speech, we first trained and tested it on the widely used corpus accessible by the community, IEMOCAP. We then used the same architecture with the real life corpus, CEMO, comprised of 440 dialogs (2h16m) from 485 speakers. The most frequent emotions expressed by callers in these real-life emergency dialogues are fear, anger and positive emotions such as relief. In the IEMOCAP general topic conversations, the most frequent emotions are sadness, anger and happiness. Using the same end-to-end deep learning architecture, an Unweighted Accuracy Recall (UA) of 63% is obtained on IEMOCAP and a UA of 45.6% on CEMO, each with 4 classes. Using only 2 classes (Anger, Neutral), the results for CEMO are 76.9% UA compared to 81.1% UA for IEMOCAP. We expect that these encouraging results with CEMO can be improved by combining the audio channel with the linguistic channel. Real-life emotions are clearly more complex than acted ones, mainly due to the large diversity of emotional expressions of speakers.

Black-Box Optimization Revisited: Improving Algorithm Selection Wizards\n through Massive Benchmarking
Laurent Meunier, Herilalaina Rakotoarison, Pak Kan Wong, Baptiste Rozière +4 more
2020· arXiv (Cornell University)40doi:10.48550/arxiv.2010.04542

Existing studies in black-box optimization for machine learning suffer from\nlow generalizability, caused by a typically selective choice of problem\ninstances used for training and testing different optimization algorithms.\nAmong other issues, this practice promotes overfitting and poor-performing user\nguidelines. To address this shortcoming, we propose in this work a benchmark\nsuite, OptimSuite, which covers a broad range of black-box optimization\nproblems, ranging from academic benchmarks to real-world applications, from\ndiscrete over numerical to mixed-integer problems, from small to very\nlarge-scale problems, from noisy over dynamic to static problems, etc. We\ndemonstrate the advantages of such a broad collection by deriving from it\nAutomated Black Box Optimizer (ABBO), a general-purpose algorithm selection\nwizard. Using three different types of algorithm selection techniques, ABBO\nachieves competitive performance on all benchmark suites. It significantly\noutperforms previous state of the art on some of them, including YABBOB and\nLSGO. ABBO relies on many high-quality base components. Its excellent\nperformance is obtained without any task-specific parametrization.\n The OptimSuite benchmark collection, the ABBO wizard and its base solvers\nhave all been merged into the open-source Nevergrad platform, where they are\navailable for reproducible research.\n

CO/DA: Live-Coding Movement-Sound Interactions for Dance Improvisation
Jules Françoise, Sarah Fdili Alaoui, Yves Candau
2022· CHI Conference on Human Factors in Computing Systems38doi:10.1145/3491102.3501916

We present a performance-led inquiry that involved a live coder programming movement-based interactive sound and two dance improvisers. During two years of collaboration, we developed a joint improvisation practice where the interactions between the dancers’ movement and the sound feedback are programmed on the fly through live coding and movement sensing. To that end, we designed a new live coding environment called CO/DA that facilitates the real-time manipulation of continuous streams of the dancers’ motion data for interactive sound synthesis. Through an autoethnographic inquiry, we describe our practice of sound and movement improvisation where live coding dynamically changes how the dancers’ movements generate sound, which in turn influences the dancers’ improvisation. We then discuss the value, potential and challenges of our dance/code improvisation practice, along with its implications as a design method.