Entrepôts, Représentation et Ingénierie des Connaissances
facilitySaint-Priest, France
Research output, citation impact, and the most-cited recent papers from Entrepôts, Représentation et Ingénierie des Connaissances (France). Aggregated across the NobleBlocks index of 300M+ scholarly works.
Top-cited papers from Entrepôts, Représentation et Ingénierie des Connaissances
Bike sharing systems (BSSs) have become a means of sustainable intermodal transport and are now proposed in many cities worldwide. Most BSSs also provide open access to their data, particularly to real-time status reports on their bike stations. The analysis of the mass of data generated by such systems is of particular interest to BSS providers to update system structures and policies. This work was motivated by interest in analyzing and comparing several European BSSs to identify common operating patterns in BSSs and to propose practical solutions to avoid potential issues. Our approach relies on the identification of common patterns between and within systems. To this end, a model-based clustering method, called FunFEM, for time series (or more generally functional data) is developed. It is based on a functional mixture model that allows the clustering of the data in a discriminative functional subspace. This model presents the advantage in this context to be parsimonious and to allow the visualization of the clustered systems. Numerical experiments confirm the good behavior of FunFEM, particularly compared to state-of-the-art methods. The application of FunFEM to BSS data from JCDecaux and the Transport for London Initiative allows us to identify 10 general patterns, including pathological ones, and to propose practical improvement strategies based on the system comparison. The visualization of the clustered data within the discriminative subspace turns out to be particularly informative regarding the system efficiency. The proposed methodology is implemented in a package for the R software, named funFEM, which is available on the CRAN. The package also provides a subset of the data analyzed in this work.
The context of the work presented in this thesis is the digital geometry. This research area is devoted to the automatic analysis of objects in digital images in dimension 2 and 3. All acquisition devices provide data organized on regular grids, called digital data. The algorithms that are explored and extended keep the discrete aspect of the data, in opposition to techniques based on an approximation process of a continuous model. More precisely, we are interested in the study of digital curves and surfaces. First of all, we consider basic digital objects such as digital straight lines, planes and circles. We present algorithms that allow to characterize such objects and we propose some extensions of these methods. Then, we study some metrics on the digital objects such as the Euclidean distance transform and the notion of digital geodesic. An approach based on the visibility property in digital domains is presented. In the third part, we define and evaluate estimators of the Euclidean measurements such as the length, the curvature or the area. Some results on the convergence of these estimators are presented. Finally, we illustrate some applications in which these researches have been used for: archaeological object automatic classification and snow sample micro-structure analysis.
"This book provides an overall view of the emerging field of complex data processing, highlighting the similarities between the different data, issues and approaches"--Provided by publisher
The forecasting of future values is a very challenging task. In almost all scientific disciplines, the analysis of time series provides useful information and even economic benefits. In this context, this paper proposes a novel hybrid algorithm to forecast functional time series with arbitrary prediction horizons. It integrates a well-known clustering functional data algorithm into a forecasting strategy based on pattern sequence similarity, which was originally developed for discrete time series. The new approach assumes that some patterns are repeated over time, and it attempts to discover them and evaluate their immediate future. Hence, the algorithm first applies a clustering functional time series algorithm, i.e., it assigns labels to every data unit (it may represent either one hour, or one day, or any arbitrary length). As a result, the time series is transformed into a sequence of labels. Later, it retrieves the sequence of labels occurring just after the sample that we want to be forecasted. This sequence is searched for within the historical data, and every time it is found, the sample immediately after is stored. Once the searching process is terminated, the output is generated by weighting all stored data. The performance of the approach has been tested on real-world datasets related to electricity demand and compared to other existing methods, reporting very promising results. Finally, a statistical significance test has been carried out to confirm the suitability of the election of the compared methods. In conclusion, a novel algorithm to forecast functional time series is proposed with very satisfactory results when assessed in the context of electricity demand.
International audience
International audience
Nowadays, air pollution is a major threat for public health with clear relationships with many diseases, especially cardiovascular ones. The spatiotemporal study of pollution is of great interest for governments and local authorities when deciding for public alerts or new city policies against pollution increase. The aim of this work is to study spatiotemporal profiles of environmental data collected in the south of France (Région Sud) by the public agency AtmoSud. The idea is to better understand the exposition to pollutants of inhabitants on a large territory with important differences in term of geography and urbanism. The data gather the recording of daily measurements of five environmental variables, namely, three pollutants (PM10, NO2, O3) and two meteorological factors (pressure and temperature) over six years. Those data can be seen as multivariate functional data: quantitative entities evolving along time for which there is a growing need of methods to summarize and understand them. For this purpose a novel co-clustering model for multivariate functional data is defined. The model is based on a functional latent block model which assumes for each co-cluster a probabilistic distribution for multivariate functional principal component scores. A stochastic EM algorithm, embedding a Gibbs sampler, is proposed for model inference as well as a model selection criteria for choosing the number of co-clusters. The application of the proposed co-clustering algorithm on environmental data of the Région Sud allowed to divide the region, composed by 357 zones, into six macroareas with common exposure to pollution. We showed that pollution profiles vary accordingly to the seasons, and the patterns are similar during the six years studied. These results can be used by local authorities to develop specific programs to reduce pollution at the macroarea level and to identify specific periods of the year with high pollution peaks in order to set up specific health prevention programs. Overall, the proposed co-clustering approach is a powerful resource to analyse multivariate functional data in order to identify intrinsic data structure and to summarize variables profiles over long periods of time.
International audience
The rise of big data has revolutionized data exploitation practices and led to the emergence of new concepts. Among them, data lakes have emerged as large heterogeneous data repositories that can be analyzed by various methods. An efficient data lake requires a metadata system that addresses the many problems arising when dealing with big data. In consequence, the study of data lake metadata models is currently an active research topic and many proposals have been made in this regard. However, existing metadata models are either tailored for a specific use case or insufficiently generic to manage different types of data lakes, including our previous model MEDAL. In this paper, we generalize MEDAL's concepts in a new metadata model called goldMEDAL. Moreover, we compare goldMEDAL with the most recent state-of-the-art metadata models aiming at genericity and show that we can reproduce these metadata models with goldMEDAL's concepts. As a proof of concept, we also illustrate that goldMEDAL allows the design of various data lakes by presenting three different use cases.
International audience
International audience
International audience
BACKGROUND: Non-Negative Matrix factorization has become an essential tool for feature extraction in a wide spectrum of applications. In the present work, our objective is to extend the applicability of the method to the case of missing and/or corrupted data due to outliers. RESULTS: An essential property for missing data imputation and detection of outliers is that the uncorrupted data matrix is low rank, i.e. has only a small number of degrees of freedom. We devise a new version of the Bregman proximal idea which preserves nonnegativity and mix it with the Augmented Lagrangian approach for simultaneous reconstruction of the features of interest and detection of the outliers using a sparsity promoting ℓ 1 penality. CONCLUSIONS: An application to the analysis of gene expression data of patients with bladder cancer is finally proposed.
National audience
Qualitative or equivalently fuzzy integrals are used as qualitative aggregation functions or as L-fuzzy quantifiers. In both cases they are generalisations of Sugeno integrals. The definitions of these fuzzy integrals are quite similar and coincide in particular cases, but surprisingly there is no deeper analysis of their relationship. The paper attempts to fill this gap and provides unified definitions of fuzzy quantifiers on the basis of which various links between these fuzzy integrals are studied. In order to make these links more visible and to emphasise their logical structure, we present them using the graded square and modern square of opposition.
International audience
International audience
International audience
L’article présente les premiers résultats d’une recherche interdisciplinaire dont l’objectif est d’identifier les logiques sociales de production des messages politiques sur Twitter. Cette recherche vise précisément à démontrer l’intérêt d’une approche interdisciplinaire de l’objet. Il s’agit, d’une part, d’élaborer des algorithmes permettant d’analyser de manière supervisée et non supervisée un très grand nombre de messages politiques afin d’en identifier la polarité et la cible et, d’autre part, de comparer ces informations à des données de sondages d’opinion afin de mieux saisir les relations (ou l’absence de relations) entre les dynamiques d’opinion en ligne et hors ligne.
National audience