Helsinki Institute for Information Technology

facilityEspoo, Uusimaa, Finland

Research output, citation impact, and the most-cited recent papers from Helsinki Institute for Information Technology (Finland). Aggregated across the NobleBlocks index of 300M+ scholarly works.

Total works

4.1K

Citations

285.7K

h-index

201

i10-index

3.7K

Also known as

Forskningsinstitutet för informationsteknologiHelsinki Institute for Information TechnologyTietotekniikan tutkimuslaitos

Top-cited papers from Helsinki Institute for Information Technology

A Fast Fixed-Point Algorithm for Independent Component Analysis

Aapo Hyvärinen, Erkki Oja

1997· Neural Computation3.4Kdoi:10.1162/neco.1997.9.7.1483

We introduce a novel fast algorithm for independent component analysis, which can be used for blind source separation and feature extraction. We show how a neural network learning rule can be transformed into a fixedpoint iteration, which provides an algorithm that is very simple, does not depend on any user-defined parameters, and is fast to converge to the most accurate solution allowed by the data. The algorithm finds, one at a time, all nongaussian independent components, regardless of their probability distributions. The computations can be performed in either batch mode or a semiadaptive manner. The convergence of the algorithm is rigorously proved, and the convergence speed is shown to be cubic. Some comparisons to gradient-based algorithms are made, showing that the new algorithm is usually 10 to 100 times faster, sometimes giving the solution in just a few iterations.

A promoter-level mammalian expression atlas

Bogumił Kaczkowski, Mutsumi Kanamori-Katayama, Charles Plessy, Michiel J. L. de Hoon +4 more

2014· Nature2.2Kdoi:10.1038/nature13182

Regulated transcription controls the diversity, developmental pathways and spatial organization of the hundreds of cell types that make up a mammal. Using single-molecule cDNA sequencing, we mapped transcription start sites (TSSs) and their usage in human and mouse primary cells, cell lines and tissues to produce a comprehensive overview of mammalian gene expression across the human body. We find that few genes are truly ‘housekeeping’, whereas many mammalian promoters are composite entities composed of several closely separated TSSs, with independent cell-type-specific expression profiles. TSSs specific to different cell types evolve at different rates, whereas promoters of broadly expressed genes are the most conserved. Promoter-based expression analysis reveals key transcription factors defining cell states and links them to binding-site motifs. The functions of identified novel transcripts can be predicted by coexpression and sample ontology enrichment analyses. The functional annotation of the mammalian genome 5 (FANTOM5) project provides comprehensive expression profiles and functional annotation of mammalian cell-type-specific transcriptomes with wide applications in biomedical research. A study from the FANTOM consortium using single-molecule cDNA sequencing of transcription start sites and their usage in human and mouse primary cells, cell lines and tissues reveals insights into the specificity and diversity of transcription patterns across different mammalian cell types. FANTOM5 (standing for functional annotation of the mammalian genome 5) is the fifth major stage of a major international collaboration that aims to dissect the transcriptional regulatory networks that define every human cell type. Two Articles in this issue of Nature present some of the project's latest results. The first paper uses the FANTOM5 panel of tissue and primary cell samples to define an atlas of active, in vivo bidirectionally transcribed enhancers across the human body. These authors show that bidirectional capped RNAs are a signature feature of active enhancers and identify more than 40,000 enhancer candidates from over 800 human cell and tissue samples. The enhancer atlas is used to compare regulatory programs between different cell types and identify disease-associated regulatory SNPs, and will be a resource for studies on cell-type-specific enhancers. In the second paper, single-molecule sequencing is used to map human and mouse transcription start sites and their usage in a panel of distinct human and mouse primary cells, cell lines and tissues to produce the most comprehensive mammalian gene expression atlas to date. The data provide a plethora of insights into open reading frames and promoters across different cell types in addition to valuable annotation of mammalian cell-type-specific transcriptomes.

SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information

Kai Dührkop, Markus Fleischauer, Marcus Ludwig, Alexander A. Aksenov +4 more

2019· Nature Methods2.0Kdoi:10.1038/s41592-019-0344-8

Mass spectrometry is a predominant experimental technique in metabolomics and related fields, but metabolite structural elucidation remains highly challenging. We report SIRIUS 4 ( https://bio.informatik.uni-jena.de/sirius/ ), which provides a fast computational approach for molecular structure identification. SIRIUS 4 integrates CSI:FingerID for searching in molecular structure databases. Using SIRIUS 4, we achieved identification rates of more than 70% on challenging metabolomics datasets. SIRIUS 4 is a fast and highly accurate tool for molecular structure interpretation from mass-spectrometry-based metabolomics data.

Noise-contrastive estimation: A new estimation principle for unnormalized statistical models

Michael U. Gutmann, Aapo Hyvärinen

2010· Edinburgh Research Explorer1.4K

We present a new estimation principle for parameterized statistical models. The idea is to perform nonlinear logistic regression to discriminate between the observed data and some artificially generated noise, using the model log-density function in the regression nonlinearity. We show that this leads to a consistent (convergent) estimator of the parameters, and analyze the asymptotic variance. In particular, the method is shown to directly work for unnormalized models, i.e. models where the density function does not integrate to one. The normalization constant can be estimated just like any other parameter. For a tractable ICA model, we compare the method with other estimation methods that can be used to learn unnormalized models, including score matching, contrastive divergence, and maximum-likelihood where the normalization constant is estimated with importance sampling. Simulations show that noise-contrastive estimation offers the best trade-off between computational and statistical efficiency. The method is then applied to the modeling of natural images: We show that the method can successfully estimate a large-scale two-layer model and a Markov random field. 1

Producing polished prokaryotic pangenomes with the Panaroo pipeline

Gerry Tonkin‐Hill, Neil MacAlasdair, Christopher Ruis, Aaron Weimann +4 more

2020· Genome biology1.3Kdoi:10.1186/s13059-020-02090-4

Population-level comparisons of prokaryotic genomes must take into account the substantial differences in gene content resulting from horizontal gene transfer, gene duplication and gene loss. However, the automated annotation of prokaryotic genomes is imperfect, and errors due to fragmented assemblies, contamination, diverse gene families and mis-assemblies accumulate over the population, leading to profound consequences when analysing the set of all genes found in a species. Here, we introduce Panaroo, a graph-based pangenome clustering tool that is able to account for many of the sources of error introduced during the annotation of prokaryotic genome assemblies. Panaroo is available at https://github.com/gtonkinhill/panaroo .

Searching molecular structure databases with tandem mass spectra using CSI:FingerID

Kai Dührkop, Huibin Shen, Marvin Meusel, Juho Rousu +1 more

2015· Proceedings of the National Academy of Sciences1.2Kdoi:10.1073/pnas.1509788112

Metabolites provide a direct functional signature of cellular state. Untargeted metabolomics experiments usually rely on tandem MS to identify the thousands of compounds in a biological sample. Today, the vast majority of metabolites remain unknown. We present a method for searching molecular structure databases using tandem MS data of small molecules. Our method computes a fragmentation tree that best explains the fragmentation spectrum of an unknown molecule. We use the fragmentation tree to predict the molecular structure fingerprint of the unknown compound using machine learning. This fingerprint is then used to search a molecular structure database such as PubChem. Our method is shown to improve on the competing methods for computational metabolite identification by a considerable margin.

Sampling Large Databases for Association Rules

Hannu Toivonen

19961.1K

Discovery of association rules.is an import-ant database mining problem. Current al-gorithms for finding association rules require several passes over the analyzed database, and obviously the role of I/O overhead is very sig-nificant for very large databases. We present new algorithms that reduce the database activ-ity considerably. The idea is to pick a Random sample, to find using this sample all associ-ation rules that probably hold in the whole database, and then to verify the results with the rest of the database. The algorithms thus produce exact association rules, not approx-imations based on a sample. The approach is, however, probabilistic, and in those rare cases where our sampling method does not produce all association rules, the missing rules can be found in a second pass. Our experiments show that the proposed algorithms can find associ-ation rules very efficiently in only one database Pa= 1

Functional Materials Based on Self-Assembly of Polymeric Supramolecules

Olli Ikkala, Gerrit ten Brinke

2002· Science1.0Kdoi:10.1126/science.1067794

Self-assembly of polymeric supramolecules is a powerful tool for producing functional materials that combine several properties and may respond to external conditions. We illustrate the concept using a comb-shaped architecture. Examples include the hexagonal self-organization of conjugated conducting polymers and the polarized luminance in solid-state films of rodlike polymers obtained by removing the hydrogen-bonded side chains from the aligned thermotropic smectic phase. Hierarchically structured materials obtained by applying different self-organization and recognition principles and directed assembly form a basis for tunable nanoporous materials, smart membranes, preparation of nano-objects, and anisotropic properties, such as proton conductivity.

Noise2Noise: Learning Image Restoration without Clean Data

Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli Laine +3 more

2018· arXiv (Cornell University)986doi:10.48550/arxiv.1803.04189

We apply basic statistical reasoning to signal reconstruction by machine learning -- learning to map corrupted observations to clean signals -- with a simple and powerful conclusion: it is possible to learn to restore images by only looking at corrupted examples, at performance at and sometimes exceeding training using clean data, without explicit image priors or likelihood models of the corruption. In practice, we show that a single model learns photographic noise removal, denoising synthetic Monte Carlo images, and reconstruction of undersampled MRI scans -- all corrupted by different processes -- based on noisy data only.

LoRDEC: accurate and efficient long read error correction

Leena Salmela, Éric Rivals

2014· Bioinformatics967doi:10.1093/bioinformatics/btu538

MOTIVATION: PacBio single molecule real-time sequencing is a third-generation sequencing technique producing long reads, with comparatively lower throughput and higher error rate. Errors include numerous indels and complicate downstream analysis like mapping or de novo assembly. A hybrid strategy that takes advantage of the high accuracy of second-generation short reads has been proposed for correcting long reads. Mapping of short reads on long reads provides sufficient coverage to eliminate up to 99% of errors, however, at the expense of prohibitive running times and considerable amounts of disk and memory space. RESULTS: We present LoRDEC, a hybrid error correction method that builds a succinct de Bruijn graph representing the short reads, and seeks a corrective sequence for each erroneous region in the long reads by traversing chosen paths in the graph. In comparison, LoRDEC is at least six times faster and requires at least 93% less memory or disk space than available tools, while achieving comparable accuracy. Availability and implementaion: LoRDEC is written in C++, tested on Linux platforms and freely available at http://atgc.lirmm.fr/lordec.

SynergyFinder 2.0: visual analytics of multi-drug combination synergies

Aleksandr Ianevski, Anil K. Giri, Tero Aittokallio

2020· Nucleic Acids Research940doi:10.1093/nar/gkaa216

SynergyFinder (https://synergyfinder.fimm.fi) is a stand-alone web-application for interactive analysis and visualization of drug combination screening data. Since its first release in 2017, SynergyFinder has become a widely used web-tool both for the discovery of novel synergistic drug combinations in pre-clinical model systems (e.g. cell lines or primary patient-derived cells), and for better understanding of mechanisms of combination treatment efficacy or resistance. Here, we describe the latest version of SynergyFinder (release 2.0), which has extensively been upgraded through the addition of novel features supporting especially higher-order combination data analytics and exploratory visualization of multi-drug synergy patterns, along with automated outlier detection procedure, extended curve-fitting functionality and statistical analysis of replicate measurements. A number of additional improvements were also implemented based on the user requests, including new visualization and export options, updated user interface, as well as enhanced stability and performance of the web-tool. With these improvements, SynergyFinder 2.0 is expected to greatly extend its potential applications in various areas of multi-drug combinatorial screening and precision medicine.

Independent component approach to the analysis of EEG and MEG recordings

Ricardo Vigário, Jaakko Särelä, V. Jousmiki, Matti Hämäläinen +1 more

2000· IEEE Transactions on Biomedical Engineering807doi:10.1109/10.841330

Multichannel recordings of the electromagnetic fields emerging from neural currents in the brain generate large amounts of data. Suitable feature extraction methods are, therefore, useful to facilitate the representation and interpretation of the data. Recently developed independent component analysis (ICA) has been shown to be an efficient tool for artifact identification and extraction from electroencephalographic (EEG) and magnetoencephalographic (MEG) recordings. In addition, ICA has been applied to the analysis of brain signals evoked by sensory stimuli. This paper reviews our recent results in this field.

Fully-automated and ultra-fast cell-type identification using specific marker combinations from single-cell transcriptomic data

Aleksandr Ianevski, Anil K. Giri, Tero Aittokallio

2022· Nature Communications772doi:10.1038/s41467-022-28803-w

Identification of cell populations often relies on manual annotation of cell clusters using established marker genes. However, the selection of marker genes is a time-consuming process that may lead to sub-optimal annotations as the markers must be informative of both the individual cell clusters and various cell types present in the sample. Here, we developed a computational platform, ScType, which enables a fully-automated and ultra-fast cell-type identification based solely on a given scRNA-seq data, along with a comprehensive cell marker database as background information. Using six scRNA-seq datasets from various human and mouse tissues, we show how ScType provides unbiased and accurate cell type annotations by guaranteeing the specificity of positive and negative marker genes across cell clusters and cell types. We also demonstrate how ScType distinguishes between healthy and malignant cell populations, based on single-cell calling of single-nucleotide variants, making it a versatile tool for anticancer applications. The widely applicable method is deployed both as an interactive web-tool ( https://sctype.app ), and as an open-source R-package.

OP-ELM: Optimally Pruned Extreme Learning Machine

Yoan Miché, A. Sorjamaa, Patrick Bas, Olli Simula +2 more

2009· IEEE Transactions on Neural Networks765doi:10.1109/tnn.2009.2036259

In this brief, the optimally pruned extreme learning machine (OP-ELM) methodology is presented. It is based on the original extreme learning machine (ELM) algorithm with additional steps to make it more robust and generic. The whole methodology is presented in detail and then applied to several regression and classification problems. Results for both computational time and accuracy (mean square error) are compared to the original ELM and to three other widely used methodologies: multilayer perceptron (MLP), support vector machine (SVM), and Gaussian process (GP). As the experiments for both regression and classification illustrate, the proposed OP-ELM methodology performs several orders of magnitude faster than the other algorithms used in this brief, except the original ELM. Despite the simplicity and fast performance, the OP-ELM is still able to maintain an accuracy that is comparable to the performance of the SVM. A toolbox for the OP-ELM is publicly available online.

The transcriptional landscape of age in human peripheral blood

Marjolein J. Peters, Roby Joehanes, Luke C. Pilling, Claudia Schurmann +4 more

2015· Nature Communications752doi:10.1038/ncomms9570

Disease incidences increase with age, but the molecular characteristics of ageing that lead to increased disease susceptibility remain inadequately understood. Here we perform a whole-blood gene expression meta-analysis in 14,983 individuals of European ancestry (including replication) and identify 1,497 genes that are differentially expressed with chronological age. The age-associated genes do not harbor more age-associated CpG-methylation sites than other genes, but are instead enriched for the presence of potentially functional CpG-methylation sites in enhancer and insulator regions that associate with both chronological age and gene expression levels. We further used the gene expression profiles to calculate the 'transcriptomic age' of an individual, and show that differences between transcriptomic age and chronological age are associated with biological features linked to ageing, such as blood pressure, cholesterol levels, fasting glucose, and body mass index. The transcriptomic prediction model adds biological relevance and complements existing epigenetic prediction models, and can be used by others to calculate transcriptomic age in external cohorts.

SynergyFinder 3.0: an interactive analysis and consensus interpretation of multi-drug synergies across multiple samples

Aleksandr Ianevski, Anil K. Giri, Tero Aittokallio

2022· Nucleic Acids Research716doi:10.1093/nar/gkac382

SynergyFinder (https://synergyfinder.fimm.fi) is a free web-application for interactive analysis and visualization of multi-drug combination response data. Since its first release in 2017, SynergyFinder has become a popular tool for multi-dose combination data analytics, partly because the development of its functionality and graphical interface has been driven by a diverse user community, including both chemical biologists and computational scientists. Here, we describe the latest upgrade of this community-effort, SynergyFinder release 3.0, introducing a number of novel features that support interactive multi-sample analysis of combination synergy, a novel consensus synergy score that combines multiple synergy scoring models, and an improved outlier detection functionality that eliminates false positive results, along with many other post-analysis options such as weighting of synergy by drug concentrations and distinguishing between different modes of synergy (potency and efficacy). Based on user requests, several additional improvements were also implemented, including new data visualizations and export options for multi-drug combinations. With these improvements, SynergyFinder 3.0 supports robust identification of consistent combinatorial synergies for multi-drug combinatorial discovery and clinical translation.

Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics

Michael U. Gutmann, Aapo Hyvärinen

2012· Edinburgh Research Explorer633doi:10.5555/2503308.2188396

We consider the task of estimating, from observed data, a probabilistic model that is parameterized by a finite number of parameters. In particular, we are considering the situation where the model probability density function is unnormalized. That is, the model is only specified up to the partition function. The partition function normalizes a model so that it integrates to one for any choice of the parameters. However, it is often impossible to obtain it in closed form. Gibbs distributions, Markov and multi-layer networks are examples of models where analytical normalization is often impossible. Maximum likelihood estimation can then not be used without resorting to numerical approximations which are often computationally expensive. We propose here a new objective function for the estimation of both normalized and unnormalized models. The basic idea is to perform nonlinear logistic regression to discriminate between the observed data and some artificially generated noise. With this approach, the normalizing partition function can be estimated like any other parameter. We prove that the new estimation method leads to a consistent (convergent) estimator of the parameters. For large noise sample sizes, the new estimator is furthermore shown to behave like the maximum likelihood estimator. In the estimation of unnormalized models, there is a trade-off between statistical and computational performance. We show that the new method strikes a competitive trade-off in comparison to other estimation methods for unnormalized models. As an application to real data, we estimate novel two-layer models of natural image statistics with spline nonlinearities. Keywords: statistics unnormalized models, partition function, computation, estimation, natural image 1.

Climate Change and Weather Extremes in the Eastern Mediterranean and Middle East

George Zittis, M. Almazroui, Pinhas Alpert, Philippe Ciais +4 more

2022· Reviews of Geophysics617doi:10.1029/2021rg000762

Abstract Observation‐based and modeling studies have identified the Eastern Mediterranean and Middle East (EMME) region as a prominent climate change hotspot. While several initiatives have addressed the impacts of climate change in parts of the EMME, here we present an updated assessment, covering a wide range of timescales, phenomena and future pathways. Our assessment is based on a revised analysis of recent observations and projections and an extensive overview of the recent scientific literature on the causes and effects of regional climate change. Greenhouse gas emissions in the EMME are growing rapidly, surpassing those of the European Union, hence contributing significantly to climate change. Over the past half‐century and especially during recent decades, the EMME has warmed significantly faster than other inhabited regions. At the same time, changes in the hydrological cycle have become evident. The observed recent temperature increase of about 0.45°C per decade is projected to continue, although strong global greenhouse gas emission reductions could moderate this trend. In addition to projected changes in mean climate conditions, we call attention to extreme weather events with potentially disruptive societal impacts. These include the strongly increasing severity and duration of heatwaves, droughts and dust storms, as well as torrential rain events that can trigger flash floods. Our review is complemented by a discussion of atmospheric pollution and land‐use change in the region, including urbanization, desertification and forest fires. Finally, we identify sectors that may be critically affected and formulate adaptation and research recommendations toward greater resilience of the EMME region to climate change.

Permutation Tests for Studying Classifier Performance

Markus Ojala, Gemma C. Garriga

2009577doi:10.1109/icdm.2009.108

We explore the framework of permutation-based p-values for assessing the behavior of the classification error. In this paper we study two simple permutation tests. The first test estimates the null distribution by permuting the labels in the data; this has been used extensively in classification problems in computational biology. The second test produces permutations of the features within classes, inspired by restricted randomization techniques traditionally used in statistics. We study the properties of these tests and present an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the classification error via permutation tests is effective; in particular, the restricted permutation test clearly reveals whether the classifier exploits the interdependency between the features in the data.

Interaction in 4-second bursts

Antti Oulasvirta, Sakari Tamminen, Virpi Roto, Jaana Kuorelahti

2005568doi:10.1145/1054972.1055101

When on the move, cognitive resources are reserved partly for passively monitoring and reacting to contexts and events, and partly for actively constructing them. The Re-source Competition Framework (RCF), building on the Multiple Resources Theory, explains how psychosocial tasks typical of mobile situations compete for cognitive resources and then suggests that this leads to the depletion of resources for task interaction and eventually results in the breakdown of fluent interaction. RCF predictions were tested in a semi-naturalistic field study measuring attention during the performance of assigned Web search tasks on mobile phone while moving through nine varied but typical urban situations. Notably, we discovered up to eight-fold differentials between micro-level measurements of atten-tional resource fragmentation, for example from spans of over 16 seconds in a laboratory condition dropping to bursts of just a few seconds in difficult mobile situations. By cali-brating perceptual sampling, reducing resources from tasks of secondary importance, and resisting the impulse to switch tasks before finalization, participants compensated for the resource depletion. The findings are compared to previous studies in office contexts. The work is valuable in many areas of HCI dealing with mobility.

Search all NobleBlocks papers mentioning “Helsinki Institute for Information Technology” →