Samsung (Poland)
companyWarsaw, Poland
Research output, citation impact, and the most-cited recent papers from Samsung (Poland) (Poland). Aggregated across the NobleBlocks index of 300M+ scholarly works.
Top-cited papers from Samsung (Poland)
This paper introduces the SAMSum Corpus, a new dataset with abstractive dialogue summaries. We investigate the challenges it poses for automated summarization by testing several models and comparing their results with those obtained on a corpus of news articles. We show that model-generated summaries of dialogues achieve higher ROUGE scores than the model-generated summaries of news -in contrast with human evaluators' judgement. This suggests that a challenging task of abstractive dialogue summarization requires dedicated models and non-standard quality measures. To our knowledge, our study is the first attempt to introduce a high-quality chatdialogues corpus, manually annotated with abstractive summarizations, which can be used by the research community for further studies.
Predictive modeling is invaded by elastic, yet complex methods such as neural networks or ensembles (model stacking, boosting or bagging). Such methods are usually described by a large number of parameters or hyper parameters - a price that one needs to pay for elasticity. The very number of parameters makes models hard to understand. This paper describes a consistent collection of explainers for predictive models, a.k.a. black boxes. Each explainer is a technique for exploration of a black box model. Presented approaches are model-agnostic, what means that they extract useful information from any predictive method despite its internal structure. Each explainer is linked with a specific aspect of a model. Some are useful in decomposing predictions, some serve better in understanding performance, while others are useful in understanding importance and conditional responses of a particular variable. Every explainer presented in this paper works for a single model or for a collection of models. In the latter case, models can be compared against each other. Such comparison helps to find strengths and weaknesses of different approaches and gives additional possibilities for model validation. Presented explainers are implemented in the DALEX package for R. They are based on a uniform standardized grammar of model exploration which may be easily extended. The current implementation supports the most popular frameworks for classification and regression.
Container technology has revolutionized the way software is being packaged and run. The telecommunications industry, now challenged with the 5G transformation, views containers as the best way to achieve agile infrastructure that can serve as a stable base for high throughput and low latency for 5G edge applications. These challenges make optimal scheduling of performance-sensitive containerized workflows a matter of emerging importance. Meanwhile, the wide adoption of Kubernetes across industries has placed it as a de-facto standard for container orchestration. Several attempts have been made to improve Kubernetes scheduling, but the existing solutions either do not respect current scheduling rules or only considered a static infrastructure viewpoint.To address this, we propose NetMARKS - a novel approach to Kubernetes pod scheduling that uses dynamic network metrics collected with Istio Service Mesh. This solution improves Kubernetes scheduling while being fully backward compatible. We validated our solution using different workloads and processing layouts. Based on our analysis, NetMARKS can reduce application response time up to 37 percent and save up to 50 percent of inter-node bandwidth in a fully automated manner. This significant improvement is crucial to Kubernetes adoption in 5G use cases, especially for multi-access edge computing and machine-to-machine communication.
In this paper, we describe our method for DCASE2019 task3: Sound Event Localization and Detection (SELD). We use four CRNN SELDnet-like single output models which run in a consecutive manner to recover all possible information of occurring events. We decompose the SELD task into estimating number of active sources, estimating direction of arrival of a single source, estimating direction of arrival of the second source where the direction of the first one is known and a multi-label classification task. We use custom consecutive ensemble to predict events' onset, offset, direction of arrival and class. The proposed approach is evaluated on the TAU Spatial Sound Events 2019 - Ambisonic and it is compared with other participants' submissions.
The relevance of the Key Information Extraction (KIE) task is increasingly important in natural language processing problems. But there are still only a few well-defined problems that serve as benchmarks for solutions in this area. To bridge this gap, we introduce two new datasets (Kleister NDA and Kleister Charity). They involve a mix of scanned and born-digital long formal English-language documents. In these datasets, an NLP system is expected to find or infer various types of entities by employing both textual and structural layout features. The Kleister Charity dataset consists of 2,788 annual financial reports of charity organizations, with 61,643 unique pages and 21,612 entities to extract. The Kleister NDA dataset has 540 Non-disclosure Agreements, with 3,229 unique pages and 2,160 entities to extract. We provide several state-of-the-art baseline systems from the KIE domain (Flair, BERT, RoBERTa, LayoutLM, LAMBERT), which show that our datasets pose a strong challenge to existing models. The best model achieved an 81.77% and an 83.57% F1-score on respectively the Kleister NDA and the Kleister Charity datasets. We share the datasets to encourage progress on more in-depth and complex information extraction tasks.
This paper describes our proposed solutions designed for a STS core track within the Se-mEval 2016 English Semantic Textual Similarity (STS) task. Our method of similarity detection combines recursive autoencoders with a WordNet award-penalty system that accounts for semantic relatedness, and an SVM classifier, which produces the final score from similarity matrices. This solution is further supported by an ensemble classifier, combining an aligner with a bi-directional Gated Recurrent Neural Network and additional features, which then performs Linear Support Vector Regression to determine another set of scores.
This paper describes the submission to IWSLT 2020 We took part in the offline End-to-End English to German TED lectures translation task. We based our solution on our last year's submission We used a slightly altered Transformer(Vaswani et al., 2017) architecture with ResNet-like(He et al., 2016) convolutional layer preparing the audio input to Transformer encoder. To improve the model's quality of translation we introduced two regularization techniques and trained on machine translated Librispeech(Panayotov et al., 2015) corpus in addition to iwsltcorpus, TEDLIUM2(Rousseau et al., 2014) and Must C(Di Gangi et al., 2019) corpora. Our best model scored almost 3 BLEU higher than last year's model. To segment 2020 test set we used exactly the same procedure as last year.
We consider several types of internal queries: questions about subwords of a text. As the main tool we develop an optimal data structure for the problem called here internal pattern matching. This data structure provides constant-time answers to queries about occurrences of one subword x in another subword y given text, assuming that , which allows for a constant-space representation of all occurrences. This problem can be viewed as a natural extension of the well-studied pattern matching problem. The data structure has linear size and admits a linear-time construction algorithm. Using the solution to the internal pattern matching problem, we obtain very efficient data structures answering queries about: primitivity of subwords, periods of subwords, general substring compression, and cyclic equivalence of two subwords. All these results improve upon the best previously known counterparts. The linear construction time of our data structure also allows to improve the algorithm for finding δ-subrepetitions in a text (a more general version of maximal repetitions, also called runs). For any fixed δ we obtain the first linear-time algorithm, which matches the linear time complexity of the algorithm computing runs. Our data structure has already been used as a part of the efficient solutions for subword suffix rank & selection, as well as substring compression using Burrows-Wheeler transform composed with run-length encoding. The model of internal queries in texts is connected to the well-studied problem of text indexing. Both models have their origins in the introduction of suffix trees. However, there is an important difference: in our model the size of the representation of a query is constant and therefore enables faster query time. Our results can be viewed as efficient solutions to “internal” equivalents of several basic problems of regular pattern matching and make an improvement in a majority of already published results related to internal queries.
In this paper we propose a novel reservation plan adaptation system based on machine learning. In the context of cloud auto-scaling, an important issue is the ability to define and use a resource reservation plan, which enables efficient resource scheduling. If necessary, the plan may allocate new resources upon reservation where a sufficient amount of resources is available. Our solution allows the updating of a reservation plan initially prepared by an administrator. It makes it possible to adapt reservation plans one or more weeks ahead. Hence, it allows time for the administrator to analyze the plan and discover potential problems with resource under-provisioning or over-provisioning, which may prevent server overload in the former case and unnecessary expenses in the latter. It also makes it possible to extract and analyze the knowledge learned, which may provide useful information about resource usage characteristics. The proposed solution is tested on OpenStack using real Wikipedia server traffic data. Experimental results demonstrate that machine learning enables an improvement in resource usage.
A radio map which consists of a number of Wi-Fi fingerprints has a tendency to became outdated over time. It is the result of both physical nature of the Wi-Fi signal and the variable indoor environment conditions which have an impact on received signal strength. Therefore, maintenance of the collected Wi-Fi signals data is essential for preserving quality of the Wi-Fi based positioning systems. However, usually it is the time and money consuming process. In the recent years number of solutions has been developed to address the problem. Unfortunately, in most of them human intervention is necessary. Our research concentrates on improving the quality of existing radio map by utilization of the crowdsourced Wi-Fi fingerprints that are post processed offline along with the logged pedestrian trajectory. The solution proposed in this paper focuses on the method which does not require human intervention, thus the end users do not have to report their locations. The results observed during the conducted research show the significant improvement of the positioning accuracy, in both surveyed and not surveyed locations.
How language users become able to process forms they have never encountered in input is central to our understanding of language cognition. A range of models, including rule-based models, stochastic models, and analogy-based models have been proposed to account for this ability. Despite the fact that all three models are reasonably successful, we argue that productivity in language is more insightfully captured through learnability than by rules or probabilities. Using a combination of computational modelling and behavioural experimentation we show that the basic principle of error-driven learning allows language users to detect relevant patterns of any degree of systematicity. In case of allomorphy, these patterns are found at a level that cuts across phonology and morphology and is not considered by mainstream approaches to language. Our findings thus highlight how a learning-based approach applies to phenomena on the continuum from rule-based over probabilistic to “unruly” and constrains our inferences about the types of structures that should be targeted on a cognitively realistic account of allomorphic representation.
Abstract A laboratory scale sequencing batch reactor (SBR), fed with synthetic wastewater containing a mixture of organic compounds, was operated for nearly six months. Despite maintaining the same operational conditions, a deterioration of enhanced biological phosphorus removal (EBPR) occurred after 40 days of SBR operation. The Prel/Cupt ratio decreased from 0.28 to 0.06 P-mol C-mol−1, and C requirements increased from 11 to 32 mg C h−1 g−1 of mixed liquor suspended solids. A FISH analysis showed that the percentage of Accumulibacter in an overall community of polyphosphate accumulating organisms (PAOs) and glycogen accumulating organisms (GAOs) dropped from 93% to 13%. An increase in abundance of Gammaproteobacteria (from 2.6% to 22%) and Alphaproteobacteria (from 1.8% to 10%) was observed. The number of Competibacter increased from 0.5% to nearly 9%. Clusters 1 and 2 of Defluviicoccus-related GAOs, not detected before deterioration, constituted 35% and 27% of Alphaproteobacteria, respectively. We concluded that lab-scale experiments should not be extended implicitly to full-scale EBPR systems because some bacterial groups are detected mainlyin lab-scale reactors. Well-defined, lab-scale operational conditions reduce the number of ecological niches available to bacteria.
We present our Generative Enhanced Model (GEM) that we used to create samples awarded the first prize on the FEVER 2.0 Breakers Task. GEM is the extended language model developed upon GPT-2 architecture. The addition of novel target vocabulary input to the already existing context input enabled controlled text generation. The training procedure resulted in creating a model that inherited the knowledge of pretrained GPT-2, and therefore was ready to generate natural-like English sentences in the task domain with some additional control. As a result, GEM generated malicious claims that mixed facts from various articles, so it became difficult to classify their truthfulness.
The detailed architectural examination of the neuronal nuclei in any brain region, using confocal microscopy, requires quantification of fluorescent signals in three-dimensional stacks of confocal images. An essential prerequisite to any quantification is the segmentation of the nuclei which are typically tightly packed in the tissue, the extreme being the hippocampal dentate gyrus (DG), in which nuclei frequently appear to overlap due to limitations in microscope resolution. Segmentation in DG is a challenging task due to the presence of a significant amount of image artifacts and densely packed nuclei. Accordingly, we established an algorithm based on continuous boundary tracing criterion aiming to reconstruct the nucleus surface and to separate the adjacent nuclei. The presented algorithm neither uses a pre-built nucleus model, nor performs image thresholding, which makes it robust against variations in image intensity and poor contrast. Further, the reconstructed surface is used to study morphology and spatial arrangement of the nuclear interior. The presented method is generally dedicated to segmentation of crowded, overlapping objects in 3D space. In particular, it allows us to study quantitatively the architecture of the neuronal nucleus using confocal-microscopic approach.
High deployment cost with respect to expected revenue is the main barrier to fiber-to-the-home (FTTH) roll-out in rural areas. This problem, as shown in this paper, is exacerbated by the uncertainty associated with the end-user take-up rate. The randomness associated with subscribers’ service take-up yields considerable fluctuation and escalation in the total cost of deployment. This adverse and varying environment makes it difficult to produce firm business cases and can increase the reluctance of potential investors and incumbent operators to deploy FTTH access networks. In this paper, we develop a holistic framework for examining real-world FTTH deployment scenarios, taking as a case study one of the most rural counties in Ireland. Further, we carry out an in-depth techno-economic analysis identifying the methods more applicable in the rural scenario. We analyze the cost effectiveness of FTTH deployment, also proposing solutions that provide different levels of upfront investment risk, relating it to uncertainty in customers’ take-up rates. For example, we show how a lower take-up rate can be made profitable by adopting a strategy that favors lower upfront costs at the expense of higher connectivity costs.
Maps used in car navigation systems differ from indoor maps due to the fact that they were designed to address different needs. Therefore, map matching techniques used for outdoor navigation cannot be applied indoors without significant modifications. The main aim of this paper is to present modified map matching algorithm for special usage such as indoor navigation in shopping malls, hotels and airports. The research concentrates to improve accuracy of positioning engine and natural presentation of pedestrian trajectory. The article includes the indoor map matching algorithm proposals, with the highlights of their strengths and weaknesses.
Alessandro Seganti, Klaudia Firląg, Helena Skowronska, Michał Satława, Piotr Andruszkiewicz. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021.
This paper describes our submission to the TL;DR challenge. Neural abstractive summarization models have been successful in generating fluent and consistent summaries with advancements like the copy (Pointer-generator) and coverage mechanisms. However, these models suffer from their extractive nature as they learn to copy words from the source text. In this paper, we propose a novel abstractive model based on Variational Autoencoder (VAE) to address this issue. We also propose a Unified Summarization Framework for the generation of summaries. Our model eliminates non-critical information at a sentencelevel with an extractive summarization module and generates the summary word by word using an abstractive summarization module. To implement our framework, we combine submodules with state-of-the-art techniques including Pointer-Generator Network (PGN) and BERT while also using our new VAE-PGN abstractive model. We evaluate our model on the benchmark Reddit corpus as part of the TL;DR challenge and show that our model outperforms the baseline in ROUGE score while generating diverse summaries.
Computation offloading is one of the approaches used for increasing application efficiency and decreasing energy consumption on consumer devices, an issue especially important for mobile appliances. While some such systems have been previously designed, very little research has been directed towards offloading code from web applications, an alternative to native solutions recently gaining in popularity. In this paper we attempt to narrow down this gap by presenting the first practical system for offloading HTML5 web workers from mobile web applications. The system is transparent to the programmer, i.e. Does not require any additional modifications to the original application to indicate which code parts should be offloaded. The results of the experiments with various sample applications have shown that for sufficiently complicated computations the offloading system can decrease both the processing time and energy consumption by even several hundred percent.
This work evaluates a possibility of creating a high-frequency, SSVEP-based brain computer interface using a low cost EEG recording hardware - an Emotiv EEG Neuro-headset. Both above aspects are crucial to enable deploying the BCI technology in the consumer market. High frequencies can be used to create a non-tiring and more pleasant interface. Commercial EEG systems, as the Emotiv EEG, although demonstrating large underperformance, are much more affordable than standard, clinical-grade EEG amplifiers. A system classifying between two stimuli and rest is designed and tested in two experiments: on five and ten subject respectively. First, the accuracy of the system is compared for frequencies in lower range (17Hz, 19Hz, 23Hz, 25Hz) and higher range (31Hz, 33Hz, 37Hz, 40Hz). The mean online accuracy is 80%±15% for the former and 67%±12% for the latter. Second, a more thorough investigation is done by evaluating the system for frequencies within a set of 35Hz-40Hz. Although the mean accuracy, 64% ± 22%, is relatively low, most of the users were able to achieve satisfying accuracy, with the mean reaching 82%±5%, which would allow for an efficient, and yet pleasant, usage of the BCI system. In each case a user dependent approach is applied, with a calibration session lasting about five minutes. EEG feature extraction is done using common spatial pattern (CSP) filtering, canonical correlation analysis (CCA), and linear discrimination analysis (LDA).