National Library of Finland

archiveHelsinki, Finland

Research output, citation impact, and the most-cited recent papers from National Library of Finland. Aggregated across the NobleBlocks index of 300M+ scholarly works.

Total works

205

Citations

1.0K

h-index

i10-index

Also known as

KansalliskirjastoNational Library of FinlandNationalbiblioteket

Top-cited papers from National Library of Finland

Can Type-Token Ratio be Used to Show Morphological Complexity of Languages?

Kimmo Kettunen

2014· Journal of Quantitative Linguistics107doi:10.1080/09296174.2014.911506

ABSTRACTType-token ratio (TTR), also known as vocabulary size divided by text length (V/N) is a simple measure of lexical diversity. It has been used in literary studies, studies in child language and even psychiatry. The basic problem of TTR is that it is affected by the length of the text sample. Several suggestions for improving this fault have been given, including standardizing the length of text samples, using logarithms in the basic formula, etc. We show in this paper that simple TTR and its more elaborate calculation MATTR can be used for approximation of morphological complexity of languages. This usage of TTR has been notified by Juola with analysis of six languages. We analyse text material with TTR and MATTR from two differing sources: firstly, text of the EU constitution with 21 languages and secondly with 16 of the same languages with available non-parallel random data from the Leipzig corpus. We compare the automatic analysis results to two independent linguistic measures of morphological complexity. Firstly, we use number of non-homographic noun forms in a language's inflectional paradigms, the paradigm size. Secondly we use available inflectional synthesis figures of verbs produced by the AUTOTYP project. We enrich our corpus findings with data from information retrieval (IR) results. It has been suggested that improvements in achieved IR effectiveness with usage of word form variation management depend on the morphological complexity of the languages. Thus this IR gain data can be used to give independent evidence to evaluation of morphological complexity. Our results show that earlier Juola complexity figures and TTR and MATTR calculations correlate moderately in the EU constitution data. Figures given by TTR and MATTR correlate highly with each other in both corpora, and they also correlate highly with the number of non-homographic noun forms in a language. Correlation to inflectional synthesis of the verbs was found weakly positive in most cases, but the data was scarce. All the three computed measures are able to order the languages quite meaningfully in a morphological complexity order that at least groups most of the languages with same kind of languages and the most and least complex languages are clearly separated. It seems also that TTR and MATTR order the languages quite consistently with both corpora. In the conclusion we discuss how the complexity figures can be utilized. ACKNOWLEDGEMENTSMost of this paper was finished while the author was visiting UFAM, Universidad Federal do Amazonas, Institute of Computing, and funded by FAPEAM, Fundação de Amparo à Pesquisa do Estado do Amazonas (http://www.fapeam.am.gov.br/) with grant number 159/2012. The author wishes to thank the anonymous referee of the Journal of Quantitative Linguistics for useful comments when preparing the final version of the paper. Prof. Johanna Nichols kindly provided the AUTOTYP verbal inflection synthesis database for use.Notes* Address correspondence to: Kimmo Kettunen, Kiviteltankatu 6 A 11, FIN-00710 Helsinki. Tel: +358 50 5710859. E-mail: kkettun4@welho.com1 Strictly speaking Ehret and Szmrecsanyi use a variant of the Juola method, where 10% of the characters in the samples are removed randomly and the resulting file is compressed. This approach is introduced in Juola (Citation2008) and evidently it is comparable to the distortion method of Juola (Citation1998).

Annif: DIY automated subject indexing using multiple algorithms

Osma Suominen

2019· LIBER Quarterly The Journal of the Association of European Research Libraries59doi:10.18352/lq.10285

Manually indexing documents for subject-based access is a labour-intensive process. We propose using metadata gathered from bibliographic databases to train algorithms that assist librarians in that work. We have developed Annif, an open source tool and microservice for automated subject indexing. After training it with a subject vocabulary and existing metadata, Annif can be used to assign subject headings for new documents. We have tested Annif with different document collections including scientific papers, old scanned books and contemporary e-books, Q&A pairs from an “ask a librarian” service, Finnish Wikipedia, and the archives of a local newspaper. The results of analysing scientific papers and current books have been reassuring, while other types of documents have proved to be more challenging. The current version is based on a combination of existing natural language processing and machine learning tools. By combining multiple approaches and existing open source algorithms, Annif can build on the strengths of individual algorithms and adapt to different settings. With Annif, we expect to improve subject indexing and classification processes especially for electronic documents as well as collections that otherwise would not be indexed at all.

Integrated interdisciplinary workflows for research on historical newspapers: Perspectives from humanities scholars, computer scientists, and librarians

Sarah Oberbichler, Emanuela Boroş, Antoine Doucet, Jani Marjanen +4 more

2021· Journal of the Association for Information Science and Technology53doi:10.1002/asi.24565

This article considers the interdisciplinary opportunities and challenges of working with digital cultural heritage, such as digitized historical newspapers, and proposes an integrated digital hermeneutics workflow to combine purely disciplinary research approaches from computer science, humanities, and library work. Common interests and motivations of the above-mentioned disciplines have resulted in interdisciplinary projects and collaborations such as the NewsEye project, which is working on novel solutions on how digital heritage data is (re)searched, accessed, used, and analyzed. We argue that collaborations of different disciplines can benefit from a good understanding of the workflows and traditions of each of the disciplines involved but must find integrated approaches to successfully exploit the full potential of digitized sources. The paper is furthermore providing an insight into digital tools, methods, and hermeneutics in action, showing that integrated interdisciplinary research needs to build something in between the disciplines while respecting and understanding each other's expertise and expectations.

Etiology, syndrome diagnosis, and cognition in childhood‐onset epilepsy: A population‐based study

Arja Sokka, Päivi Olsén, Jarkko Kirjavainen, Maijakaisa Harju +4 more

2016· Epilepsia Open40doi:10.1002/epi4.12036

OBJECTIVE: To evaluate the prevalence of various etiologies of epilepsies and epilepsy syndromes and to estimate cognitive function in cases of childhood-onset epilepsy. METHODS: A population-based retrospective registry study. We identified all medically treated children with epilepsy born in 1989-2007 in Finland's Kuopio University Hospital catchment area, combining data from the birth registry and the national registry of special-reimbursement medicines. We reevaluated the epilepsy diagnoses and syndromes and gathered data on etiologies and cognitive impairment. RESULTS: We identified 289 children with epilepsy. The annual incidence rate of epilepsies and epilepsy syndromes was 38 in 100,000, and the misdiagnosis rate was 3%. A specific etiology was identified in 65% of the cases, with a structural etiology accounting for 29% and a genetic or presumed genetic etiology for 32%. Most patients with unknown-etiology epilepsy had focal epilepsy and were of normal intelligence. Intellectual disability was detected in 35% of cases, and only 17% in this group had an unknown etiology for the epilepsy. Electroclinical syndromes (mainly West syndrome) were recognized in 35% of the patients. SIGNIFICANCE: Epilepsy is a complex disease that encompasses many etiologies and rare syndromes. The etiology and specific epilepsy syndrome are important determinants of the outcome and key factors in treatment selection. Etiological diagnosis can be achieved for the majority of children and syndromic diagnosis for only a third.

Fetal head size and effect of manual perineal protection

Magdalena Jansová, Vladimír Kališ, Zdeněk Rušavý, Sari Räisänen +2 more

2017· PLoS ONE33doi:10.1371/journal.pone.0189842

OBJECTIVE: The aim of this study was to evaluate whether a previously identified modification of Viennese method of perineal protection remains most effective for reduction of perineal tension in cases with substantially smaller or larger fetal heads. METHODS: A previously designed finite element model was used to compare perineal tension of different modifications of the Viennese method of perineal protection to "hands-off" technique for three different sizes of the fetal head. Quantity and extent of tension throughout the perineal body during vaginal delivery at the time when the suboccipito-bregmatic circumference passes between the fourchette and the lower margin of the pubis was determined. RESULTS: The order of effectiveness of different modifications of manual perineal protection was similar for all three sizes of fetal head. The reduction of perineal tension was most significant in delivery simulations with larger heads. The final position of fingers 2cm anteriorly from the fourchette (y = +2) consistently remains most effective in reducing the tension. The extent of finger movement along the anterior-posterior (y-axis) contributes to the effectiveness of manual perineal protection. CONCLUSION: Appropriately performed Viennese manual perineal protection seems to reduce the perineal tension regardless of the fetal head size, and thus the method seems to be applicable to reduce risk of perineal trauma for all parturients.

Net Promoter Score as Indicator of Library Customers' Perception

Markku Laitinen

2018· Journal of Library Administration31doi:10.1080/01930826.2018.1448655

The Net Promoter Score (NPS) is used in business to measure the customers’ willingness to recommend the product, service, or enterprise as a whole to their friends or colleagues. Introduced by Fred Reichheld in 2003, NPS may answer the need of libraries to find easy and non-laborious methods of assessing the customers’ experience. It may target either the library as whole or critical services that are the most relevant to the library's main goals. However, literature about the use of NPS in public sector organizations is sparse. This article examines the applicability of the NPS to data retrieved from the user surveys 2014–2016 of the Finna service of the National Library of Finland.

Information retrieval from historical newspaper collections in highly inflectional languages: A query expansion approach

Anni Järvelin, Heikki Keskustalo, Eero Sormunen, Miamaria Saastamoinen +1 more

2015· Journal of the Association for Information Science and Technology24doi:10.1002/asi.23379

The aim of the study was to test whether query expansion by approximate string matching methods is beneficial in retrieval from historical newspaper collections in a language rich with compounds and inflectional forms ( F innish). First, approximate string matching methods were used to generate lists of index words most similar to contemporary query terms in a digitized newspaper collection from the 1800s. Top index word variants were categorized to estimate the appropriate query expansion ranges in the retrieval test. Second, the effectiveness of approximate string matching methods, automatically generated inflectional forms, and their combinations were measured in a C ranfield‐style test. Finally, a detailed topic‐level analysis of test results was conducted. In the index of historical newspaper collection the occurrences of a word typically spread to many linguistic and historical variants along with optical character recognition ( OCR ) errors. All query expansion methods improved the baseline results. Extensive expansion of around 30 variants for each query word was required to achieve the highest performance improvement. Query expansion based on approximate string matching was superior to using the inflectional forms of the query words, showing that coverage of the different types of variation is more important than precision in handling one type of variation.

Exporting Finnish Digitized Historical Newspaper Contents for Offline Use

Tuula Pääkkönen, Jukka Kervinen, Asko Nivala, Kimmo Kettunen +1 more

2016· D-Lib Magazine19doi:10.1045/july2016-paakkonen

Digital collections of the National Library of Finland (NLF) contain over 10 million pages of historical newspapers, journals and some technical ephemera. The material ranges from the early Finnish newspapers from 1771 until the present day. The material up to 1910 can be viewed in the public web service, where as anything later is available at the six legal deposit libraries in Finland. A recent user study noticed that a different type of researcher use is one of the key uses of the collection. National Library of Finland has gotten several requests to provide the content of the digital collections as one offline bundle, where all the needed content is included. For this purpose we introduced a new format, which contains three different information sets: the full metadata of a publication page, the actual page content as ALTO XML, and the raw text content. We consider these formats most useful to be provided as raw data for the researchers. In this paper we will describe how the export format was created, how other parties have packaged the same data and what the benefits are of the current approach. We shall also briefly discuss word level quality of the content and show a real research scenario for the data.

Automated Dewey Decimal Classification of Swedish library metadata using Annif software

Koraljka Golub, Osma Suominen, Ahmed Taiye Mohammed, Harriet Aagaard +1 more

2024· Journal of Documentation19doi:10.1108/jd-01-2022-0026

Purpose In order to estimate the value of semi-automated subject indexing in operative library catalogues, the study aimed to investigate five different automated implementations of an open source software package on a large set of Swedish union catalogue metadata records, with Dewey Decimal Classification (DDC) as the target classification system. It also aimed to contribute to the body of research on aboutness and related challenges in automated subject indexing and evaluation. Design/methodology/approach On a sample of over 230,000 records with close to 12,000 distinct DDC classes, an open source tool Annif, developed by the National Library of Finland, was applied in the following implementations: lexical algorithm, support vector classifier, fastText, Omikuji Bonsai and an ensemble approach combing the former four. A qualitative study involving two senior catalogue librarians and three students of library and information studies was also conducted to investigate the value and inter-rater agreement of automatically assigned classes, on a sample of 60 records. Findings The best results were achieved using the ensemble approach that achieved 66.82% accuracy on the three-digit DDC classification task. The qualitative study confirmed earlier studies reporting low inter-rater agreement but also pointed to the potential value of automatically assigned classes as additional access points in information retrieval. Originality/value The paper presents an extensive study of automated classification in an operative library catalogue, accompanied by a qualitative study of automated classes. It demonstrates the value of applying semi-automated indexing in operative information retrieval systems.

Conceptualising Benefits of User-Centred Design for Digital Library Services

Heli Kautonen, Marko Nieminen

2018· LIBER Quarterly The Journal of the Association of European Research Libraries17doi:10.18352/lq.10231

Libraries are increasingly adopting user-centred design (UCD) approaches to the development of their services for the benefit of customers. Less attention is paid to evaluating the activity of designing. To address this managerial question, we present a study that examines UCD performance in the context of digital library services' development. The study builds on the existing knowledge on library and design evaluation and examines the literature from two theoretical perspectives: performance management and temporalities. As the main contribution of this paper, we introduce the conceptual 360-Degree Temporal Benefits Model, which captures the situation where many stakeholders are involved in a design activity of a digital library service. Application of the model on two cases demonstrates that the stakeholders can assess the benefits of UCD very differently. We argue that the new model helps in framing the change from the measurable design benefits towards more ambitious and ambiguous public values.

Looking for commitment: Finnish open access journals, infrastructure and funding

Jyrki Ilva

2018· Insights the UKSG journal17doi:10.1629/uksg.414

Most of the 100+ Finnish scholarly journals are published by small learned societies. Since 2015, the National Library of Finland and the Federation of Finnish Learned Societies have been working on a joint project which aims to provide the journals with the support they need for making a transition to open access. The project has launched an OJS-based shared publication platform (Journal.fi), which is already used by 50 journals. It has also been developing a new funding model for the journals. Since the subscription and licensing costs paid by the research libraries for these journals have been very small, it is not possible to simply use these funds to pay for open access. Instead, the project has been working on a consortium-based model, under which the Finnish research organizations and funders would commit themselves to providing long-term funding to the journals. In return, the journals would pledge to follow strict standards in openness, licensing, peer review and infrastructure.

Collaboration at International, National and Institutional Level – Vital in Fostering Open Science

Kristiina Hormia‐Poutanen, Pirjo-Leena Forsström

2016· LIBER Quarterly The Journal of the Association of European Research Libraries17doi:10.18352/lq.10157

Open science and open research provide potential for new discoveries and solutions to global problems, thus are automatically extending beyond the boundaries of an individual research laboratory. By nature they imply and lead to collaboration among researchers. This collaboration should be established on all possible levels: institutional, national and international. The present paper looks at the situation in Finland, it shows how these collaborations are organized at the various levels. The special role played by LIBER is evidenced. The advantages of these collaborations are highlighted.

NewsEye: A digital investigator for historical newspapers

Antoine Doucet, Martin Gasteiner, Mark Granroth-Wilding, Kaiser, Max +4 more

2020· HAL (Le Centre pour la Communication Scientifique Directe)15doi:10.5281/zenodo.3895268

This short paper introduces the NewsEye project.

The future of metadata: open, linked, and multilingual – the YSO case

Satu Niininen, Susanna Nykyri, Osma Suominen

2017· Journal of Documentation13doi:10.1108/jd-06-2016-0084

Purpose The purpose of this paper is threefold: to focus on the process of multilingual concept scheme construction and the challenges involved; to addresses concrete challenges faced in the construction process and especially those related to equivalence between terms and concepts; and to briefly outlines the translation strategies developed during the process of concept scheme construction. Design/methodology/approach The analysis is based on experience acquired during the establishment of the Finnish thesaurus and ontology service Finto as well as the trilingual General Finnish Ontology YSO, both of which are being maintained and further developed at the National Library of Finland. Findings Although uniform resource identifiers can be considered language-independent, they do not render concept schemes and their construction free of language-related challenges. The fundamental issue with all the challenges faced is how to maintain consistency and predictability when the nature of language requires each concept to be treated individually. The key to such challenges is to recognise the function of the vocabulary and the needs of its intended users. Social implications Open science increases the transparency of not only research products, but also metadata tools. Gaining a deeper understanding of the challenges involved in their construction is important for a great variety of users – e.g. indexers, vocabulary builders and information seekers. Today, multilingualism is an essential aspect at both the national and international information society level. Originality/value This paper draws on the practical challenges faced in concept scheme construction in a trilingual environment, with a focus on “concept scheme” as a translation and mapping unit.

From MARC silos to Linked Data silos?

Osma Suominen, Nina Hyvönen

2017· Työväentutkimus Vuosikirja11doi:10.5282/o-bib/2017h2s1-13

Libraries are opening up their bibliographic metadata as Linked Data. However, they have all used different data models for structuring their bibliographic data. Some are using a FRBR-based model with several layers of entities while others use flat, record-oriented data models. The proliferation of data models limits the reusability of bibliographic data. In effect, libraries have moved from MARC silos to Linked Data silos of incompatible data models. Data sets can be difficult to combine and reuse. Small modelling differences may be overcome by schema mappings, but it is not clear that interoperability has improved overall. We present a survey of published bibliographic Linked Data, the data models proposed for representing bibliographic data as RDF, and tools used for conversion from MARC. Also, the approach of the National Library of Finland is discussed. Seit einiger Zeit stellen Bibliotheken ihre bibliografischen Metadadaten verstärkt offen in Form von Linked Data zur Verfügung. Dabei kommen jedoch ganz unterschiedliche Modelle für die Strukturierung der bibliografischen Daten zur Anwendung. Manche Bibliotheken verwenden ein auf FRBR basierendes Modell mit mehreren Schichten von Entitäten, während andere flache, am Datensatz orientierte Modelle nutzen. Der Wildwuchs bei den Datenmodellen erschwert die Nachnutzung der bibliografischen Daten. Im Ergebnis haben die Bibliotheken die früheren MARC-Silos nur mit zueinander inkompatiblen Linked-Data-Silos vertauscht. Deshalb ist es häufig schwierig, Datensets miteinander zu kombinieren und nachzunutzen. Kleinere Unterschiede in der Datenmodellierung lassen sich zwar durch Schema Mappings in den Griff bekommen, doch erscheint es fraglich, ob die Interoperabilität insgesamt zugenommen hat. Der Beitrag stellt die Ergebnisse einer Studie zu verschiedenen veröffentlichten Sets von bibliografischen Daten vor. Dabei werden auch die unterschiedlichen Modelle betrachtet, um bibliografische Daten als RDF darzustellen, sowie Werkzeuge zur Erzeugung von entsprechenden Daten aus dem MARC-Format. Abschließend wird der von der Finnischen Nationalbibliothek verfolgte Ansatz behandelt.

Names, Right or Wrong

Kimmo Kettunen, Teemu Ruokolainen

201710doi:10.1145/3078081.3078084

Named Entity Recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system's performance is genre and domain dependent and also used entity categories vary [16]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report evaluation result of NER with data out of a digitized Finnish historical newspaper collection Digi. Experiments, results and discussion of this research serve development of the Web collection of historical Finnish newspapers.

FinGPT: Large Generative Models for a Small Language

Risto Luukkonen, Ville Komulainen, Jouni Luoma, Anni Eskelinen +4 more

20239doi:10.18653/v1/2023.emnlp-main.164

Risto Luukkonen, Ville Komulainen, Jouni Luoma, Anni Eskelinen, Jenna Kanerva, Hanna-Mari Kupari, Filip Ginter, Veronika Laippala, Niklas Muennighoff, Aleksandra Piktus, Thomas Wang, Nouamane Tazi, Teven Scao, Thomas Wolf, Osma Suominen, Samuli Sairanen, Mikko Merioksa, Jyrki Heinonen, Aija Vahtola, Samuel Antao, Sampo Pyysalo. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023.

Tagging Named Entities in 19th Century and Modern Finnish Newspaper Material with a Finnish Semantic Tagger

Kimmo Kettunen, Laura Löfberg

2017· DSpace repository (University of Tartu)9

Named Entity Recognition (NER), search, classification and tagging of names and name like informational elements in texts, has become a standard information extraction procedure for textual data during the last two decades.NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc.In general a NER system's performance is genre and domain dependent.Also used entity categories vary a lot (Nadeau and Sekine, 2007).The most general set of named entities is usually some version of three part categorization of locations, persons and corporations.In this paper we report evaluation results of NER with two different data: digitized Finnish historical newspaper collection Digi and modern Finnish technology news, Digitoday.Historical newspaper collection Digi contains 1,960,921 pages of newspaper material from years 1771-1910 both in Finnish and Swedish.We use only material of Finnish documents in our evaluation.The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70-75%, and its NER evaluation collection consists of 75 931 words (Kettunen and Pkknen, 2016; Kettunen et al., 2016).Digitoday's annotated collection consists of 240 articles in six different sections of the newspaper.Our new evaluated tool for NER tagging is non-conventional: it is a rulebased Finnish Semantic Tagger, the FST (Lfberg et al., 2005), and its results are compared to those of a standard rulebased NE tagger, FiNER.

Ground Truth OCR Sample Data of Finnish Historical Newspapers and Journals in Data Improvement Validation of a re-OCRing Process

Kimmo Kettunen, Mika Koistinen, Jukka Kervinen

2020· LIBER Quarterly The Journal of the Association of European Research Libraries8doi:10.18352/lq.10322

The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 16.51 million pages mainly in Finnish and Swedish. Out of these about 7.64 million pages are freely available on the web site https://digi.kansalliskirjasto.fi/etusivu. The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929. The last nine years, 1921–1929, were opened in January 2018. This paper presents briefly the ground truth Optical Character Recognition data of about 500 000 words that has been compiled at the NLF for development of an improved OCR process for the Finnish collection. We discuss compilation of the data generally and show results of the new OCR process in comparison to current OCR, using the ground truth data as an evaluation benchmark. We also show with real newspaper data of 30 years and 109 million words that the re-OCRing process is improving the quality of the OCRed data.

Monitoring agreements with open access elements: why article-level metadata are important

Mafalda Marques, Saskia Woutersen-Windhouwer, Arja Tuuliniemi

2019· Insights the UKSG journal8doi:10.1629/uksg.489

Agreements with open access (OA) elements (e.g. agreements with APC discounts, offsetting agreements, read and publish agreements) have been increasing in number in the last few years. With more agreements including some form of OA, consortia and academic institutions need to monitor the number of OA publications, the costs and the value of these agreements. Publishers are therefore required to account for the articles published OA to consortia, academic institutions and research funders. One way publishers can do so is by providing regular reports with article-level metadata. This article uses the Knowledge Exchange (KE) and the Efficiency and Standards for Article Charges (ESAC) initiative recommendations as a check-list to assess what article-level metadata consortia request from publishers and what metadata publishers deliver to consortia. KE countries’ agreements with major publishers were analysed to assess how far consortia and publishers are from requesting and providing article-level metadata. The results from this research can be used as a benchmark to determine how major publishers were performing until early 2019 and prior to Plan S coming into effect in 2021. A recommendation is made that publishers use the article-level metadata check-list as a template to provide the metadata recommended by KE and ESAC.

Search all NobleBlocks papers mentioning “National Library of Finland” →