The project has a strong focus on releasing open access linguistic resources for the Norwegian language. This will be mainly achieved through the Norwegian Language Bank (Språkbanken). In this page we will collect links to the main corpora developed during the project.
- The Norwegian Parliamentary Speech Corpus (NPSC) consists of 140 hours of audio files and corresponding manual transcriptions of meetings at Stortinget (the Norwegian parliament) from 2017 and 2018. The NPSC is primarily intended as an open-source dataset for ASR development.
- ONOMASTICA is a pronunciation lexicon for Norwegian names created by Telenor in the 90ies. This version is converted to csv with utf-8 encoding.
- N-grams from NB Digital contains unigrams, bigrams and trigrams extracted from the digitized texts at the National Library of Norway. This resource can be used for language modelling.