Resources

> Academic Article

> Corpus

> Dataset

> Document

> Experimental research data

> Software publication

> Working Paper


0
prog_list: C02
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n8977
type_label: Academic Article
rlabel: Register effects of imagined addressees on f0 across generations
1
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n10352
resource_origin_list: created
rlabel: A Grammatically Annotated Corpus of the Old Latvian Postil of Georg Mancelius
link: https://zenodo.org/record/7890894
prog_list: B02
abstract:

This grammatically annoted corpus aims at facilitating linguistic research on Old Latvian based on the Postil of Georg Mancelius from the year 1654. The corpus is divided into two subcorpora, "pericopes" and "homilies" to make register related research easier.

The pericopes were annotated using SIL Toolbox and converted to be used in the search-tool ANNIS using the conversion tool PEPPER.

Three formats are provided in this release: 1. the Toolbox files, 2. the transitional Excel files and 3. a zipped folder to be imported into ANNIS.

Created in the project B02, Emergence and change of registers: The case of Lithuanian and Latvian of the CRC 1412 "Register" (funded by the Deutsche Forschungsgemeinschaft: DFG, German Research Foundation: 416591334).

type_label: Corpus
doi: 10.5281/zenodo.7890894
2
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n7448
resource_origin_list: re-used
rlabel: BeDiaCo - Berlin Dialogue Corpus
link: https://rs.cms.hu-berlin.de/phon
prog_list: C06
abstract:

The corpus consists of acoustic recordings of spontaneous dialogues of German native speakers with both task-free and task-based parts and additional read word lists.
Malte Belz, Christine Mooshammer, Alina Zöllner, and Lea‑Sophie Adam. Berlin Dialogue Corpus
(BeDiaCo): Version 2, 2021.

type_label: Corpus
3
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n7879
resource_origin_list: re-used
rlabel: BeMeCo v1
link: https://rs.cms.hu-berlin.de/phon/
prog_list: C06
abstract: lina Zöllner, Christine Mooshammer, and Silke Hamann. Berlin Menutask Corpus (BeMeCo): Version
1, 2021. URL https://rs.cms.hu‑berlin.de/phon
type_label: Corpus
4
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n15388
resource_origin_list: created
rlabel: BiNoKo V. 1.0 Birgitta-Notker-Korpus
link: https://zenodo.org/record/7764078
prog_list: B04
abstract:

The Birgitta-Notker-Korpus (BiNoKo) is a resource dedicated to comparative research on historical registers. The corpus comprises two sources: The Old High German Book of Psalms by Notker III of Saint Gall and the Old Swedish Revelations of Birgitta of Sweden. The subcorpus of Birgitta's Revelations and the subcorpus of Notker's Psalms are available as separate zip files. The corpus format is ANNIS. For local installation, use ANNIS Desktop. The documentation for ANNIS can be found here:
https://corpus-tools.org/annis/
https://corpus-tools.org/annis/download.html

The guidelines (see 'related identifiers') are published in REALIS 2/3 and include information about the corpus design, annotation layers, meta data, and annotation principles.

type_label: Corpus
doi: 10.5281/zenodo.7764078
5
prog_list: A02
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n15248
resource_origin_list: created
type_label: Corpus
rlabel: Bislama Spoken Corpus
6
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n174
resource_origin_list: used
rlabel: Bonner Totenbuchprojekt
link: http://totenbuch.awk.nrw.de/
prog_list: B03
abstract: Das Totenbuch bildete im Alten Ägypten über 1500 Jahre hinweg einen Wissensschatz für den Verstorbenen, der ihm in Schriftform mit ins Grab gegeben wurde.
type_label: Corpus
7
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n9271
resource_origin_list: used
rlabel: British National Corpus (BNC)
link: https://www.english-corpora.org/bnc/
prog_list: A05
abstract: The British National Corpus (BNC) was originally created by Oxford University press in the 1980s - early 1990s, and it contains 100 million words of text from a wide range of genres (e.g. spoken, fiction, magazines, newspapers, and academic).
type_label: Corpus
8
prog_list: C06
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n7330
type_label: Corpus
rlabel: C02-Corpus
9
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n976
resource_origin_list: used
rlabel: CaeMmCom
link: https://www.archaeologie.hu-berlin.de/de/aknoa/forschung-und-projekte/projekte/caemmcom
prog_list: B03
abstract:

Corpus of Ancient Egyptian Multimodal Communication. 

CAEMmCom – Corpus of Ancient Egyptian Multimodal Communication: Getting Started [pdf]

2020 Multimodale graphische Kommunikation im pharaonischen Ägypten: Entwurf einer Analysemethode, Lingua Aegyptia 28: 81-116.
2020 (mit Rebecca Döhl und Jens-Martin Loebel) CaeMmCom – Corpus altaegyptischer multimodaler Communication. Der Aufbau einer multimodalen Datensammlung altägyptischer Kommunikate, Zeitschrift für digitale Geisteswissenschaft, [open access].

type_label: Corpus
10
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n19027
rlabel: CoACan: Corpus del español académico de Canarias (1.0)
link: https://doi.org/10.5281/zenodo.17413533
prog_list: A09
abstract:


COACAN (Corpus of Academic Canarian Spanish) is a linguistic  corpus that documents and analyzes varieties of Spanish spoken in the Canary  Islands, specifically within the university academic context. This dataset was developed as part of project A09 "On the interplay between register and socio-geographic variation in Canarian Spanish" of the Collaborative Research Centre 1412 "REGISTER" (Register: Language Users' Knowledge of Situational-Functional Variation), led by Prof. Dr. Miriam Bouzouita at Humboldt-Universität zu Berlin.

This corpus was developed at the University of La Laguna with students from three specific degree programs in October-November 2024:


• Physical Activity and Sports Sciences (CAFYD)
• Hispanic Linguistics and Literature
• Education (Pedagogy)

The corpus focuses on spontaneous speech from Canarian university students, capturing both dialectal features of Canarian Spanish and academic/colloquial registers used in higher education settings, as well as their relationship with insular territory.

 

type_label: Corpus
11
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n234
resource_origin_list: created
rlabel: CoCoYum
link: https://scm.cms.hu-berlin.de/lehmannx/cocoyum-public
prog_list: A06
abstract: Language: Yucatec, Maya
Size: 159.00 tokens
Description: natural language production (spoken) and elicited data
Features: morpheme, glosses, translations, comments
Access: CC-BY-NC-ND

The Collective Corpus of Yucatec Maya (CoCoYum) is a collection of data from various researchers about the Yucatec Mayan language. It contains transcriptions of recordings (e.g. story telling, dialogue, public events), written data as well as elicited data. The corpus will be enlarged in time with fresh data collections and when further researchers add their data to the corpus.
type_label: Corpus
12
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n14512
rlabel: CoMeCaYo: Corpus Mediático de Canarias en YouTube (1.0)
link: https://doi.org/10.5281/zenodo.17413919
prog_list: A09
abstract: Comecayo (Corpus mediático de Canarias en YouTube) is a comprehensive corpus of Spanish-language YouTube videos from Canarian media outlets containing timestamped transcriptions. This dataset was developed as part of project A09 "On the interplay between register and socio-geographic variation in Canarian Spanish" within the Collaborative Research Centre 1412 "REGISTER" (Register: Language Users' Knowledge of Situational-Functional Variation), led by Prof. Dr. Miriam Bouzouita at Humboldt-Universität zu Berlin. The corpus spans over 15 years of video content from major Canarian television channels, radio stations, and digital media platforms, providing a valuable resource for linguistic research, natural language processing, and computational linguistics studies focusing on Spanish language evolution and regional media discourse patterns.
type_label: Corpus
13
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n3329
rlabel: CoPaCaYo: Corpus del Parlamento de Canarias en YouTube (1.0)
link: https://doi.org/10.5281/zenodo.17414418
prog_list: A09
abstract: COPACAYO (Corpus del Parlamento de Canarias en YouTube) is a comprehensive corpus of Spanish-language YouTube videos from Canarian parliamentary and governmental institutions containing timestamped transcriptions. This dataset was developed as part of project A09 "On the interplay between register and socio-geographic variation in Canarian Spanish" within the Collaborative Research Centre 1412 "REGISTER" (Register: Language Users' Knowledge of Situational-Functional Variation), led by Prof. Dr. Miriam Bouzouita at Humboldt-Universität zu Berlin.

The corpus spans over 14 years of video content from major Canarian governmental institutions, parliamentary sessions, and official channels, providing a valuable resource for linguistic research, natural language processing, and computational linguistics studies focusing on Spanish language evolution and regional institutional discourse patterns.

type_label: Corpus
14
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n20177
rlabel: CoParCan: Corpus del Parlamento de Canarias (1.0)
link: https://doi.org/10.5281/zenodo.17413688
prog_list: A09
abstract: Corpus of institutional declarations from the Parliament of the Canary Islands that documents contemporary Canarian political-institutional discourse. This corpus is part of project A09 "On the interplay between register and sociogeographic variation in Canarian Spanish" of the Collaborative Research Centre 1412 "REGISTER" (Register: Language Users' Knowledge of SituationalFunctional Variation), led by Prof. Dr. Miriam Bouzouita at Humboldt-Universität zu Berlin.
type_label: Corpus
15
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n3651
resource_origin_list: created
rlabel: Corpus of Non-Native Addressee Register (CoNNAR). Version 1
link: https://rs.cms.hu-berlin.de/phon/pages/home.php
prog_list: C06
type_label: Corpus
16
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n6464
resource_origin_list: re-used
rlabel: Czech corpus Koditex
link: http://www.korpus.cz
prog_list: A03
abstract: A synchronic, representative and reference 9‑million‑word corpus (excl. punctuation)
compiled for the purpose of conducting a multidimensional analysis (MDA) of Czech.

Zasina, A. J. – Lukeš, D. – Komrsková, Z. – Poukarová, P. – Řehořková, A.: Koditex: A corpus of diversified texts. Institute of the Czech National Corpus, Faculty of Arts, Charles University, Prague 2018. Available at WWW: www.korpus.cz
type_label: Corpus
17
prog_list: C06
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n5606
type_label: Corpus
rlabel: DNam
18
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n3943
resource_origin_list: used
rlabel: DNam corpus + DNam Wenker corpus
link: https://www.linguistik.hu-berlin.de/en/institut-en/professuren-en/german-in-multilingual-contexts/corpora/deutsch-in-namibia
prog_list: C07
abstract: The corpus "German in Namibia" („Deutsch in Namibia“ –DNam) was created in the period 2016-2021, in the DFG project „NamDeutsch: Die Dynamik des Deutschen im mehrsprachigen Kontext Namibias“ ("NamDeutsch: The Dynamics of German in Namibia's Multilingual Context" – WI 2155/9-1 and SI 750/4-1, directed by Heike Wiese and Horst Simon in cooperation with Marianne Zappen-Thomson) at the University of Potsdam (until 2019) and at HU Berlin (since 2019), at the FU Berlin and at UNAM Windhoek.

Article PDF
type_label: Corpus
doi: 10.37307/j.1868-775X.2020.03.03
19
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n3429
resource_origin_list: re-used
rlabel: ENCOW16
link: https://webcorpora.org
prog_list: B01; A04
abstract: Access: webcorpora.org (free access for academic use)
Engines: NoSkE and RStudio Server
type_label: Corpus
20
prog_list: B01
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n31615
type_label: Corpus
rlabel: ENCOW16A-NANO
21
prog_list: C03
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n1236
resource_origin_list: created
type_label: Corpus
rlabel: Eye-Tracking Corpus
22
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n4631
resource_origin_list: used
rlabel: FOLK excerpt
prog_list: A06
abstract: Language: German
Size: 194,716 tokens
Description: conversations in various situations
Features: rich metadata lemma, POS, speech unit segmentation, some dependencies
Access: internal
type_label: Corpus
23
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n9758
resource_origin_list: re-used
rlabel: Falko Corpus
link: https://www.linguistik.hu-berlin.de/de/institut/professuren/korpuslinguistik/forschung/falko/standardseite
prog_list: C04; C05
abstract: L1 and L2- authored argumentative essays collected in a controlled setting.
Further information about the Falko-project: https://www.linguistik.hu-berlin.de/de/institut/professuren/korpuslinguistik/forschung/falko
type_label: Corpus
24
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n1690
resource_origin_list: used
rlabel: GeWISS excerpt
link: https://gewiss.uni-leipzig.de/index.php?id=home&L=1
prog_list: A06
abstract:

GeWiss is a research project in spoken academic language. It provides a multilingual (German/English/Polish/Italian) corpus of audio recordings and transcriptions of academic communications, as an empirical foundation for comparative research.

To this end, the GeWiss corpus focusses on two main genres of spoken adademic language:

  • talks including discussions, and
  • oral exams,

and it explicitly distinguishes between L1 and L2 subcorpora. The corpus is enlarged and developed continuously.

type_label: Corpus
25
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n2186
resource_origin_list: used
rlabel: GermaNet
link: https://uni-tuebingen.de/fakultaeten/philosophische-fakultaet/fachbereiche/neuphilologie/seminar-fuer-sprachwissenschaft/arbeitsbereiche/allg-sprachwissenschaft-computerlinguistik/ressourcen/lexica/germanet-1/
prog_list: A01
abstract: GermaNet ist ein lexikalisch-semantisches Wortnetz, das deutsche Nomina, Verben und Adjektive semantisch zueinander in Beziehung setzt, indem es lexikalische Einheiten, die dasselbe Konzept ausdrücken, in Synsets zusammenfasst und semantische Relationen zwischen diesen Synsets definiert. GermaNet hat viel mit dem Englischen WordNet®  gemeinsam und kann als ein Online-Thesaurus oder als eine Lightweight-Ontologie betrachtet werden.
type_label: Corpus
26
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n7118
resource_origin_list: used
rlabel: GermaParl Corpus of Plenary Protocols
link: https://zenodo.org/record/3742113#.Y1pbJr7P2-Y
prog_list: A01
abstract: The GermaParl Corpus has been prepared in the PolMine Project (http://polmine.github.io) and comprises all protocols of plenary sessions in the German Bundestag (1996 - 2016). This version of the corpus is based on plain text documents issued by the German Bundestag. For a period between 2008 and 2010, txt files are not available. To fill the gap, pdf documents were processed. As part of the corpus preparation pipeline, the data has been linguistically annotated (using the TreeTagger) and imported into the Corpus Workbench (CWB). See the GermaParl documentation website (http://polmine.github.io/GermaParl) for further information.
type_label: Corpus
doi: 10.5281/zenodo.3742113
27
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n4983
resource_origin_list: used
rlabel: Icelandic Parsed Historical Corpus (IcePaHC)
link: https://clarin.is/en/resources/icepahc/
prog_list: A05
abstract: The Icelandic Parsed Historical Corpus (IcePaHC) is a project that has built a diachronic corpus with samples of written Icelandic from all periods from the 12th century to modern times. The corpus is mostly compatible with the corpora of historical English developed at UPenn. For historical texts spelling is modernized for phonological change.
type_label: Corpus
28
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n5848
resource_origin_list: created
rlabel: Kobalt_RST: Die Annotation von rhetorischen Strukturen im Kobalt-DaF-Korpus
prog_list: C04
abstract:

Das Kobalt-DaF-Korpus ist ein systematisch erhobenes und tief annotiertes Deutschlernerkorpus, welches 80 deutschsprachige argumentative Texte von deutschen L1-Sprecher:innen und Deutschlerner:innen unterschiedlicher L1 enthält. Dieses Repositorium stellt eine zusätzliche Annotation des Kobalt-DaF-Korpus bzgl. rhetorischer Strukturen frei zur Verfügung. Folgende Informationen sind hier zu finden: (1) Die Darstellung des Annotationsprozesses (Annotationsframework, -richtlinie, und -verfahren). (2) Die annotierten rs3-Dateien.

*Versionshinweise: Bislang sind ausschließlich die Texte der chinesischen Deutschlerner:innen und der deutschen L1-Sprecher:innen (insgesamt 40 Texte) verfügbar. Die Annotation der übrigen Texte folgt demnächst. 

*Die Annotationsarbeit wurde gefördert durch das Chinese Scholarship Council und die Deutsche Forschungsgemeinschaft (DFG) – SFB 1412, 416591334.

type_label: Corpus
doi: https://doi.org/10.5281/zenodo.5731458
29
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n870
resource_origin_list: created
rlabel: Lang*Reg: A multi-lingual corpus of intra-individual variation across situations
prog_list: C06; A06
abstract: Language: German, Persian, Yucatec Maya, Kurdish, Javanese
Size: 36 hours
Description: same speakers varied by mode, acquaintance, professionalism, and expertise
Features: transcription, syntactic segmentation, normalization, token, glossing or POS-tags, some syntax
Access: transcription or annotation in progress; CC-BY-NC-ND
type_label: Corpus
doi: https://doi.org/10.5281/zenodo.7646320
30
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n402
resource_origin_list: re-used
rlabel: Lithuanian Corpus
prog_list: B02
abstract: The Old Lithuanian corpus is the postil of Jonas Bretkūnas published as a facsimile edition by Ona Aleknavičienė (Jono Bretkūno Postilė, parengė Ona Aleknavičienė. Vilnius: Lietuvių kalbos institutas, 2005. ISBN 9986-668-96-4).
The text files used in the research were generated from the facsimiles.
type_label: Corpus
31
prog_list: B04
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n4832
type_label: Corpus
rlabel: Luther-Bretke-Korpus
32
prog_list: A02
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n2934
resource_origin_list: created
type_label: Corpus
rlabel: Morisien Spoken Corpus
33
prog_list: A05
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n9399
resource_origin_list: created
type_label: Corpus
rlabel: Online production experiment on Imprecision
34
prog_list: B01
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n5585
type_label: Corpus
rlabel: Penn-Helsinki Corpus of Early Modern English
35
prog_list: B01
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n443
type_label: Corpus
rlabel: Penn-Helsinki Corpus of Middle English
36
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n11270
resource_origin_list: used
rlabel: Potsdam Commentary Corpus
link: http://angcl.ling.uni-potsdam.de/resources/pcc.html
prog_list: A01
abstract: The Potsdam Commentary Corpus (PCC) is a corpus of 220 German newspaper commentaries (2.900 sentences, 44.000 tokens) taken from the online issues of the Märkische Allgemeine Zeitung (MAZ subcorpus) and Tagesspiegel (ProCon subcorpus) and is annotated with a range of different types of linguistic information.

[Bourgonje & Stede 2020] Bourgonje, Peter and Stede, Manfred (2020). The Potsdam Commentary Corpus 2.2: Extending Annotations for Shallow Discourse Parsing Proc. of the Language Resources and Evaluation Conference (LREC), Marseille.
type_label: Corpus
37
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n3174
resource_origin_list: re-used
rlabel: PreCOXX25: Register-annotated German webcorpus
prog_list: A04
abstract: Access: webcorpora.org (free access for academic use)
Engines: NoSkE and RStudio Server
type_label: Corpus
38
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n1004
resource_origin_list: re-used
rlabel: Prestudy ”situational context” in Czech
prog_list: A03
abstract: Ibex farm project (Zehr and Schwarz, 2018)
type_label: Corpus
39
prog_list: C06
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n3937
type_label: Corpus
rlabel: RUEG-GER
40
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n21689
resource_origin_list: used
rlabel: Ramsès Project
link: http://ramses.ulg.ac.be
prog_list: B03
abstract: Morphologically annotated, lemmatized text corpus of Late Egyptian texts (c. 1550 – 1000 BCE) by the University of Liège (http://ramses.ulg.ac.be)
type_label: Corpus
41
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n3519
resource_origin_list: created
rlabel: ReFlexAE
prog_list: C05
abstract: The ReFlexAE corpus (Register Flexibility in Academic Education) is a longitudinal corpus of written grammatical explanations built to investigate late register development in the context of higher education. The data are collected through a longitudinal written elicitation study with German L1 students enrolled in programs for primary school teachers. According to a repeated measures design, the longitudinal written study elicits data at three time points: before and after linguistic courses and before graduation. Each participant completes the same test battery comprising four written elicitation tasks, a grammar test, a demographic questionnaire and standard psychological questionnaires assessing personal traits and motivation for learning.
type_label: Corpus
42
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n1563
resource_origin_list: used
rlabel: Russian National Corpus
link: https://ruscorpora.ru/
prog_list: A03
abstract:

The Russian National Corpus is a representative collection of texts in Russian, counting about 1,5 bln tokens and completed with linguistic annotation and search tools

type_label: Corpus
43
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n9270
resource_origin_list: re-used
rlabel: SENIE Corpus
link: http://senie.korpuss.lv/toc.jsp
prog_list: B02
abstract: Latvian texts provided by the SENIE project of the University
of Latvia. http://senie.korpuss.lv/toc.jsp

Andronova, Everita (2007). The Corpus of Early Written Latvian: current state and future tasks. Proceedings of the Corpus Linguistics Conference. CL2007. University of Birmingham, UK. 27-30 July 2007. Edited by Matthew Davies, Paul Rayson, Susan Hunston, Pernilla Danielsson. ISSN 1747-9398. (http://ucrel.lancs.ac.uk/publications/CL2007/paper/245_Paper.pdf)
type_label: Corpus
44
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n4891
resource_origin_list: created
rlabel: Simulated Zoom-Corpus
prog_list: C02
abstract: Simulated zoom interaction with choreographed videos (variation of interlocutor persona [formality] & variation of topic / atstakeness) Simultaneous laboratory recordings of audio and video.
type_label: Corpus
45
prog_list: C05
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n1890
type_label: Corpus
rlabel: The GeWiss corpus (Gesprochene Wissenschaftssprache)
46
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n13221
resource_origin_list: created
rlabel: The grammatically annotated corpus of the pericopes of the Old Lithuanian Postil of Jonas Bretkūnas
link: https://zenodo.org/record/7890990
prog_list: B02
abstract:

This grammatically annoted corpus aims at facilitating linguistic research on Old Lithuanian based on the Postil of Jonas Bretkūnas from the year 1591. The corpus is divided into two subcorpora, "pericopes" and "homilies" to make register related research easier.

The pericopes were annotated using SIL Toolbox and converted to be used in the search-tool ANNIS using the conversion tool PEPPER.

Three formats are provided in this release: 1. the Toolbox files, 2. the transitional Excel files and 3. a zipped folder to be imported into ANNIS.

Created in the project B02, Emergence and change of registers: The case of Lithuanian and Latvian of the CRC 1412 "Register" (funded by the Deutsche Forschungsgemeinschaft: DFG, German Research Foundation: 416591334).

type_label: Corpus
doi: 10.5281/zenodo.7890990
47
prog_list: C06; C05
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n2392
type_label: Corpus
rlabel: The oral ReFlexAE corpus
48
prog_list: C05
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n3414
type_label: Corpus
rlabel: The written ReFlexAE corpus (Register Flexibility in Academic Education)
49
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n3389
resource_origin_list: used
rlabel: Thesaurus Linguae Aegyptiae (TLA)
link: http://aaew2.bbaw.de/tla
prog_list: B03
abstract: Digital text corpus of ancient Egyptian and Demotic language, morphosyntactic annotation & lemmatized. Largest corpus of Egyptian texts of different types and times (c. 2500 BCE – 450 AD)
type_label: Corpus
50
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n8100
resource_origin_list: re-used
rlabel: WroDiaCo v2
link: https://rs.cms.hu-berlin.de/phon/
prog_list: C06
abstract: Sarah Wesolek, Malte Belz, and Christine Mooshammer. Wroclaw Dialogue Corpus (WroDiaCo):
Version 2, 2020. URL https://rs.cms.hu‑berlin.de/phon.
type_label: Corpus
51
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n25141
resource_origin_list: re-used
rlabel: sgs corpus
link: http://sgscorpus.com/
prog_list: A06
abstract: Language: Persian
Size: 26 h
Description: free spoken dialogues with interviewer on fictive crime scenario
Features: social metadata, syntax
Access: internal
type_label: Corpus
52
prog_list: A09
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n5245
type_label: Dataset
rlabel: Conversations
53
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n13260
rlabel: Experimental Data: Bias and modality in conditionals: experimental evidence and theoretical implications
prog_list: A07
abstract: The repository contains all related files used for the experimental work reported in the paper " Bias and modality in conditionals: experimental evidence and theoretical implications". It contains files with the experimental stimuli, rating data, demographics information of participants, R scripts, and plots.
type_label: Dataset
doi: 10.17605/OSF.IO/QKVTF
54
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n1272
rlabel: Experimental Data: Comparing comparatives: Appropriateness ratings of synthetic, analytic and double comparatives in American and British English
prog_list: A07
abstract: The repository contains all related files used for the experimental work reported in the paper "Comparing comparatives: Appropriateness ratings of synthetic, analytic and double comparatives in American and British English". It contains files with the experimental stimuli, rating data, demographics information of participants, R scripts, and plots.
type_label: Dataset
doi: DOI 10.17605/OSF.IO/VFZWM
55
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n4216
rlabel: Experimental Data: Interlocutor relation predicts the formality of the conversation: an experiment in American and British English
prog_list: A07
abstract:

The repository contains all related files used for the experimental work reported in the paper "Interlocutor relation predicts the formality of the conversation: an experiment in American and British English". It contains files with the experimental stimuli, rating data, demographics information of participants, R scripts, and plots.

type_label: Dataset
doi: DOI 10.17605/OSF.IO/7BUZR
56
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n57779
rlabel: Experimental Data: Modal concord in American and British English: A register-based experimental study
prog_list: A07
abstract:

The repository contains all related files used for the experimental work reported in the paper "Modal concord in American and British English: A register-based experimental study". It contains files with the experimental stimuli, rating data, demographics information of participants, R scripts, and plots.

type_label: Dataset
doi: DOI 10.17605/OSF.IO/V8KTB
57
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n5905
rlabel: Experimental Data:
prog_list: A07
abstract: The repository contains all related files used for the experimental work reported in the paper "Counterfactual language, emotion, and perspective: a sentence completion study during the COVID-19 pandemic". It contains files with the experimental stimuli, rating data, completion tasks, demographics information of participants, R scripts, and plots.
type_label: Dataset
doi: DOI 10.17605/OSF.IO/TYH8P
58
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n16172
rlabel: Experimental Data:
prog_list: A07
abstract: The repository contains all related files used for the experimental work reported in the paper "Less formal and more rebellious --- An experiment on the social meaning of negative concord in American English". It contains files with the experimental stimuli, rating data, completion tasks, demographics information of participants, R scripts, and plots.


type_label: Dataset
doi: 10.17605/OSF.IO/ZMW46
59
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n9964
rlabel: Experimental data:
prog_list: A07
abstract:

The repository contains all related files used for the experimental work reported in the paper "A register approach to negative concord vs. negative polarity items in English". It contains files with the experimental stimuli, rating data, completion tasks, demographics information of participants, R scripts, and plots.

type_label: Dataset
doi: 10.17605/OSF.IO/2PXE4
60
prog_list: A07; C03
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n7251
type_label: Dataset
rlabel: Experimental data: demographic and language background
61
prog_list: C07
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n8845
type_label: Dataset
rlabel: Interview data
62
prog_list: A06
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n4523
type_label: Dataset
rlabel: Lang*Reg Social Survey
63
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n14702
rlabel: Lang*Reg: A multi-lingual corpus of intra-speaker variation across situations
prog_list: A06
abstract:

The Lang*Reg corpus records intra-speaker variation across languages and different situational-functional contexts, presumed to result in different registers. It has been prepared in the SFB1412 Register with data collections taking place in 2021-2022 for the following languages included in this version: German, Persian, Kurdish, Javanese. The data sets for each language comprise the speech of the same language users in a variety of spoken conversations and one written interaction. A minimum of 12 participants per language traversed a course of 6 situations in which they were asked to produce language in three types of activities: telling a story to a friend, talking freely with various interlocutors (friend, stranger, taxi driver) and engaging in an interview with a (university) professor. Moreover, our design included the storytelling in two modes, which allows for the comparison between spoken and written modes of the same language user. 

Lang*Reg has a basic syntactic segmentation (one matrix clause and all its dependent clauses per segment). v0.2.0 includes the data sets with transcriptions, normalizations and tokens for each language as well as additional language-specific annotations such as glosses and syntactic annotations. We prepared each data set also for use with the browser-based search and visualization architecture ANNIS. For further language-specific morpho-syntactic and sociolinguistic annotations, refer to the respective data set description. For an overview of all data set characteristics, please see the corpus documentation in each data set.

type_label: Dataset
doi: https://doi.org/10.5281/zenodo.13889198
64
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n4500
resource_origin_list: used
rlabel: Big Five Inventory (BFI-10)
link: https://zis.gesis.org/skala/Rammstedt-Kemper-Klein-Beierlein-Kovaleva-Big-Five-Inventory-(BFI-10)#
prog_list: A07; C05
abstract: The BFI-10 is a highly economic scale that allows the personality to be recorded according to the five-factor model. The scale is easy to administer in different survey modes. The empirical evidence of the validation studies suggests that the BFI-10 allows not only an economic but also a reliable and valid recording of the Big Five. The BFI-10 allows a rough measurement of the individual personality structure of adult interviewees from the German-speaking general population.

Rammstedt, B., Kemper, C. J., Klein, M. C., Beierlein, C., & Kovaleva, A. (2014). Big Five Inventory (BFI-10).
Zusammenstellung sozialwissenschaftlicher Items und Skalen (ZIS).
https://doi.org/10.6102/zis76


type_label: Document
doi: https://doi.org/10.6102/zis76
65
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n6714
resource_origin_list: used
rlabel: Depressions-Angst-Stress Skalen (DASS 21)
link: https://www.embloom.de/inhalte/dass-21-en/
prog_list: A07
abstract: Die DASS eignen sich zur Erfassung von Belastungen durch Depression, Angst und Stress ohne konfundierende somatische Faktoren, wie beispielsweise chronische Schmerzprobleme. Die Skalen sind allerdings auch für Klienten ohne somatische Beschwerden brauchbar. Die DASS sind in der Kurzversion mit 21 Items sowie der Langversion mit 42 Items jeweils auf Deutsch und Englisch verfügbar.
© Lovibond, P.F., Lovibond, S.H., Nilges, P. & Essau, C.
type_label: Document
doi: https://doi.org/10.23668/psycharchives.4579
66
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n5228
resource_origin_list: used
rlabel: Epidemic - Pandemic Impacts Inventory (EPII)
link: https://health.uconn.edu/psychiatry/research/family-adversity-and-resilience-research-program/epii/
prog_list: A07
abstract: The EPII is a tool designed to assess tangible impacts of epidemics and pandemics across personal and social life domains.
type_label: Document
67
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n4209
resource_origin_list: used
rlabel: Interpersonal Reactivity Index (IRI)
prog_list: C05
abstract: Davis, M. (1983). Measuring individual differences in empathy: Evidence for a multidimensional approach. Journal of Personality and Social Psychology, 44, 1114–1126.
type_label: Document
doi: https://doi.org/10.1037/0022-3514.44.1.113
68
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n7993
resource_origin_list: used
rlabel: Skalen zur motivationalen Regulation beim Lernen im Studium (SMR-LS)
link: https://ius.aau.at/projekte/26/
prog_list: C05
type_label: Document
doi: https://doi.org/10.1026/0012-1924/a000201
69
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n5760
resource_origin_list: created
rlabel: Experimental data: Addressee identification study
prog_list: C07
abstract:

Rating data (on a 9-point scale) of the probable addobressee of spoken texts in three conditions: a) entirely Standard German, b) with Namibian-specific lexical, and c) with non-standard grammatical features. The open-guise method was used to collect the data.

Participants:adults and adolescents in Namibia
type_label: Experimental research data
70
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n5252
resource_origin_list: created
rlabel: Experimental data: Newspaper correction study
prog_list: C07
abstract:

Data containing corrections of Namibian-German vs. Standard German features (lexical, morpho-syntactic, and grammatical) presented in a written mock newspaper article.
Participants: Adults and adolescents in Namibia and Germany

type_label: Experimental research data
71
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n7281
resource_origin_list: created
rlabel: Experimental data: speaker evaluation study
prog_list: C07
abstract:

Ratings (on a 9-point scale) of social meaning (competence and solidarity assessments) and inferences (origin, place of residence) regarding speakers of spoken texts in three conditions: a) entirely Standard German, b) with Namibian-specific lexical, and c) non-standard grammatical features, collected using in the open-guise method.
Participants: adults and adolescents in Namibia

type_label: Experimental research data
72
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n3850
resource_origin_list: used
rlabel: Stanford Log-linear Part-Of-Speech Tagger
link: https://nlp.stanford.edu/software/tagger.html
prog_list: A01; INF
abstract:

A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural'. This software is a Java implementation of the log-linear part-of-speech taggers described in these papers (if citing just one paper, cite the 2003 one):

Kristina Toutanova and Christopher D. Manning. 2000. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63-70.

Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003, pp. 252-259.
type_label: Software publication
73
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n27441
resource_origin_list: used
rlabel: ANNIS3
link: https://corpus-tools.org/annis/
prog_list: A03; C04; C06; B04; INF; A06
abstract:

A web browser-based search and visualization architecture for complex multilayer linguistic corpora with diverse types of annotation.

Available from: https://corpus-tools.org/annis/
Documentation: https://corpus-tools.org/annis/documentation.html
Cite as: Krause, Thomas & Zeldes, Amir (2016): ANNIS3: A new architecture for generic corpus query and visualization. in: Digital Scholarship in the Humanities 2016 (31). http://dsh.oxfordjournals.org/content/31/1/118

type_label: Software publication
doi: https://doi.org/10.1093/llc/fqu057
74
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n3846
resource_origin_list: used
rlabel: AntConc
link: https://www.laurenceanthony.net/software/antconc/
prog_list: B02
abstract: A freeware corpus analysis toolkit for concordancing and text analysis.

type_label: Software publication
75
prog_list: B01
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n4726
type_label: Software publication
rlabel: CorpusSearch
76
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n14595
resource_origin_list: used
rlabel: ELAN
link: https://archive.mpi.nl/tla/elan
prog_list: A06
abstract: ELAN is computer software, a professional tool to manually and semi-automatically annotate and transcribe audio or video recordings. It has a tier-based data model that supports multi-level, multi-participant annotation of time-based media.

Max Planck Institute for Psycholinguistics, The Language Archive, Nijmegen, The Netherlands

Lausberg, H., & Sloetjes, H. (2009). Coding gestural behavior with the NEUROGES-ELAN system. Behavior Research Methods, Instruments, & Computers, 41(3), 841-849. doi:10.3758/BRM.41.3.841.
type_label: Software publication
doi: https://doi.org/10.3758/BRM.41.3.841
77
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n7355
resource_origin_list: used
rlabel: EXMARaLDA
link: https://exmaralda.org/
prog_list: C05; A06
abstract: EXMARaLDA is a system for working with oral corpora on a computer. It consists of a transcription and annotation tool (Partitur-Editor), a tool for managing corpora (Corpus-Manager) and a query and analysis tool (EXAKT). Further parts of EXMARaLDA are FOLKER and OrthoNormal, which were both developed in and for the FOLK project.
Schmidt T and Wörner K (2014), „EXMARaLDA“, In Handbook on Corpus Phonology, pp. 402-419. Oxford University Press.
type_label: Software publication
78
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n2310
resource_origin_list: used
rlabel: Field Linguist's Toolbox
link: https://software.sil.org/toolbox/
prog_list: B02; A06
abstract: Toolbox is a data management and analysis tool for field linguists. It is especially useful for maintaining lexical data, and for parsing and interlinearizing text, but it can be used to manage virtually any kind of data. Toolbox is free to download and use.
type_label: Software publication
79
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n13705
resource_origin_list: used
rlabel: HAL-Inria
link: https://hal.inria.fr/inria-00527799
prog_list: A03
abstract: In this paper, we present, SALT, a framework for mapping heterogeneous linguistic formats from one another based on a model-based approach, i.e. independently of the actual formats in which the corresponding linguistic data is being expressed. While we describe the underlying concept of this framework, we identify how it echoes past ongoing standardisation activities within ISO committee TC 37/SC 4, and in particular, the possible conceptual equivalences with ISO CD 24612 (LAF) combined with ISO 24610-1 (FSR), as well as the possible role of the central data category registry (ISOCat), currently under deployment. We thus show the adequacy of our methodology and its capacity to integrate a wide range of possible linguistic annotation models.
type_label: Software publication
80
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n11441
resource_origin_list: used
rlabel: INCEpTION
link: https://inception-project.github.io
prog_list: A01; B04; INF; A06
abstract: We introduce INCEpTION, a new annotation platform for tasks including interactive and semantic annotation (e.g., concept linking, fact linking, knowledge base population, semantic frame annotation). These tasks are very time consuming and demanding for annotators, especially when knowledge bases are used. We address these issues by developing an annotation platform that incorporates machine learning capabilities which actively assist and guide annotators. The platform is both generic and modular. It targets a range of research domains in need of semantic annotation, such as digital humanities, bioinformatics, or linguistics. INCEpTION is publicly available as open-source software.

INF is hosting INCEpTION at https://inception.sfb1412.hu-berlin.de (Intranet only)
type_label: Software publication
81
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n5347
resource_origin_list: used
rlabel: PennController for Internet Based Experiments (IBEX)
link: https://doc.pcibex.net/
prog_list: A07; A03; C03; A06
abstract: PennController for Internet Based Experiments (“PennController” or “PCIbex” for short) provides the tools to build and run online experiments, from familiar paradigms like self-paced reading to completely custom-designed paradigms.
type_label: Software publication
82
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n2574
resource_origin_list: used
rlabel: Pepper
link: https://corpus-tools.org/pepper/
prog_list: A01; INF; A06
abstract:

A highly extensible platform for conversion and manipulation of linguistic data between an unbound set of formats. Pepper can be used stand-alone as a command line interface, or be integrated as an API into other software products.

Available from: https://corpus-tools.org/pepper/
Documentation: https://corpus-tools.org/pepper/userGuide.html
Cite as: F. Zipser & L. Romary (2010). A model oriented approach to the mapping of annotation formats using standards. In: Proceedings of the Workshop on Language Resource and Language Technology Standards, LREC 2010. Malta. URL: http://hal.archives-ouvertes.fr/inria-00527799/en/

type_label: Software publication
83
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n6847
resource_origin_list: used
rlabel: RStudio Server
link: https://www.rstudio.com/products/rstudio/#rstudio-server
prog_list: A07; C03; INF; A04
abstract:

RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.

type_label: Software publication
84
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n5658
resource_origin_list: used
rlabel: TextSTat
link: https://github.com/textstat/textstat
prog_list: B02
abstract: Textstat is an easy to use library to calculate statistics from text. It helps determine readability, complexity, and grade level.
type_label: Software publication
85
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n17681
resource_origin_list: created
rlabel: ToolboxTools
link: https://github.com/suetzjoh/ToolboxTools
prog_list: B02
abstract:

This project provides tools to read and write files in the Toolbox format.

type_label: Software publication
86
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n9524
resource_origin_list: used
rlabel: Tools rPraat and mPraat - Interfacing Phonetic Analyses with Signal Processing
prog_list: A03; C06; A06
abstract: The paper presents the rPraat package for R/mPraat toolbox for Matlab which constitutes an interface between the most popular software for phonetic analyses, Praat, and the two more general programmes. The package adds on to the functionality of Praat, it is shown to be superior in terms of processing speed to other tools, while maintaining the interconnection with the data structure of R and Matlab, which provides a wide range of subsequent processing possibilities. The use of the proposed tool is demonstrated on a comparison of real speech data with synthetic speech generated by means of dynamic unit selection.
type_label: Software publication
doi: https://doi.org/10.1007/978-3-319-45510-5_42
87
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n3506
resource_origin_list: used
rlabel: emuR
link: https://cran.r-project.org/web/packages/emuR/index.html
prog_list: C06; INF
abstract: Raphael Winkelmann, Klaus Jaensch, Steve Cassidy, and Jonathan Harrington. emuR: Main Package of the EMU Speech Database Management System, 2018.
type_label: Software publication
88
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n20535
resource_origin_list: used
rlabel: flairNLP / flair
link: https://github.com/flairNLP/flair
prog_list: MGK; Z; INF
abstract: A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends.

Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual String Embeddings for Sequence Labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1638–1649, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
type_label: Software publication
89
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n7430
resource_origin_list: used
rlabel: spaCy
link: https://spacy.io/
prog_list: INF
abstract: spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.
type_label: Software publication
90
resource_url: https://vivo.sfb1412.hu-berlin.de/individual/n19102
rlabel: Register: Language Users’ Knowledge of Situational-Functional Variation. Frame text of the Second Phase Proposal for the CRC 1412
link: https://edoc.hu-berlin.de/items/77f96e06-bd36-4f82-9b55-85efd7f8cbd5
prog_list: C05
type_label: Working Paper
doi: https://doi.org/10.18452/29131