Tag your Middle Dutch text

You can find more information on this specific model here: https://github.com/hipster-philology/middle-dutch-model

Cite with the following

Please remember that corpus creation and software engineering is valid research, so please cite these resources when you use this lemmatizer for your research: this includes the wonderful original research by E. Manjavacas, M. Kestemont and Á. Kádár as well as the software wrapping built to handle pre- and post-processing.

For each models, a bibliography and potentially other citable works are given, such as models and datasets are given.

@software{thibault_clerice_2020_3883590,
  author       = {Clérice, Thibault},
  title        = {Pie Extended, an extension for Pie with pre-processing and post-processing},
  month        = jun,
  year         = 2020,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.3883589},
  url          = {https://doi.org/10.5281/zenodo.3883589}
}
@inproceedings{manjavacas-etal-2019-improving,
    title = "Improving Lemmatization of Non-Standard Languages with Joint Learning",
    author = "Manjavacas, Enrique  and
      K{\'a}d{\'a}r, {\'A}kos  and
      Kestemont, Mike",
    booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational
      Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/N19-1153",
    doi = "10.18653/v1/N19-1153",
    pages = "1493--1503",}
@article{KestemontEA17,
  author    = {Mike Kestemont and
               Guy De Pauw and
               Renske van Nie and
               Walter Daelemans},
  title     = {Lemmatization for variation-rich languages using deep learning},
  journal   = {Digital Scholarship in the Humanities},
  volume    = {32},
  number    = {4},
  pages     = {797--815},
  year      = {2017},
  url       = {https://doi.org/10.1093/llc/fqw034},
  doi       = {10.1093/llc/fqw034},
}

Information about the model

This model provides support for the lemmatization and part-of-speech tagging of Middle Dutch texts (ca. 1150-1450 AD). The model was trained on the union of the four main corpora that are available for Middle Dutch dialects. An attempt has been made at uniformizing the various tagging conventions across the subcorpora, but this model generally aims for a maximal lexical coverage, at the expense of some consistency in tagging. The model furthermore assumes that your input has been tokenized already: although it can deal with clitical forms (e.g. tpaert), it will make no attempts at token merging in the event of non-standard token splitting (e.g. ghe seit). The included sub corpora on which this model was trained include:

  • Dutch Language Institute, Corpus-Gysseling, [link]
  • Dutch Language Institute, Corpus of Middle Dutch, [link]
  • Royal Academy of Dutch Language and Literature, Corpus van veertiende-eeuwse niet-literaire Nederlandse teksten (C14), [link]
  • All smaller sample of religious texts, described in the paper by Kestemont et al. (2017) listed below

Much credit should go to the Dutch Language Institute who are the primary curator of these materials.

Bibliography

This lemmatizer is provided to you thanks to the data of the LASLA, the software of Emmanuel Manjavacas and Mike Kestemont and some engineering from the École nationale des chartes. If you want to cite them :

  • E. Manjavacas & Á. Kádár & M. Kestemont, « Improving Lemmatization of Non-Standard Languages with Joint Learning », Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Special issue on "Natural Language Processing and Ancient Languages", 2019, pp. 493--1503.
  • Enrique Manjavacas & Mike Kestemont. (2019, January 17). emanjavacas/pie v0.1.3 (Version v0.1.3). Zenodo. http://doi.org/10.5281/zenodo.2542537 Check the latest version here :Zenodo DOI
  • Dutch Language Institute, Corpus-Gysseling, [link]
  • Dutch Language Institute, Corpus of Middle Dutch, [link]
  • Royal Academy of Dutch Language and Literature, Corpus van veertiende-eeuwse niet-literaire Nederlandse teksten (C14), [link]