STATISTICAL SOFTWARE R IN CORPUS-DRIVEN RESEARCH AND MACHINE LEARNING
PDF

Keywords

corpus linguistics
machine learning model
linguistic classifier
statistical software R
RStudio
grammatical construction
linguistic parameter
univariate analysis of variance (ANOVA)
multivariate analysis of variance (MANOVA)
the Tukey test
linear discriminant analysis
methodological aspects of interdisciplinary studies

How to Cite

[1]
V. V. Zhukovska and O. O. Mosiiuk, “STATISTICAL SOFTWARE R IN CORPUS-DRIVEN RESEARCH AND MACHINE LEARNING”, ITLT, vol. 86, no. 6, pp. 1–18, Dec. 2021, doi: 10.33407/itlt.v86i6.4627.

Abstract

The rapid development of computer software and network technologies has facilitated the intensive application of specialized statistical software not only in the traditional information technology spheres (i.e., statistics, engineering, artificial intelligence) but also in linguistics. The statistical software R is one of the most popular analytical tools for statistical processing a huge array of digitalized language data, especially in quantitative corpus linguistic studies of Western Europe and North America. This article discusses the functionality of the software package R, focusing on its advantages in performing complex statistical analyses of linguistic data in corpus-driven studies and creating linguistic classifiers in machine learning. With this in mind, a three-stage strategy of computer-statistical analysis of linguistic corpus data is elaborated: 1) data processing and preparing to be subjected to a statistical procedure, 2) utilizing statistical hypothesis testing methods (MANOVA, ANOVA) and the Tukey post-hoc test, and 3) developing a model of a linguistic classifier and analyzing its effectiveness. The strategy is implemented on 11 000 tokens of English detached nonfinite constructions with an explicit subject extracted from the BNC-BYU corpus. The statistical analysis indicates significant differences in the realization of the factors of the parameter “Part of speech of the subject”. The analyzed linguistic data are employed to build a machine model for the classification of the given constructions. Particular attention is devoted to the methodological perspectives of interdisciplinary research in the fields of linguistics and computer studies. The potential application of the elaborated case study in training undergraduate, master, and postgraduate students of Applied Linguistics is indicated. The article provides all the statistical data and codes written in the R script with comprehensive descriptions and explanations. The concluding part of the article summarizes the obtained results and highlights the issues for further research connected with the popularization of the statistical software complex R and raising the awareness of specialists in this statistical analysis system.

PDF

References

R. Fox, “The Contribution of Linguistics Towards Transdisciplinarity in Organizational Discourse.” International Journal of Transdisciplinary Research, no. 1(4), pp.16 – 34, 2009. (in English)

L. A. Janda, Cognitive linguistics: the quantitative turn. Berlin: De Gruyter Mouton, 2013. doi: https://doi.org/10.1515/9783110335255. (in English)

L. A. Janda, “Linguistic profiles: A quantitative approach to theoretical questions.” Language and Method, no. 3, pp.127-145. 2016. (in English)

G. Desagulier, Corpus linguistics and statistics with R. Introduction to quantitative methods in linguistics. Cham: Springer International Publishing, 2017. doi: https://doi.org/10.1007/978-3-319-64572-8. (in English)

M. V. Kopotev, Principles of syntactic idiomaticity. Helsinki: Helsinki University Press, 2008. (in Russian)

The R Project for Statistical Computing. [Online]. Available: http://www.R-project.org/ (in English)

Comprehensive R archive network. [Online]. Available: https://cran.r-project.org.

V. V. Zhukovska, O. O. Mosiiuk, & V. V. Komarenko, (2018). “Using R in the research by future philologists.” Information Technologies and Learning Tools, vol.66(4), pp.272-285, 2018. doi: https://doi.org/10.33407/itlt.v66i4.2196. (in Ukrainian)

V. Brezina, Statistics in corpus linguistics. Cambridge: Cambridge University Press, 2018. doi: https://doi.org/10.1017/9781316410899. (in English)

S. Gries, Multifactorial Analysis in Corpus Linguistics: A Study of Particle Placement (Open linguistics series). New York, London: Continuum International Publishing Group Ltd., 2003. (in English)

S. Gries, Statistics for Linguistics with R: A Practical Introduction (Mouton Textbook). Berlin/Boston: De Gruyter Mouton., 2013. (in English)

G. Desagulier, Corpus linguistics and statistics with R. Cham: Springer., 2017. doi: https://doi.org/10.1007/978-3-319-64572-8. (in English)

R. Baayen, Analyzing linguistic data. Cambridge: Cambridge University Press. 2008. doi: https://doi.org/10.1017/CBO9780511801686. (in English)

N. Levshina, How to do linguistics with R. Amsterdam: John Benjamins Publishing., 2015. doi: https://doi.org/10.1075/z.195. (in English)

J. Klavan, M. Pilvik, & K. Uiboaed, “The use of multivariate statistical classification models for predicting constructional choice in spoken, non-standard varieties of Estonian.” SKY Journal of Linguistics. [Online], no. 28, pp.187-224. 2015. Available: http://www.linguistics.fi/julkaisut/SKY2015/SKYJoL28_Klavan.pdf (in English)

D. Divjak, & A. Arppe, 2013. “Extracting prototypes from exemplars What can corpus data tell us about concept representation?” Cognitive Linguistics, no.24(2), pp.221-274, 2013. doi: https://doi.org/10.1515/cog-2013-0008. (in English)

A. E. Goldberg, Explain me this: Creativity, Competition, and the Partial Productivity of Constructions. Princeton/ Oxford : Princeton University Press, 2019. doi: https://doi.org/10.1515/9780691183954. (in English)

M. Hilpert, “Constructional Approaches,” in The Oxford Handbook of English Grammar. B. Aarts, J. Bowie, G. Popova (eds). Oxford: Oxford University Press, pp.106-123. 2020. doi: https://doi.org/10.1093/oxfordhb/9780198755104.013.13. (in English)

J. Bybee, “From usage to grammar: The mind’s response to repetition.” Language, no.82, pp.711 – 733, 2006. (in English)

J. Bybee, “Usage-based Theory and Exemplar Representations of Constructions”, in The Oxford Handbook of Construction Grammar, T. Hoffmann, G. Trousdale (eds.) Oxford: Oxford University Press, pp.49 - 69, 2013. (in English)

BNC-BYU. (2020, Dec. 20). [Online]. Available: www.english-corpora.org/bnc/. (in English)

A. B. Shipunov, E. M. Baldin, P. A. Volkova, A. I. Korobeinikov, S. A. Nazarova, S. V. Petrov, V. G. Sufiyanov, (2021, July 27). Visual statistics. Use R!, [Online]. Available: https://cran.r-project.org/doc/contrib/Shipunov-rbook.pdf (in Russian)

Yu. V. Nikolskyi, V. V. Pasichnyk, Yu. M. Shcherbyna, Artificial intelligence systems, Lviv, 2015. (in Ukrainian)

Discriminant Analysis Essentials in R - Articles - STHDA. (2021, July 27). [Online]. Available: http://www.sthda.com/english/articles/36-classification-methods-essentials/146-discriminant-analysis-essentials-in-r/#linear-discriminant-analysis---lda. (in English)

Package MASS. (2021, July 27). [Online]. Available: https://cran.r-project.org/web/packages/MASS/MASS.pdf. (in English)

M. Kuhn, Building predictive models in R using the caret package. Journal of Statistical Software, no.28(5). 2008. [Online]. Available: https://www.jstatsoft.org/index.php/jss/article/view/v028i05/v28i05.pdf. (in English)

L. Coelho, and W. Richert, Building Machine Learning Systems with Python. Packt Publishing, 2013. (in English)

S. Narkhede, Understanding Confusion Matrix. (2021, July 27). [Online] Medium. Available: https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62. (in English)

Educational-professional program «Applied Linguisitcs» (2021, July 27). [Online]. Available: https://eportfolio.zu.edu.ua/media/ StudyProgram/99/6dx45d.pdf (in English)

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Copyright (c) 2021 Олександр Олександрович Мосіюк, Вікторія Вікторівна Жуковська

Downloads

Download data is not yet available.