VP – Introduction to R Scripting Language

CodeSemesterCompletionCreditsRangeCourse Details
B83128summercredit30+30View in SIS

Course description

R language and R computational environment are dedicated to statistical computations and their ongoing graphical visualization (The R Project for Statistical Computing, https://www.r-project.org/). Besides statisticians, the R language is commonly used for purposes of data and hypotheses analyses by biologists and physicians at a vast majority of not only foreign universities. The R environment is open-source, free-of-charge (both free-as-in- beer and free-as-in-speech) and relatively user-friendly and becomes a dominant analytical tool in many fields, which makes it an equivalent alternative to clickable software. Furthermore, in biomedical fields such as analyses of molecular data scoping on genes (Bioconductor platform) or interpretations of cross-over designs of pharmacological trials, there are even no alternatives to R yet. The subject is recommended for all undergraduate students considering doing science in their future career, i. e. the ones thinking about a Ph.D. study, as well as for all interested graduate students (Ph.D. candidates). The subject is designed as a brief introduction into the R language, therefore there are no demands on previous R or other programming knowledge. A student is going to be introduced with installation and running of the R environment, with import and export of data to/from the R environment and eventually with data manipulation. Going through fingers-on examples based on (bio)medical data, a student is going to become familiar with R libraries, basic statistical analyses and data visualization by which she could realize any possibilities of R application in her future intended research. In addition, a student is going to be introduced into R programming and user-defining of her own R functions. During all the course and particularly in the final course project, a student will face a need to keep her data analyses and project results in a reproducible and transparent way. R used in statistical analysis of a publication is claimed to be linked to statistically significantly higher chance the publication will be cited more often, according to some evidence.

The course is aimed at students interested in programming language and environment R and the field of data science as well, as R is widely used for data science applications. R is not only a programming language designed for statistical computing and graphics purposes but also a Turing-complete general-purpose programming language suitable for complex tasks solutions. Advantages of R over commercial systems such as MATLAB are (i) open-source distribution – both free in the sense of costing no money (“free-as-in-beer”) and having absolutely no restrictions on source code editing or commercial use (“free-as-in-speech”). Among other benefits, (ii) there is a large online community congregated around R ready to help and answer user’s questions. R also provides (iii) an easy development of R web applications or (iv) user-friendly TeX documents typesetting directly via R code. The syntax of R language is simple, intuitive, and quite similar to the syntax of MATLAB language. According to the recent kaggle.com worldwide statistics, R became the most popular programming language chosen for data analysis, data science, and machine learning. Let’s say R is the lingua franca of data science. Class is practice-based and focused on problem-solving, number-crunching exercises, and on real-data analyses solved via hands-on R programming and scripting; assigned tasks follow an easy-to-difficult schedule.

Course syllabus

1st Introduction, installation, settings of the R environment, R data types and structures overview; basic operations, numbers, vectors, and simple manipulation with them, respectively. Quick statistics and summaries for one-dimensional data.

2nd More on data types in R, data structures, and structures manipulation. Arrays, data frames, lists. Effective work with data of usual formats.

3rd Loading external data into R. Saving data from R to a file. Data (pre)processing. How to automatize routine data cleaning using R.

4th Functions in R. Useful built-in functions. User-defined functions in R. Functionalities far beyond MS Excel’s “data processing”.

5th R as a programming language. Loops, conditions, warnings, errors. Automation of data tasks.

6th Elements of statistics and data analysis in R. Gently introduction to probability distributions. Measures of average and variability. Introduction to hypothesis testing in R. Basic graphical visualization. Must-know for data analysis.

7th Advanced statistics and data analysis in R. Linear models, including generalized ones (GLM). Linear regression. Analysis of variance (ANOVA), analysis of covariance (ANCOVA), and their multidimensional alternatives (MANOVA, MANCOVA). Appropriate graphical visualization. The golden standard for scientific articles in (bio)medicine.

8th Logistic regression and its interpretation, visualization. A tool for the classification of patients into classes of disease.

9th Time series. Survival analysis. Appropriate graphical visualization. Introducing a time as a variable in an analysis that matters.

10th Selected advanced statistical methods in R. Cluster analysis. Discriminant analysis. Factor analysis (FA), explanation and clarification of the solution rotation. Principal Component Analysis (PCA). Appropriate graphical visualization.

11th More on graphical outputs in R. Low-level and high-level graphical commands. Multivariate data displaying. Parameters of plots and diagrams. Overview of plots and diagrams in R and how to save a plot to a file. Choosing the most appropriate type of plot to use, given the analysis. How to improve the plot enough to use it in a publication.

*12th Selected methods of machine learning in R. Naïve Bayes classifier. Support Vector Machine (SVM). Cross-Validation (CV). Decision trees. Random forests. Neural networks. Association rules. Jackknife. Bootstrap.

*13th Text processing in R. Handling and processing strings in R. Regular expressions in R. Tokenization, n-gramming. TeX code included within R code. How to add R code or results of data analysis and plots outputted by R into TeX code and typeset a pdf.

*14th Building web applications with R and Shiny package. Shiny package. Components of web application built with R. Using HTML, CSS, and javascript to build R web application.

15th Consultations and individual help with the seminar project.

The topics marked with an asterisk (*) are advanced and, therefore, optional.

6441