E Installation

E.1 Installing CUDA (optional)

In 2007 NVIDIA released CUDA® (Compute Unified Device Architecture). CUDA is a parallel computing platform and programming model that enables dramatic increases in computing performance by harnessing the power of the GPU. You can download the platform from here, but before you install it, check that your system is compatible and read the disclaimers provided. For Windows operating systems you can find additional information here. Mind that the above is optional. You can run all the material provided in this course without the GPU computation option. If you plan to install CUDA anyway, make sure to check out the detailed installation steps that worked for the author first.

E.2 The R language

R is a turing-complete programming language focusing on specifically on data analysis. It is the most popular among statisticians and data scientists and in comparison to Python it is often said to be the brains whereas Python is said to be the muscles (see Upadhyay 2016):

In an article on his channel called YOU CANalytics, Roopam Upadhyay has compared R with Batman and Python with Superman. Such a comparison is deemed to be opinionated, and I believe that some of the differences explained in the article are not unconditionally valid, but at least it debunks some of the long held beliefs about the R language.

Figuur E.1: In an article on his channel called YOU CANalytics, Roopam Upadhyay has compared R with Batman and Python with Superman. Such a comparison is deemed to be opinionated, and I believe that some of the differences explained in the article are not unconditionally valid, but at least it debunks some of the long held beliefs about the R language.

For Windows systems, you can download the latest stable R version from here. It comes with a number of very useful functions already in thebase namespace Again, make sure to continue reading first to check the installation steps that worked for the author below.

E.3 Python

Python is a general-purpose programming language developed by Guido Van Rossum. The previous paragraph already discusses the differences between R and Python. In addition, some resources or platforms will rely only on one of these two languages. For these reasons, it pays of for a prospective data-scientist to manage both.

Persoonlijkheid E.1 (Guido Van Rossum)

Guido Van Rossum is the father of Python. He developed the language in 1989 as a Dutch IT-specialist associated with the Mathematisch Centrum in Amsterdam, Netherlands. Originally meant as a low-threshold language for beginning programmers, it now trumps all other. See for yourself on Google trends.

E.4 RStudio

Rstudio was found by far to be the most valuable and student-friendly IDE (integrated development environment) for working professionally with data, not just in R, but also in Python, Julia and other (scripting) languages such as SQL and Stan. In addition, it is the easiest IDE for people who have limited computer experience. Installing RStudio is free for academic use. See https://rstudio.com for more information, installation generally takes only a few minutes. Once installed, get acquainted with the IDE. See the book of Campbell for details on the IDE (Campbell 2019).

E.5 Installing Tensorflow

Around 2005, Google released Tensorflow®. TensorFlow is an open-source Python library to aid with the development, training, building and releasing of machine learning models, particularly deep learning models. To install TensorFlow for Rstudio, visit https://tensorflow.rstudio.com/ and read the instructions there. Check the installation steps used by the author below before you start.

E.6 Installation steps that worked for the author

Every system is different, so the steps below might not work for you. With these and with the links provided above, you should be able to get up and running in no time. Here is the set of instructions that worked for me, starting from scratch on a Dell XPS 15 7590 with a Windows 10 Pro operating system. For Windows users, remember check the ‘set Path’ option during installation, if available, or set the path manually using the [command line]](https://www.windows-commandline.com/set-path-command-line/). Also, test each step before you go to the next.

  1. (optional) Install CUDA version 10.2.89_441.22 for Windows and test with nvcc -V anywhere at the command line prompt
  2. Install R and test by running the command R --version anywhere at the command line prompt
  3. Install Python 3 and test by running the command python --version anywhere at the command line prompt
  4. Install Rstudio and check that the correct version of R is being linked to
  5. Install Miniconda in a directory that contains no spaces (not automatically adding to Windows PATH variable as advised, but manually) and test by running the command conda --version anywhere at the command line prompt
  6. Add a python environment, here called tf (in the Scripts sub-folder of Miniconda installation folder), read and follow the advise given (if any) and test afterwards with the command conda info --envs or conda env list:
conda create -n tf
Collecting package metadata (current_repodata.json): done
Solving environment: done

==> WARNING: A newer version of conda exists. <==
  current version: 4.8.3
  latest version: 4.8.4

Please update conda by running

    $ conda update -n base -c defaults conda

  Package Plan ##
  environment location: […]\envs\tf

Proceed ([y]/n)? y

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
 
  To activate this environment, use
 
      $ conda activate tf
 
  To deactivate an active environment, use
 
      $ conda deactivate
  1. Open RStudio and, using the console, install the helper package tensorflow that guides the installation:
install.packages("tensorflow")
Installing package into […]
(as ‘lib’ is unspecified)
also installing the dependencies ‘rappdirs’, ‘config’, ‘reticulate’, ‘tfruns’

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.6/rappdirs_0.3.1.zip'
Content type 'application/zip' length 87285 bytes (85 KB)
downloaded 85 KB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.6/config_0.3.zip'
Content type 'application/zip' length 27334 bytes (26 KB)
downloaded 26 KB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.6/reticulate_1.16.zip'
Content type 'application/zip' length 1742735 bytes (1.7 MB)
downloaded 1.7 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.6/tfruns_1.4.zip'
Content type 'application/zip' length 1479931 bytes (1.4 MB)
downloaded 1.4 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.6/tensorflow_2.2.0.zip'
Content type 'application/zip' length 145036 bytes (141 KB)
downloaded 141 KB

package ‘rappdirs’ successfully unpacked and MD5 sums checked
package ‘config’ successfully unpacked and MD5 sums checked
package ‘reticulate’ successfully unpacked and MD5 sums checked
package ‘tfruns’ successfully unpacked and MD5 sums checked
package ‘tensorflow’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in […]
  1. Still in RStudios console window, load the tensorflow package in memory (change tf if use you another environment name):
install_tensorflow(version = "nightly-gpu", envname = "tf")
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done Package Plan ## environment location: […]\envs\tf added / updated specs: - python=3.6 The following packages will be downloaded: package | build ---------------------------|----------------- certifi-2020.6.20 | py36h9f0ad1d_0 152 KB conda-forge pip-20.2.2 | py_0 1.1 MB conda-forge python-3.6.11 |h6f26aa1_2_cpython 18.9 MB conda-forge python_abi-3.6 | 1_cp36m 4 KB conda-forge setuptools-49.6.0 | py36h9f0ad1d_0 919 KB conda-forge vc-14.1 | h869be7e_1 6 KB conda-forge vs2015_runtime-14.16.27012 | h30e32a0_2 2.2 MB conda-forge wheel-0.35.1 | pyh9f0ad1d_0 29 KB conda-forge wincertstore-0.2 | py36_1003 13 KB conda-forge ------------------------------------------------------------ Total: 23.3 MB The following NEW packages will be INSTALLED: certifi conda-forge/win-64::certifi-2020.6.20-py36h9f0ad1d_0 pip conda-forge/noarch::pip-20.2.2-py_0 python conda-forge/win-64::python-3.6.11-h6f26aa1_2_cpython python_abi conda-forge/win-64::python_abi-3.6-1_cp36m setuptools conda-forge/win-64::setuptools-49.6.0-py36h9f0ad1d_0 vc conda-forge/win-64::vc-14.1-h869be7e_1 vs2015_runtime conda-forge/win-64::vs2015_runtime-14.16.27012-h30e32a0_2 wheel conda-forge/noarch::wheel-0.35.1-pyh9f0ad1d_0 wincertstore conda-forge/win-64::wincertstore-0.2-py36_1003 […] Collecting tf-nightly-gpu Downloading tf_nightly_gpu-2.4.0.dev20200822-cp36-cp36m-win_amd64.whl (282.9 MB) […] Installation complete. Warning messages: 1: In normalizePath(path.expand(path), winslash, mustWork) : path[1]="[…]\Miniconda\envs\tf/python.exe": Het systeem kan het opgegeven bestand niet vinden 2: In normalizePath(path.expand(path), winslash, mustWork) : path[1]="[…]\Miniconda\envs\tf/python.exe": Het systeem kan het opgegeven bestand niet vinden Restarting R session... NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.
  1. Test tensorflow using the command below
import numpy as np
import tensorflow as tf

tf.random.set_seed(1)

model = tf.keras.Sequential([tf.keras.layers.Dense( \
  units = 1, input_shape = [1])], )

print(tf.reduce_sum(tf.random.normal([1000, 1000])))
2020-09-17 15:06:29.340924: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library nvcuda.dll
2020-09-17 15:06:29.365342: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1650 computeCapability: 7.5
coreClock: 1.56GHz coreCount: 16 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 119.24GiB/s
2020-09-17 15:06:29.365627: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
2020-09-17 15:06:29.368816: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cublas64_10.dll
2020-09-17 15:06:29.371298: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cufft64_10.dll
2020-09-17 15:06:29.372145: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library curand64_10.dll
2020-09-17 15:06:29.375605: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cusolver64_10.dll
2020-09-17 15:06:29.377715: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cusparse64_10.dll
2020-09-17 15:06:29.385621: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudnn64_7.dll
2020-09-17 15:06:29.385907: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-09-17 15:06:29.386607: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-09-17 15:06:29.489348: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x20dd8509930 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-09-17 15:06:29.489640: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-09-17 15:06:29.490043: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1650 computeCapability: 7.5
coreClock: 1.56GHz coreCount: 16 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 119.24GiB/s
2020-09-17 15:06:29.490402: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
2020-09-17 15:06:29.490567: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cublas64_10.dll
2020-09-17 15:06:29.490775: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cufft64_10.dll
2020-09-17 15:06:29.491080: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library curand64_10.dll
2020-09-17 15:06:29.491298: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cusolver64_10.dll
2020-09-17 15:06:29.491542: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cusparse64_10.dll
2020-09-17 15:06:29.491784: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudnn64_7.dll
2020-09-17 15:06:29.492628: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-09-17 15:06:30.261351: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-17 15:06:30.261492: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0 
2020-09-17 15:06:30.261571: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N 
2020-09-17 15:06:30.261943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2907 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1650, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-09-17 15:06:30.289124: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x20e0e74be40 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-09-17 15:06:30.289298: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1650, Compute Capability 7.5

tf.Tensor(22.580101, shape=(), dtype=float32)

E.7 Waar vind ik hulp

Als je een bepaalde actie wil uitvoeren in R, maar je kent de naam van het pakket niet, dan zijn R Bloggers en Stack Overflow vermoedelijk de beste plaatsen om te zoeken. Gaat over de RStudio IDE, dan is de RStudio Community misschien wel iets. Zoek je toch op Google, voeg dan de term package toe aan de lijst van zoektermen om betere resultaten te bekomen. Heb je de naam van een R pakket of een functie en werk je met RStudio, dan moet je meestal de IDE niet te verlaten. Wil je meer weten over een bepaald pakket, dan zijn er vignettes (soort handleiding) en documentatie. Vignettes roep je aan d.m.v. de vignette() functie, documentatie met ? of help(). Verder heeft R ook demo’s en kan je rechtstreeks de voorbeelden uit de technische documentatie uitvoeren zonder de code te moeten kopiëren.

# Start
help.start()

# Documentatie
?magrittr
?magrittr::extract

# Zoeken naar hulp op basis van term
help.search("pipes")

# Informatie/handleiding specifiek pakket
vignette('magrittr')

# Lijst pakketten van gekoppelde bibliotheken
vignette(all = FALSE)

# Lijst pakketten geïnstalleerde bibliotheken
vignette(all = TRUE)

# Lijst alle vignettes
browseVignettes()

# Demonstratie
demo(image)

# Voorbeeld-code uit documentatie
example("stl")

# Pakket info (auteur, versie, ...)
library(help = "magrittr")

# R site
RSiteSearch("magrittr")

# CRAN repository pakketten
browseURL("https://cran.r-project.org/web/packages/available_packages_by_name.html")

# CRAN repo specifiek pakket
browseURL("https://cran.r-project.org/web/packages/magrittr/index.html")

# CRAN task views
browseURL("https://cran.r-project.org/web/views/")

# CRAN speciefieke task view
browseURL("https://cran.r-project.org/web/views/Hydrology.html")

Voor Python en andere talen verwijs ik naar Stack Overflow. Zie ook volgende paragraaf voor wat meer info over rond Python.

E.8 Waar kan ik leeralgoritmen terugvinden

Laten we beginnen met waar je de verscheidene leeralgoritmen kunt terugvinden. In sommige gevallen, zoals voor een enkelvoudig perceptron, kan je het zelf gewoon coderen maar meestal worden de algoritmen gelukkig ter beschikking gesteld. Omdat wiskundigen eerder R gebruiken zullen de nieuwere leeralgoritmen gewoonlijk eerst in R verschijnen en pas later in Python. Bij het publiceren van nieuwe methoden wordt er immers vaak van de auteurs verwacht dat ze het algoritme als een peer-reviewed R pakket (eng: package) registreren en publiek ter beschikking stellen. Het Comprehensive R Archive Network (CRAN), dat de R pakketten beheert en erop toeziet dat de pakketten zowel wiskundig als programmatorisch aan bepaalde standaarden voldoet, stelt zogenaamde task views op met telkens een overzicht van de pakketten die kunnen helpen bij een welbepaalde taak. Zoeken we naar “machine learning”, komen we op honderden pakketten uit van ahaz tot xgboost13 en vele van deze pakketten geven toegang tot tientallen verschillende typen leeralgoritmen. Maar dat is niet alles. Op de webpagina met het overzicht van alle task views, vinden we bijvoorbeeld het onderwerp natuurlijke taalverwerking (eng: natural language processing; NLP) met nog eens tientallen pakketten specifiek rond dit thema.

Voor Python ontwikkeling verwijzen we voor leeralgoritmen eerst en vooral naar scikit-learn. Deze bibliotheek biedt een ruime gereedschapskist aan voor beginnende datawetenschappers. De naam komt trouwens van Science-kit \(\rightarrow\) scikit \(\rightarrow\) sk. Dan is er Tensorflow waar in de cursus dieper op wordt ingegaan. Ook hier vind je een rijke schat aan informatie. Python biedt een aantal voordelen voor ML en het belangrijkste voordeel is misschien wel dat Python een algemene programmeertaal is waarmee je gemakkelijker volwaardige applicaties kunt maken of met hardware interfacen. De bewering dat Python ‘krachtiger’ of ‘performanter’ zou zijn dan R neem ik eerder met een korreltje zout. Op de weinige momenten waarop beide platformen op eerlijke wijze vergeleken worden is het verschil onbeduidend en bovendien moet je voornamelijk rekening houden met ontwikkeltijd en daar biedt Python zeker geen voordeel. Tenslotte vermeld ik hier nog even Julia, een opkomende taal waarin ML eenvoudiger moet worden.

Bronvermelding

Campbell, M., 2019. RStudio views, in: Learn Rstudio Ide. Springer, pp. 29–38.

Upadhyay, R., 2016. R vs python – a comparison [WWW Document] Accessed: 2020-08-28.