Installation
Installing CUDA (optional)
In 2007 NVIDIA released CUDA® (Compute Unified Device Architecture). CUDA is a parallel computing platform and programming model that enables dramatic increases in computing performance by harnessing the power of the GPU. You can download the platform from here, but before you install it, check that your system is compatible and read the disclaimers provided. For Windows operating systems you can find additional information here. Mind that the above is optional. You can run all the material provided in this course without the GPU computation option. If you plan to install CUDA anyway, make sure to check out the detailed installation steps that worked for the author first.
The R language
R is a turing-complete programming language focusing on specifically on data analysis. It is the most popular among statisticians and data scientists and in comparison to Python it is often said to be the brains whereas Python is said to be the muscles (see Upadhyay 2016):
For Windows systems, you can download the latest stable R version from here. It comes with a number of very useful functions already in thebase
namespace Again, make sure to continue reading first to check the installation steps that worked for the author below.
Python
Python is a general-purpose programming language developed by Guido Van Rossum. The previous paragraph already discusses the differences between R and Python. In addition, some resources or platforms will rely only on one of these two languages. For these reasons, it pays of for a prospective data-scientist to manage both.
Persoonlijkheid E.1 (Guido Van Rossum)
Guido Van Rossum is the father of Python. He developed the language in 1989 as a Dutch IT-specialist associated with the Mathematisch Centrum in Amsterdam, Netherlands. Originally meant as a low-threshold language for beginning programmers, it now trumps all other. See for yourself on
Google trends.
RStudio
Rstudio was found by far to be the most valuable and student-friendly IDE (integrated development environment) for working professionally with data, not just in R, but also in Python, Julia and other (scripting) languages such as SQL and Stan. In addition, it is the easiest IDE for people who have limited computer experience. Installing RStudio is free for academic use. See https://rstudio.com for more information, installation generally takes only a few minutes. Once installed, get acquainted with the IDE. See the book of Campbell for details on the IDE (Campbell 2019).
Installing Tensorflow
Around 2005, Google released Tensorflow®. TensorFlow is an open-source Python library to aid with the development, training, building and releasing of machine learning models, particularly deep learning models. To install TensorFlow for Rstudio, visit https://tensorflow.rstudio.com/ and read the instructions there. Check the installation steps used by the author below before you start.
Installation steps that worked for the author
Every system is different, so the steps below might not work for you. With these and with the links provided above, you should be able to get up and running in no time. Here is the set of instructions that worked for me, starting from scratch on a Dell XPS 15 7590 with a Windows 10 Pro operating system. For Windows users, remember check the ‘set Path’ option during installation, if available, or set the path manually using the [command line]](https://www.windows-commandline.com/set-path-command-line/). Also, test each step before you go to the next.
- (optional) Install CUDA version 10.2.89_441.22 for Windows and test with
nvcc -V
anywhere at the command line prompt
- Install R and test by running the command
R --version
anywhere at the command line prompt
- Install Python 3 and test by running the command
python --version
anywhere at the command line prompt
- Install Rstudio and check that the correct version of R is being linked to
- Install Miniconda in a directory that contains no spaces (not automatically adding to Windows PATH variable as advised, but manually) and test by running the command
conda --version
anywhere at the command line prompt
- Add a python environment, here called
tf
(in the Scripts
sub-folder of Miniconda installation folder), read and follow the advise given (if any) and test afterwards with the command conda info --envs
or conda env list
:
Collecting package metadata (current_repodata.json): done
Solving environment: done
==> WARNING: A newer version of conda exists. <==
current version: 4.8.3
latest version: 4.8.4
Please update conda by running
$ conda update -n base -c defaults conda
Package Plan ##
environment location: […]\envs\tf
Proceed ([y]/n)? y
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
To activate this environment, use
$ conda activate tf
To deactivate an active environment, use
$ conda deactivate
- Open RStudio and, using the console, install the helper package
tensorflow
that guides the installation:
install.packages("tensorflow")
Installing package into […]
(as ‘lib’ is unspecified)
also installing the dependencies ‘rappdirs’, ‘config’, ‘reticulate’, ‘tfruns’
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.6/rappdirs_0.3.1.zip'
Content type 'application/zip' length 87285 bytes (85 KB)
downloaded 85 KB
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.6/config_0.3.zip'
Content type 'application/zip' length 27334 bytes (26 KB)
downloaded 26 KB
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.6/reticulate_1.16.zip'
Content type 'application/zip' length 1742735 bytes (1.7 MB)
downloaded 1.7 MB
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.6/tfruns_1.4.zip'
Content type 'application/zip' length 1479931 bytes (1.4 MB)
downloaded 1.4 MB
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.6/tensorflow_2.2.0.zip'
Content type 'application/zip' length 145036 bytes (141 KB)
downloaded 141 KB
package ‘rappdirs’ successfully unpacked and MD5 sums checked
package ‘config’ successfully unpacked and MD5 sums checked
package ‘reticulate’ successfully unpacked and MD5 sums checked
package ‘tfruns’ successfully unpacked and MD5 sums checked
package ‘tensorflow’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in […]
- Still in RStudios console window, load the
tensorflow
package in memory (change tf
if use you another environment name):
install_tensorflow(version = "nightly-gpu", envname = "tf")
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done
Package Plan ##
environment location: […]\envs\tf
added / updated specs:
- python=3.6
The following packages will be downloaded:
package | build
---------------------------|-----------------
certifi-2020.6.20 | py36h9f0ad1d_0 152 KB conda-forge
pip-20.2.2 | py_0 1.1 MB conda-forge
python-3.6.11 |h6f26aa1_2_cpython 18.9 MB conda-forge
python_abi-3.6 | 1_cp36m 4 KB conda-forge
setuptools-49.6.0 | py36h9f0ad1d_0 919 KB conda-forge
vc-14.1 | h869be7e_1 6 KB conda-forge
vs2015_runtime-14.16.27012 | h30e32a0_2 2.2 MB conda-forge
wheel-0.35.1 | pyh9f0ad1d_0 29 KB conda-forge
wincertstore-0.2 | py36_1003 13 KB conda-forge
------------------------------------------------------------
Total: 23.3 MB
The following NEW packages will be INSTALLED:
certifi conda-forge/win-64::certifi-2020.6.20-py36h9f0ad1d_0
pip conda-forge/noarch::pip-20.2.2-py_0
python conda-forge/win-64::python-3.6.11-h6f26aa1_2_cpython
python_abi conda-forge/win-64::python_abi-3.6-1_cp36m
setuptools conda-forge/win-64::setuptools-49.6.0-py36h9f0ad1d_0
vc conda-forge/win-64::vc-14.1-h869be7e_1
vs2015_runtime conda-forge/win-64::vs2015_runtime-14.16.27012-h30e32a0_2
wheel conda-forge/noarch::wheel-0.35.1-pyh9f0ad1d_0
wincertstore conda-forge/win-64::wincertstore-0.2-py36_1003
[…]
Collecting tf-nightly-gpu
Downloading tf_nightly_gpu-2.4.0.dev20200822-cp36-cp36m-win_amd64.whl (282.9 MB)
[…]
Installation complete.
Warning messages:
1: In normalizePath(path.expand(path), winslash, mustWork) :
path[1]="[…]\Miniconda\envs\tf/python.exe": Het systeem kan het opgegeven bestand niet vinden
2: In normalizePath(path.expand(path), winslash, mustWork) :
path[1]="[…]\Miniconda\envs\tf/python.exe": Het systeem kan het opgegeven bestand niet vinden
Restarting R session...
NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.
- Test tensorflow using the command below
import numpy as np
import tensorflow as tf
tf.random.set_seed(1)
model = tf.keras.Sequential([tf.keras.layers.Dense( \
units = 1, input_shape = [1])], )
print(tf.reduce_sum(tf.random.normal([1000, 1000])))
2020-09-17 15:06:29.340924: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library nvcuda.dll
2020-09-17 15:06:29.365342: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1650 computeCapability: 7.5
coreClock: 1.56GHz coreCount: 16 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 119.24GiB/s
2020-09-17 15:06:29.365627: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
2020-09-17 15:06:29.368816: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cublas64_10.dll
2020-09-17 15:06:29.371298: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cufft64_10.dll
2020-09-17 15:06:29.372145: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library curand64_10.dll
2020-09-17 15:06:29.375605: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cusolver64_10.dll
2020-09-17 15:06:29.377715: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cusparse64_10.dll
2020-09-17 15:06:29.385621: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudnn64_7.dll
2020-09-17 15:06:29.385907: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-09-17 15:06:29.386607: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-09-17 15:06:29.489348: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x20dd8509930 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-09-17 15:06:29.489640: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-09-17 15:06:29.490043: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1650 computeCapability: 7.5
coreClock: 1.56GHz coreCount: 16 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 119.24GiB/s
2020-09-17 15:06:29.490402: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
2020-09-17 15:06:29.490567: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cublas64_10.dll
2020-09-17 15:06:29.490775: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cufft64_10.dll
2020-09-17 15:06:29.491080: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library curand64_10.dll
2020-09-17 15:06:29.491298: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cusolver64_10.dll
2020-09-17 15:06:29.491542: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cusparse64_10.dll
2020-09-17 15:06:29.491784: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudnn64_7.dll
2020-09-17 15:06:29.492628: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-09-17 15:06:30.261351: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-17 15:06:30.261492: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0
2020-09-17 15:06:30.261571: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N
2020-09-17 15:06:30.261943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2907 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1650, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-09-17 15:06:30.289124: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x20e0e74be40 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-09-17 15:06:30.289298: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce GTX 1650, Compute Capability 7.5
tf.Tensor(22.580101, shape=(), dtype=float32)
Waar vind ik hulp
Als je een bepaalde actie wil uitvoeren in R, maar je kent de naam van het pakket niet, dan zijn R Bloggers en Stack Overflow vermoedelijk de beste plaatsen om te zoeken. Gaat over de RStudio IDE, dan is de RStudio Community misschien wel iets. Zoek je toch op Google, voeg dan de term package
toe aan de lijst van zoektermen om betere resultaten te bekomen. Heb je de naam van een R pakket of een functie en werk je met RStudio, dan moet je meestal de IDE niet te verlaten. Wil je meer weten over een bepaald pakket, dan zijn er vignettes (soort handleiding) en documentatie. Vignettes roep je aan d.m.v. de vignette()
functie, documentatie met ?
of help()
. Verder heeft R ook demo’s en kan je rechtstreeks de voorbeelden uit de technische documentatie uitvoeren zonder de code te moeten kopiëren.
# Start
help.start()
# Documentatie
?magrittr
?magrittr::extract
# Zoeken naar hulp op basis van term
help.search("pipes")
# Informatie/handleiding specifiek pakket
vignette('magrittr')
# Lijst pakketten van gekoppelde bibliotheken
vignette(all = FALSE)
# Lijst pakketten geïnstalleerde bibliotheken
vignette(all = TRUE)
# Lijst alle vignettes
browseVignettes()
# Demonstratie
demo(image)
# Voorbeeld-code uit documentatie
example("stl")
# Pakket info (auteur, versie, ...)
library(help = "magrittr")
# R site
RSiteSearch("magrittr")
# CRAN repository pakketten
browseURL("https://cran.r-project.org/web/packages/available_packages_by_name.html")
# CRAN repo specifiek pakket
browseURL("https://cran.r-project.org/web/packages/magrittr/index.html")
# CRAN task views
browseURL("https://cran.r-project.org/web/views/")
# CRAN speciefieke task view
browseURL("https://cran.r-project.org/web/views/Hydrology.html")
Voor Python en andere talen verwijs ik naar Stack Overflow. Zie ook volgende paragraaf voor wat meer info over rond Python.
Waar kan ik leeralgoritmen terugvinden
Laten we beginnen met waar je de verscheidene leeralgoritmen kunt terugvinden. In sommige gevallen, zoals voor een enkelvoudig perceptron, kan je het zelf gewoon coderen maar meestal worden de algoritmen gelukkig ter beschikking gesteld. Omdat wiskundigen eerder R gebruiken zullen de nieuwere leeralgoritmen gewoonlijk eerst in R verschijnen en pas later in Python. Bij het publiceren van nieuwe methoden wordt er immers vaak van de auteurs verwacht dat ze het algoritme als een peer-reviewed R pakket (eng: package) registreren en publiek ter beschikking stellen. Het Comprehensive R Archive Network (CRAN), dat de R pakketten beheert en erop toeziet dat de pakketten zowel wiskundig als programmatorisch aan bepaalde standaarden voldoet, stelt zogenaamde task views op met telkens een overzicht van de pakketten die kunnen helpen bij een welbepaalde taak. Zoeken we naar “machine learning”, komen we op honderden pakketten uit van ahaz
tot xgboost
en vele van deze pakketten geven toegang tot tientallen verschillende typen leeralgoritmen. Maar dat is niet alles. Op de webpagina met het overzicht van alle task views, vinden we bijvoorbeeld het onderwerp natuurlijke taalverwerking (eng: natural language processing; NLP) met nog eens tientallen pakketten specifiek rond dit thema.
Voor Python ontwikkeling verwijzen we voor leeralgoritmen eerst en vooral naar scikit-learn. Deze bibliotheek biedt een ruime gereedschapskist aan voor beginnende datawetenschappers. De naam komt trouwens van Science-kit \(\rightarrow\) scikit \(\rightarrow\) sk. Dan is er Tensorflow waar in de cursus dieper op wordt ingegaan. Ook hier vind je een rijke schat aan informatie. Python biedt een aantal voordelen voor ML en het belangrijkste voordeel is misschien wel dat Python een algemene programmeertaal is waarmee je gemakkelijker volwaardige applicaties kunt maken of met hardware interfacen. De bewering dat Python ‘krachtiger’ of ‘performanter’ zou zijn dan R neem ik eerder met een korreltje zout. Op de weinige momenten waarop beide platformen op eerlijke wijze vergeleken worden is het verschil onbeduidend en bovendien moet je voornamelijk rekening houden met ontwikkeltijd en daar biedt Python zeker geen voordeel. Tenslotte vermeld ik hier nog even Julia, een opkomende taal waarin ML eenvoudiger moet worden.