6323 Labs
1 6323 Knowledge Mining
Welcome to the homepage for the 6323 Labs! This course focuses on data mining and machine learning methods, core aspects of data science. Throughout the labs, you will explore a range of statistical modeling techniques and data analysis methods, including regression analysis, classification, decision trees, neural networks, and support vector machines. These labs will also introduce you to new developments in learning, such as deep learning and interpretable machine learning.
Below, you will find links to each lab assignment, along with a brief description of what each lab covers.
2 Lab02: Basic Commands and Matrix Operations in R
In this lab, you will practice basic commands and matrix operations in R, including how to load data, index matrices, and perform basic descriptive statistics. The lab also covers graphical summaries and introduces linear regression models in R.
2.1 Key Learning Objectives:
- Indexing matrices and performing matrix operations in R.
- Loading data from external sources (GitHub, websites) into R.
- Creating graphical summaries such as scatterplots, histograms, and pairwise plots.
- Fitting simple linear regression models and interpreting the results.
- Understanding how to work with multiple regression models, non-linear transformations, and qualitative predictors.
3 Lab03: Exploratory Data Analysis (EDA)
In this lab, you will explore Exploratory Data Analysis (EDA) in R using the iris
dataset. The lab focuses on creating interactive 3D visualizations using the Plotly package and performing multiple linear regression to understand relationships between variables.
3.1 Key Learning Objectives:
- Visualizing relationships between quantitative variables using 3D scatterplots with
Plotly
. - Fitting multiple linear regression models to predict
Petal.Length
usingSepal.Length
andSepal.Width
. - Understanding how to generate regression surfaces and enhance visualizations with interactive elements.
- Using the
reshape2
package to prepare data for 3D plotting.
3.2 Tools:
- R packages:
Plotly
,reshape2
,datasets
. - Data Source: The
iris
dataset, which contains measurements of sepal and petal dimensions for three species of iris flowers.
4 Lab04: Linear Discriminant Analysis Lab
This lab introduces Linear Discriminant Analysis (LDA), a classification method applied to predict stock market directions using the Smarket
dataset from the ISLR
package. You will learn how to implement LDA in R and evaluate model performance based on prediction accuracy.
4.1 Key Learning Objectives:
- Understanding the theory behind LDA and its application in financial market prediction.
- Implementing LDA in R using the
MASS
package. - Visualizing LDA decision boundaries and interpreting confusion matrices.
- Evaluating classification accuracy by comparing predicted vs. actual market directions.
4.2 Tools:
- R packages:
MASS
,ISLR
,descr
. - Data Source: The
Smarket
dataset, which includes stock market data from 2001 to 2005, will be used to predict market direction (Up/Down).
5 Lab05: Linear Regression Lab
This lab explores various Linear Regression techniques in R, using the Boston
and Carseats
datasets. You will learn how to implement simple and multiple linear regression, and how to interpret the results. The lab also covers interaction terms, nonlinear terms, and working with qualitative predictors.
5.1 Key Learning Objectives:
- Fitting simple and multiple linear regression models in R.
- Interpreting regression coefficients, R-squared values, and prediction intervals.
- Exploring interaction terms, nonlinear relationships, and qualitative predictors.
- Visualizing regression results using custom plotting functions.
5.2 Tools:
- R packages:
MASS
,ISLR
,arm
. - Data Sources: The
Boston
dataset for housing price predictions and theCarseats
dataset for sales analysis.
6 Lab06: Logistic Regression Lab
In this lab, you will explore Logistic Regression by predicting stock market directions using the Smarket
dataset from the ISLR
package. You will fit logistic regression models to binary outcomes and evaluate the model’s performance through prediction accuracy.
6.1 Key Learning Objectives:
- Fitting logistic regression models in R using the
Smarket
dataset. - Interpreting coefficients and odds ratios from logistic regression models.
- Evaluating model performance using confusion matrices and calculating classification accuracy.
6.2 Tools:
- R packages:
ISLR
,MASS
. - Data Source: The
Smarket
dataset, which contains stock market data from 2001 to 2005, including variables such as lag values and volume.
7 Lab07: Model Selection Lab
In this lab, you will explore Model Selection techniques, including Best Subset Selection, Forward Selection, and Backward Selection. These methods will help you determine the best model based on different criteria, such as Cp, BIC, and Adjusted R².
7.1 Key Learning Objectives:
- Implementing Best Subset, Forward, and Backward Selection in R.
- Comparing models using Cp, BIC, and Adjusted R².
- Visualizing model performance using selection criteria plots.
- Extracting model coefficients and interpreting model fit.
7.2 Tools:
- R packages:
leaps
,datasets
. - Data Source: Synthetic data generated by a cubic polynomial of
X
with added noise, used to test model selection techniques.
8 Lab08: caret - Random Forest, KNN, GLMNet
This lab explores Random Forest, KNN, and GLMNet using the caret
package in R. You will learn how to build Random Forest models for both classification and regression tasks, and explore alternative models like K-Nearest Neighbors (KNN) and GLMNet for different types of data.
8.1 Key Learning Objectives:
- Implementing Random Forest classification and regression models using the
caret
package. - Training and evaluating models with cross-validation techniques.
- Exploring alternative models like KNN and GLMNet for various predictive tasks.
- Interpreting model performance using accuracy for classification and RMSE for regression tasks.
8.2 Tools:
- R packages:
caret
,randomForest
,dplyr
,ggplot2
,tidyr
. - Data Sources: The
iris
dataset for classification tasks and themtcars
dataset for regression tasks.
9 Lab09: Interpretable Machine Learning Lab
This lab focuses on Interpretable Machine Learning techniques using R: LIME, Partial Dependence Plots (PDP), and SHAP values. You will work with the iris
dataset and train models using different algorithms to demonstrate how these techniques help in interpreting machine learning models.
9.1 Key Learning Objectives:
- Understanding and applying LIME for local explanations of model predictions.
- Generating Partial Dependence Plots (PDP) to explore feature effects on predictions.
- Calculating SHAP values to understand the contribution of individual features to model predictions.
9.2 Tools:
- R packages:
caret
,randomForest
,lime
,pdp
,xgboost
. - Data Source: The
iris
dataset, which contains measurements of sepal and petal dimensions for three species of iris flowers.
Each of these labs will help you develop essential skills in statistical modeling and data analysis. Click the links to access detailed instructions and resources for each lab.