6323 Labs

Author

Jim

Published

September 2, 2024

1 6323 Knowledge Mining

Welcome to the homepage for the 6323 Labs! This course focuses on data mining and machine learning methods, core aspects of data science. Throughout the labs, you will explore a range of statistical modeling techniques and data analysis methods, including regression analysis, classification, decision trees, neural networks, and support vector machines. These labs will also introduce you to new developments in learning, such as deep learning and interpretable machine learning.

Below, you will find links to each lab assignment, along with a brief description of what each lab covers.

2 Lab02: Basic Commands and Matrix Operations in R

In this lab, you will practice basic commands and matrix operations in R, including how to load data, index matrices, and perform basic descriptive statistics. The lab also covers graphical summaries and introduces linear regression models in R.

2.1 Key Learning Objectives:

Indexing matrices and performing matrix operations in R.
Loading data from external sources (GitHub, websites) into R.
Creating graphical summaries such as scatterplots, histograms, and pairwise plots.
Fitting simple linear regression models and interpreting the results.
Understanding how to work with multiple regression models, non-linear transformations, and qualitative predictors.

Click here to view Lab02

3 Lab03: Exploratory Data Analysis (EDA)

In this lab, you will explore Exploratory Data Analysis (EDA) in R using the iris dataset. The lab focuses on creating interactive 3D visualizations using the Plotly package and performing multiple linear regression to understand relationships between variables.

3.1 Key Learning Objectives:

Visualizing relationships between quantitative variables using 3D scatterplots with Plotly.
Fitting multiple linear regression models to predict Petal.Length using Sepal.Length and Sepal.Width.
Understanding how to generate regression surfaces and enhance visualizations with interactive elements.
Using the reshape2 package to prepare data for 3D plotting.

3.2 Tools:

R packages: Plotly, reshape2, datasets.
Data Source: The iris dataset, which contains measurements of sepal and petal dimensions for three species of iris flowers.

Click here to view Lab03

4 Lab04: Linear Discriminant Analysis Lab

This lab introduces Linear Discriminant Analysis (LDA), a classification method applied to predict stock market directions using the Smarket dataset from the ISLR package. You will learn how to implement LDA in R and evaluate model performance based on prediction accuracy.

4.1 Key Learning Objectives:

Understanding the theory behind LDA and its application in financial market prediction.
Implementing LDA in R using the MASS package.
Visualizing LDA decision boundaries and interpreting confusion matrices.
Evaluating classification accuracy by comparing predicted vs. actual market directions.

4.2 Tools:

R packages: MASS, ISLR, descr.
Data Source: The Smarket dataset, which includes stock market data from 2001 to 2005, will be used to predict market direction (Up/Down).

Click here to view Lab04

5 Lab05: Linear Regression Lab

This lab explores various Linear Regression techniques in R, using the Boston and Carseats datasets. You will learn how to implement simple and multiple linear regression, and how to interpret the results. The lab also covers interaction terms, nonlinear terms, and working with qualitative predictors.

5.1 Key Learning Objectives:

Fitting simple and multiple linear regression models in R.
Interpreting regression coefficients, R-squared values, and prediction intervals.
Exploring interaction terms, nonlinear relationships, and qualitative predictors.
Visualizing regression results using custom plotting functions.

5.2 Tools:

R packages: MASS, ISLR, arm.
Data Sources: The Boston dataset for housing price predictions and the Carseats dataset for sales analysis.

Click here to view Lab05

6 Lab06: Logistic Regression Lab

In this lab, you will explore Logistic Regression by predicting stock market directions using the Smarket dataset from the ISLR package. You will fit logistic regression models to binary outcomes and evaluate the model’s performance through prediction accuracy.

6.1 Key Learning Objectives:

Fitting logistic regression models in R using the Smarket dataset.
Interpreting coefficients and odds ratios from logistic regression models.
Evaluating model performance using confusion matrices and calculating classification accuracy.

6.2 Tools:

R packages: ISLR, MASS.
Data Source: The Smarket dataset, which contains stock market data from 2001 to 2005, including variables such as lag values and volume.

Click here to view Lab06

7 Lab07: Model Selection Lab

In this lab, you will explore Model Selection techniques, including Best Subset Selection, Forward Selection, and Backward Selection. These methods will help you determine the best model based on different criteria, such as Cp, BIC, and Adjusted R².

7.1 Key Learning Objectives:

Implementing Best Subset, Forward, and Backward Selection in R.
Comparing models using Cp, BIC, and Adjusted R².
Visualizing model performance using selection criteria plots.
Extracting model coefficients and interpreting model fit.

7.2 Tools:

R packages: leaps, datasets.
Data Source: Synthetic data generated by a cubic polynomial of X with added noise, used to test model selection techniques.

Click here to view Lab07

8 Lab08: caret - Random Forest, KNN, GLMNet

This lab explores Random Forest, KNN, and GLMNet using the caret package in R. You will learn how to build Random Forest models for both classification and regression tasks, and explore alternative models like K-Nearest Neighbors (KNN) and GLMNet for different types of data.

8.1 Key Learning Objectives:

Implementing Random Forest classification and regression models using the caret package.
Training and evaluating models with cross-validation techniques.
Exploring alternative models like KNN and GLMNet for various predictive tasks.
Interpreting model performance using accuracy for classification and RMSE for regression tasks.

8.2 Tools:

R packages: caret, randomForest, dplyr, ggplot2, tidyr.
Data Sources: The iris dataset for classification tasks and the mtcars dataset for regression tasks.

Click here to view Lab08

9 Lab09: Interpretable Machine Learning Lab

This lab focuses on Interpretable Machine Learning techniques using R: LIME, Partial Dependence Plots (PDP), and SHAP values. You will work with the iris dataset and train models using different algorithms to demonstrate how these techniques help in interpreting machine learning models.

9.1 Key Learning Objectives:

Understanding and applying LIME for local explanations of model predictions.
Generating Partial Dependence Plots (PDP) to explore feature effects on predictions.
Calculating SHAP values to understand the contribution of individual features to model predictions.

9.2 Tools:

R packages: caret, randomForest, lime, pdp, xgboost.
Data Source: The iris dataset, which contains measurements of sepal and petal dimensions for three species of iris flowers.

Click here to view Lab09

Each of these labs will help you develop essential skills in statistical modeling and data analysis. Click the links to access detailed instructions and resources for each lab.