Course Project Topics
This course offers 11 curated project topics for group work. Students work in groups (3-4 members) to select one topic for their final project. Groups work on their chosen topic throughout the semester, applying all course skills learned in lectures—from data import and cleaning to statistical analysis and professional reporting.
Each topic includes a curated dataset with detailed instructions. Groups report on their project progress during each lab week, building toward a comprehensive final presentation in Week 15.
Topic 1: Superstore Sales Easy
Problem:
A retail company needs to analyze its sales performance to identify opportunities for growth, understand customer behavior, and optimize product strategies. The company has collected comprehensive sales data but lacks insights to make data-driven business decisions.
Task:
Using the Superstore Sales dataset, conduct a comprehensive business analysis to:
- Identify top-performing products, categories, and regions
- Analyze sales trends and patterns over time
- Calculate key business metrics (profit margins, customer segment performance, discount effectiveness)
- Create an interactive dashboard that provides actionable insights for business decision-making
- Provide recommendations for improving sales and profitability
Dataset Description:
Sales data for a fictional retail company, including product information, orders, customers, and financial metrics. Ideal for practicing Excel fundamentals with real business context.
Dataset Files:
Topic 2: Titanic Easy
Problem:
Historical passenger data from the Titanic contains inconsistencies, missing values, and data quality issues that prevent accurate analysis. The data needs to be cleaned, standardized, and validated before any meaningful insights can be extracted about passenger demographics, survival patterns, and factors affecting survival rates.
Task:
Using the Titanic dataset, perform comprehensive data cleaning and preparation to:
- Import and validate the data structure
- Identify and handle missing values using appropriate strategies
- Standardize data formats and fix inconsistencies
- Create clean, analysis-ready datasets using Power Query
- Document all data cleaning steps and decisions
- Perform initial exploratory analysis on the cleaned data to identify key patterns in passenger demographics and survival
Dataset Description:
Passenger information from the Titanic, including demographics, ticket class, and survival status. Contains missing values and various data types, perfect for practicing data cleaning techniques.
Dataset Files:
Topic 3: Adult Census Income Easy
Problem:
Understanding the relationship between demographic characteristics and income levels is crucial for policy makers, researchers, and social programs. The census data contains rich demographic information that can reveal patterns and relationships, but requires thorough exploration to uncover meaningful insights about income distribution across different population segments.
Task:
Using the Adult Census Income dataset, conduct comprehensive exploratory data analysis to:
- Summarize and describe the distribution of key demographic and economic variables
- Identify relationships between demographic factors (age, education, occupation, marital status) and income levels
- Segment the population into meaningful groups and compare income patterns across segments
- Create summary tables and visualizations that highlight key findings
- Provide insights about factors that may influence income levels based on the exploratory analysis
Dataset Description:
Demographic, social, and economic attributes from the 1994 U.S. Census database. Rich in categorical and numerical variables, ideal for exploring relationships and patterns across demographic groups.
Dataset Files:
Topic 4: World Happiness Report Medium +0.5
Problem:
The World Happiness Report tracks happiness levels across countries over multiple years, but the data needs to be analyzed and visualized to understand trends, regional patterns, and factors contributing to national well-being. Stakeholders need clear, comparative visualizations to identify which countries have improved or declined in happiness and what factors drive these changes.
Task:
Using the World Happiness Report dataset (2015-2019), create a comprehensive visualization and analysis to:
- Compare happiness levels across countries and regions over time
- Identify trends and changes in happiness rankings
- Analyze relationships between happiness scores and contributing factors (GDP, social support, life expectancy, etc.)
- Build an interactive dashboard with multiple visualizations that allow exploration of the data
- Create comparative visualizations showing changes over the five-year period
- Provide insights about what factors are most strongly associated with high happiness levels
Dataset Description:
Happiness levels and related factors (economic, social, health) for over 150 countries across multiple years (2015-2019). Excellent for creating comparative visualizations and time-series analyses.
Dataset Files:
Topic 5: Bank Marketing Medium +0.5
Problem:
A bank needs to optimize its marketing campaigns by understanding which customer segments are most likely to subscribe to term deposits. The bank has collected data from previous marketing campaigns but needs analysis to identify patterns, segment customers effectively, and improve future campaign targeting and effectiveness.
Task:
Using the Bank Marketing dataset, perform comprehensive marketing analytics to:
- Analyze campaign effectiveness and subscription rates
- Segment customers based on demographic and behavioral characteristics
- Identify factors that influence subscription decisions
- Calculate key marketing metrics (conversion rates, segment performance, campaign ROI indicators)
- Create visualizations that highlight customer segments and campaign performance
- Provide actionable recommendations for improving marketing campaign targeting and effectiveness
Dataset Description:
Marketing campaign data from a Portuguese banking institution, including customer demographics, campaign details, and subscription outcomes. Perfect for analyzing marketing effectiveness and customer behavior patterns.
Dataset Files:
Topic 6: Boston Housing Hard +1.5
Problem:
Understanding the factors that influence housing prices is essential for homebuyers, real estate professionals, and policy makers. The Boston Housing dataset contains information about various neighborhood characteristics, but requires statistical analysis to determine which factors significantly affect housing prices and to what extent.
Task:
Using the Boston Housing dataset, conduct comprehensive statistical analysis to:
- Explore relationships between housing prices and neighborhood characteristics
- Perform correlation analysis to identify strong associations
- Build regression models to predict housing prices based on key factors
- Test statistical hypotheses about relationships between variables
- Interpret statistical results and assess model performance
- Provide insights about which neighborhood characteristics are most important in determining housing prices and their relative impact
Dataset Description:
Housing data from Boston, Massachusetts, including prices and various neighborhood characteristics. Designed for analyzing relationships between house prices and features, perfect for regression analysis.
Dataset Files:
Topic 7: Stock Price Hard +1.5
Problem:
Investors and financial analysts need to understand stock price trends, volatility patterns, and relationships between different technology stocks to make informed investment decisions. The stock price data for major technology companies contains temporal patterns that require time series analysis to identify trends, compare performance, and understand market dynamics.
Task:
Using the Stock Price dataset (Amazon, Apple, Facebook, Google, Netflix), perform comprehensive time series analysis to:
- Analyze price trends and patterns over time for each company
- Compare performance across different technology stocks
- Calculate and visualize key metrics (returns, volatility, trading volume patterns)
- Identify seasonal patterns, trends, and anomalies in the data
- Create time series visualizations that effectively communicate stock performance
- Provide insights about stock performance patterns, volatility characteristics, and comparative analysis of the technology sector
Dataset Description:
Daily stock prices for major technology companies (Amazon, Apple, Facebook, Google, Netflix) including open, high, low, close prices and trading volume. Ideal for practicing time series analysis and trend identification.
Dataset Files:
Topic 8: Loan Default Survival Analysis Hard +1.5
Problem:
Financial institutions need to understand not just whether loans will default, but when defaults occur. Analyzing the time until loan default helps banks manage risk, set appropriate interest rates, and develop early warning systems. The loan portfolio contains time-to-event data that requires survival analysis techniques to identify risk factors and predict default timing.
Task:
Using the Loan Default Survival dataset, perform comprehensive survival analysis to:
- Calculate survival probabilities (non-default rates) over time
- Identify factors that influence time to default (loan amount, interest rate, borrower characteristics)
- Create Kaplan-Meier survival curves using Excel
- Analyze censored data (loans still active or completed without default)
- Compare default risk across different borrower segments (credit score, loan grade, purpose)
- Build visualizations showing survival curves and risk comparisons
- Provide insights for credit risk management and loan portfolio optimization
Dataset Description:
Loan data with time-to-event information for 5,000 loans, including borrower characteristics, loan details, and survival outcomes (default or censored). Contains time-to-default, event indicators, and various risk factors. Ideal for practicing survival analysis techniques in Excel.
Dataset Files:
Topic 9: Patient Survival Analysis Hard +1.5
Problem:
Healthcare providers need to evaluate treatment effectiveness and patient prognosis by analyzing survival times. Understanding factors that influence patient survival helps improve treatment strategies and resource allocation. The patient data contains survival times and prognostic factors that require survival analysis to assess treatment outcomes and identify critical risk factors.
Task:
Using the Patient Survival dataset (GBSG2), perform comprehensive survival analysis to:
- Calculate survival probabilities over time
- Compare survival across different treatment groups (hormonal therapy)
- Identify prognostic factors (age, disease stage, treatment type, tumor characteristics)
- Create survival curves and compare groups visually
- Analyze censored observations (patients still alive or lost to follow-up)
- Build comparative visualizations showing survival differences across patient groups
- Provide insights for treatment planning, patient care, and clinical decision-making
Dataset Description:
Patient survival data from the German Breast Cancer Study Group 2 (GBSG2), containing information on 686 patients with primary node-positive breast cancer. Includes survival times, event indicators, treatment information, and clinical variables. Ideal for practicing survival analysis techniques in healthcare contexts.
Dataset Files:
Topic 10: Healthcare Costs & Utilization Analysis Medium +0.5
Problem:
Healthcare administrators need to understand cost patterns, utilization rates, and resource allocation across different patient demographics, medical conditions, and treatment types. The healthcare cost data contains multiple dimensions that require comprehensive analysis to identify cost drivers, utilization patterns, and opportunities for cost optimization.
Task:
Using the Healthcare Costs dataset, perform comprehensive analysis to:
- Analyze healthcare costs across different patient demographics (age, gender, region)
- Compare utilization rates by medical condition, treatment type, and provider
- Identify cost drivers and high-cost patient segments
- Create comparative visualizations showing cost patterns across groups
- Build an interactive dashboard with multiple visualizations for cost exploration
- Segment patients by cost categories and analyze utilization patterns
- Provide insights for healthcare resource allocation and cost management strategies
Dataset Description:
Healthcare cost and utilization data including patient demographics, medical conditions, treatment types, costs, length of stay, and provider information. Rich in categorical and numerical variables, ideal for segmentation analysis and comparative visualizations.
Dataset Files:
Topic 11: Disease Progression & Treatment Outcomes Hard +1.5
Problem:
Medical researchers and clinicians need to understand how patient characteristics, treatment protocols, and clinical variables influence disease progression and treatment outcomes in Rheumatoid Arthritis (RA). RA is a chronic autoimmune disease that causes joint inflammation, pain, and progressive joint damage. The clinical data contains multiple prognostic factors (demographics, comorbidities, inflammatory biomarkers, treatment types) that require statistical analysis to identify significant predictors and build predictive models for treatment response and disease outcomes.
Task:
Using the Disease Progression dataset, conduct comprehensive statistical analysis to:
- Explore relationships between patient characteristics and disease outcomes
- Perform correlation analysis to identify strong associations with disease progression
- Build regression models to predict treatment outcomes based on clinical variables
- Compare treatment effectiveness across different patient groups
- Test statistical hypotheses about relationships between variables
- Analyze time-to-event outcomes (disease progression, treatment response)
- Interpret statistical results and assess model performance
- Provide insights about which factors are most important in predicting patient outcomes
Dataset Description:
Clinical data from patients with Rheumatoid Arthritis (RA), including baseline characteristics, treatment information, clinical measurements over time, and disease progression outcomes. The dataset includes inflammatory biomarkers (C-Reactive Protein, Erythrocyte Sedimentation Rate) and disease activity scores measured at multiple time points (baseline, 3, 6, and 12 months), making it suitable for longitudinal analysis and statistical modeling of treatment response patterns.
Dataset Files:
Topic Selection Guidelines
Group Formation:
- Form groups of 3-4 members by Week 2
- Consider diverse skills and interests when forming groups
Selection Process:
- Review all 11 topics and their descriptions
- Select a topic that aligns with your group’s interests and career goals
- Consider the complexity and scope appropriate for your skill level
- Finalize topic selection during Week 2 lab session
- Each group selects ONE topic to work on for the entire semester
Project Workflow:
- Groups work on their chosen topic throughout the semester
- Apply concepts and techniques learned in lectures to your project
- Report progress during each lab week (Weeks 2, 4, 6, 8, 11, 13, 14)
- Build incrementally: data import → cleaning → exploration → analysis → visualization → reporting
Project Expectations:
- Apply all course skills learned throughout the semester to your chosen topic
- Deliver a complete analysis from data cleaning to final reporting
- Create professional visualizations and dashboards
- Write a comprehensive report (2,000-2,500 words)
- Present findings effectively (15-minute group presentation in Week 15)
Support Resources:
- Dataset files and instructions available in each topic folder (see Student Desk)
- Instructor guidance during lab sessions and office hours
- Peer review and feedback opportunities during lab weeks
- Access to course materials and reference texts