Lab 3: Medical EDA with Confidence Intervals and p-values

Load and clean medical data, run descriptive analysis, segment with PivotTables, and practice CI/p-value interpretation in Excel.

Important: Excel vs Google Sheets

This lab is designed for Microsoft Excel. You may use formulas in both tools, but menu paths for Power Query, PivotTables, and some statistical options are Excel-specific.

Dataset Overview

Introduction

This lab uses a synthetic Medical EDA dataset (patient-level records) created for Week 5 content:

Descriptive statistics
Summary tables
Data segmentation
PivotTables
Introductory confidence intervals (95% CI) and p-values

Dataset Files

Google Drive (recommended): Lab 2/3 Dataset Folder
Course website (alternative):
Medical-EDA.csv
Medical-EDA.xlsx
Student package: student desk/Lab-3-Dataset/ (if provided)

Variables

Column	Type	Description
PatientID	Text	Unique patient ID
VisitDate	Date/Text	Visit date (some rows need type fixes)
Age	Number	Age in years
Sex	Text	Female/Male
Region	Text	North/South/East/West (inconsistent spacing/case in some rows)
HospitalUnit	Text	Clinical unit
DiagnosisGroup	Text	Hypertension/Diabetes/Respiratory (inconsistent case in some rows)
TreatmentGroup	Text	A/B (some lowercase values)
BaselineSBP	Number	Baseline systolic blood pressure
FollowupSBP	Number	Follow-up systolic blood pressure
LDL	Number/Text	LDL cholesterol (mixed numeric/text in some rows)
HbA1c	Number	Glycemic marker (some missing values)
BMI	Number	Body Mass Index
AdherenceScore	Number/Text	Medication adherence score (mixed numeric/text in some rows)
LengthOfStay	Number	Days in hospital
Readmission30d	Binary (0/1)	30-day readmission indicator
Notes	Text	Optional notes

Data Quality Notes

Duplicate records
Blank rows
Mixed text/number types (LDL, AdherenceScore)
Inconsistent category formatting (Region, DiagnosisGroup, TreatmentGroup)
Missing values (Region, HbA1c)
Outliers (BMI, LengthOfStay)

Learning Objectives

By the end of this lab, you can:

Import and clean medical data in Excel
Produce descriptive statistics and segmented summaries
Build PivotTables for subgroup analysis
Compute and interpret 95% CIs
Compute and interpret p-values for simple comparisons

Quick Reference (Week 5 Skills)

Import: Data -> Get Data -> From File -> From Text/CSV or From Excel Workbook
Open query again: Data -> Queries & Connections -> Right-click query -> Edit
PivotTable: Insert -> PivotTable
Pivot grouping: right-click date/number field -> Group
Pivot filters: Report Filter and optional Slicer
Pivot design: Subtotals, Grand Totals, Report Layout
Pivot calculated field: PivotTable Analyze -> Fields, Items, & Sets -> Calculated Field
Refresh: Data -> Refresh All
95% CI formula: mean +/- t* * (SD / sqrt(n))
Two-sample p-value: =T.TEST(rangeA, rangeB, 2, 3)
Categorical p-value (optional): =CHISQ.TEST(actual_range, expected_range)

Lab Exercises

Part A: Load and Clean

Exercise A1: Import and Inspect

Question: Import Medical-EDA.csv (or .xlsx) and identify at least 4 quality issues.

Instructions:
1. Import with Transform Data.
2. Inspect data types and scan columns for missing values and inconsistent categories.
3. Record at least 4 issues found.

Exercise A2: Clean Core Columns

Question: Clean data so all key columns are analysis-ready.

Instructions:
1. Remove fully blank rows.
2. Remove duplicates (all columns).
3. Trim text columns: Region, DiagnosisGroup, TreatmentGroup.
4. Replace category variants (e.g., a -> A, b -> B, uppercase/lowercase diagnosis labels -> standard labels).
5. Set data types:
- VisitDate -> Date
- Age, BaselineSBP, FollowupSBP, LDL, HbA1c, BMI, AdherenceScore, LengthOfStay -> Number
- Readmission30d -> Whole Number

Part B: Descriptive Statistics

Exercise B1: Overall Summary

Question: Build a summary table for BaselineSBP, FollowupSBP, LDL, and BMI.

Instructions:
1. On a new sheet, create a table with metrics: Mean, Median, SD, Min, Max, N.
2. Use formulas such as:
- AVERAGE(range)
- MEDIAN(range)
- STDEV.S(range)
- MIN(range), MAX(range)
- COUNT(range)

Exercise B2: Segment by Treatment

Question: Compare FollowupSBP and LengthOfStay between Treatment A and B.

Instructions:
1. Use AVERAGEIFS (or PivotTable) to compute mean by treatment.
2. Report group means side-by-side.

Part C: Summary Tables and Segmentation

Exercise C1: COUNTIFS and AVERAGEIFS

Question: Create a segmented summary by Region and DiagnosisGroup.

Instructions:
1. Build a small matrix of counts: COUNTIFS(Region, r, DiagnosisGroup, d).
2. Build another matrix of means: AVERAGEIFS(FollowupSBP, Region, r, DiagnosisGroup, d).

Exercise C2: Readmission Segmentation

Question: Compare readmission rates by TreatmentGroup and Sex.

Instructions:
1. Compute subgroup readmission rate: AVERAGEIFS(Readmission30d, TreatmentGroup, g, Sex, s).
2. Convert to percent format.

Part D: PivotTables

Exercise D1: PivotTable 1 (Clinical Outcome by Treatment)

Question: Summarize FollowupSBP and LengthOfStay by TreatmentGroup.

Instructions:
1. Insert PivotTable from clean table.
2. Rows: TreatmentGroup
3. Values: Average of FollowupSBP, Average of LengthOfStay

Exercise D2: PivotTable 2 (Readmission by Region and Diagnosis)

Question: Build a segmented PivotTable for readmission.

Instructions:
1. Rows: Region
2. Columns: DiagnosisGroup
3. Values: Average of Readmission30d (format as %)
4. Optional Filter: Sex

Exercise D3: PivotTable 3 (Time Grouping by Month/Quarter)

Question: Group patient visits by month/quarter and summarize average follow-up SBP.

Instructions:
1. Create a new PivotTable from the clean table.
2. Rows: VisitDate
3. Values: Average of FollowupSBP
4. Right-click any date in the PivotTable -> Group -> select Months (and optionally Quarters).
5. Add TreatmentGroup to Filters to compare A vs B after grouping.

Exercise D4: Pivot Layout, Filters, and Slicers

Question: Improve readability and interactivity of a PivotTable.

Instructions:
1. Start from PivotTable 1 or 2.
2. Turn on/off Subtotals and Grand Totals (PivotTable Design tab).
3. Change Report Layout (Compact/Outline/Tabular) and choose the clearest format.
4. Add at least one report filter (e.g., Sex or HospitalUnit).
5. Optional: Insert a Slicer for TreatmentGroup and test interactive filtering.

Exercise D5 (optional): Calculated Field in PivotTable

Question: Add a calculated field for SBP improvement (BaselineSBP - FollowupSBP) in a PivotTable.

Instructions:
1. Open a PivotTable built from the cleaned data.
2. PivotTable Analyze -> Fields, Items, & Sets -> Calculated Field.
3. Name: SBP_Improvement.
4. Formula: =BaselineSBP - FollowupSBP
5. Add TreatmentGroup as Rows and summarize average SBP_Improvement.

Part E: Confidence Intervals and p-values

Exercise E1: 95% CI for Mean FollowupSBP (Overall)

Question: Calculate the 95% CI for mean FollowupSBP.

Instructions:
1. Compute:
- mean = AVERAGE(range)
- sd = STDEV.S(range)
- n = COUNT(range)
2. Compute t critical:
- t_star = T.INV.2T(0.05, n-1)
3. Margin of error:
- ME = t_star * sd / SQRT(n)
4. CI bounds:
- Lower = mean - ME
- Upper = mean + ME

Exercise E2: 95% CI by Treatment Group

Question: Compute separate 95% CIs for Treatment A and B (FollowupSBP).

Instructions:
1. Filter or reference each treatment group separately.
2. Repeat E1 formulas for each group.
3. Compare intervals qualitatively (overlap vs separation).

Exercise E3: p-value for Difference in Means (Treatment A vs B)

Question: Is FollowupSBP different between Treatment A and B?

Instructions:
1. Put group ranges in two columns (A and B).
2. Use:
- =T.TEST(rangeA, rangeB, 2, 3)
- 2 = two-tailed, 3 = unequal variances (Welch)
3. Decision rule:
- If p < 0.05, reject equal-means null hypothesis.

Exercise E4 (optional): p-value for Categorical Association

Question: Is 30-day readmission associated with TreatmentGroup?

Instructions:
1. Build a 2x2 observed table (TreatmentGroup x Readmission30d) using COUNTIFS or PivotTable.
2. Build expected counts table.
3. Use CHISQ.TEST(observed_range, expected_range).

Part F: Short Interpretation

Exercise F1: Mini report paragraph

Question: Write 5-8 sentences summarizing findings.

Instructions:
Include:
1. One descriptive-statistics insight
2. One segmentation/PivotTable insight
3. One CI statement
4. One p-value statement
5. One practical implication for healthcare decision-making

Verification Checklist

[ ] Cleaned dataset loaded in Excel
[ ] Descriptive statistics table completed
[ ] At least 2 segmented summary tables completed
[ ] At least 3 PivotTables completed (including one grouped by month/quarter)
[ ] At least one Pivot filter (or slicer) used
[ ] Optional calculated field tested (SBP_Improvement)
[ ] 95% CI computed correctly (overall and by treatment)
[ ] p-value test completed and interpreted correctly
[ ] Mini report paragraph completed

Common Errors and Fixes

Numbers treated as text
Fix with data type conversion in Power Query or VALUE() in Excel.
Date not recognized
Check locale and source format; re-parse date column.
T.TEST returns error
Ensure both ranges are numeric and contain no text blanks.
CHISQ.TEST error
Observed and expected ranges must have same shape and non-zero expected counts.

Conclusion

You practiced the full Week 5 EDA workflow on medical data:

Load and clean
Descriptive and segmented summaries
PivotTables
Confidence intervals and p-values

These outputs feed directly into Group Progress Report 3 and prepare you for later visualization and statistical-analysis weeks.