Lab 3: Medical EDA with Confidence Intervals and p-values
Load and clean medical data, run descriptive analysis, segment with PivotTables, and practice CI/p-value interpretation in Excel.
Important: Excel vs Google Sheets
This lab is designed for Microsoft Excel. You may use formulas in both tools, but menu paths for Power Query, PivotTables, and some statistical options are Excel-specific.
Dataset Overview
Introduction
This lab uses a synthetic Medical EDA dataset (patient-level records) created for Week 5 content:
- Descriptive statistics
- Summary tables
- Data segmentation
- PivotTables
- Introductory confidence intervals (95% CI) and p-values
Dataset Files
- Google Drive (recommended): Lab 2/3 Dataset Folder
- Course website (alternative):
- Student package:
student desk/Lab-3-Dataset/(if provided)
Variables
| Column | Type | Description |
|---|---|---|
| PatientID | Text | Unique patient ID |
| VisitDate | Date/Text | Visit date (some rows need type fixes) |
| Age | Number | Age in years |
| Sex | Text | Female/Male |
| Region | Text | North/South/East/West (inconsistent spacing/case in some rows) |
| HospitalUnit | Text | Clinical unit |
| DiagnosisGroup | Text | Hypertension/Diabetes/Respiratory (inconsistent case in some rows) |
| TreatmentGroup | Text | A/B (some lowercase values) |
| BaselineSBP | Number | Baseline systolic blood pressure |
| FollowupSBP | Number | Follow-up systolic blood pressure |
| LDL | Number/Text | LDL cholesterol (mixed numeric/text in some rows) |
| HbA1c | Number | Glycemic marker (some missing values) |
| BMI | Number | Body Mass Index |
| AdherenceScore | Number/Text | Medication adherence score (mixed numeric/text in some rows) |
| LengthOfStay | Number | Days in hospital |
| Readmission30d | Binary (0/1) | 30-day readmission indicator |
| Notes | Text | Optional notes |
Data Quality Notes
- Duplicate records
- Blank rows
- Mixed text/number types (LDL, AdherenceScore)
- Inconsistent category formatting (Region, DiagnosisGroup, TreatmentGroup)
- Missing values (Region, HbA1c)
- Outliers (BMI, LengthOfStay)
Learning Objectives
By the end of this lab, you can:
- Import and clean medical data in Excel
- Produce descriptive statistics and segmented summaries
- Build PivotTables for subgroup analysis
- Compute and interpret 95% CIs
- Compute and interpret p-values for simple comparisons
Quick Reference (Week 5 Skills)
- Import:
Data -> Get Data -> From File -> From Text/CSVorFrom Excel Workbook - Open query again:
Data -> Queries & Connections -> Right-click query -> Edit - PivotTable:
Insert -> PivotTable - Pivot grouping: right-click date/number field -> Group
- Pivot filters: Report Filter and optional Slicer
- Pivot design: Subtotals, Grand Totals, Report Layout
- Pivot calculated field: PivotTable Analyze -> Fields, Items, & Sets -> Calculated Field
- Refresh:
Data -> Refresh All - 95% CI formula:
mean +/- t* * (SD / sqrt(n)) - Two-sample p-value:
=T.TEST(rangeA, rangeB, 2, 3) - Categorical p-value (optional):
=CHISQ.TEST(actual_range, expected_range)
Lab Exercises
Part A: Load and Clean
Exercise A1: Import and Inspect
Question: Import Medical-EDA.csv (or .xlsx) and identify at least 4 quality issues.
Instructions:
- Import with Transform Data.
- Inspect data types and scan columns for missing values and inconsistent categories.
- Record at least 4 issues found.
Answer: Typical issues: duplicate rows, blank rows, mixed text/number in LDL and AdherenceScore, inconsistent Region/DiagnosisGroup/TreatmentGroup, missing Region/HbA1c, and outliers in BMI/LengthOfStay.
Exercise A2: Clean Core Columns
Question: Clean data so all key columns are analysis-ready.
Instructions:
- Remove fully blank rows.
- Remove duplicates (all columns).
- Trim text columns:
Region,DiagnosisGroup,TreatmentGroup. - Replace category variants (e.g.,
a->A,b->B, uppercase/lowercase diagnosis labels -> standard labels). - Set data types:
VisitDate-> DateAge,BaselineSBP,FollowupSBP,LDL,HbA1c,BMI,AdherenceScore,LengthOfStay-> NumberReadmission30d-> Whole Number
Answer: A clean table should have standardized categories, valid types, and no blank/duplicate rows.
Part B: Descriptive Statistics
Exercise B1: Overall Summary
Question: Build a summary table for BaselineSBP, FollowupSBP, LDL, and BMI.
Instructions:
- On a new sheet, create a table with metrics: Mean, Median, SD, Min, Max, N.
- Use formulas such as:
AVERAGE(range)MEDIAN(range)STDEV.S(range)MIN(range),MAX(range)COUNT(range)
Answer: Your summary should show lower average FollowupSBP than BaselineSBP and plausible central values for LDL/BMI after cleaning.
Exercise B2: Segment by Treatment
Question: Compare FollowupSBP and LengthOfStay between Treatment A and B.
Instructions:
- Use
AVERAGEIFS(or PivotTable) to compute mean by treatment. - Report group means side-by-side.
Answer: In this synthetic dataset, Treatment A typically has lower follow-up SBP and shorter stay on average than Treatment B.
Part C: Summary Tables and Segmentation
Exercise C1: COUNTIFS and AVERAGEIFS
Question: Create a segmented summary by Region and DiagnosisGroup.
Instructions:
- Build a small matrix of counts:
COUNTIFS(Region, r, DiagnosisGroup, d). - Build another matrix of means:
AVERAGEIFS(FollowupSBP, Region, r, DiagnosisGroup, d).
Answer: You should obtain subgroup-level counts and mean follow-up SBP values suitable for EDA interpretation.
Exercise C2: Readmission Segmentation
Question: Compare readmission rates by TreatmentGroup and Sex.
Instructions:
- Compute subgroup readmission rate:
AVERAGEIFS(Readmission30d, TreatmentGroup, g, Sex, s). - Convert to percent format.
Answer: Readmission percentages are expected to differ by treatment group; record subgroup differences.
Part D: PivotTables
Exercise D1: PivotTable 1 (Clinical Outcome by Treatment)
Question: Summarize FollowupSBP and LengthOfStay by TreatmentGroup.
Instructions:
- Insert PivotTable from clean table.
- Rows:
TreatmentGroup - Values: Average of
FollowupSBP, Average ofLengthOfStay
Answer: PivotTable should clearly compare treatment outcomes.
Exercise D2: PivotTable 2 (Readmission by Region and Diagnosis)
Question: Build a segmented PivotTable for readmission.
Instructions:
- Rows:
Region - Columns:
DiagnosisGroup - Values: Average of
Readmission30d(format as %) - Optional Filter:
Sex
Answer: You should identify combinations with relatively higher readmission rates.
Exercise D3: PivotTable 3 (Time Grouping by Month/Quarter)
Question: Group patient visits by month/quarter and summarize average follow-up SBP.
Instructions:
- Create a new PivotTable from the clean table.
- Rows:
VisitDate - Values: Average of
FollowupSBP - Right-click any date in the PivotTable -> Group -> select
Months(and optionallyQuarters). - Add
TreatmentGroupto Filters to compare A vs B after grouping.
Answer: You should see monthly/quarterly trends in follow-up SBP and be able to compare trends by treatment using filters.
Exercise D4: Pivot Layout, Filters, and Slicers
Question: Improve readability and interactivity of a PivotTable.
Instructions:
- Start from PivotTable 1 or 2.
- Turn on/off Subtotals and Grand Totals (PivotTable Design tab).
- Change Report Layout (Compact/Outline/Tabular) and choose the clearest format.
- Add at least one report filter (e.g.,
SexorHospitalUnit). - Optional: Insert a Slicer for
TreatmentGroupand test interactive filtering.
Answer: A good Pivot report is readable, consistently formatted, and easy to explore with filters/slicers.
Exercise D5 (optional): Calculated Field in PivotTable
Question: Add a calculated field for SBP improvement (BaselineSBP - FollowupSBP) in a PivotTable.
Instructions:
- Open a PivotTable built from the cleaned data.
- PivotTable Analyze -> Fields, Items, & Sets -> Calculated Field.
- Name:
SBP_Improvement. - Formula:
=BaselineSBP - FollowupSBP - Add
TreatmentGroupas Rows and summarize averageSBP_Improvement.
Answer: Positive values indicate reduction from baseline to follow-up. Compare mean improvement by treatment group.
Part E: Confidence Intervals and p-values
Exercise E1: 95% CI for Mean FollowupSBP (Overall)
Question: Calculate the 95% CI for mean FollowupSBP.
Instructions:
- Compute:
mean = AVERAGE(range)sd = STDEV.S(range)n = COUNT(range)
- Compute t critical:
t_star = T.INV.2T(0.05, n-1)
- Margin of error:
ME = t_star * sd / SQRT(n)
- CI bounds:
Lower = mean - MEUpper = mean + ME
Answer: Report CI as [Lower, Upper]. Interpretation: we are 95% confident the population mean follow-up SBP lies in this interval.
Exercise E2: 95% CI by Treatment Group
Question: Compute separate 95% CIs for Treatment A and B (FollowupSBP).
Instructions:
- Filter or reference each treatment group separately.
- Repeat E1 formulas for each group.
- Compare intervals qualitatively (overlap vs separation).
Answer: Treatment A and B should show different interval centers; discuss whether separation appears meaningful.
Exercise E3: p-value for Difference in Means (Treatment A vs B)
Question: Is FollowupSBP different between Treatment A and B?
Instructions:
- Put group ranges in two columns (A and B).
- Use:
=T.TEST(rangeA, rangeB, 2, 3)2= two-tailed,3= unequal variances (Welch)
- Decision rule:
- If
p < 0.05, reject equal-means null hypothesis.
- If
Answer: Report p-value and decision. Example interpretation: “p < 0.05 suggests a statistically significant difference in mean follow-up SBP between treatment groups.”
Exercise E4 (optional): p-value for Categorical Association
Question: Is 30-day readmission associated with TreatmentGroup?
Instructions:
- Build a 2x2 observed table (
TreatmentGroupxReadmission30d) usingCOUNTIFSor PivotTable. - Build expected counts table.
- Use
CHISQ.TEST(observed_range, expected_range).
Answer: Report p-value and interpretation at alpha = 0.05.
Part F: Short Interpretation
Exercise F1: Mini report paragraph
Question: Write 5-8 sentences summarizing findings.
Instructions: Include:
- One descriptive-statistics insight
- One segmentation/PivotTable insight
- One CI statement
- One p-value statement
- One practical implication for healthcare decision-making
Answer: A strong response clearly distinguishes descriptive findings from inferential findings and states both statistical and practical meaning.
Verification Checklist
- Cleaned dataset loaded in Excel
- Descriptive statistics table completed
- At least 2 segmented summary tables completed
- At least 3 PivotTables completed (including one grouped by month/quarter)
- At least one Pivot filter (or slicer) used
- Optional calculated field tested (
SBP_Improvement) - 95% CI computed correctly (overall and by treatment)
- p-value test completed and interpreted correctly
- Mini report paragraph completed
Common Errors and Fixes
- Numbers treated as text
- Fix with data type conversion in Power Query or
VALUE()in Excel.
- Fix with data type conversion in Power Query or
- Date not recognized
- Check locale and source format; re-parse date column.
- T.TEST returns error
- Ensure both ranges are numeric and contain no text blanks.
- CHISQ.TEST error
- Observed and expected ranges must have same shape and non-zero expected counts.
Conclusion
You practiced the full Week 5 EDA workflow on medical data:
- Load and clean
- Descriptive and segmented summaries
- PivotTables
- Confidence intervals and p-values
These outputs feed directly into Group Progress Report 3 and prepare you for later visualization and statistical-analysis weeks.