Lab 2: Data Import & Cleaning with Store Sales Orders (Student Version — No Answer Key)
Import from CSV/Excel, clean data with Power Query, and handle missing values.
Important: Excel vs Google Sheets
This lab is designed for Microsoft Excel and Power Query. Get Data and the Power Query Editor are Excel-specific. Google Sheets uses different tools (e.g. Connect to data, Import). Students should use Excel for these exercises.
Dataset Overview
Introduction
This lab uses a Store Sales Orders dataset — a synthetic table with deliberate data quality issues so you can practice importing, cleaning, and handling missing values with Power Query. The dataset has 200+ rows (including duplicates and blank rows), making it suitable for practicing transformations at a realistic scale.
What is Power Query? Power Query is Excel’s data connection and transformation engine. It lets you import data from files, databases, and the web, transform it (clean, reshape, merge), and load it into a worksheet or the Data Model. All steps are recorded and can be refreshed when the source changes.
Typical workflow: Import (Get Data) → Preview & detect types → Transform (clean, fix types, remove duplicates, etc.) → Load (worksheet or Data Model) → Refresh when needed.
Dataset Files
Download the Lab 2 dataset (Lab2-StoreOrders.csv, Lab2-StoreOrders.xlsx) from:
Use the CSV for Import from CSV exercises and the Excel file for Import from Excel exercises. Save both to a folder on your computer before starting.
Variables
| Column | Intended type | Description |
|---|---|---|
| OrderID | Text | Unique order identifier (e.g. ORD-1001) |
| OrderDate | Date | Order date (stored as text in the source) |
| Customer | Text | Customer name |
| Category | Text | Product category (Electronics, Clothing, Food, Books, Sports) |
| Amount | Number | Order amount (some rows stored as text) |
| Quantity | Number | Quantity (some rows stored as text) |
| Region | Text | Sales region (North, South, East, West) |
| Notes | Text | Optional notes (many blanks) |
Data Quality Notes
The dataset contains intentional issues for practice:
- Duplicates: Several exact duplicate rows.
- Wrong types: Order dates stored as text; Amount and Quantity stored as text in many rows.
- Trim/Clean: Leading and trailing spaces in Customer, Category, and Region.
- Missing values: Blanks in Region (several rows) and Notes (many rows).
- Blank rows: Several fully blank rows.
- Replace Values: Inconsistent Category values (e.g. "Electronics", " electronics ", "ELECTRONICS").
Learning Objectives
By completing this lab, you will:
- Import data from CSV and Excel using Get Data (Power Query).
- Use the Power Query Editor to change data types, remove duplicates, filter rows, and replace values.
- Apply Trim and Clean, remove blank rows, and add a conditional column to flag missing values.
- Load cleaned data to a worksheet (and optionally the Data Model) and refresh queries.
- Document your data-cleaning decisions and handle missing values in a principled way.
Power Query & Excel Quick Reference
Import Data
Import CSV: Data → Get Data → From File → From Text/CSV → Load / Transform Data
Import Excel: Data → Get Data → From File → From Excel Workbook → Select Sheet/Table → Load / Transform Data
Import Web: Data → Get Data → From Other Sources → From Web → Enter URL → Transform Data
Open Power Query Editor
Data → Get Data → Transform Data
or
Data → Queries & Connections → Right-click Query → Edit
Data Type
Power Query Editor → Transform → Data Type
Remove Duplicates
Power Query Editor → Home → Remove Rows → Remove Duplicates
Filter Rows
Column Header → Filter Dropdown → Text / Number / Date Filters
Replace Values
Power Query Editor → Transform → Replace Values
Trim / Clean Text
Power Query Editor → Transform → Format → Trim
Power Query Editor → Transform → Format → Clean
Remove Blank Rows
Power Query Editor → Home → Remove Rows → Remove Blank Rows
Conditional Column (Flag Missing)
Power Query Editor → Add Column → Conditional Column
Load Data
Load to Worksheet: Power Query Editor → Close & Load
Load to Data Model: Power Query Editor → Close & Load To… → Only Create Connection → Add this data to the Data Model
Refresh Data
Data → Refresh All
or
Data → Queries & Connections → Right-click Query → Refresh
Lab Exercises
The following exercises guide you through importing, cleaning, and loading the Store Sales Orders data with Power Query. Each exercise includes a question and step-by-step instructions. Complete the exercises in order and document your results; later steps assume earlier transformations.
Part A: Import Data
Exercise A1: Import from CSV
Question: Import the Store Sales Orders data from CSV and open the Power Query Editor. What do you see in the preview (column types, obvious issues)?
Instructions:
- Excel: Data → Get Data → From File → From Text/CSV.
- Select
Lab2-StoreOrders.csv(from this folder or where you saved it). Click Transform Data (not Load) so you open the Power Query Editor before loading. - In the Power Query Editor, check the data types shown in the column headers (ABC, 123, calendar icons). Scroll through the table and look for duplicates, blank rows, extra spaces in text, and missing values.
Explanation: Using Transform Data instead of Load lets you clean the data first. The preview shows how Excel interprets each column; types may be wrong (e.g. dates or numbers as Text) and need fixing.
Exercise A2: Import from Excel
Question: Import the same dataset from the Excel workbook and open the Power Query Editor.
Instructions:
- Excel: Data → Get Data → From File → From Excel Workbook.
- Select
Lab2-StoreOrders.xlsx, then choose the Orders sheet (or table). Click Transform Data. - Confirm the structure (columns and row count) matches the CSV import.
Explanation: Importing from Excel often preserves more type information than CSV, but you still need to verify. Use one source (CSV or Excel) consistently for Parts B–D.
Exercise A3 (optional): Import from Web
Question: Import a small table from a web page using Get Data → From Web.
Instructions:
- Excel: Data → Get Data → From Other Sources → From Web.
- Enter a URL that contains an HTML table (e.g. a simple reference or demo table). Use Transform Data to preview and clean if needed.
- Document the URL and the steps you used.
Exercise A4: Compare CSV vs Excel Import
Question: If you imported both CSV and Excel, compare the two queries. Do the initial detected data types or row counts differ? Why might they?
Instructions:
- Create two queries: one from
Lab2-StoreOrders.csv, one fromLab2-StoreOrders.xlsx(both with Transform Data). - In each, note the detected types for OrderDate, Amount, and Quantity, and the row count at the bottom.
- Briefly note any differences.
Explanation: CSV has no type information; Excel infers from cell formats. Excel workbooks can preserve Number/Date formats, so Power Query may detect them more accurately. Row counts should match if the sources are the same.
Part B: Power Query Transformations
Exercise B1: Change Data Types
Question: Set correct data types for OrderDate, Amount, and Quantity.
Instructions:
- Select the OrderDate column. Transform → Data Type → Date (or Date/Time). Choose the correct locale if prompted (e.g. day/month/year vs month/day/year).
- Select Amount → Transform → Data Type → Decimal Number (or Currency if you prefer).
- Select Quantity → Transform → Data Type → Whole Number.
- Check for any columns still detected as Text that should be numeric or date; fix them.
Explanation: Correct types ensure filtering, sorting, and calculations work properly. Wrong types (e.g. numbers as text) cause incorrect sums or sorts.
Exercise B2: Remove Duplicates
Question: Remove duplicate rows and note how many rows were removed.
Instructions:
- Select all columns (or the key columns you use for duplicates, e.g. OrderID).
- Home → Remove Rows → Remove Duplicates.
- Check the row count before and after; record the number of rows removed.
Explanation: Duplicates inflate counts and can bias analysis. Remove them based on a meaningful key (e.g. OrderID) or on all columns, depending on your definition of a duplicate.
Exercise B3: Filter Rows
Question: Use the column filter dropdowns to inspect or exclude certain values (e.g. blanks or a specific Category/Region).
Instructions:
- Click the filter icon on a column header (e.g. Category or Region).
- Use Text Filters or "Remove Empty" (or equivalent) to filter out blanks if needed, or filter to a single category for inspection.
- Remove or adjust the filter as needed for later steps (e.g. you may remove blank rows via Home → Remove Rows → Remove Blank Rows instead).
Explanation: Filtering helps you explore subsets and confirm where blanks or odd values appear. Use it before standardizing categories or removing blank rows.
Exercise B4: Replace Values
Question: Standardize the Category column (e.g. replace "ELECTRONICS", " electronics " with "Electronics").
Instructions:
- Select the Category column. Transform → Replace Values.
- Replace each variant (e.g. "ELECTRONICS", " electronics ") with the standard form "Electronics". Repeat for other categories (Clothing, Food, Books, Sports) as needed.
- Alternatively, use a Conditional Column or multiple Replace Values steps.
Explanation: Inconsistent categories split groups in PivotTables and reports. Standardizing them ensures correct aggregation and filters.
Exercise B5: Trim and Clean
Question: Apply Trim to Customer, Category, and Region; use Clean if you encounter non-printable characters.
Instructions:
- Select Customer. Transform → Format → Trim. Repeat for Category and Region.
- If any column has non-printable characters (e.g. from imported web or text data), use Transform → Format → Clean on that column.
Explanation: Trim removes leading and trailing spaces; Clean removes non-printable characters (e.g. tab, vertical tab). Both improve matching and grouping.
Exercise B6: Duplicates by Key vs All Columns
Question: Try removing duplicates first by OrderID only, then (in a new query or after undoing) by all columns. Compare the number of rows removed in each case. When would you use each approach?
Instructions:
- Remove duplicates based only on OrderID. Note the row count before and after.
- Start again from the source (or use a duplicate of the query) and remove duplicates based on all columns. Note the row count before and after.
- Compare and briefly explain when to use key-only vs all-column duplicate removal.
Explanation: Key-only (e.g. OrderID) treats two rows as duplicates if the key is the same, even if other columns differ. All-column duplicate removal keeps only rows that are identical in every column. Use key-only when the key uniquely identifies a record; use all-column when you want to drop only exact duplicates.
Part C: Missing Values
Exercise C1: Remove Blank Rows
Question: Remove fully blank rows and note the change in row count.
Instructions:
- Home → Remove Rows → Remove Blank Rows.
- Compare the row count before and after.
Explanation: Blank rows can break PivotTables and formulas. Remove them when they add no information.
Exercise C2: Flag Missing Notes
Question: Add a conditional column that flags rows where Notes is null or blank, without deleting those rows.
Instructions:
- Add Column → Conditional Column.
- Set Column name (e.g.
Notes_Missing). If Notes is null or equals"", outputYes(orTrue); otherwiseNo(orFalse). - Load the query and confirm the new column appears.
Explanation: Flagging missing values preserves all rows for analysis while making it easy to filter or report on missingness.
Exercise C3: Document Your Strategy
Question: In one or two sentences, state how you handled missing values in Region and Notes (e.g. removed blank rows, flagged missing, left as-is).
Exercise C4: When to Impute, Flag, or Delete
Question: Briefly explain when you would (a) delete rows with missing values, (b) flag missing values with a new column, or (c) impute (e.g. fill with mean/median/mode). Give one example each.
Instructions:
- Consider amount of missing data, whether it’s random or systematic, and how the variable is used.
- Write 1–2 sentences for each of (a), (b), and (c) with a concrete example.
Explanation: Deletion is simple but loses data; use when missing is small and random. Flagging keeps all rows and supports analysis of missingness. Imputation preserves sample size but can bias results if done poorly; use when justified and documented.
Part D: Load and Refresh
Exercise D1: Load to Worksheet
Question: Load the cleaned data into an Excel table in a new sheet.
Instructions:
- In the Power Query Editor, Close & Load.
- Ensure the query has a clear name (e.g.
StoreOrders_Cleaned). Rename it in Queries & Connections if needed.
Exercise D2 (optional): Load to Data Model
Question: Create a connection that loads the data into the Data Model only (no worksheet).
Instructions:
- Duplicate the query or create a new one from the same source. Close & Load To…
- Choose Only Create Connection and check Add this data to the Data Model.
- Confirm in Data → Queries & Connections that the query exists and is connected to the Data Model.
Exercise D3: Refresh Data
Question: Refresh the query after changing the source file and confirm the change appears.
Instructions:
- Close the workbook or ensure the source CSV/Excel is not locked. Add a new row to the source file (or change an existing value), then save.
- In Excel, Data → Refresh All (or right-click the query → Refresh).
- Verify that the updated data appears in the loaded table.
Exercise D4: Connection and Refresh Settings
Question: Where can you change the source file path or refresh settings for a query? What happens if you move or rename the source file?
Instructions:
- Open Data → Queries & Connections. Right-click the query → Edit or Properties (depending on your Excel version).
- Look for Source or Connection settings. Note how the path is stored.
- Move or rename the source file, then try Refresh. Observe the result.
Explanation: The connection stores the full path to the source. If you move or rename the file, Refresh fails until you update the path (e.g. via Data Source Settings or by editing the query and changing the source step).
Verification & Expected Results
Quick Verification Checklist
After completing all exercises, verify:
- [ ] Data imported from CSV and from Excel (and optionally from Web).
- [ ] OrderDate, Amount, and Quantity have correct data types.
- [ ] Duplicates removed; row count change documented.
- [ ] Category standardized; Trim (and Clean if needed) applied to text columns.
- [ ] No fully blank rows; optional Notes_Missing (or similar) column added.
- [ ] Query loaded to worksheet (and optionally to Data Model).
- [ ] Refresh works after updating the source file.
- [ ] Your strategy for missing values (Region, Notes) is documented.
Common Errors & Troubleshooting
Date stays as text after Change Type
- Cause: Format or locale mismatch; column contains invalid dates.
- Fix: Use Transform → Data Type → Date and choose the correct locale. Ensure all values are valid dates; fix or remove problematic rows if needed.
Wrong step order or duplicate steps
- Cause: Steps applied in the wrong order or applied twice.
- Fix: In Applied Steps, delete or reorder steps as needed. Re-run the query to preview results.
Refresh fails
- Cause: Source file moved, renamed, or open elsewhere; connection settings incorrect.
- Fix: Ensure the source path is correct, the file is not locked by another app, and the connection uses the right file path. Re-import or update the source step if necessary.
Duplicate removal removes too many or too few rows
- Cause: Duplicates defined by all columns vs key columns only.
- Fix: Decide whether duplicates are defined by a key (e.g. OrderID) or by all columns; select the appropriate columns before Remove Duplicates.
Conclusion
Congratulations! You have completed Lab 2 covering:
- Import from CSV and Excel (and optionally Web) using Get Data.
- Power Query transformations: Change Type, Remove Duplicates, Filter Rows, Replace Values, Trim, Clean, Remove Blank Rows.
- Handling missing values: Remove Blank Rows, Conditional Column to flag missing, and documenting your strategy.
- Loading to a worksheet (and optionally the Data Model) and refreshing queries.
- When to impute, flag, or delete missing values, and how connection/source settings affect refresh.
Next Steps
- Use the cleaned Store Sales Orders data for EDA and PivotTables in later labs.
- Practice the same workflow on other datasets (e.g. Titanic, project topic data).
- Prepare for Group Progress Report 2: data quality assessment, cleaning methodology, and preliminary data structure documentation (see course schedule).
Lab Completed: __
Date: __
Instructor Signature: _____