Lab 2: Data Import & Cleaning with Store Sales Orders
Import from CSV/Excel, clean data with Power Query, and handle missing values.
Important: Excel vs Google Sheets
This lab is designed for Microsoft Excel and Power Query. Get Data and the Power Query Editor are Excel-specific. Google Sheets uses different tools (e.g. Connect to data, Import). Students should use Excel for these exercises.
Dataset Overview
Introduction
This lab uses a Store Sales Orders dataset — a synthetic table with deliberate data quality issues so you can practice importing, cleaning, and handling missing values with Power Query. The dataset has 200+ rows (including duplicates and blank rows), making it suitable for practicing transformations at a realistic scale.
What is Power Query? Power Query is Excel’s data connection and transformation engine. It lets you import data from files, databases, and the web, transform it (clean, reshape, merge), and load it into a worksheet or the Data Model. All steps are recorded and can be refreshed when the source changes.
Typical workflow: Import (Get Data) → Preview & detect types → Transform (clean, fix types, remove duplicates, etc.) → Load (worksheet or Data Model) → Refresh when needed.
Dataset Files
Download the Lab 2 dataset (Lab2-StoreOrders.csv, Lab2-StoreOrders.xlsx) from:
Variables
| Column | Intended type | Description |
|---|---|---|
| OrderID | Text | Unique order identifier (e.g. ORD-1001) |
| OrderDate | Date | Order date (stored as text in the source) |
| Customer | Text | Customer name |
| Category | Text | Product category (Electronics, Clothing, Food, Books, Sports) |
| Amount | Number | Order amount (some rows stored as text) |
| Quantity | Number | Quantity (some rows stored as text) |
| Region | Text | Sales region (North, South, East, West) |
| Notes | Text | Optional notes (many blanks) |
Data Quality Notes
The dataset contains intentional issues for practice:
- Duplicates: Several exact duplicate rows.
- Wrong types: Order dates stored as text; Amount and Quantity stored as text in many rows.
- Trim/Clean: Leading and trailing spaces in Customer, Category, and Region.
- Missing values: Blanks in Region (several rows) and Notes (many rows).
- Blank rows: Several fully blank rows.
- Replace Values: Inconsistent Category values (e.g. "Electronics", " electronics ", "ELECTRONICS").
Learning Objectives
By completing this lab, you will:
- Import data from CSV and Excel using Get Data (Power Query).
- Use the Power Query Editor to change data types, remove duplicates, filter rows, and replace values.
- Apply Trim and Clean, remove blank rows, and add a conditional column to flag missing values.
- Load cleaned data to a worksheet (and optionally the Data Model) and refresh queries.
- Document your data-cleaning decisions and handle missing values in a principled way.
Power Query & Excel Quick Reference
Import Data
Import CSV: Data → Get Data → From File → From Text/CSV → Load / Transform Data
Import Excel: Data → Get Data → From File → From Excel Workbook → Select Sheet/Table → Load / Transform Data
Import Web: Data → Get Data → From Other Sources → From Web → Enter URL → Transform Data
Open Power Query Editor
Data → Get Data → Transform Data
or
Data → Queries & Connections → Right-click Query → Edit
Data Type
Power Query Editor → Transform → Data Type
Remove Duplicates
Power Query Editor → Home → Remove Rows → Remove Duplicates
Filter Rows
Column Header → Filter Dropdown → Text / Number / Date Filters
Replace Values
Power Query Editor → Transform → Replace Values
Trim / Clean Text
Power Query Editor → Transform → Format → Trim
Power Query Editor → Transform → Format → Clean
Remove Blank Rows
Power Query Editor → Home → Remove Rows → Remove Blank Rows
Conditional Column (Flag Missing)
Power Query Editor → Add Column → Conditional Column
Load Data
Load to Worksheet: Power Query Editor → Close & Load
Load to Data Model: Power Query Editor → Close & Load To… → Only Create Connection → Add this data to the Data Model
Refresh Data
Data → Refresh All
or
Data → Queries & Connections → Right-click Query → Refresh
Lab Exercises
The following exercises guide you through importing, cleaning, and loading the Store Sales Orders data with Power Query. Each exercise includes a question, step-by-step instructions, and a brief answer. Complete them in order; later steps assume earlier transformations.
Part A: Import Data
Exercise A1: Import from CSV
Question: Import the Store Sales Orders data from CSV and open the Power Query Editor. What do you see in the preview (column types, obvious issues)?
Instructions:
- Excel: Data → Get Data → From File → From Text/CSV.
- Select
Lab2-StoreOrders.csv(from this folder or where you saved it). Click Transform Data (not Load) so you open the Power Query Editor before loading. - In the Power Query Editor, check the data types shown in the column headers (ABC, 123, calendar icons). Scroll through the table and look for duplicates, blank rows, extra spaces in text, and missing values.
Explanation: Using Transform Data instead of Load lets you clean the data first. The preview shows how Excel interprets each column; types may be wrong (e.g. dates or numbers as Text) and need fixing.
Answer: You should see columns OrderID, OrderDate, Customer, Category, Amount, Quantity, Region, Notes. Many rows are present (200+). OrderDate is typically detected as Text; Amount and Quantity may be Text in several rows. You will notice duplicate rows, fully blank rows, leading/trailing spaces in Customer, Category, and Region, and missing values in Region and Notes.
Exercise A2: Import from Excel
Question: Import the same dataset from the Excel workbook and open the Power Query Editor.
Instructions:
- Excel: Data → Get Data → From File → From Excel Workbook.
- Select
Lab2-StoreOrders.xlsx, then choose the Orders sheet (or table). Click Transform Data. - Confirm the structure (columns and row count) matches the CSV import.
Explanation: Importing from Excel often preserves more type information than CSV, but you still need to verify. Use one source (CSV or Excel) consistently for Parts B–D.
Answer: The Orders sheet has the same columns and rows as the CSV. You can use either source for the cleaning exercises; stick to one for the rest of the lab.
Exercise A3 (optional): Import from Web
Question: Import a small table from a web page using Get Data → From Web.
Instructions:
- Excel: Data → Get Data → From Other Sources → From Web.
- Enter a URL that contains an HTML table (e.g. a simple reference or demo table). Use Transform Data to preview and clean if needed.
- Document the URL and the steps you used.
Answer: "Import from Web" is covered in the Week 3 slides. Use any stable URL with a table; the goal is to practice the From Web workflow and see how Power Query lists available tables.
Exercise A4: Compare CSV vs Excel Import
Question: If you imported both CSV and Excel, compare the two queries. Do the initial detected data types or row counts differ? Why might they?
Instructions:
- Create two queries: one from
Lab2-StoreOrders.csv, one fromLab2-StoreOrders.xlsx(both with Transform Data). - In each, note the detected types for OrderDate, Amount, and Quantity, and the row count at the bottom.
- Briefly note any differences.
Explanation: CSV has no type information; Excel infers from cell formats. Excel workbooks can preserve Number/Date formats, so Power Query may detect them more accurately. Row counts should match if the sources are the same.
Answer: Row counts should match. CSV often detects OrderDate, Amount, or Quantity as Text more frequently; the Excel import may show Date or Number for some columns initially. Both sources require verifying and possibly changing types in Power Query.
Part B: Power Query Transformations
Exercise B1: Change Data Types
Question: Set correct data types for OrderDate, Amount, and Quantity.
Instructions:
- Select the OrderDate column. Transform → Data Type → Date (or Date/Time). Choose the correct locale if prompted (e.g. day/month/year vs month/day/year).
- Select Amount → Transform → Data Type → Decimal Number (or Currency if you prefer).
- Select Quantity → Transform → Data Type → Whole Number.
- Check for any columns still detected as Text that should be numeric or date; fix them.
Explanation: Correct types ensure filtering, sorting, and calculations work properly. Wrong types (e.g. numbers as text) cause incorrect sums or sorts.
Answer: After changing types, OrderDate supports date filters and sorting; Amount and Quantity can be used in calculations and aggregates. The Applied Steps panel will show "Changed Type" steps for each column you modified.
Exercise B2: Remove Duplicates
Question: Remove duplicate rows and note how many rows were removed.
Instructions:
- Select all columns (or the key columns you use for duplicates, e.g. OrderID).
- Home → Remove Rows → Remove Duplicates.
- Check the row count before and after; record the number of rows removed.
Explanation: Duplicates inflate counts and can bias analysis. Remove them based on a meaningful key (e.g. OrderID) or on all columns, depending on your definition of a duplicate.
Answer: The dataset contains duplicate rows. After removing duplicates, the row count decreases. The exact number depends on whether you use all columns or only OrderID (or another key). Record your before/after counts for your report.
Exercise B3: Filter Rows
Question: Use the column filter dropdowns to inspect or exclude certain values (e.g. blanks or a specific Category/Region).
Instructions:
- Click the filter icon on a column header (e.g. Category or Region).
- Use Text Filters or "Remove Empty" (or equivalent) to filter out blanks if needed, or filter to a single category for inspection.
- Remove or adjust the filter as needed for later steps (e.g. you may remove blank rows via Home → Remove Rows → Remove Blank Rows instead).
Explanation: Filtering helps you explore subsets and confirm where blanks or odd values appear. Use it before standardizing categories or removing blank rows.
Answer: Filtering helps you inspect subsets of the data and confirm blanks or specific categories. You can clear the filter and use Remove Blank Rows for a permanent change, or keep filters for ad-hoc exploration.
Exercise B4: Replace Values
Question: Standardize the Category column (e.g. replace "ELECTRONICS", " electronics " with "Electronics").
Instructions:
- Select the Category column. Transform → Replace Values.
- Replace each variant (e.g. "ELECTRONICS", " electronics ") with the standard form "Electronics". Repeat for other categories (Clothing, Food, Books, Sports) as needed.
- Alternatively, use a Conditional Column or multiple Replace Values steps.
Explanation: Inconsistent categories split groups in PivotTables and reports. Standardizing them ensures correct aggregation and filters.
Answer: After replacing values, Category has consistent capitalization and spacing. You may need several Replace Values steps (one per variant) or a Conditional Column that maps variants to standard names.
Exercise B5: Trim and Clean
Question: Apply Trim to Customer, Category, and Region; use Clean if you encounter non-printable characters.
Instructions:
- Select Customer. Transform → Format → Trim. Repeat for Category and Region.
- If any column has non-printable characters (e.g. from imported web or text data), use Transform → Format → Clean on that column.
Explanation: Trim removes leading and trailing spaces; Clean removes non-printable characters (e.g. tab, vertical tab). Both improve matching and grouping.
Answer: Trim removes leading and trailing spaces; Clean removes non-printable characters. Apply them to text columns that come from external sources to avoid duplicate categories or failed lookups.
Exercise B6: Duplicates by Key vs All Columns
Question: Try removing duplicates first by OrderID only, then (in a new query or after undoing) by all columns. Compare the number of rows removed in each case. When would you use each approach?
Instructions:
- Remove duplicates based only on OrderID. Note the row count before and after.
- Start again from the source (or use a duplicate of the query) and remove duplicates based on all columns. Note the row count before and after.
- Compare and briefly explain when to use key-only vs all-column duplicate removal.
Explanation: Key-only (e.g. OrderID) treats two rows as duplicates if the key is the same, even if other columns differ. All-column duplicate removal keeps only rows that are identical in every column. Use key-only when the key uniquely identifies a record; use all-column when you want to drop only exact duplicates.
Answer: Removing duplicates by OrderID only removes more rows if some duplicates have identical OrderIDs but different values elsewhere (e.g. typo in one column). Removing by all columns removes only rows that are identical in every column. Use OrderID when it is the true unique identifier; use all columns when you want to eliminate only exact duplicate rows.
Part C: Missing Values
Exercise C1: Remove Blank Rows
Question: Remove fully blank rows and note the change in row count.
Instructions:
- Home → Remove Rows → Remove Blank Rows.
- Compare the row count before and after.
Explanation: Blank rows can break PivotTables and formulas. Remove them when they add no information.
Answer: Blank rows are removed. The number of rows deleted equals the number of fully blank rows in the table. Record this for your cleaning documentation.
Exercise C2: Flag Missing Notes
Question: Add a conditional column that flags rows where Notes is null or blank, without deleting those rows.
Instructions:
- Add Column → Conditional Column.
- Set Column name (e.g.
Notes_Missing). If Notes is null or equals"", outputYes(orTrue); otherwiseNo(orFalse). - Load the query and confirm the new column appears.
Explanation: Flagging missing values preserves all rows for analysis while making it easy to filter or report on missingness.
Answer: The new column identifies rows with missing Notes. You can use it for analysis or reporting while keeping all rows. Similar logic can be applied to Region or other columns if you choose to flag rather than delete.
Exercise C3: Document Your Strategy
Question: In one or two sentences, state how you handled missing values in Region and Notes (e.g. removed blank rows, flagged missing, left as-is).
Answer: Example: "We removed fully blank rows, flagged missing Notes with a conditional column, and kept Region as-is (including blanks) for filtering. We did not impute values." Your statement should match what you actually did in the query.
Exercise C4: When to Impute, Flag, or Delete
Question: Briefly explain when you would (a) delete rows with missing values, (b) flag missing values with a new column, or (c) impute (e.g. fill with mean/median/mode). Give one example each.
Instructions:
- Consider amount of missing data, whether it’s random or systematic, and how the variable is used.
- Write 1–2 sentences for each of (a), (b), and (c) with a concrete example.
Explanation: Deletion is simple but loses data; use when missing is small and random. Flagging keeps all rows and supports analysis of missingness. Imputation preserves sample size but can bias results if done poorly; use when justified and documented.
Answer: (a) Delete: When missing is a small proportion and effectively random (e.g. a few scattered blanks in a large survey). (b) Flag: When you need to keep all rows but also analyze or filter by missingness (e.g. flag missing Notes, then filter to "Notes_Missing = Yes" for a subset analysis). (c) Impute: When the variable is important, missing is modest, and you have a defensible method (e.g. fill missing Region with mode, or missing numeric values with column median), and you document the choice.
Part D: Load and Refresh
Exercise D1: Load to Worksheet
Question: Load the cleaned data into an Excel table in a new sheet.
Instructions:
- In the Power Query Editor, Close & Load.
- Ensure the query has a clear name (e.g.
StoreOrders_Cleaned). Rename it in Queries & Connections if needed.
Answer: The cleaned data appears as a table in a new worksheet. You can use it for further analysis, PivotTables, or charts.
Exercise D2 (optional): Load to Data Model
Question: Create a connection that loads the data into the Data Model only (no worksheet).
Instructions:
- Duplicate the query or create a new one from the same source. Close & Load To…
- Choose Only Create Connection and check Add this data to the Data Model.
- Confirm in Data → Queries & Connections that the query exists and is connected to the Data Model.
Answer: The data is available for PivotTables and DAX via the Data Model, without a separate table on a sheet. Useful for larger datasets and multi-table models.
Exercise D3: Refresh Data
Question: Refresh the query after changing the source file and confirm the change appears.
Instructions:
- Close the workbook or ensure the source CSV/Excel is not locked. Add a new row to the source file (or change an existing value), then save.
- In Excel, Data → Refresh All (or right-click the query → Refresh).
- Verify that the updated data appears in the loaded table.
Answer: Refresh replays the query steps on the current source. Use it when source files are updated regularly (e.g. daily exports). The connection stores the file path; refresh uses that path.
Exercise D4: Connection and Refresh Settings
Question: Where can you change the source file path or refresh settings for a query? What happens if you move or rename the source file?
Instructions:
- Open Data → Queries & Connections. Right-click the query → Edit or Properties (depending on your Excel version).
- Look for Source or Connection settings. Note how the path is stored.
- Move or rename the source file, then try Refresh. Observe the result.
Explanation: The connection stores the full path to the source. If you move or rename the file, Refresh fails until you update the path (e.g. via Data Source Settings or by editing the query and changing the source step).
Answer: You can change the source path via the query’s Source step in Power Query or via Data Source Settings (Data → Get Data → Data Source Settings). If the file is moved or renamed, Refresh fails with an error; update the path and refresh again.
Verification & Expected Results
Quick Verification Checklist
After completing all exercises, verify:
- [ ] Data imported from CSV and from Excel (and optionally from Web).
- [ ] OrderDate, Amount, and Quantity have correct data types.
- [ ] Duplicates removed; row count change documented.
- [ ] Category standardized; Trim (and Clean if needed) applied to text columns.
- [ ] No fully blank rows; optional Notes_Missing (or similar) column added.
- [ ] Query loaded to worksheet (and optionally to Data Model).
- [ ] Refresh works after updating the source file.
- [ ] Your strategy for missing values (Region, Notes) is documented.
Common Errors & Troubleshooting
Date stays as text after Change Type
- Cause: Format or locale mismatch; column contains invalid dates.
- Fix: Use Transform → Data Type → Date and choose the correct locale. Ensure all values are valid dates; fix or remove problematic rows if needed.
Wrong step order or duplicate steps
- Cause: Steps applied in the wrong order or applied twice.
- Fix: In Applied Steps, delete or reorder steps as needed. Re-run the query to preview results.
Refresh fails
- Cause: Source file moved, renamed, or open elsewhere; connection settings incorrect.
- Fix: Ensure the source path is correct, the file is not locked by another app, and the connection uses the right file path. Re-import or update the source step if necessary.
Duplicate removal removes too many or too few rows
- Cause: Duplicates defined by all columns vs key columns only.
- Fix: Decide whether duplicates are defined by a key (e.g. OrderID) or by all columns; select the appropriate columns before Remove Duplicates.
Conclusion
Congratulations! You have completed Lab 2 covering:
- Import from CSV and Excel (and optionally Web) using Get Data.
- Power Query transformations: Change Type, Remove Duplicates, Filter Rows, Replace Values, Trim, Clean, Remove Blank Rows.
- Handling missing values: Remove Blank Rows, Conditional Column to flag missing, and documenting your strategy.
- Loading to a worksheet (and optionally the Data Model) and refreshing queries.
- When to impute, flag, or delete missing values, and how connection/source settings affect refresh.
Next Steps
- Use the cleaned Store Sales Orders data for EDA and PivotTables in later labs.
- Practice the same workflow on other datasets (e.g. Titanic, project topic data).
- Prepare for Group Progress Report 2: data quality assessment, cleaning methodology, and preliminary data structure documentation (see course schedule).
Lab Completed: __
Date: __
Instructor Signature: _____