Data Cleaning Workflow
VisiData is a powerful data cleaning tool. This lesson provides a systematic, reproducible workflow for cleaning CSV and tabular data — from inspection to final export.
Learning Focus
Follow this workflow on any messy dataset. The key insight is that VisiData operations are non-destructive — your source file is unchanged until you explicitly save.
Step 1: Inspect the Raw Data
vd /var/www/html/exports/raw_orders.csv
# Instant overview:
Shift+I # Describe Sheet — see nulls, distinct counts, min, max
zF # one-line summary for current column
Look for:
- Columns with high null counts
- Columns with only 1 distinct value (useless)
- Numeric columns with type string
- Date columns stored as text
Step 2: Fix Column Types
# For each numeric column that shows as string:
# Move to 'amount' column
% # float
# Move to 'quantity' column
# # integer
# Move to 'order_date' column
@ # date (auto-detects common formats)
Step 3: Fill Null Values
# Move to a column with nulls (e.g., 'region')
# Press f to fill nulls with the value above
f
# Repeat for all columns with propagating values
For columns that should have a fixed default:
# Select all rows where 'region' is null
z|
# Enter: region is None or region == ''
# Set selected rows' 'region' to a default
ge
# Enter: Unknown
Step 4: Remove Duplicates
# Open frequency table on the unique key column (e.g., 'order_id')
Shift+F
# Look for any count > 1 (duplicates)
# Press ] to sort by count descending
# Press Enter on rows with count > 1 → see the duplicate rows
# Back in source: select duplicates and delete
Step 5: Normalize Text Values
# Normalize status values (e.g., 'ACTIVE', 'Active', 'active' → 'active')
gs # select all rows
g*
# Enter: (?i)active<Tab>active
# Normalizes all case variants to lowercase 'active'
# Remove leading/trailing whitespace with expression column
=
# Enter: value.strip()
# Then replace original column with cleaned version
Step 6: Remove Invalid Rows
# Select rows where 'amount' is negative
z|
# Enter: amount < 0
# Delete selected invalid rows
gd
# Verify row count in status bar
Ctrl+G
Step 7: Export Clean Data
# Confirm the cleaned sheet looks correct
Shift+I # Describe Sheet — verify nulls = 0, types correct
# Save to new file (never overwrite the source!)
Ctrl+S
# Enter: /var/www/html/exports/clean_orders.csv
Complete Cleaning Workflow (Diagram)
Save the Workflow as a CommandLog
After cleaning, save the CommandLog for reproducibility:
Shift+D # view CommandLog for this sheet
Ctrl+S
# Enter: /tmp/clean_orders_workflow.vdj
Replay on a new file next time:
vd --play /tmp/clean_orders_workflow.vdj --batch new_raw_orders.csv -o new_clean_orders.csv
Practical Use Cases
Clean WordPress User Export
vd /var/www/html/exports/wp_users.csv
# 1. Inspect: Shift+I
# 2. Cast 'ID' to int: #
# 3. Cast 'user_registered' to date: @
# 4. Normalize 'user_status': gs → g* → (?i)1<Tab>active
# 5. Remove test users: z| → user_login.startswith('test') → gd
# 6. Save: Ctrl+S → /tmp/wp_users_clean.csv
Clean API Response Data
# Download and clean JSON response
curl -s https://api.example.com/orders | vd -f json
# Expand nested columns: ( on any dict/list column
# Type numeric fields: #, %
# Filter invalid status: z| → status in ('completed', 'shipped') → "
# Export: Ctrl+S
Troubleshooting Matrix
| Problem | Cause | Fix |
|---|---|---|
f (fill) fills wrong direction | Column is sorted differently | Sort by a stable key first |
g* replace doesn't match | Regex case sensitivity | Add (?i) prefix for case-insensitive |
gd deleted wrong rows | Wrong rows selected | Check selection before deleting — use {/} |
| Date cast fails | Non-standard date format | Set options.disp_date_fmt in .visidatarc |
Hands-On Practice
cat > /tmp/dirty.csv << 'EOF'
id,name,amount,status,region
1,Alice,100.50,ACTIVE,sg
2,Bob,-50,active,
3,Carol,200,Active,kl
4,Alice,100.50,ACTIVE,sg
5,Dave,abc,inactive,jk
EOF
vd /tmp/dirty.csv
# 1. Shift+I → see nulls in 'region', 'amount' has 'abc'
# 2. Cast 'amount' to float: % → notice #ERR for 'abc' row
# 3. Select row 5 with 'abc' → s → gd → delete
# 4. Select row 2 (region empty): z| → not region → ge → Unknown
# 5. gs → g* → (?i)active<Tab>active → normalize status
# 6. Shift+F on 'id' → find row 1/4 duplicate
# 7. Ctrl+S → save as /tmp/clean.csv