Skip to main content

Data Cleaning Workflow

VisiData is a powerful data cleaning tool. This lesson provides a systematic, reproducible workflow for cleaning CSV and tabular data — from inspection to final export.

Learning Focus

Follow this workflow on any messy dataset. The key insight is that VisiData operations are non-destructive — your source file is unchanged until you explicitly save.

Step 1: Inspect the Raw Data

vd /var/www/html/exports/raw_orders.csv

# Instant overview:
Shift+I # Describe Sheet — see nulls, distinct counts, min, max
zF # one-line summary for current column

Look for:

  • Columns with high null counts
  • Columns with only 1 distinct value (useless)
  • Numeric columns with type string
  • Date columns stored as text

Step 2: Fix Column Types

# For each numeric column that shows as string:
# Move to 'amount' column
% # float

# Move to 'quantity' column
# # integer

# Move to 'order_date' column
@ # date (auto-detects common formats)

Step 3: Fill Null Values

# Move to a column with nulls (e.g., 'region')
# Press f to fill nulls with the value above
f
# Repeat for all columns with propagating values

For columns that should have a fixed default:

# Select all rows where 'region' is null
z|
# Enter: region is None or region == ''

# Set selected rows' 'region' to a default
ge
# Enter: Unknown

Step 4: Remove Duplicates

# Open frequency table on the unique key column (e.g., 'order_id')
Shift+F

# Look for any count > 1 (duplicates)
# Press ] to sort by count descending
# Press Enter on rows with count > 1 → see the duplicate rows
# Back in source: select duplicates and delete

Step 5: Normalize Text Values

# Normalize status values (e.g., 'ACTIVE', 'Active', 'active' → 'active')
gs # select all rows
g*
# Enter: (?i)active<Tab>active
# Normalizes all case variants to lowercase 'active'

# Remove leading/trailing whitespace with expression column
=
# Enter: value.strip()
# Then replace original column with cleaned version

Step 6: Remove Invalid Rows

# Select rows where 'amount' is negative
z|
# Enter: amount < 0

# Delete selected invalid rows
gd

# Verify row count in status bar
Ctrl+G

Step 7: Export Clean Data

# Confirm the cleaned sheet looks correct
Shift+I # Describe Sheet — verify nulls = 0, types correct

# Save to new file (never overwrite the source!)
Ctrl+S
# Enter: /var/www/html/exports/clean_orders.csv

Complete Cleaning Workflow (Diagram)

Save the Workflow as a CommandLog

After cleaning, save the CommandLog for reproducibility:

Shift+D # view CommandLog for this sheet
Ctrl+S
# Enter: /tmp/clean_orders_workflow.vdj

Replay on a new file next time:

vd --play /tmp/clean_orders_workflow.vdj --batch new_raw_orders.csv -o new_clean_orders.csv

Practical Use Cases

Clean WordPress User Export

vd /var/www/html/exports/wp_users.csv

# 1. Inspect: Shift+I
# 2. Cast 'ID' to int: #
# 3. Cast 'user_registered' to date: @
# 4. Normalize 'user_status': gs → g* → (?i)1<Tab>active
# 5. Remove test users: z| → user_login.startswith('test') → gd
# 6. Save: Ctrl+S → /tmp/wp_users_clean.csv

Clean API Response Data

# Download and clean JSON response
curl -s https://api.example.com/orders | vd -f json

# Expand nested columns: ( on any dict/list column
# Type numeric fields: #, %
# Filter invalid status: z| → status in ('completed', 'shipped') → "
# Export: Ctrl+S

Troubleshooting Matrix

ProblemCauseFix
f (fill) fills wrong directionColumn is sorted differentlySort by a stable key first
g* replace doesn't matchRegex case sensitivityAdd (?i) prefix for case-insensitive
gd deleted wrong rowsWrong rows selectedCheck selection before deleting — use {/}
Date cast failsNon-standard date formatSet options.disp_date_fmt in .visidatarc

Hands-On Practice

cat > /tmp/dirty.csv << 'EOF'
id,name,amount,status,region
1,Alice,100.50,ACTIVE,sg
2,Bob,-50,active,
3,Carol,200,Active,kl
4,Alice,100.50,ACTIVE,sg
5,Dave,abc,inactive,jk
EOF

vd /tmp/dirty.csv

# 1. Shift+I → see nulls in 'region', 'amount' has 'abc'
# 2. Cast 'amount' to float: % → notice #ERR for 'abc' row
# 3. Select row 5 with 'abc' → s → gd → delete
# 4. Select row 2 (region empty): z| → not region → ge → Unknown
# 5. gs → g* → (?i)active<Tab>active → normalize status
# 6. Shift+F on 'id' → find row 1/4 duplicate
# 7. Ctrl+S → save as /tmp/clean.csv

What's Next