Wednesday, October 24, 2012

Data Analytics with Powershell Part I

Below is some cruft I am using to analyze PDC data for candidates in my local WA elections. This is part of a larger project I am working on to use Powershell 3.0 as a data analytic solution.  I find working with data in Powershell 3.0 somewhat tricky,  sometimes limited, but also no less straightforward than SQL or R.  Although, Powershell can sometimes be a little frustrating to work with , I rather like being able to craft my own 'data analytic'  solution from the console. I found myself close to the data while working with PS 3.0. Certain techniques produced surprisingly rapid and illuminating results. Take the query below. After having imported to a variable ('$NM') a candidates data from CSV, in one line of code I am able to exclude all WA state contributions; then use 'group-object' to list all 'out of state' contributors,state,amounts in a sort.

$NM | Where State -ne WA | group -property Contributor,State,Amount -noelement | sort -desc Count,Name | ft -auto -wrap

Count Name
----- ----
    1 WEBB LISA, MT, 100
    1 VASKAS JANET, PA, 100
    1 VASKAS ALAN, PA, 100
    1 SOTO JLEANA, CA, 100
    1 ATU, DC, 900

Powershell 3.0 is flexible in working with data.  After importing in candidate spreadsheet data, we want to ask some questions:

  • How many donations did each candidate receive greater than or equal to $800?
  • How many donations did each candidate receive less than or equal to $100?

# import data and create variable list

sl C:\users\rferrisx\Downloads
if ($out) {rv -ea 0 out}
$NM= import-csv .\NM.$date.csv
$MDK= import-csv .\MDK.$date.csv
$JEO= import-csv .\JEO.$date.csv
$VKB= import-csv .\VKB.$date.csv
$varlist=foreach ($i in @("NM","MDK","JEO","VKB")) {ls variable:/$i}

rv -ea 0 a
# create one database stored as a variable (e.g. '$out') by merging all candidate donations. 
# Add a 'member' or field (e.g. Candidate Name) to each record
$out+=$ | % {
$a=(ls variable:/$PSItem).value;
$Name=(ls variable:/$PSItem).name;
$a | add-member -force -passthru -NotePropertyName Candidate -NotePropertyValue $PSItem; 

# An $out record now is 


Contributor : ATU
Date        : 06/22/12
Amount      : 900
P/G         : P
City        : WASHINGTON
State       : DC
Zip         : 20016
Employer    :
Occupation  :
Candidate   : NM

Now we want ordered hash tables to answer our questions:

 foreach ($i in $varlist.Name) {$ge800+=[ordered]@{$i=$(($out | ? {$_.Amount -ge "800"} | ? {$_.Candidate -eq $i}).count)};}


Name                           Value
NM                             9
MDK                            13
JEO                            31
VKB                            37

 foreach ($i in $varlist.Name) {$le100+=[ordered]@{$i=$(($out | ? {$_.Amount -le "100"} | ? {$_.Candidate -eq $i}).count)};}


Name                           Value
NM                             136
MDK                            115
JEO                            43
VKB                            34

Using .NET 4.0 charting, our last ordered hashtable becomes a chart:

Chart-HashData RangeColumn 600 600 "Candidate Donations less than or equal to `$100" "Candidate" "Number of Small Donations"

A full script for filtering 'out of state' contributions into two graphs and charts is below. Here are some prerequisites for working with 'dirty data'.

The CSV from WA PDC data sometimes needs to be cleaned up before importing:
  • top summaries removed
  • special characters not imported
  • leading spaces removed for all columns
  • zipcodes standardized to five or nine characters
To remove leading spaces and header lines I use a regex query in Open Office (e.g. '^ ').

CSV import objects should look like this:


Contributor : EADES KASHA
Date        : 05/08/12
Amount      : 900
P/G         : P
City        : BELLINGHAM
State       : WA
Zip         : 98226
Employer    :
Occupation  : HOMEMAKER

In Powershell 3.0 this gives us such that (for sum of all amounts for any candidate):

($JEO.Amount | measure-object -sum).sum

$Candidate_States give us data that excludes all contributions from WA. Like this:

$JEO_States=$JEO | Where State -ne WA |group -property State,Amount -noelement | sort -desc Count,Name

Count Name
----- ----
    2 VA, 900
    2 VA, 800
    2 TX, 250
    1 TX, 900
    1 TX, 800
    1 TX, 500
    1 TX, 400
    1 TX, 350
    1 PA, 400
    1 OR, 900
    1 OR, 500
    1 NY, 900
    1 NJ, 500
    1 NJ, 300
    1 NE, 250
    1 MO, 900
    1 MO, 800
    1 FL, 500
    1 DC, 400
    1 CA, 800
    1 CA, 300
    1 CA, 250
    1 CA, 100
    1 AR, 800

# Produces charts and graphs for Out of State Donations for Select Candidates from PDC WA data. 
# Revise path for your own folders 
# Revise file names for your candidates
# Set date from CSV imports
# Uses Chart-HashData.ps1 (See code repository)

# Assume csv PDC data has been cleaned up, special characters not imported, leading spaces removed,zipcodes normalized to either 5 or 9 digits

sl C:\users\rferrisx\Downloads
[array]$NM= import-csv .\NM.$date.csv
[array]$MDK= import-csv .\MDK.$date.csv
[array]$JEO= import-csv .\JEO.$date.csv
[array]$VKB= import-csv .\VKB.$date.csv
rv -ea 0 HashData
rv -ea 0 Total
$Global:NMStates=$NM | Where State -ne WA |group -property State,Amount -noelement | sort -desc Count,Name
$States=$NM | Where State -ne WA |group -property Amount -noelement | sort -desc Count
foreach ($i in $States) {$total+=$i.count * $}
$HashData+=[ordered]@{(ls variable:/NM).Name=$Total;}

rv -ea 0 Total
$Global:MDKStates=$MDK | Where State -ne WA |group -property State,Amount -noelement | sort -desc Count,Name
$States=$MDK | Where State -ne WA |group -property Amount -noelement | sort -desc Count
foreach ($i in $States) {$total+=$i.count * $}
$HashData+=[ordered]@{(ls variable:/MDK).Name=$Total;}

rv -ea 0 Total
$Global:JEOStates=$JEO | Where State -ne WA |group -property State,Amount -noelement | sort -desc Count,Name
$States=$JEO | Where State -ne WA |group -property Amount -noelement | sort -desc Count
foreach ($i in $States) {$total+=$i.count * $}
$HashData+=[ordered]@{(ls variable:/JEO).Name=$Total;}

rv -ea 0 Total
$Global:VKBStates=$VKB | Where State -ne WA |group -property State,Amount -noelement | sort -desc Count,Name
$States=$VKB | Where State -ne WA |group -property Amount -noelement | sort -desc Count
foreach ($i in $States) {$total+=$i.count * $}
$HashData+=[ordered]@{(ls variable:/VKB).Name=$Total;}
Chart-HashData RangeColumn 500 500 "Total Amount Out of State Donations" "Candidate" "Amount" "SeeHowTheyFundRaise"

$PCT_NM=$HashData[0]/($NM.Amount | measure-object -sum).sum
$PCT_MDK=$HashData[1]/($MDK.Amount | measure-object -sum).sum
$PCT_JEO=$HashData[2]/($JEO.Amount | measure-object -sum).sum
$PCT_VKB=$HashData[3]/($VKB.Amount | measure-object -sum).sum
rv -ea 0 hashdata

$HashData+=[ordered]@{(ls variable:/NM).Name=[float]($PCT_NM * 100);}
$HashData+=[ordered]@{(ls variable:/MDK).Name=[float]($PCT_MDK * 100);}
$HashData+=[ordered]@{(ls variable:/JEO).Name=[float]($PCT_JEO * 100);}
$HashData+=[ordered]@{(ls variable:/VKB).Name=[float]($PCT_VKB * 100);}
Chart-HashData RangeColumn 500 500 "Out of State Percentage" "Candidate" "Percent of All Donations from Out of State" "SeeHowTheyFundRaise"

Useing Chart-HashData.ps1  for the .NET 4.0,charts look like this:

No comments: