Pandashells
Introduction
For decades, system administrators, dev-ops engineers and data analysts have been piping textual data between unix tools such as grep, awk, sed, etc. Chaining these tools together provides an extremely powerful workflow.
The more recent emergence of the "data-scientist" has resulted in the increasing popularity of tools like R, Pandas, IPython, etc. These tools have amazing power for transforming, analyzing and visualizing data-sets in ways that grep, awk, sed, and even the dreaded perl-one-liner could never accomplish.
Pandashells is an attempt to marry the expressive, concise workflow of the shell pipeline with the statistical and visualization tools of the python data-stack.
What is Pandashells?
- A set of command-line tools for working with tabular data
- Easily read/write data in CSV, or space delimited formats
- Quickly aggregate, join, and summarize tabular data
- Compute descriptive statistics
- Perform spectral decomposition and linear regression
- Create data visualizations that can be saved to images or rendered interactively using either a native backend or html.
- Easily integrate with unix tools like curl, awk, grep, sed, etc.
If you work with data using Python, you have almost certainly encountered Pandas, SciPy, Matplotlib, Statsmodels and Seaborn. Pandashells opens up a bash API into the python data stack with command syntax closely mirroring the underlying libraries on which it is built. This should allow those familiar with the python data stack to be immediately productive.
Installation
Pandashells is a pure-python package, but depends heavily on other packages which are not. By far the fastest and most painless way to get started with Pandashells is to install the Miniconda package manager, and then simply run
[~]$ conda install -c https://conda.anaconda.org/robdmc pandashells
Note that this command will also work if you are using the much heavier Anaconda Python Distribution.
If you prefer to manage your own dependencies, you can install Pandashells with pip using the command
[~]$ pip install pandashells # does NOT automatically install dependencies (see below)
or for the development version (could be unstable)
[~]$ pip install --upgrade git+https://github.com/robdmc/pandashells.git
Requirements
Pandashells is both Python2 and Python3 compatible and was developed using the Anaconda Python Distribution. We strongly recommend using Anaconda and or Miniconda to run Pandashells because installing the required external and system libraries is completely taken care of.
There is no requirements file with pandashells because some of the tools only require the standard library, and there's no sense installing unnecessary packages if you only want to use that subset of tools. If a particular tool encounters a missing dependency, it will gracefully fail with an informative message detailing the steps required for installing the missing dependency.
Below is a comprehensive list of packages that pandashells imports.
- gatspy
- matplotlib
- mpld3
- numpy
- pandas
- scipy
- seaborn
- statsmodels
Important: If you want to use pandashells without interactive visualizations (e. g. on a VM without X-forwarding), but would like to retain the ability to create static-image or html-based visualizations, you may need to configure pandashells to use the Agg backend as follows:
p.config --plot_backend Agg
Overview
All Pandashells executables begin with a "p." This is designed to work nicely with the bash-completion feature. If you can't remember the exact name of a command, simply typing p.[tab] will show you a complete list of all Pandashells commands.
Every command can be run with a -h option to view help. Each of these help messages will contain multiple examples of how to properly use the tool.
Pandashells is equipped with a tool to generate sample csv files. This tool provides standardized inputs for use in the tool help sections as well as this documentation.
[~]$ p.example_data -h
Tool Descriptions
Tool | Purpose |
---|---|
p.cdf | Plot emperical distribution function |
p.config | Set default Pandashells configuration options |
p.crypt | Encrypt/decrypt files using open-ssl |
p.df | Pandas dataframe manipulation of text files |
p.example_data | Create sample csv files for training/testing |
p.facet_grid | Create faceted plots for data exploration |
p.format | Render python string templates using input data |
p.hist | Plot histograms |
p.linspace | Generate a linearly spaced series of numbers |
p.lomb_scargle | Generate Lomb-Scarge spectrogram of input time series |
p.merge | Merge two data files by specifying join keys |
p.parallel | Read shell commands from stdin and run them in parallel |
p.plot | Create xy plot visualizations |
p.rand | Generate random numbers |
p.regplot | Quickly plot linear regression of data to a polynomial |
p.regress | Perform (multi-variate) linear regression with R-like patsy syntax |
p.sig_edit | Remove outliers using iterative sigma-editing |
DataFrame Manipulations
Pandashells allows you to specify multiple dataframe operations in a single command. Each operation assumes data is in a dataframe named df. Operations performed on this dataframe will overwrite the df variable with the results of that operation. Special consideration is taken for assignments such as df['a'] = df.b + 1. These are understood to augment the input dataframe with a new column. By way of example, this command at the bash prompt:
p.df 'df["c"] = 2 * df.b' 'df.groupby(by="a").c.count()' 'df.reset_index()'
is equivalent to the following python snippet:
# this code in a python script
df["c"] = 2 * df.b
df = df.groupby(by="a").c.count()
df = df.reset_index()
Shown below are several examples of how to use the p.df tool. You are encourage to copy/paste these commands to your bash prompt to see pandashells in action.
Show a few rows of an example data set.
[~]$ p.example_data -d tips | head "total_bill","tip","sex","smoker","day","time","size" 16.99,1.01,"Female","No","Sun","Dinner",2 10.34,1.66,"Male","No","Sun","Dinner",3 21.01,3.5,"Male","No","Sun","Dinner",3 23.68,3.31,"Male","No","Sun","Dinner",2
Transorm the sample data from csv format to table format
[~]$ p.example_data -d tips | p.df 'df.head()' -o table total_bill tip sex smoker day time size 16.99 1.01 Female No Sun Dinner 2 10.34 1.66 Male No Sun Dinner 3 21.01 3.50 Male No Sun Dinner 3 23.68 3.31 Male No Sun Dinner 2 24.59 3.61 Female No Sun Dinner 4
Compute statistics for numerical fields in the data set.
[~]$ p.example_data -d tips | p.df 'df.describe().T' -o table index count mean std min 25% 50% 75% max total_bill 244 19.785943 8.902412 3.07 13.3475 17.795 24.1275 50.81 tip 244 2.998279 1.383638 1.00 2.0000 2.900 3.5625 10.00 size 244 2.569672 0.951100 1.00 2.0000 2.000 3.0000 6.00
Find the mean tip broken down by gender and day
[~]$ p.example_data -d tips | p.df 'df.groupby(by=["sex","day"]).tip.mean()' -o table index tip sex day Female Fri 2.781111 Sat 2.801786 Sun 3.367222 Thur 2.575625 Male Fri 2.693000 Sat 3.083898 Sun 3.220345 Thur 2.980333
Join files on key fields
Pandashells can join files based on a set of key fields. This example uses only one field as a key, but like the pandas merge function on which it is based, multiple key fields can be used for the join.
Show poll resultes for the 2008 US presidential election
[~]$ p.example_data -d election | p.df -o table | head days state obama mccain poll -305 OH 43 50 SurveyUSA -303 PA 38 46 Rasmussen -298 OR 47 47 SurveyUSA -298 WA 52 43 SurveyUSA -294 AL 29 63 SurveyUSA -294 NY 44 42 Siena Coll. -294 VA 40 52 SurveyUSA -290 NM 41 50 SurveyUSA -290 NY 49 43 SurveyUSA
Show population and electoral college numbers for states
[~]$ p.example_data -d electoral_college | p.df -o table | head state name electors population AK Alaska 3 710000 AL Alabama 9 4780000 AR Arkansas 6 2916000 AZ Arizona 11 6392000 CA California 55 37254000 CO Colorado 9 5029000 CT Connecticut 7 3574000 DC Dist. of Col. 3 602000 DE Delaware 3 898000
Join poll and electoral-college data (Note the use of bash process substitution to specify files to join.)
[~]$ p.merge <(p.example_data -d election) <(p.example_data -d electoral_college) --how left --on state | p.df -o table | head days state obama mccain poll name electors population -252 AK 43 48 SurveyUSA Alaska 3 710000 -213 AK 43 48 Rasmussen Alaska 3 710000 -176 AK 41 50 Rasmussen Alaska 3 710000 -143 AK 41 45 Rasmussen Alaska 3 710000 -112 AK 40 45 Rasmussen Alaska 3 710000 -99 AK 39 44 Rasmussen Alaska 3 710000 -65 AK 35 54 Ivan Moore Research Alaska 3 710000 -58 AK 33 64 Rasmussen Alaska 3 710000 -56 AK 39 55 ARG Alaska 3 710000
Visualization Tools
Pandashells provides a number of visualization tools to help you quickly explore your data. All visualizations are automatically configured to show an interactive plot using the configured backend (default is TkAgg, but can be configured with the p.config tool).
The visualizations can also be saved to image files (e.g. .png) or rendered to html. The html generated can either be opened directly in the browser to show an interactive plot (using mpld3), or can be embedded in an existing html file. The examples below show Pandashells-created png images along with the command used to generate them.
Simple xy scatter plots
[~]$ p.example_data -d tips | p.plot -x total_bill -y tip -s 'o' --title 'Tip Vs Bill'
Faceted plots
[~]$ p.example_data -d tips | p.facet_grid --row smoker --col sex --hue day --map pl.scatter --args total_bill tip --kwargs 'alpha=.2' 's=100'
Histograms plots (Note the use of bash process substitution to paste two outputs together.)
[~]$ paste <(p.rand -t normal -n 10000 | p.df --names normal) <(p.rand -t gamma -n 10000 | p.df --names gamma) | p.hist -i table -c normal gamma
Empirical cumulative distribution plots
[~]$ p.rand -t normal -n 500 | p.cdf -c value --names value
Spectral Estimation
Plot a time series over which to compute a spectrum
[~]$ p.example_data -d sealevel | p.plot -x year -y sealevel_mm
Plot the spectrum
[~]$ p.example_data -d sealevel | p.lomb_scargle -t year -y sealevel_mm --interp_exp 3 | p.plot -x period -y amp --xlim 0 1.5 --ylim 0 6.5 --xlabel 'Period years' --ylabel 'Amplitude (mm)' --title 'Global Sea Surface Height Spectrum'
Linear Regression
Pandashells leverages the excellent Seaborn and Statsmodels libraries to handle linear regression.
Quick and dirty fit to a line
[~]$ p.linspace 0 10 20 | p.df 'df["y_true"] = .2 * df.x' 'df["noise"] = np.random.randn(20)' 'df["y"] = df.y_true + df.noise' --names x | p.regplot -x x -y y
Multi-variable linear regression
[~]$p.example_data -d sealevel | p.df 'df["sin"]=np.sin(2*np.pi*df.year)' 'df["cos"]=np.cos(2*np.pi*df.year)' | p.regress -m 'sealevel_mm ~ year + sin + cos' OLS Regression Results ============================================================================== Dep. Variable: sealevel_mm R-squared: 0.961 Model: OLS Adj. R-squared: 0.961 Method: Least Squares F-statistic: 6442. Date: Mon, 27 Jul 2015 Prob (F-statistic): 0.00 Time: 23:28:11 Log-Likelihood: -2234.0 No. Observations: 780 AIC: 4476. Df Residuals: 776 BIC: 4495. Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ Intercept -6500.1722 47.829 -135.903 0.000 -6594.063 -6406.282 year 3.2577 0.024 136.513 0.000 3.211 3.305 sin -4.6933 0.217 -21.650 0.000 -5.119 -4.268 cos 1.4061 0.214 6.566 0.000 0.986 1.826 ============================================================================== Omnibus: 5.332 Durbin-Watson: 0.709 Prob(Omnibus): 0.070 Jarque-Bera (JB): 5.401 Skew: -0.189 Prob(JB): 0.0672 Kurtosis: 2.846 Cond. No. 6.29e+05 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 6.29e+05. This might indicate that there are strong multicollinearity or other numerical problems.
Further examples of each tool can be seen by calling it with the -h switch. You are encouraged to fully explore these examples. They highlight how Pandashells can be used to significantly improve your efficiency.