OpenWest 2014/Python Pandas

A Brief Tour of the Python Pandas Package
 * by Matt Harrison (@__mharrison__)

"Python is gaining popularity among Data Scientists. One reason is the Pandas package, which provides facilities for data manipulation. It has facilities similar to Excel, SQL, ETL packages, and more."

Python Data Analysis Library — pandas: Python Data Analysis Library - http://pandas.pydata.org/

NumPy — http://www.numpy.org/ - NumPy is the fundamental package for scientific computing with Python.

SciPy - http://www.scipy.org/ - SciPy (pronounced “Sigh Pie”) is a Python-based ecosystem of open-source software for mathematics, science, and engineering

Pandas - http://pandas.pydata.org/ - Python Data Analysis Library - pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Matt Harrison - http://hairysun.com
 * co-chair Utah Python.

Impetus - if this were a perl class it would be about regexes. Panda is the weapon of choice for dealing with tabular data in Python.

Pandas is "A nosql in-memory db using Python, that has SQL-like constructs" - Matt's view
 * note adopts many numpy-isms that may not appear pure Python

Based off of data framing (tabular data) stolen from 'R'. Data frame is similar to a table in SQL.

Panda is best for small to medium data, not "Big Data".

Not really good from ETL perspective - star schema
 * Extract Transform Load - take data from one system to another
 * Data warehousing

Data Structures:
 * Series (1D)
 * TimeSeries (1D) - special Series
 * DataFrame (2D)
 * Panel (3D) - like stacked DataFrames

Series: ser = { 'index':[0,1,2], 'data':[.5,.6,.7], 'name':'growth', }
 * 1) python version

import pandas as pd ser = pd.Series([.5,.6,.7], name='growth')
 * 1) pandas version

Behaves like NumPy array: ser[1] ser.mean

Boolean Array ser > ser.median a False b False c True

Filtering: ser[ser > ser.median]

DataFrames - Tables with columns as Series df = { 'index':[0,1,2], cols = [ { 'name':'growth', 'data':[.5,.6,1.2] }, { 'name':'Name', 'data':["paul","george", "ringo"] }, ] }
 * 1) python version, but not a true Pandas DataFrame

df = pd.DataFrame({  'growth':[.5,.7,1.2],   'Name':['paul','geroge','ringo'] }
 * 1) pandas version

Import DataFrame from: rows (list of dicts), columns (dicts of lists), csv file ***, slurp up a NumPy ndarray directly

Two Axes: df.axes[0] or df.index df.axes[1] or df.columns
 * axes 0 - index
 * axes 1 - columns

Examine: df.columns df.describe df.to_string df.test1 # or df['test1'] # makes magic attribute for you df.test1.median df.test1.corr(df.test2) # correlation - if data goes in same direction 1, no would be 0 and opposite would be -1

Tweaking Data df = pd.concat df['test3'] = 0 def name_grade(val): .. df['test4'] = df.fname.apply(name_grade) t3 = df.pop('test3') # or del df['test3']   # note: del df.test3 does not work!
 * note: pandas objects are generally immutable
 * add row
 * add column
 * 1) note: df.test3 = 3 does not work!
 * add column with function
 * remove column
 * rename column

Fill - statistics ignore NaN, so if you want a zero can use this.

Install Pandas: (what worked for me) pip install pandas
 * 1) yum install gcc-c++

Pivoting - Pivot Tables print pd.pivot_table(..rules..)

Serialization
 * dump to CSV, etc

Plotting
 * box plot, etc...

Clipping

GPS example.