In [1]:

```
%pylab inline
```

Consulting Geologist at H&S Consultants

In [2]:

```
import math as math
```

In [3]:

```
loops = 2500000
a = range(1, loops)
def math_function(x):
return(3 * math.log(x) + math.cos(x) ** 2)
%timeit r = [math_function(x) for x in a]
```

It worked but it was too slow. Let's try it using NumPy

In [4]:

```
import numpy as np
```

In [5]:

```
def numpy_function(array):
return(3 * np.log(array) + np.cos(array)**2)
```

In [6]:

```
a = np.arange(1, loops)
%timeit r = numpy_function(a)
```

In [7]:

```
import numexpr as ne
ne.set_num_threads(4)
numexpr_func = '3 * log(a) + cos(a) ** 2'
%timeit r = ne.evaluate(numexpr_func)
```

In [8]:

```
python_list = [1, 2, 3, 4, 5]
python_list*2
```

Out[8]:

Compared to what happens if we multiply a numpy array by two:

In [9]:

```
numpy_array = np.array([1, 2, 3, 4, 5])
numpy_array*2
```

Out[9]:

First of all Pandas makes it easy to import data from a variety of files including csv and MS Excel

To load a csv file into memory you'll need to tell it the file path. A quick way do do this is go to the file in Windows Explorer, press Ctrl+ right-click and then select "Copy as path". Then you can just paste the whole file path as I've done below.

In [10]:

```
file_path = "C:\Projects\Python Presentation\Data\ExampleData.csv"
```

In [11]:

```
import pandas as pd
```

In [12]:

```
num_samples = 2000
num_drill_holes = 20
assay_fields = ['Ag_ppm', 'Au_ppm', 'Co_ppm', 'Cu_pc', 'Fe_pc', 'Mg_pc', 'Pb_ppm', 'Pb_pc', 'S_pc', 'Zn_ppm', 'Zn_pc',
'Bi_ppm','Ni_ppm', 'As_ppm', 'Cr_ppm', 'Zr_ppm', 'Y_ppm', 'W_ppm', 'U_ppm', 'V_ppm', 'Tl_ppm', 'Ti_pc',
'Th_ppm', 'Sr_ppm','Se_ppm', 'Sc_ppm', 'Sb_ppm', 'Rb_ppm', 'P_ppm','Nb_ppm', 'Na_pc', 'Mo_ppm',
'Mn_ppm', 'La_ppm', 'K_pc', 'Cd_ppm', 'Ca_pc', 'Be_ppm', 'Ba_ppm', 'Al_pc']
drill_hole_names = ["DH"+str(i+1) for i in range(num_drill_holes)]*100 # Make up the drill hole names (100 records for each drill hole)
drill_hole_names.sort()
means = np.random.random(len(assay_fields)) # Generate random means for each element
stds = np.random.random(len(assay_fields)) # Generate random standard deviations for each element
# Make a DataFrame of all the assays
df_assay = pd.DataFrame(np.random.lognormal(means, stds, (num_samples,len(assay_fields))), columns = assay_fields)
df_assay["hole_id"] = drill_hole_names # Add the hole_ids to the DataFrame
# Assign a zone code (e.g. area code) and oxidation code to each sample at random
df_assay["Zone Code"] = np.random.choice([1,2,3,4,99], num_samples)
df_assay["Ox"] = np.random.choice([2,3], num_samples)
# Randomly select an interval length from the listed numbers
df_assay["Interval"] = np.random.choice([0.5,0.8, 1, 1.5], num_samples)
# Calculate the To depths from the cumulative sum of the Intervals for each drill hole
df_assay["To"] = df_assay.groupby("hole_id")["Interval"].cumsum()
df_assay["From"] = df_assay["To"].shift(1) # Get the From depth from the To depth of the record above
# Assign zero depth to the first record of every drill hole
df_assay.ix[np.arange(0,num_samples, num_samples/num_drill_holes), "From"] = 0
# Define the order of columns when looking at the dataframe
df_assay = df_assay[["hole_id", "From", "To", "Interval", "Ox", "Zone Code"]+assay_fields]
```

In [13]:

```
df_assay.shape
```

Out[13]:

In [14]:

```
df_assay.columns
```

Out[14]:

In [15]:

```
econ_elems = ['Cu_pc', 'Pb_pc', 'Zn_pc', 'Ag_ppm', 'Au_ppm', 'S_pc']
```

In [16]:

```
df_assay[econ_elems].describe()
```

Out[16]:

or, if we're going to use it a lot we can make a little pointer to it e.g.

In [17]:

```
df_econ = df_assay[econ_elems]
df_econ.head(10) # Shows the first 10 rows of the DataFrame
```

Out[17]:

Pandas does have some basic statistical methods built in to it. One of the handy things that Pandas statistical methods does is take care of missing data (NaN) for you.

This webpage introduces some of the basic functionality that Pandas offers. For example, in order to get the mean for each column:

In [18]:

```
df_econ.mean()
```

Out[18]:

In [19]:

```
grouped = df_assay.groupby("Zone Code")
grouped[econ_elems].mean()
```

Out[19]:

In [20]:

```
grouped = df_assay.groupby(["Ox","Zone Code"])
grouped[econ_elems].mean()
```

Out[20]:

In [21]:

```
def grouped_weighted_average(df, elem_list, weight_fn, groupby_fns):
grouped_df = df.groupby(groupby_fns)
df_averages = grouped_df[elem_list].sum() # This just creates a dataframe with the correct columns and groups.
for elem in elem_list:
df_averages[elem] = grouped_df.apply(lambda x: (np.average(x[elem], weights=x[weight_fn])))
df_averages = np.round(df_averages, 3) # Rounds the numbers to 3 decimal places
return(df_averages)
```

In [22]:

```
grouped_weighted_average(df_assay, econ_elems, "Interval", ['Ox', "Zone Code"])
```

Out[22]:

In [23]:

```
import os
file_dir, file_name = os.path.split(file_path)
output_path = os.path.join(file_dir, file_name[:-4])
if not os.path.exists(output_path):
os.makedirs(output_path)
output_path
```

Out[23]:

In [24]:

```
wt_averages = grouped_weighted_average(df_assay, econ_elems, "Interval", ['Ox', "Zone Code"])
wt_averages.to_csv(os.path.join(output_path, "wt_averages.csv"))
```

In [25]:

```
def get_r2(df, elem_list, groupby_fns = None):
if groupby_fns is not None:
df = df.groupby(groupby_fns)
df_corr = df[elem_list].corr()**2# This is squared beccause .corr() gives R but we're used to R squared
return(df_corr)
```

In [26]:

```
get_r2(df_assay, econ_elems, groupby_fns = ["Ox", "Zone Code"])
```

Out[26]:

In [27]:

```
Au_equiv = df_econ["Au_ppm"]*0.6
Ag_equiv = df_econ["Ag_ppm"]*0.01
Pb_equiv = df_econ["Pb_pc"]*0.2
Zn_equiv = df_econ["Zn_pc"]* 0.1
df_econ["Cu_Equivalent"] = Au_equiv + Ag_equiv + Pb_equiv + Zn_equiv + df_econ["Cu_pc"]
df_econ.head()
```

Out[27]:

Pandas does have some handy but basic plotting functions. Personally, I tend not to use them as I like to write more specific personalised graphs using Matplotlib.

- 'barâ€™ or â€˜barhâ€™ for bar plots
- â€˜histâ€™ for histogram
- â€˜boxâ€™ for boxplot
- â€˜kdeâ€™ or 'density' for density plots
- â€˜areaâ€™ for area plots
- â€˜scatterâ€™ for scatter plots
- â€˜hexbinâ€™ for hexagonal bin plots
- â€˜pieâ€™ for pie plots

This webpage gives a few examples of what can be done using Pandas

Here is a very quick and dirty way to take a good look at correlations between elements:

In [28]:

```
temp = pd.scatter_matrix(df_econ)
```

Here's an example to make a histogram:

In [29]:

```
df_assay["Cu_pc"].hist(bins=40, alpha = 0.6)
```

Out[29]:

In [30]:

```
fig, ax = plt.subplots()#define
fig.set_size_inches((10, 6))
np.log10(df_assay["Cu_pc"]).hist(bins=40, ax=ax, alpha = 0.3, color="g")
labels = ax.get_xticks()
new_labels = plt.xticks(labels, np.round(np.power(10, labels),2))
#plt.savefig(output_path + "\\Cu_Hist1.png", dpi = 600)
```

In [31]:

```
df_assay.boxplot(column= "Cu_pc", by = ["Zone Code"], rot = 90)
```

Out[31]:

In [32]:

```
fig, ax = plt.subplots()#define
fig.set_size_inches((10, 6))
df_assay.boxplot(column= "Cu_pc", by = ["Ox","Zone Code"], ax=ax, rot = 90)
ax.set_yscale('log')
#plt.savefig(output_path + "\\Cu_Boxplot.png", dpi = 600)
```

In [33]:

```
def plot_pandas_bp(df, element_fn, groupby_fns, save_folder= None):
fig, ax = plt.subplots()
fig.set_size_inches((10, 6))
df.boxplot(column= elem, by = groupby_fns, ax=ax, rot = 90)
#Add some weighted averages to the boxplot
wt_means = grouped_weighted_average(df, [element_fn], "Interval", groupby_fns)
x_locations = np.arange(1, len(wt_means)+1,1) # Pandas plots the boxplots on x axis starting from 1
plt.scatter(x_locations, wt_means, color = "red")
ax.set_yscale('log')
# Get rid of the "Boxplot grouped by ..." title
fig.suptitle('')
#Optionally save the the picture
if save_folder is not None:
save_dir = save_folder + "\\Boxplot of " + element_fn + " by " + str(groupby_fns)
plt.savefig(save_dir)
```

In [34]:

```
for elem in econ_elems:
plot_pandas_bp(df_assay, elem, ["Ox","Zone Code"])#, save_folder = output_path)
```

In [59]:

```
def grouped_scatter(df, x_fn, y_fn, groupby_fns):
fig, ax = plt.subplots()#define
fig.set_size_inches((10, 6))
grouped = df.groupby(groupby_fns)
for group_name, group_data in grouped:
ax.plot(group_data[x_fn], group_data[y_fn], marker='o', linestyle='', ms=8, label=group_name)
ax.set_xlabel(x_fn)
ax.set_ylabel(y_fn)
ax.grid()
legend = ax.legend(numpoints=1, loc= 'best')
```

In [60]:

```
grouped_scatter(df_assay, "Cu_pc", "Ag_ppm", "Zone Code")
```

In [37]:

```
"""
http://matplotlib.org/examples/statistics/histogram_demo_multihist.html
Demo of the histogram (hist) function with multiple data sets.
Plot histogram with multiple sample sets and demonstrate:
* Use of legend with multiple sample sets
* Stacked bars
* Step curve with a color fill
* Data sets of different sample sizes
"""
import numpy as np
import matplotlib.pyplot as plt
n_bins = 10
x = np.random.randn(1000, 3)
fig, axes = plt.subplots(nrows=2, ncols=2)
ax0, ax1, ax2, ax3 = axes.flat
colors = ['red', 'tan', 'lime']
ax0.hist(x, n_bins, normed=1, histtype='bar', color=colors, label=colors)
ax0.legend(prop={'size': 10})
ax0.set_title('bars with legend')
ax1.hist(x, n_bins, normed=1, histtype='bar', stacked=True)
ax1.set_title('stacked bar')
ax2.hist(x, n_bins, histtype='step', stacked=True, fill=True)
ax2.set_title('stepfilled')
# Make a multiple-histogram of data-sets with different length.
x_multi = [np.random.randn(n) for n in [10000, 5000, 2000]]
ax3.hist(x_multi, n_bins, histtype='bar')
ax3.set_title('different sample sizes')
plt.tight_layout()
```

In [38]:

```
drill_1 = df_assay[df_assay["hole_id"]== "DH1"]
```

In [39]:

```
fig, (ax1, ax2, ax3) = plt.subplots(1,3, sharey=True)#define
fig.set_size_inches((6, 6))
ax1.plot(drill_1["Cu_pc"],drill_1["From"]*-1, c= "r")
ax1.set_xlabel("Cu_pc")
ax2.plot(drill_1["Pb_pc"], drill_1["From"]*-1, c= "b")
ax2.set_xlabel("Pb_pc")
ax3.plot(drill_1["Ag_ppm"], drill_1["From"]*-1, c= "g")
ax3. set_xlabel("Ag_ppm")
fig.subplots_adjust(wspace=0)
```

In [40]:

```
import scipy.interpolate as interpolate
import matplotlib.pyplot as plt
#Creating some data, with each coordinate and the values stored in separated lists
x = [10,60,40,70,10,50,20,70,30,60]
y = [10,20,30,30,40,50,10,70,80,90]
values = [1,2.2,1,3,4,6,7,7.5,8.3,10]
#Creating the output grid (100x100, in the example)
ti = np.linspace(0, 100.0, 100)
XI, YI = np.meshgrid(ti, ti)
#Creating the interpolation function and populating the output matrix value
rbf = interpolate.Rbf(x, y, values, function='linear') # other functions to try are: 'thin_plate', 'multiquadric,
ZI = rbf(XI, YI)
# Plotting the result
n = plt.Normalize(0.0, 100.0)
fig, ax = plt.subplots()#define
fig.set_size_inches((10, 6.18))
plt.subplot(1, 1, 1)
plt.pcolor(XI, YI, ZI)
plt.scatter(x, y, s = np.array([values])*100, c = values)
plt.colorbar()
#Add some contours
CS = plt.contour(XI, YI, ZI, 10, colors='k')
plt.clabel(CS, inline=1, fontsize=10)
plt.title('RBF interpolation')
plt.xlim(0, 100)
plt.ylim(0, 100)
```

Out[40]:

When I was preparing for this talk I was planning on demonstrating how easy it is to use Principal Component Analysis (PCA) and apply it to geological applications. This webpage by Sebastian Raschka really helped me understand what PCA does. If you're interested in using PCA it's worth taking the time to go through. It shows a handy trick to help decide how many components you may want to reduce the data to. At the Geochemistry course last week at UNSW I learned that Factor Analysis is more applicable to geological applications. A quick google search told me that Factor Analysis is also available from Scikit-learn and even how it differs from PCA ("*PCA assumes equal noise variance for each feature*"). Unsurprisingly all the Factor Analysis functions are the same as the PCA functions. If you wanted to test out PCA just change the "decomposition.FactorAnalysis" to "decomposition.PCA".

The following is a example of how to use Factor Analysis and K-Means clustering to take a guess at classifying rock types. Depending on your application you may need to massage the data first to get rid of extreme values and outliers. There are various ways to do this but using Mahalanobis distances may help. This webpage from Scikit-learn shows how it can be applied.

In [41]:

```
from sklearn import decomposition
from sklearn import cluster
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
```

First of all we need to select the columns of interest that we want to conduct the classifying on.

In [42]:

```
cois =['Al_pc', 'Ba_ppm', 'Ca_pc', 'Cr_ppm', 'Fe_pc', 'K_pc', 'Mg_pc', 'Na_pc', 'Ti_pc', 'V_ppm']#'S_pc',
```

In [43]:

```
df_fa = df_assay[cois].dropna()
df_fa.shape
```

Out[43]:

In [44]:

```
standardised = StandardScaler().fit_transform(df_fa)
```

We then set the number of components to look at. In this example we're choosing the seven because I have found that this works well on this dataset. Check out the link to Sebastian Raschka's example in order to work out the eigenvalues for each eigenvector.

We then need to create the Factor Analysis instance and then fit the data and transform it.

In [45]:

```
n_components = 7
FA = decomposition.FactorAnalysis(n_components= n_components)
Y_FA = FA.fit_transform(standardised)
```

That's the Factor Analysis completed. Piece of cake! Then we need to classify the data into groups. There are many ways to do this in Scikits-learn. An example of eight different methods is shown near the end of this notebook. Anyway, here we're going to use k-means cluster analysis to group the data into seperate groups. This is an example of unsupervised learning.

First we create the instance and then fit the data. Then we add a column to our dataframe with the labels that the k-means clustering process has assigned to each record.

In [52]:

```
k_means = cluster.KMeans(init='k-means++', n_clusters= n_components, random_state= 456)
k_means.fit(Y_FA)
df_fa["K_means_Groups"] = k_means.labels_
```

In [53]:

```
count_order = df_fa["K_means_Groups"].value_counts().index # Lists the groups in order of counts
dictionary = dict(zip(count_order, range(1, len(count_order)+1))) # Creates a dictionary of groups and the new group names
df_fa["K_means_Groups"].replace(dictionary, inplace = True) # Replaces the groups with the new group names
df_fa["K_means_Groups"].value_counts() # Displays the new group names so we can look at them
```

Out[53]:

In [54]:

```
df_assay_mod = df_assay.join(df_fa[["K_means_Groups"]])
df_assay_mod.to_csv(output_path+"\\Test_kmeans.csv")
```

In [55]:

```
np.round(df_fa.groupby("K_means_Groups").mean(),2)
```

Out[55]:

In [56]:

```
grouped_scatter(df_assay_mod, "Ca_pc", "Na_pc", "K_means_Groups")
```

In [51]:

```
import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn import cluster, datasets
from sklearn.neighbors import kneighbors_graph
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')
np.random.seed(0)
# Generate datasets. We choose the size big enough to see the scalability
# of the algorithms, but not too big to avoid too long running times
n_samples = 1500
noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5,
noise=.05)
noisy_moons = datasets.make_moons(n_samples=n_samples, noise=.05)
blobs = datasets.make_blobs(n_samples=n_samples, random_state=8)
no_structure = np.random.rand(n_samples, 2), None
colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
colors = np.hstack([colors] * 20)
clustering_names = [
'MiniBatchKMeans', 'AffinityPropagation', 'MeanShift',
'SpectralClustering', 'Ward', 'AgglomerativeClustering',
'DBSCAN', 'Birch']
plt.figure(figsize=(len(clustering_names) * 2 + 3, 9.5))
plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05,
hspace=.01)
plot_num = 1
datasets = [noisy_circles, noisy_moons, blobs, no_structure]
for i_dataset, dataset in enumerate(datasets):
X, y = dataset
# normalize dataset for easier parameter selection
X = StandardScaler().fit_transform(X)
# estimate bandwidth for mean shift
bandwidth = cluster.estimate_bandwidth(X, quantile=0.3)
# connectivity matrix for structured Ward
connectivity = kneighbors_graph(X, n_neighbors=10, include_self=False)
# make connectivity symmetric
connectivity = 0.5 * (connectivity + connectivity.T)
# create clustering estimators
ms = cluster.MeanShift(bandwidth=bandwidth, bin_seeding=True)
two_means = cluster.MiniBatchKMeans(n_clusters=2)
ward = cluster.AgglomerativeClustering(n_clusters=2, linkage='ward',
connectivity=connectivity)
spectral = cluster.SpectralClustering(n_clusters=2,
eigen_solver='arpack',
affinity="nearest_neighbors")
dbscan = cluster.DBSCAN(eps=.2)
affinity_propagation = cluster.AffinityPropagation(damping=.9,
preference=-200)
average_linkage = cluster.AgglomerativeClustering(
linkage="average", affinity="cityblock", n_clusters=2,
connectivity=connectivity)
birch = cluster.Birch(n_clusters=2)
clustering_algorithms = [
two_means, affinity_propagation, ms, spectral, ward, average_linkage,
dbscan, birch]
for name, algorithm in zip(clustering_names, clustering_algorithms):
# predict cluster memberships
t0 = time.time()
algorithm.fit(X)
t1 = time.time()
if hasattr(algorithm, 'labels_'):
y_pred = algorithm.labels_.astype(np.int)
else:
y_pred = algorithm.predict(X)
# plot
plt.subplot(4, len(clustering_algorithms), plot_num)
if i_dataset == 0:
plt.title(name, size=18)
plt.scatter(X[:, 0], X[:, 1], color=colors[y_pred].tolist(), s=10)
if hasattr(algorithm, 'cluster_centers_'):
centers = algorithm.cluster_centers_
center_colors = colors[:len(centers)]
plt.scatter(centers[:, 0], centers[:, 1], s=100, c=center_colors)
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.xticks(())
plt.yticks(())
plt.text(.99, .01, ('%.2fs' % (t1 - t0)).lstrip('0'),
transform=plt.gca().transAxes, size=15,
horizontalalignment='right')
plot_num += 1
```