%pylab inline
Consulting Geologist at H&S Consultants
This IPython Notebook was presented at a SMEDG presentation. The idea behind this notebook is to introduce geologists to some of the benefits of using Python for data analysis. It is recommended to copy and paste the bits of code that you want to use into your own notebook.
In my experience geolgists tend to use MS Excel for data analysis but Python makes it much easier to dig deeper into data and therefore identify relationships that may have been missed. Python code also tends to be more reusable allowing for easier automation of repeatable tasks.
The core Python package gives us all the tools we need to operate on numbers. Here is an example where we do a simple calculation on numbers 1 to 2,500,000
import math as math
loops = 2500000
a = range(1, loops)
def math_function(x):
return(3 * math.log(x) + math.cos(x) ** 2)
%timeit r = [math_function(x) for x in a]
It worked but it was too slow. Let's try it using NumPy
import numpy as np
def numpy_function(array):
return(3 * np.log(array) + np.cos(array)**2)
a = np.arange(1, loops)
%timeit r = numpy_function(a)
Much faster! Around 15 times faster. There are faster ways to achieve calculations if you're really interested in speed but unless you're writing serious code or programs I wouldn't bother. If you are concerned or some bits of your code are taking too long to run check out numexpr and/or Cython. Here's a quick example with numexpr:
import numexpr as ne
ne.set_num_threads(4)
numexpr_func = '3 * log(a) + cos(a) ** 2'
%timeit r = ne.evaluate(numexpr_func)
Another major reason why we want to use numpy arrays is that we can operate on them easily. Below is an example comparing what hapeens if we multiply a python list by two:
python_list = [1, 2, 3, 4, 5]
python_list*2
Compared to what happens if we multiply a numpy array by two:
numpy_array = np.array([1, 2, 3, 4, 5])
numpy_array*2
First of all Pandas makes it easy to import data from a variety of files including csv and MS Excel
To load a csv file into memory you'll need to tell it the file path. A quick way do do this is go to the file in Windows Explorer, press Ctrl+ right-click and then select "Copy as path". Then you can just paste the whole file path as I've done below.
file_path = "C:\Projects\Python Presentation\Data\ExampleData.csv"
It is a strong convention to import all the libraries you will use in a script at the begining of the script but we'll ignore that today.
import pandas as pd
The following line of code loads the whole csv file into memory. If you have a massive csv file you may want to read the data in chunks using the chunksize option or load the data into an hdf5 store. That does get a little complicated though so I try to avoid it.
Unfortunately I can't supply the data that I used during the presentation. Instead, I'm just going to create some random data so that the code can be run independently. Don't be put off by the amount of code! The fact that the data is random means it won't produce any interesting patterns but if you imported your own data and used some of the code then who knows what you might uncover?
num_samples = 2000
num_drill_holes = 20
assay_fields = ['Ag_ppm', 'Au_ppm', 'Co_ppm', 'Cu_pc', 'Fe_pc', 'Mg_pc', 'Pb_ppm', 'Pb_pc', 'S_pc', 'Zn_ppm', 'Zn_pc',
'Bi_ppm','Ni_ppm', 'As_ppm', 'Cr_ppm', 'Zr_ppm', 'Y_ppm', 'W_ppm', 'U_ppm', 'V_ppm', 'Tl_ppm', 'Ti_pc',
'Th_ppm', 'Sr_ppm','Se_ppm', 'Sc_ppm', 'Sb_ppm', 'Rb_ppm', 'P_ppm','Nb_ppm', 'Na_pc', 'Mo_ppm',
'Mn_ppm', 'La_ppm', 'K_pc', 'Cd_ppm', 'Ca_pc', 'Be_ppm', 'Ba_ppm', 'Al_pc']
drill_hole_names = ["DH"+str(i+1) for i in range(num_drill_holes)]*100 # Make up the drill hole names (100 records for each drill hole)
drill_hole_names.sort()
means = np.random.random(len(assay_fields)) # Generate random means for each element
stds = np.random.random(len(assay_fields)) # Generate random standard deviations for each element
# Make a DataFrame of all the assays
df_assay = pd.DataFrame(np.random.lognormal(means, stds, (num_samples,len(assay_fields))), columns = assay_fields)
df_assay["hole_id"] = drill_hole_names # Add the hole_ids to the DataFrame
# Assign a zone code (e.g. area code) and oxidation code to each sample at random
df_assay["Zone Code"] = np.random.choice([1,2,3,4,99], num_samples)
df_assay["Ox"] = np.random.choice([2,3], num_samples)
# Randomly select an interval length from the listed numbers
df_assay["Interval"] = np.random.choice([0.5,0.8, 1, 1.5], num_samples)
# Calculate the To depths from the cumulative sum of the Intervals for each drill hole
df_assay["To"] = df_assay.groupby("hole_id")["Interval"].cumsum()
df_assay["From"] = df_assay["To"].shift(1) # Get the From depth from the To depth of the record above
# Assign zero depth to the first record of every drill hole
df_assay.ix[np.arange(0,num_samples, num_samples/num_drill_holes), "From"] = 0
# Define the order of columns when looking at the dataframe
df_assay = df_assay[["hole_id", "From", "To", "Interval", "Ox", "Zone Code"]+assay_fields]
So now my csv file/data is in memory we can ask it some questions e.g. how many rows and columns do we have?
df_assay.shape
This tells me we have 2,000 rows and 46 columns. I'm not interested in all of those columns at the same time so I'll make a list of economic elements of interest. We can easily ask what the columns are called using the following:
df_assay.columns
econ_elems = ['Cu_pc', 'Pb_pc', 'Zn_pc', 'Ag_ppm', 'Au_ppm', 'S_pc']
We can then either refer just to these columns in the following way where we use Pandas in-built describe method to get some basic stats
df_assay[econ_elems].describe()
or, if we're going to use it a lot we can make a little pointer to it e.g.
df_econ = df_assay[econ_elems]
df_econ.head(10) # Shows the first 10 rows of the DataFrame
Pandas does have some basic statistical methods built in to it. One of the handy things that Pandas statistical methods does is take care of missing data (NaN) for you.
This webpage introduces some of the basic functionality that Pandas offers. For example, in order to get the mean for each column:
df_econ.mean()
If we wanted to get a mean for each group in the table e.g. for each Zone, lithology or drill hole we can use the pandas groupby function like so:
grouped = df_assay.groupby("Zone Code")
grouped[econ_elems].mean()
We can groupby a list of columns as well. Here we groupby Zone Code and Oxidation by feeding the groupby a list:
grouped = df_assay.groupby(["Ox","Zone Code"])
grouped[econ_elems].mean()
The above can be easily recreated in MS Excel by using pivot tables but if we wanted to do something a little more complicated we can write it in a function and then apply that to each group in the groupby object. Below is a simple example to get weighted averages using NumPy's average function which excepts weights. There may be simpler ways to do this but I found that this approach works so I'm happy with it.
def grouped_weighted_average(df, elem_list, weight_fn, groupby_fns):
grouped_df = df.groupby(groupby_fns)
df_averages = grouped_df[elem_list].sum() # This just creates a dataframe with the correct columns and groups.
for elem in elem_list:
df_averages[elem] = grouped_df.apply(lambda x: (np.average(x[elem], weights=x[weight_fn])))
df_averages = np.round(df_averages, 3) # Rounds the numbers to 3 decimal places
return(df_averages)
In this example we apply the function above and also demonstrate how the groupby function can take a list instead of a single column name
grouped_weighted_average(df_assay, econ_elems, "Interval", ['Ox', "Zone Code"])
The next bit creates a name for my output path. It basically takes the path and the file name from the input file and creates a new folder (if it doesn't already exist) with the name of the file name (not including the last 4 characters which are ".csv").
NB. Saving output files and figures in this script won't work unless you define a file to read the data from or define an output path.
import os
file_dir, file_name = os.path.split(file_path)
output_path = os.path.join(file_dir, file_name[:-4])
if not os.path.exists(output_path):
os.makedirs(output_path)
output_path
If we wanted to save the wt_averages dataframe we could assign it to a name e.g. wt_averages and then use the Pandas .to_csv function to write the file to the output folder as in the following lines of code.
wt_averages = grouped_weighted_average(df_assay, econ_elems, "Interval", ['Ox', "Zone Code"])
wt_averages.to_csv(os.path.join(output_path, "wt_averages.csv"))
If you wanted to save several dataframes on seperate sheets of an excel file you can do this by setting up a dictionary. Then assign each dataframe to the dictionary. The key of the dictionary should be the sheet name. You can then iterate over the dictionary to save each dataframe. For the purposes of brevity we'll skip that.
Correlation coefficients are a common thing to look at for geologists. Below is an example of using a function that produces Pearsons correlation coefficients for the dataframe. We then need to square it as everyone is used to R squared. The groupby fields in this case are optional. The Pandas .corr() function also accepts a method so we could produce a Spearman rank correlation by using .corr(method = "spearman"). Additonal info can be found here
def get_r2(df, elem_list, groupby_fns = None):
if groupby_fns is not None:
df = df.groupby(groupby_fns)
df_corr = df[elem_list].corr()**2# This is squared beccause .corr() gives R but we're used to R squared
return(df_corr)
get_r2(df_assay, econ_elems, groupby_fns = ["Ox", "Zone Code"])
A lot of the time in MS Excel we are operating on a series of columns and then putting the result in a new column. This is also easy with Pandas.
Au_equiv = df_econ["Au_ppm"]*0.6
Ag_equiv = df_econ["Ag_ppm"]*0.01
Pb_equiv = df_econ["Pb_pc"]*0.2
Zn_equiv = df_econ["Zn_pc"]* 0.1
df_econ["Cu_Equivalent"] = Au_equiv + Ag_equiv + Pb_equiv + Zn_equiv + df_econ["Cu_pc"]
df_econ.head()
The Pandas library gives you some handy, quick plotting options. There are various types and they are based around Matplotlib's plotting. I tend to just use Matplotlib as normally the graphs I make require some modifications.
Pandas does have some handy but basic plotting functions. Personally, I tend not to use them as I like to write more specific personalised graphs using Matplotlib.
This webpage gives a few examples of what can be done using Pandas
Here is a very quick and dirty way to take a good look at correlations between elements:
temp = pd.scatter_matrix(df_econ)
Here's an example to make a histogram:
df_assay["Cu_pc"].hist(bins=40, alpha = 0.6)
Often we will need to plot assay data on a log scale to show anything clearly. Below is an example of how we can modify the Pandas histogram by plotting it on a Matplotlib figure. This gives us the opportunity to change the x axis labels back into real numbers rather than log numbers.
fig, ax = plt.subplots()#define
fig.set_size_inches((10, 6))
np.log10(df_assay["Cu_pc"]).hist(bins=40, ax=ax, alpha = 0.3, color="g")
labels = ax.get_xticks()
new_labels = plt.xticks(labels, np.round(np.power(10, labels),2))
#plt.savefig(output_path + "\\Cu_Hist1.png", dpi = 600)
Boxplots are also easy and can be grouped by a field in the dataframe. If lists of element field names are supplied seperate subplots will be produced. Groupby fields can also be lists
df_assay.boxplot(column= "Cu_pc", by = ["Zone Code"], rot = 90)
Below is an example of the same boxplot on a log scale. The last line (hashed out) saves the figure to the output folder. dpi cpntrols the dots per inch. The higher the dpi the longer it takes to plot and the the larger the file size. I've found that dpi= 600 is about as high as you need to go for putting anything into MS Word. We're using randomly generated data so it doesn't tell us too much but you get the idea.
fig, ax = plt.subplots()#define
fig.set_size_inches((10, 6))
df_assay.boxplot(column= "Cu_pc", by = ["Ox","Zone Code"], ax=ax, rot = 90)
ax.set_yscale('log')
#plt.savefig(output_path + "\\Cu_Boxplot.png", dpi = 600)
If we want to use this plot again and again it's a good idea to put it in a function as in the following. Here we also use the weighted average function that we created earlier to calculate the averages and then plot them on the same graph. The option to save the figure has also been added.
def plot_pandas_bp(df, element_fn, groupby_fns, save_folder= None):
fig, ax = plt.subplots()
fig.set_size_inches((10, 6))
df.boxplot(column= elem, by = groupby_fns, ax=ax, rot = 90)
#Add some weighted averages to the boxplot
wt_means = grouped_weighted_average(df, [element_fn], "Interval", groupby_fns)
x_locations = np.arange(1, len(wt_means)+1,1) # Pandas plots the boxplots on x axis starting from 1
plt.scatter(x_locations, wt_means, color = "red")
ax.set_yscale('log')
# Get rid of the "Boxplot grouped by ..." title
fig.suptitle('')
#Optionally save the the picture
if save_folder is not None:
save_dir = save_folder + "\\Boxplot of " + element_fn + " by " + str(groupby_fns)
plt.savefig(save_dir)
It is then easy to loop through a list like we do here. This will produce a figure for each element in our econ_elems list.
for elem in econ_elems:
plot_pandas_bp(df_assay, elem, ["Ox","Zone Code"])#, save_folder = output_path)
One of the things that really bugs me with MS Excel is the amount of faffing around that you need to do to create plots that show multiple series on the same graph. In the following example we create a scatter plot with the help of the pandas groupby function. Here we actually iterate over the grouped data in order to plot it in different colours. We'll put in a function so we can use it later if we want to.
def grouped_scatter(df, x_fn, y_fn, groupby_fns):
fig, ax = plt.subplots()#define
fig.set_size_inches((10, 6))
grouped = df.groupby(groupby_fns)
for group_name, group_data in grouped:
ax.plot(group_data[x_fn], group_data[y_fn], marker='o', linestyle='', ms=8, label=group_name)
ax.set_xlabel(x_fn)
ax.set_ylabel(y_fn)
ax.grid()
legend = ax.legend(numpoints=1, loc= 'best')
grouped_scatter(df_assay, "Cu_pc", "Ag_ppm", "Zone Code")
There are loads of examples on the web to copy and paste straight into your code and modify if you want to. The matplotlib gallery is a good place to check out what is available. Here is an example I copied straight out of the gallery (as I'm running out of time!)
"""
http://matplotlib.org/examples/statistics/histogram_demo_multihist.html
Demo of the histogram (hist) function with multiple data sets.
Plot histogram with multiple sample sets and demonstrate:
* Use of legend with multiple sample sets
* Stacked bars
* Step curve with a color fill
* Data sets of different sample sizes
"""
import numpy as np
import matplotlib.pyplot as plt
n_bins = 10
x = np.random.randn(1000, 3)
fig, axes = plt.subplots(nrows=2, ncols=2)
ax0, ax1, ax2, ax3 = axes.flat
colors = ['red', 'tan', 'lime']
ax0.hist(x, n_bins, normed=1, histtype='bar', color=colors, label=colors)
ax0.legend(prop={'size': 10})
ax0.set_title('bars with legend')
ax1.hist(x, n_bins, normed=1, histtype='bar', stacked=True)
ax1.set_title('stacked bar')
ax2.hist(x, n_bins, histtype='step', stacked=True, fill=True)
ax2.set_title('stepfilled')
# Make a multiple-histogram of data-sets with different length.
x_multi = [np.random.randn(n) for n in [10000, 5000, 2000]]
ax3.hist(x_multi, n_bins, histtype='bar')
ax3.set_title('different sample sizes')
plt.tight_layout()
It's also easy to share one or both axes. Below is another example I modified from the gallery in five minutes. With a little more work this could turn into a nice way to plot downhole data. At the moment it gives an idea of what is going on downhole. You can see how the data from the Pandas dataframe is provided to the matplotlib plotting.
drill_1 = df_assay[df_assay["hole_id"]== "DH1"]
fig, (ax1, ax2, ax3) = plt.subplots(1,3, sharey=True)#define
fig.set_size_inches((6, 6))
ax1.plot(drill_1["Cu_pc"],drill_1["From"]*-1, c= "r")
ax1.set_xlabel("Cu_pc")
ax2.plot(drill_1["Pb_pc"], drill_1["From"]*-1, c= "b")
ax2.set_xlabel("Pb_pc")
ax3.plot(drill_1["Ag_ppm"], drill_1["From"]*-1, c= "g")
ax3. set_xlabel("Ag_ppm")
fig.subplots_adjust(wspace=0)
With a bit of tweaking you can get some fancy looking graphs. Python makes it easy to automate plotting so you could, for example, write a function to iterate over each drill hole in your table or each standard (CRM) in your QAQC database.
Below is an example that I got from this website by Roger Veciana i Rovira. It uses the Radial Basis Function from Scipy to interpolate grades between points. I added the contour lines and made the points scaled to value size. There are many ways to interpolate between the points so be careful which method you use.
import scipy.interpolate as interpolate
import matplotlib.pyplot as plt
#Creating some data, with each coordinate and the values stored in separated lists
x = [10,60,40,70,10,50,20,70,30,60]
y = [10,20,30,30,40,50,10,70,80,90]
values = [1,2.2,1,3,4,6,7,7.5,8.3,10]
#Creating the output grid (100x100, in the example)
ti = np.linspace(0, 100.0, 100)
XI, YI = np.meshgrid(ti, ti)
#Creating the interpolation function and populating the output matrix value
rbf = interpolate.Rbf(x, y, values, function='linear') # other functions to try are: 'thin_plate', 'multiquadric,
ZI = rbf(XI, YI)
# Plotting the result
n = plt.Normalize(0.0, 100.0)
fig, ax = plt.subplots()#define
fig.set_size_inches((10, 6.18))
plt.subplot(1, 1, 1)
plt.pcolor(XI, YI, ZI)
plt.scatter(x, y, s = np.array([values])*100, c = values)
plt.colorbar()
#Add some contours
CS = plt.contour(XI, YI, ZI, 10, colors='k')
plt.clabel(CS, inline=1, fontsize=10)
plt.title('RBF interpolation')
plt.xlim(0, 100)
plt.ylim(0, 100)
I hope so far the examples have shown you how easy Python is to read and write. In my opinion it beats MS Excel for many simple tasks. Some of the more advanced functions available in Python would be very difficult to achieve in Excel and yet they're not actually any more difficult to apply than the examples we have seen so far.
When I was preparing for this talk I was planning on demonstrating how easy it is to use Principal Component Analysis (PCA) and apply it to geological applications. This webpage by Sebastian Raschka really helped me understand what PCA does. If you're interested in using PCA it's worth taking the time to go through. It shows a handy trick to help decide how many components you may want to reduce the data to. At the Geochemistry course last week at UNSW I learned that Factor Analysis is more applicable to geological applications. A quick google search told me that Factor Analysis is also available from Scikit-learn and even how it differs from PCA ("PCA assumes equal noise variance for each feature"). Unsurprisingly all the Factor Analysis functions are the same as the PCA functions. If you wanted to test out PCA just change the "decomposition.FactorAnalysis" to "decomposition.PCA".
The following is a example of how to use Factor Analysis and K-Means clustering to take a guess at classifying rock types. Depending on your application you may need to massage the data first to get rid of extreme values and outliers. There are various ways to do this but using Mahalanobis distances may help. This webpage from Scikit-learn shows how it can be applied.
from sklearn import decomposition
from sklearn import cluster
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
First of all we need to select the columns of interest that we want to conduct the classifying on.
cois =['Al_pc', 'Ba_ppm', 'Ca_pc', 'Cr_ppm', 'Fe_pc', 'K_pc', 'Mg_pc', 'Na_pc', 'Ti_pc', 'V_ppm']#'S_pc',
Then we will select these columns and name the dataframe. The dropna() function drops any rows (samples) that do not have a value for every element in the columns of interest. Most (all?) of the Scipy or Scikits-learn functions do not work with missing data. In this example we are left with 6,554 samples.
df_fa = df_assay[cois].dropna()
df_fa.shape
We then need to standardise the data which removes the mean and scales the data to the unit variance.
standardised = StandardScaler().fit_transform(df_fa)
We then set the number of components to look at. In this example we're choosing the seven because I have found that this works well on this dataset. Check out the link to Sebastian Raschka's example in order to work out the eigenvalues for each eigenvector.
We then need to create the Factor Analysis instance and then fit the data and transform it.
n_components = 7
FA = decomposition.FactorAnalysis(n_components= n_components)
Y_FA = FA.fit_transform(standardised)
That's the Factor Analysis completed. Piece of cake! Then we need to classify the data into groups. There are many ways to do this in Scikits-learn. An example of eight different methods is shown near the end of this notebook. Anyway, here we're going to use k-means cluster analysis to group the data into seperate groups. This is an example of unsupervised learning.
First we create the instance and then fit the data. Then we add a column to our dataframe with the labels that the k-means clustering process has assigned to each record.
k_means = cluster.KMeans(init='k-means++', n_clusters= n_components, random_state= 456)
k_means.fit(Y_FA)
df_fa["K_means_Groups"] = k_means.labels_
The next bit isn't actually necessary but I like to do it. k-means will assign labels as numbers from 0 to the n_clusters-1. These numbers are in random order but I prefer to have numbers ascending from 1 to n_clusters (7 here) asecending according to the number of assigned values i.e. the group with the most samples will be group 1 and the group with the least samples will be group 7 in this example. This is also a chance to show another handy Pandas function where we can use a dictionary to find and replace values. Then we take a look at the number of samples assigned to each group.
count_order = df_fa["K_means_Groups"].value_counts().index # Lists the groups in order of counts
dictionary = dict(zip(count_order, range(1, len(count_order)+1))) # Creates a dictionary of groups and the new group names
df_fa["K_means_Groups"].replace(dictionary, inplace = True) # Replaces the groups with the new group names
df_fa["K_means_Groups"].value_counts() # Displays the new group names so we can look at them
Then I want to export the data out again. The problem is that I want all the data back again, not just the columns that are in the df_fa table. This is easy because the df_fa retained the original indexing from the df_assay. We can therefore just use Pandas Join function on the new "K_means_Groups" series (column) and it will automatically put the values in the correct rows. Then export to csv so we can load it into Micromine and take a look.
df_assay_mod = df_assay.join(df_fa[["K_means_Groups"]])
df_assay_mod.to_csv(output_path+"\\Test_kmeans.csv")
You could plot these groups using a series of scatter plots and grouping by the "K_means_Groups" column by modifying one of the examples above to investigate how it has done it's grouping. Here we'll just take a look at the means:
np.round(df_fa.groupby("K_means_Groups").mean(),2)
Just to get a clearer idea we'll plot a scatter plot grouped by the K-means classification using the grouped_scatter function we created earlier. This doesn't look great using randomly generated data but with real data it can be used to quickly give split rock types based on assay data. I wouldn't trust it implicitly though!
grouped_scatter(df_assay_mod, "Ca_pc", "Na_pc", "K_means_Groups")
If you wanted to get more defination on any of the groups you could repeat the process on each/any of the groups to subdivide them. You'll probably want to change the n_components though.
Scipy has loads of cluster methods for classification. There are also other ways to classify data. The following example to compare different clustering methods. This webpage gives some backgoud on the different cluster methods and this webpage gives examples of some other classifiers. The code below can be found by clicking on the image on the webpage. I just copied and pasted it below.
import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn import cluster, datasets
from sklearn.neighbors import kneighbors_graph
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')
np.random.seed(0)
# Generate datasets. We choose the size big enough to see the scalability
# of the algorithms, but not too big to avoid too long running times
n_samples = 1500
noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5,
noise=.05)
noisy_moons = datasets.make_moons(n_samples=n_samples, noise=.05)
blobs = datasets.make_blobs(n_samples=n_samples, random_state=8)
no_structure = np.random.rand(n_samples, 2), None
colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
colors = np.hstack([colors] * 20)
clustering_names = [
'MiniBatchKMeans', 'AffinityPropagation', 'MeanShift',
'SpectralClustering', 'Ward', 'AgglomerativeClustering',
'DBSCAN', 'Birch']
plt.figure(figsize=(len(clustering_names) * 2 + 3, 9.5))
plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05,
hspace=.01)
plot_num = 1
datasets = [noisy_circles, noisy_moons, blobs, no_structure]
for i_dataset, dataset in enumerate(datasets):
X, y = dataset
# normalize dataset for easier parameter selection
X = StandardScaler().fit_transform(X)
# estimate bandwidth for mean shift
bandwidth = cluster.estimate_bandwidth(X, quantile=0.3)
# connectivity matrix for structured Ward
connectivity = kneighbors_graph(X, n_neighbors=10, include_self=False)
# make connectivity symmetric
connectivity = 0.5 * (connectivity + connectivity.T)
# create clustering estimators
ms = cluster.MeanShift(bandwidth=bandwidth, bin_seeding=True)
two_means = cluster.MiniBatchKMeans(n_clusters=2)
ward = cluster.AgglomerativeClustering(n_clusters=2, linkage='ward',
connectivity=connectivity)
spectral = cluster.SpectralClustering(n_clusters=2,
eigen_solver='arpack',
affinity="nearest_neighbors")
dbscan = cluster.DBSCAN(eps=.2)
affinity_propagation = cluster.AffinityPropagation(damping=.9,
preference=-200)
average_linkage = cluster.AgglomerativeClustering(
linkage="average", affinity="cityblock", n_clusters=2,
connectivity=connectivity)
birch = cluster.Birch(n_clusters=2)
clustering_algorithms = [
two_means, affinity_propagation, ms, spectral, ward, average_linkage,
dbscan, birch]
for name, algorithm in zip(clustering_names, clustering_algorithms):
# predict cluster memberships
t0 = time.time()
algorithm.fit(X)
t1 = time.time()
if hasattr(algorithm, 'labels_'):
y_pred = algorithm.labels_.astype(np.int)
else:
y_pred = algorithm.predict(X)
# plot
plt.subplot(4, len(clustering_algorithms), plot_num)
if i_dataset == 0:
plt.title(name, size=18)
plt.scatter(X[:, 0], X[:, 1], color=colors[y_pred].tolist(), s=10)
if hasattr(algorithm, 'cluster_centers_'):
centers = algorithm.cluster_centers_
center_colors = colors[:len(centers)]
plt.scatter(centers[:, 0], centers[:, 1], s=100, c=center_colors)
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.xticks(())
plt.yticks(())
plt.text(.99, .01, ('%.2fs' % (t1 - t0)).lstrip('0'),
transform=plt.gca().transAxes, size=15,
horizontalalignment='right')
plot_num += 1
We haven't had time to scratch the surface of the machine learning options available in Scikits-learn. What we have done so far is used an unspervised learning process (in this case k-means cluster analysis). There is also a large amount of supervised learning methods available in which you give the method a training set e.g. a lithology code for each sample and then get the method to classify the rest of the data for you. You could also feed it the same data i.e. the training set and it might draw your eye to places where you might want to double check your classification. This webpage gives some good background to machine learning in Scikit-learn and has links to all the processes currently available.
Give them a go on your datasets and see what works for you!