Analyzing Data with Azure Machine Learning Studio and Python


by Vlad Iliescu

vladiliescu.ro

The Data Science Process

A non-linear, iterative process

  1. Ask an interesting question
  2. Get the data
  3. Explore the data
  4. Model the data
  5. Communicate and visualize results Profit

Formalized by Joe Blitzstein and Hanspeter Pfister, for the Harvard data science course

1. An interesting question

Who would survive the Titanic?

2. Get Titanic survival data

Understand the dataset's features

In [6]:
from azureml import Workspace

ws = Workspace()
ds = ws.datasets['Kaggle Titanic - Train']
df = ds.to_dataframe()
In [7]:
df.head()
Out[7]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

survived - Survival

  • 1 = Yes
  • 0 = No

pclass - Ticket class

  • 1 = 1st class (Upper class)
  • 2 = 2nd class (Middle class)
  • 3 = 3rd class (Lower class)

sex - Sex

male/female

age - Age in years

Age is fractional if less than 1. If the age is estimated, it is in the form of xx.5

sibsp - number of siblings / spouses aboard the Titanic

The dataset defines family relations in this way...

  • Sibling = brother, sister, stepbrother, stepsister
  • Spouse = husband, wife (mistresses and fiancés were ignored)

parch - number of parents / children aboard the Titanic

The dataset defines family relations in this way...

  • Parent = mother, father
  • Child = daughter, son, stepdaughter, stepson

Some children travelled only with a nanny, therefore parch=0 for them.

ticket - Ticket number

fare - Passenger fare

cabin - Cabin number

embarked - Port of Embarkation

  • C = Cherbourg
  • Q = Queenstown
  • S = Southampton

Let's see how well we can answer the interesting question at the moment

Take a look at the basic version of the experiment on the Cortana Intelligence Gallery here, and notice how many null predictions we get.

Afterwards, take a look here and see how well the algorithms perform after pretending that all people for which we couldn't predict survival actually lived.

Metrics for evaluating a binary classification model

  • Accuracy
  • Precision
  • Recall
  • F1-Score

Confusion Matrix

Predicted: Yes Predicted: No
Actual: Yes True Positive False Negative
Actual: No False Positive True Negative

Precision = How many selected items are relevant (how many people from those that we said would survive actually survived)

Precision = TP / (TP + FP)


Recall = How many relevant items are selected (how many people from those that actually survived we said would survive)

Recall = TP / (TP + FN)


F-Score = Combines Precision and Recall into a single value

F-Score = 2 * (Precision * Recall) / (Precision + Recall)

Conclusion

We want to maximize Accuracy, while also taking Precision/Recall into account

We'll focus on

  • Getting an overview of the data
  • Handling missing features
  • Visualizing features
  • Correlating features with the outcome and with themselves

Initial conclusions

  • Fix the columns with missing values (Cabin, Age, Embarked)
  • We need to ignore the PassengerId and Ticket
  • Ignore names but extract titles
In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

newplot = lambda width=10,height=5: plt.figure(figsize=(width,height)).gca() 

Extract Deck from Cabin

In [8]:
df['Deck'] = df['Cabin'].str.extract('(^[A-Z])', expand = True)

df['Deck'].describe()
Out[8]:
count     204
unique      8
top         C
freq       59
Name: Deck, dtype: object
In [9]:
df['Deck'].value_counts()
Out[9]:
C    59
B    47
D    33
E    32
A    15
F    13
G     4
T     1
Name: Deck, dtype: int64
In [10]:
df['Deck'].value_counts().plot.bar(ax = newplot())
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f25d0632b38>
In [11]:
df['Deck'].fillna('X', inplace=True)
df['Deck'].value_counts()
Out[11]:
X    687
C     59
B     47
D     33
E     32
A     15
F     13
G      4
T      1
Name: Deck, dtype: int64
In [12]:
df[['Fare', 'Deck']].boxplot(by='Deck', ax=newplot())
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f259872cc50>
In [13]:
sns.countplot(y='Deck', hue='Survived', data=df)
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f25985f0ba8>

Analyze Age

In [14]:
g = sns.FacetGrid(df, col='Survived')
g.map(plt.hist, 'Age', bins=20)
Out[14]:
<seaborn.axisgrid.FacetGrid at 0x7f25985d3860>
In [15]:
g = sns.FacetGrid(df, col='Pclass', hue='Survived', palette='Set2')

g.map(plt.hist, 'Age', alpha=.5, bins=20)
g.add_legend();

Apply MICE (Multiple Imputation by Chained Equations) to fill in missing Age values

Fill in Embark

In [16]:
df[df['Embarked'].isnull()]
Out[16]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Deck
61 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0 B28 NaN B
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0 B28 NaN B
In [17]:
sns.boxplot(x='Embarked', y='Fare', hue='Sex', data=df[df['Pclass'] == 1], palette='Set1', ax=newplot())
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f25980a9240>
In [18]:
df.groupby(['Sex', 'Pclass', 'Embarked'])['Fare'].median()
Out[18]:
Sex     Pclass  Embarked
female  1       C           83.1583
                Q           90.0000
                S           79.6500
        2       C           24.0000
                Q           12.3500
                S           23.0000
        3       C           14.4583
                Q            7.7500
                S           14.4500
male    1       C           61.6792
                Q           90.0000
                S           35.0000
        2       C           25.8604
                Q           12.3500
                S           13.0000
        3       C            7.2292
                Q            7.7500
                S            8.0500
Name: Fare, dtype: float64
In [19]:
df['Embarked'].fillna('S', inplace=True)

Extract Title from Name

In [20]:
df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=True)

pd.crosstab(df['Sex'], df['Title'])
pd.crosstab(df['Survived'], df['Title'])
Out[20]:
Title Capt Col Countess Don Dr Jonkheer Lady Major Master Miss Mlle Mme Mr Mrs Ms Rev Sir
Sex
female 0 0 1 0 1 0 1 0 0 182 2 1 0 125 1 0 0
male 1 2 0 1 6 1 0 2 40 0 0 0 517 0 0 6 1
Out[20]:
Title Capt Col Countess Don Dr Jonkheer Lady Major Master Miss Mlle Mme Mr Mrs Ms Rev Sir
Survived
0 1 1 0 1 4 1 0 1 17 55 0 0 436 26 0 6 0
1 0 1 1 0 3 0 1 1 23 127 2 1 81 99 1 0 1
In [21]:
df['Title'] = df['Title'].replace('Mlle', 'Miss')
df['Title'] = df['Title'].replace('Ms', 'Miss')
df['Title'] = df['Title'].replace('Mme', 'Mrs')
df['Title'] = df['Title'].replace(['Lady', 'Countess','Capt', 'Col', 
                                            'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

pd.crosstab(df['Sex'], df['Title'])

pd.crosstab(df['Survived'], df['Title'])
Out[21]:
Title Master Miss Mr Mrs Rare
Sex
female 0 185 0 126 3
male 40 0 517 0 20
Out[21]:
Title Master Miss Mr Mrs Rare
Survived
0 17 55 436 26 15
1 23 130 81 100 8
In [22]:
df[['Title', 'Survived']].groupby('Title').mean().plot.bar(ax=newplot())
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2597df95c0>

Combine SibSp and Parch into a single feature

In [23]:
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
df.tail()
Out[23]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Deck Title FamilySize
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.00 NaN S X Rare 1
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.00 B42 S B Miss 1
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S X Miss 4
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C C Mr 1
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q X Mr 1
In [24]:
df.groupby(['FamilySize'])['Survived'].mean()
Out[24]:
FamilySize
1     0.303538
2     0.552795
3     0.578431
4     0.724138
5     0.200000
6     0.136364
7     0.333333
8     0.000000
11    0.000000
Name: Survived, dtype: float64
In [25]:
df.groupby(['FamilySize'], as_index=False)['Survived'].mean()\
    .plot.bar(x='FamilySize', ax=newplot())
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2597d86080>

Drop unnecessary columns

In [26]:
df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp', 'Parch'],\
        axis=1, inplace=True)

df.head()
Out[26]:
Survived Pclass Sex Age Fare Embarked Deck Title FamilySize
0 0 3 male 22.0 7.2500 S X Mr 2
1 1 1 female 38.0 71.2833 C C Mrs 2
2 1 3 female 26.0 7.9250 S X Miss 1
3 1 1 female 35.0 53.1000 S C Mrs 2
4 0 3 male 35.0 8.0500 S X Mr 1
In [27]:
sns.pairplot(df.dropna()[['Survived', 'Age', 'Fare', 'Pclass', 'FamilySize', 'Title', 'Deck']], hue='Survived', plot_kws={'alpha':0.3, 's':80})
Out[27]:
<seaborn.axisgrid.PairGrid at 0x7f2597da6438>

Let's try out the improved dataset

Fin