You can do some amazing work with machine learning. Predict how many crimes will occur in a city in a timeframe, based on historical data. Identify signs of diabetic retinopathy in eye images to help diagnose the disease in areas with limited access to doctors. Discover correlations between driving patterns and the need for car maintenance. These are the kind of apps that win hackathons and get featured in the news!
In this blog post, I want to give you 10 easy steps to get up and running with Azure Machine Learning. (Steps 1 and 2 could be done in advance of the hackathon, and some of the steps at the end are optional.)
Step 1: Set up an Azure account.
My teammate Joe Healy has listed the ways to get Azure. The easiest option to try is out for free is the free trial. But if you have an MSDN subscription, you can get free Azure credit with that. Startups can get free Azure credit through BizSpark. Students get some Azure with DreamSpark, but currently the Azure Machine Learning functionality isn’t included in this offer…check out this “Azure Machine Learning for Teaching” blog post for other options (near the bottom):
- There is a free tier that includes 10GB of Azure storage for our datasets, and ability to build Azure ML experiments for an hour with up to 100 modules. Get started with this here.
- Azure for Education is for Faculty running courses using Azure, including Azure ML. Each student receives $100 of Azure credit per month, for 6 months. The Faculty member receives $250 per month, for 12 months. You can apply anytime at http://www.microsoftazurepass.com/azureu.
- Azure Machine Learning for Research is for University Faculty running data science courses who may need greater amounts of Azure storage and additional services such as HDInsight (Hadoop and Spark) or DocumentDB (NoSQL). Proposals are accepted every two months, you can find out more and apply at http://research.microsoft.com/en-us/projects/azure/ml.aspx.
Step 2: Work through the intro tutorial.
The Azure machine learning team provided a very nice walkthrough tutorial which covers a lot of the basics. I highly recommend going through that once. It takes you through the entire process of creating an AzureML workspace, uploading data, creating an experiment to predict someone’s credit risk, building, training, and evaluating the models, publishing your best model as a web service, and calling that web service.
Step 3: Find interesting data with which to work.
Now you need to learn how to import a data set into Azure Machine Learning, and where to find interesting data to build something amazing at your hackathon. You can upload local data (like a .csv file) from your machine or access data from elsewhere on the internet (like an OData feed provider). I cover both of these topics in detail in blog and video format, so enjoy in whatever format you prefer. (Spoiler alert: my favorite sources of data are Kaggle, Data.gov, and the UCI machine learning repository.)
Step 4: Build out your experiment.
Many predictive experiments using supervised learning (regression, classification, or anomaly detection) will follow this basic pattern.
Drag the data set that you chose in step 3 onto your AzureML workspace. Then you may want to use the various Data Transformation modules to clean or reformat your data (such as removing rows with missing data, etc).
Then you will split your data set in a training and test set. I usually split 75% training and 25% test. Why do we have to split it? Well, remember that with supervised learning, you need data with labeled examples. For example, let’s say I want to predict how much a house will cost based on its square footage and zip code. To train my model, I need a dataset of existing houses with their square footage, zip codes, and prices, like this:
SquareFootage | ZipCode | Price |
2000 | 48075 | 200,000 |
3000 | 48075 | 300,000 |
4000 | 48075 | 400,000 |
5000 | 48075 | 500,000 |
In this example, square footage and zip code are my features (inputs) and price is my label (output). I could train a model on some data like the above, and then use the trained model to predict prices, given only a square footage and zip code.
So, the reason I split the data is to provide most of the data to train the model (it will process the data to figure out correlations between the inputs and outputs in the “train model” module), but we want to hold back some of that labeled data to test the model that we built. Then, we can compare the price value that the trained model generates against the actual labeled price value in the test dataset (in the “score model” module) to see how well the model is performing. (We can't use the same data for both…the model is built using the training data, so it will perform pretty accurately with that; we hold back unused data to test.)
Finally, the “evaluate model” module lets us compare two models against each other to determine which performs better for our needs.
Step 5: Learn how to choose the right machine learning algorithm.
The hardest part for many developers is staring down the list of Azure machine learning algorithms (there are currently 25 of them) and trying to figure out which one would work best. I cover how to choose the best Azure Machine Learning algorithm in this blog post. To summarize, there are 4 categories of algorithms currently supported in Azure Machine Learning:
- Clustering: grouping similar data together
- Regression: predicting a value
- Classification: predicting a discrete category
- Anomaly detection: identifying data that is outside of the norm
Once you determine the category of algorithm that makes sense for your problem, you need to choose a specific algorithm within that category. The best resource for this is the Azure Machine Learning Cheat Sheet. It is a useful flowchart that helps you analyze your data and figure out which algorithm may perform best. However, keep in mind that there is some art to this – definitely try multiple algorithms and see which one gives the best results for your particular data set. There are other factors to keep in mind when choosing an algorithm as well: do you want to be able to incrementally update your trained model with new data? Is accuracy or training time more important to you? How large is your data set and how many features does each data point have? This article does a great job discussing some of the nuances.
Step 6: Refine your model.
Now comes the experimentation phase. I start by trying multiple algorithms and seeing which performs best*.
Each algorithm contains a number of initial parameters. Tweaking the initial parameters can greatly improve your results. The "Sweep Parameters" module can help by trying many different input parameters for you, and you can specify the metric that you want to optimize for (such as accuracy, precision, recall, etc.). See this article for a brief description.
Changing algorithms and adjusting their initial parameters can greatly affect your results. Here are some resources to help you learn to perfect your model:
How to choose parameters to optimize your algorithms in Azure Machine Learning
"Run and Fine-Tune Multiple Models" video by Data Science Dojo
* You might be asking: how do I figure out which is performing best? To evaluate your model, right-click on the output node of the “Evaluate Model” module and select “Visualize”.
The data provided is different depending on what category of algorithm you are using:
Regression models give you the mean absolute error, root mean squared error, relative absolute error, relative squared error, and the coefficient of determination. You want the errors to be as close to 0 as possible, and you want the coefficient of determination to be as close to 1 as possible.
Binary (two-class) classification models provide metrics on accuracy, precision, recall, F1 score (which is a combination of precision and recall), and AUC (area under the curve). You want all of these numbers to be as close to 1 as possible. It also provides the number of true positives, false positives, false negatives, and true negatives. You want the number of true positives and true negatives to be high, and the number of false positives and false negatives to be low.
Multiclass classification models provide a confusion matrix of actual vs. predicted instances.
Here are some resources to help you with evaluating your model:
Step 7: Publish your model as a web service.
To publish your model, click the “SET UP WEB SERVICE” button in the bottom toolbar in Azure Machine Learning Studio. If there are multiple trained models in your experiment, select the “Train Model” module for the algorithm/trained model you want to use before clicking the button.
Select the creation of a “Predictive Web Service”. The tool will generate a new experiment with web service inputs and outputs. Verify that all of your data preprocessing modules still make sense when you call the service with new data. You can also use the “Project Columns” module to remove some columns from the web service inputs and outputs. Then, run your predictive experiment and click “DEPLOY WEB SERVICE”.
There is further documentation on publishing your web service here. (You can also reference this step in the walkthrough I mentioned in step 2.)
Step 8: Call your web service.
Finally, you need to write a little code (or grab some sample code) to call your web service. The Azure web service that you created can operate two different ways:
- Request/Response - The user sends one or more rows of credit data to the service by using an HTTP protocol, and the service responds with a set of results.
- Batch Execution - The user sends to the service the URL of an Azure blob that contains one or more rows of credit data. The service stores the results in another blob and returns the URL of that container.
When you published the web service in the previous step, you were taken to a webpage documenting the different ways to call your service. Sample code is provided in C#, Python, and R. An Excel spreadsheet with macros to call the web service is also provided.
The official documentation on calling your web service is here.
Step 9: Retrain your model over time.
Honestly, this step is beyond what you would really do at a time-constrained hackathon, but people ask me about it all the time so I wanted to include it. :)
You may have new data coming in continually, and want to occasionally retrain your ML model based on that new data. Here is the official documentation on how to retrain machine learning models programmatically.
PRO TIP: Bayesian algorithms update well. Neural nets and SVMs need to work on the entire training dataset in batch mode, so they don’t update as well.
Step 10: Share the awesome and PROFIT
You have built something amazing. Now tell the story of why it’s cool. If you are presenting your work at the hackathon, make sure to include concrete data about the problem, which lends to its importance. For example, look at the pitches of past hackathon winners.
“Ciris is a real-time color augmentation overlay that allows the colorblind to more clearly see contrasts on their desktop computers and mobile devices. 700 million people are color-blind worldwide. In the current world, we use color in charts, pictures, graphics, and clothing to convey information. These cues are lost on color-blind people. If these individuals could somehow glean this information, it would enrich their day-to-day lives and solve a whole host of problems.”
Including the number of people that colorblindness affects is very powerful. Use statistics and data whenever possible to show the work’s impact.
If you aren’t presenting your work at a hackathon, share it anyway! Post a comment below. Write a blog post. Present at a local user group. Hopefully you learned something and others will benefit from your knowledge, troubleshooting efforts, and lessons learned as well. You can also share your machine learning model to the Azure Machine Learning gallery with a button click from the bottom toolbar in AzureML Studio.
Useful Resources for Azure Machine Learning
Get started with Azure Machine Learning
Feature requests for Azure Machine Learning
Microsoft Virtual Academy course on Azure Machine Learning