One goal of machine learning algorithms is to use past information to predict the future. The advantage of machine learning over traditional analytics is the ability for the machine-learning algorithm to automatically build a good model, saving time, preventing overfitting, and generally being more robust. To do this the algorithm builds a model, calculates the error rate of the model, adjusts parameters to lower the error rate, and iterates again, 'learning' from its mistakes.
There's a step in between: calculating the error rate requires us to split our dataset into a training and test dataset, in which we train the model on the training dataset, and calculate the error rate on the test dataset. We need to do this because if we calculate the error rate while training on the entire dataset we would get a low error rate since the model is trained on that specific data, and this would be misleading for predicting future, unknown data. So using a separate test data set is better measure of how a model actually performs.
Below, we've split a dataset into the testing and training sets already; if you were to build a rudimentary model on the following data, how would you draw it?
Most people draw a monotonic line trending upwards, but the reality is that this particular dataset was of temperature readings from San Francisco, which have a generally defined cycle over the course of a year (we had plotted the number of days into the year against temperature). When we looked at the data in two dimensions way we missed out on a another factor: time (seasonality). This should influence our model. Examples like this are why it's important to inspect your dataset, think about your problem, and have the necessary domain knowledge, before throwing an algorithm at a dataset.
This process is also incredibly important for feature generation, which can reveal aspects of your dataset leading to better prediction and classification results. One other important processing step is to cross-validate your data while building a model. This step can be more difficult for time-series datasets, like our example, but it is relatively painless to shuffle your data or to add in k-fold cross-validation.