Predicting when a batter is more likely to swing at a pitch can be useful for a pitcher. For instance, it can be useful to determine how likely a batter will swing on a full count. If he is inclined to swing, then the batter could be susceptible to chasing pitches out of the strike zone. Here, I take a supervised learning approach to predicting whether a batter will swing at a pitch based on pitch trajectory and game situation information from Pitchf/x. The binary classification models I test are logistic regression, naive Bayes, K-nearest neighbors, support vector machines, and random forest.
I first test various subsets of features with a random forest model. Once I select the subset of features that produces the best performance based on the area under receiver operating characteristic (ROC) curves, I use the same subset to build other models. ROC curves suggest that random forest is the best model. Additionally, random forest has the highest accuracy (78%) on predicting Jack Cust swings in 2008.