Predicting Pitcher Injuries PDF Free Download

1 / 22
0 views22 pages

Predicting Pitcher Injuries PDF Free Download

Predicting Pitcher Injuries PDF free Download. Think more deeply and widely.

PredictingPitcherInjuries
By:KurtBullard,JakeMeagher,andDeclanGarvey
Introduction:
Wesetouttocreateamodel—usinglogisticregression—thatcouldhelppredictwhether
ornotapitcherwouldgetinjuredthefollowingseasonbasedontraditionalandadvanced
statisticsfromboththepreviousyearandtheentirecareerofeachpitcher.
ThismodelcouldassistMLBfrontofficesintargetingoravoidingcertainpitchersinfree
agencyandintrades,oratleastadjustingtheirevaluationsofthemaccordingly,whichcould
dramaticallyimproveteamperformancefromseasontoseason.Eachyear,generalmanagers
throughoutMajorLeagueBaseballdishouttensofmillionsofdollarstoacquirestarplayers
duringtheannualfreeagencyperiod.Usually,themajor“prizes”offreeagencyarestarpitchers
whocanhelpbolsterapitchingrotation.Thisyear,theBostonRedSoxinkedDavidPricetoa
sevenyear,$217millioncontract.Dayslater,onDec.4,2015,theArizonaDiamondbacks
signedstarpitcherZackGreinketoasixyear,$206millioncontract.Teamsspendalotof
moneyonthesehurlers,soit’simperativethattheymakesuretheseinvestmentsaresound.
Evaluatingpitchersforinjuryrisk,therefore,isoftheutmostinterestforeachteam,asmostof
themoneydesignatedinthesecontractsisguaranteed,andaninjurywouldspoilthe
investment.Overthepastfiveyears,over23percenthavebeenplacedontheDisabledList
(DL).
Methods:
DataCollection:
Welookedatpitcherdatafrom20102014tomatchtherespectiveinjurydatathatwe
foundfrom20112015,sincewewereconcernedwithhowlastyear’susageandperformance
mightaffectnextyear’sinjuryrisk.Wewantedtolookatpitcherspecificinjuries—onesthat
cameasaresultofpitchingstressputoncertainpartsofthebody,asopposedtoinjuriesthat
were“fluky.”Forthatreason,weonlyconsideredinjuriesinvolvingthearm,shoulder,back,and
side.Otherinjuries,weassumed,werenotinherentlyrelatedtopitchingstress(e.g.
gastrointestinal).Intotal,therewere3330pitcherseasonsthatmetthisinitialcriteria.
Ourindependentvariablescamefromthreedifferentsources:
BaseballReference:Strikeouts,Age,DummyVariableforAL/NL,Games,Games
Started,DummyVariableforStartingPitcher(GamesStarted>0),CompleteGames,Innings
Pitched,Hits,Runs,BB,FIP,BattersFaced,StrikePercentage,andCareerBattersFaced
BaseballInfoSolutionsData(fromFangraphs):PercentageofPitchThrownand
AverageVelocityfortheFollowingPitches:Fastball,Cutter,Slider,andCurveball
TommyJohnDatabase:Adummyvariablesignalingwhetherornotapitcherhadhad
TommyJohnSurgerybefore
Wedidenduptrimmingsomeofthedatathatwedidnotfindrepresentativeofadecent
MLBpitcher.Forone,wedidnotincludethreepitcherseasonswherethepitchermadean
appearancedidnotrecordanout,whichmesseswithFIPandmakestheseasonsunusable.
Also,ifapitcherdoesn’trecordanouttheentireyear,he’smorelikelythannotamediocre
talentthatshouldnotinfluencethemodel.
Inaddition,wedidnotincludepitcherswhorecordedlessthan10inningsinaseason.
Werealizethatthismaybeabitproblematicinthatsomepitchersmayhavegotteninjuredless
than10inningsintotheseasonanddidnotplayforthatreasonratherthannotplayduetolack
oftalent—sevenpercentofthesepitchersendedupgettinginjured.Thatbeingsaid,theoverall
injuryrateforpitchershoveredaround23%,somostofthepitcherswhopitchedfewerthan10
inningsweresimplyunderutilizedratherthaninjured.However,therewasverylittleBaseball
InfoSolutionsdataforpitcherswithfewappearances,soitwouldhavethrownofftheactual
impactofpitchspeedandselection.Attheend,wewereleftwith2749pitcherseasons.
VariableSelection:
Amongthevariablesweelectedtoeliminatefromtheregressionmodelwereinnings
pitched,whichshowedsignsofmulticollinearitywithbattersfaced(correlationgreaterthan.99).
Wedidkeepbattersfacedinthemodelthough.Wealsodidnotincludegamesstarted,asit
showedstrongmulticollinearitywithbattersfaced(0.941).
Anothervariablewedidawaywithwashitbypitches,asHBPandwildpitcheshadalso
shownsignsofmulticollinearity(.43).Wekeptwildpitches,butwedidtransformit(seebelow).
Hitsandrunsshowedcollinearity(.97),sowedecidedtokeephits,becausewehypothesize
thoseweremorestableacrossyearsthanwereruns,whichisbasedmoreontheclusteringof
hits.
Wealsonoticedcollinearitybetweenthreevariables—strikeouts,hits,andwalks—and
battersfaced(.93,.98,.89),whichmakessense,sincethemoreyouplay,themorehitsand
walksyouletupandstrikeoutsyoucanregister,regardlessofskilllevel.Thus,wetransformed
thethreecountingstatistics(SO,H,BB)intorates,comparingthemeachtobattersfaced
(SO/BF,H/BF,BB/BF).Asaresult,wewereabletoeliminatethestrongcorrelationbetweenthe
aforementionedvariablesandbattersfacedwiththesenewstatistics.Wealsotranslatedwild
pitchestoarateparameterforsimilarreasons,eventhoughthecollinearitybetweenWPand
BFwasnotstrong.
WealsonoticedarelationshipbetweenFIPandtherateparametersofSO,H,andBB,
sowedeletedFIPfromthemodel(0.5897501,0.4537724,0.321768).Thiscorrelationmakes
sense,becauseFIPisastatisticcreatedstrictlywithSO,BB,andHR.
Transformations:
Inadditiontoremovingsomesuperfluousvariablesfromthemix,wealsodecidedto
transformahandfulofothersafterexaminingtheirdistributions.
Sincethedistributionofgameswasskewedright(seeleft),weusedasquareroot
transformationtoincreaseitsnormality(seeright).
Wealsotransformedbattersfaced,whichalsohadarightskeweddistribution.
Consequently,weutilizedanothersquareroottransformationtoincreaseitssymmetryand
normality.
Wealsotransformedwildpitchesperbattersfaced,whichhadarightskewed
distribution.Thus,weutilizedanothersquareroottransformationtoincreaseitssymmetryand
normality.Thenewlytransformeddistributionstillshowsapotentialproblemasthereseemsto
bealotofremainingzerovalues(ornearzerovalues),butwedealtwiththatlateron,aswedid
forthesubsequenttransformationsaswell.
Forsliderpercentage,weseeanotherrightskeweddistributionanduseasquareroot
transformationtoincreaseitssymmetryandnormality.Wecreatedanindicatorvariableto
interactwiththepercentageandvelocitydata,asnotallpitchershaveeverypitchtypeintheir
arsenal.Thisshouldnullifytheeffectofthezerovalues.
Forcutterpercentage,weseeaveryrightskeweddistributionandelecttousealog
transformation.Morespecifically,weactuallylog“CutterPercentage+0.01”toaccountforzero
values,becauseyoucannotlogzeroes.Wecreatedanindicatorvariabletointeractwiththe
percentageandvelocitydata,asnotallpitchershaveeverypitchtypeintheirarsenal.This
shouldnullifytheeffectofthezerovalues.
Forcurveballpercentage,weseearightskeweddistributionanduseasquareroot
transformationtoincreaseitsnormality.Wecreatedanindicatorvariabletointeractwiththe
percentageandvelocitydata,asnotallpitchershaveeverypitchtypeintheirarsenal.This
shouldnullifytheeffectofthezerovalues.
Fortotalbattersfaced,weseeaveryrightskeweddistributionandelecttousealog
transformationtoimprovesymmetryandnormality.
Thedistributionsfortheremainderofthevariablesappearedtobeapproximately
normal,sowefoundnoneedtotransformthem.Belowisakeyofthevariablesthatwewere
considering,posttransformations
:Results:
Model1:
Intotal,outofthe2749pitcherseasonsinthesample,644pitcherswereinjuredthe
followingseason,apercentageof23.4%.Usingthatdataandthevariablesabove,werana
stepwiselogisticregressionthatyieldedthefollowingresults:
Therearealotofvariablesincludedinthismodel.Someofthemmakealotofsense:
completegamesandcareerbattersfacedarebothverysignificantandpositive,sincetheyboth
inducemorestressonthearm.However,thismodelincludesseveralmeaningless,yet
significant,interactions,likehitsperbattersfacedandcurveballvelocity,orwildpitchesper
battersfacedandchangeupvelocity.Theseinteractionsledustoquestionthemodel’s
legitimacy,despitethemanyvariablesthatdidmakesense.
TheAkaikeinformationcriterion(AIC)ofthismodel,whichmeasuresitsquality,is
2738.2,whichwillberelevantlaterwhenwetestanothermodel.
Whenvisualizingtheresidualplotofourmodel,theresultinggraphlookedlikethis:
Althoughatfirstwewereconcernedbythediscretelinearityoftheplots,wequickly
realizedthattheabovegraphswasnotnecessarilyanindicatorofabadmodel.Because,ina
logisticregression,theoutcomeiscategorical(canonlytakeon0or1),theresidualsfora
noninjuredpitchercanonlybenegative,andtheresidualforaninjuredpitchercanonlybe
positive.Withrespecttothefirstplot,becausepredictedvaluesandresidualsmustsumto
eitherzerooroneforeachobservation,theresidualplotthereforefollowsalinearpattern.
Nonetheless,becauseofthehighamountofmeaninglessvariablesmentionedabove,we
lookedtocreateanewmodel.
Model2:
Oneconcernthatarosewithourfirstmodelwastheeffectofthemultitudeofzeroesin
thecutterpercentage,sliderpercentage,changeuppercentage,andcurveballpercentage
predictorvariables(seeTransformationssection).Weharboredfearthattheskewnessofthe
datawasthrowingofftheaccuracyofourmodel.So,inthismodel,weonlyusedfastball
percentageandfastballvelocity,andsubstitutedindicatorvariables,ratherthanpercentages
andvelocities,foreachofthefouroffspeedpitches.Muchofthesameinformationisstill
included,asthecomplementoffastballpercentageisoffspeedpercentage,andfastballvelocity
isalsofairlycorrelatedwiththevelocityofapitcher’sotherpitches.Wesetthethresholdfor
“havingapitch”at1percent,assomepitchersoccasionallythrowpitchesthatthey’renot
accustomedtothrowingregularly.
Thestepwiseregressionproducedthefollowingmodel:
TheAICinthismodelis2770.2,whichisslightlyhigherthanthefirstmodelthatwe
tested,signalingthatthefirstmodelisperhapsbetter.
ThelowestAICdoesn’tguaranteethe
bestmodel,butitoftenleadstoamoreusefulmodel.
Theresidualplotfollowsalinearpattern—aswasexplainedabove,thatisexpectedina
logisticregressionbasedonthebinaryoutcome.
CrossValidation:
Afterrunning2,000simulationsofcrossvalidation,trainingthedatausing2,000
observationstopredicttheother749,wefoundthattheaverageSumofSquaredErrorforeach
modelwasasfollows:
Model1Model2
4.9867784.676602
ThelowerSSEaverageinModel2suggests,contrarytotheAIC,thatModel2mayhave
morepredictivepower.
Sincethetwomeasuresofmodelqualitywerecontradicting,wehadtodecidewhich
modeltousegoingforward(inevitably,neitherbecameourfinalmodel).Wechosethefirst
modelasaresultofitslowerAIC,eventhoughitproducedahigherSSEinthecrossvalidation
process.Inaddition,alotofthepitchselectionrelatedvariablesweresignificant(slider
percentage,cuttervelocity,etc.).So,wefeltlikeitwascriticaltoincludeamodelthatcontained
thesevariables.
Predictionsfor2016Season(Attempt1):
Lastyear,598pitchersthrew10ormoreinnings.Wehadtoremove19pitchers
becausetheydidnothaveBaseballInfoSolutionsdata,whichleft579pitchersfromlastseason
whothrew10ormoreinningsandhadtherelevantvelocityandpitchselectiondata.
Whenweranourpredictions,however,wenoticedthatalotofpitchershadeithera
nearzeroornear1probabilityofgettinginjurednextseason,whichdoesnotmakesense.We
thennoticed,afterlookingthroughthepredictionsmoreclosely,thatreliefpitchersexhibiteda
100%probabilityofgettinginjured,whilestartingpitchershadanearzeropercentchanceof
gettinginjured.Ontheirown,theseresultsdonotmakesense.Butparticularlysincestarters
throwmorethanrelieversdo,thesepredictionsareespeciallypuzzling.Forthisreason,we
cannotgoforwardwiththismodel.Wehavetomakeanewmodelthatgivesbetter
predictions.
Model3:
Webelievethattheaforementionedproblemaroseduetooverfitting—themodelwas
notgenericenoughtoapplytoanewdataset.Inthisthirdmodel,weeliminatesomevariables
thatwedonotbelievearecriticaltotheregressionforfearofoverfitting.Weeliminatedgames,
league,wildpitchesperbattersfaced,hitsperbattersfaced,strikepercentage,startingpitcher,
aswellasallofthepitchselectiondatabesidesfastballpercentage,velocityandanindicative
variableforcutters.Wefeltcomfortablewiththeeliminationoftheseaforementionedvariables
forthefollowingreason:
Games:Althoughnotcorrelatedwithbattersfaced,itisameasureofusage.Batters
facedismorespecific,sowe’lleliminategames.
League:EventhoughpitchershavetobatintheNationalLeague,thisprobablydoes
notaffectpitcherinjury(atleastthetypesofinjurieswe’relookingat).
WildPitchesperBF:Wealreadyhaveameasureofaccuracyinwalks,sothere’snota
greatneedforanothervariablethatrepresentsasimilarmeasure.
HitsperBF:Wethinkthatthismaynotberelevant,sincehitsperbattersfacedmay
alreadybereflectedbyBF.
StrikePercentage:Wealreadyhaveameasureofaccuracyinwalks,sothere’snota
greatneedforanothervariablethatrepresentsasimilarmeasure.
SP:Inthefirstmodel,wefoundthatthestartingpitcherindicatorvariableledtoalotof
0%and100%values,sowewilltrytoleaveoutwhetherornotapitcherstartedgamesorcame
inlater.
PitchSelection:Fastballpercentagealsoholdsinitsvaluethecomplement,which
representstheamountofoffspeedthrown,whichcouldbepredictiveofinjurybasedoffarm
motion.Cuttershavealsobeenhypothesizedtoleadtoinjury,sowekeptanindicatorforthat.
Lastly,weonlyusefastballvelocity,aswebelievethatitshouldbecorrelatedwiththevelocity
ofotherpitches.
So,wekeeponlythecutterindicator,fastballvelocityandfastballpercentagefromthe
pitchselectiondata.
Wealsoincludedafewinteractionvariables:TommyJohnandFastballvelocity,
completegamesandbattersfaced,fastballpercentageandvelocity,strikeoutsperbattersfaced
andfastballvelocity,strikeoutsperbattersfacedandage,ageandvelocity,walksandtommy
john,andfinallythecutterindicatorandfastballpercentage.
Afterrunningthestepwiseregression,wefoundthefollowingresults:
Alotofvariablesinthismodelmakesense.Throwingmorecompletegamesleadstoa
higherinjuryrate,asdoesfacingmorebatters.HavingreceivedTommyJohnsurgerybefore
alsopointstohigherratesofinjury,asitmakessensethatpitcherswhohavehadarmtroublein
thepastmayhaveitinthefuture.Fastballpercentageisnegativelycorrelated,whichmakes
sensebecausethelessfastballsyouthrow,themoreoffspeedyouthrow,whichmeanshigher
stressonthearm.Careertotalbattersfacedisalsopositive,whichmakessensebecausethe
moreyouthrow,themoreofaphysicaltollittakesonyourarm.
Therearealsosomeinteractiontermsthataresignificant.Completegamesareless
impactfulwhenyou’vethrowntomorebatters,butthismayarisebecausethebestpitchersare
onesfacingthemostbattersandthrowingthemostcompletegames.Strikeoutsarealsomore
costlywhenyougetolder,asstrikeoutpitcherstendtothrowmorepitchesinanatbatthando
groundballandflyballpitchers.
Onevariableinparticularstandsoutasodd:age.Itisslightlynegative,butthisismost
likelybecauseofsamplebias—thebestpitcherswhoareoldprobablylastedthislongbecause
they’redurable,sothey’llmakeitseemliketheolderyouget,themoredurableyoubecome.
TheAICofthismodelis2867.2,whichwillberelevantwhencomparingittothenext
model.
Residuals:
Theresidualplotthereforefollowsalinearpattern—aswasexplainedabove,thatis
expectedinalogisticregressionbasedonthebinaryoutcome.
Model4:
Forthismodel,weaddedbackafewvariablesjusttomakesurewedidn’toverreactto
ourfearofoverfitting.Weaddedbackwildpitchesperbattersfaced,hitsperbattersfaced,and
strikepercentage,andtheresultswereasfollows:
ThemodelissimilartoModel3,buttherearesomeinterestingvariablesselectedinthis
model.Forone,hitsperbattersfacedishugelynegative,whichdoesn’tmakesense,since
givinguphitsleadstomorebattersfaced.Strikeoutsperbattersfacedwithhitsperbatters
facedisalsohugelypositiveandsignificant,whichdoesn’tmakesenseforarelatively
meaninglessinteraction.Ageandstrikepercentageisalsosignificanthere,despiteitbeinga
meaninglessinteraction.
EventhoughtheAICis2817—50lowerthanModel3—itlooksliketheothermodel
makesmoresenseandcouldbeabetterpredictivemeasure.
Residuals:
Theresidualplotfollowsalinearpattern—aswasexplainedabove,thatisexpectedina
logisticregressionbasedonthebinaryoutcome.
CrossValidation:
Thecrossvalidationprocessisalreadydescribedabove,sothereisnoneedtoexplainit
again.Theresultsareasfollows:
Model1:4.98
Model2:4.67
Model3:3.22
Model4:4.23
WeseethatModel3isoverwhelminglybetteratmakingpredictionsthananyofthe
otherthreemodels,meaningthatthismodelinmostlikelihoodhasthemostexternalvalidity,
eventhoughitdoesnothavethelowestAIC.So,wewillchooseModel3asourfinalmodel
tomakepredictionsfornextseason.
Asareminder,Model3isbelow:
Thelogoddsofinjuryriskarenegativelycorrelatedwith(frommosttoleastsignificant,
wheresignificantrelationshipsarebolded):
CompleteGameswiththeLogofBattersFaced
Age
Strikeouts
FastballPercentage
TommyJohnWithFastballVelocity
WalksperbattersfacedandTommyJohn
Strikeoutsperbattersfaced
Walksperbattersfaced
CutterIndicator
FastballVelocity
Thelogoddsofinjuryriskarepositivelycorrelatedwith(frommosttoleastsignificant,
wheresignificantrelationshipsarebolded):
TotalBattersFaced
CompleteGames
LogofBattersFaced
FastballVelocitywithFastballPercentage
StrikeoutsperBattersFacedandAge
TommyJohn
FastballPercentageandAge
FastballPercentageandCutterIndicator
StrikeoutsPerBattersFacedandFastballVelocity
Themostsignificantvariablesinthismodelmakealotofsense.Intermsofpositive
correlations,completegames,totalbattersfaced,andthelogofthepreviousseason’sbatters
facedallmakesensebecauseitpunishesyouforthrowingalotofpitchesinindividualgames,
seasons,andcareers.Fastballvelocityandpercentagealsomakesense,sinceaquicker
fastballismoredangerouswhenyouthrowitmoreoften.Lastly,havinghadTommyJohnpoints
tohavinghadpreviousinjury,whichisagoodindicatorofgettinginjuredonceagain.
Thesignificantnegativecorrelationsalsomakesenseintermsofthestructureofthe
dataset.Fastballpercentageisnegativelycorrelatedwithinjuryrisk,asthelessfastballone
throws,themorearmstrainingoffspeedpitchesarethrown.Thebestpitchersaretheoneswho
throwthemoststrikeoutsandcompletegames,meaningtheyprobablyhavesoundmechanics,
explainingwhystrikeoutsandcompletegameswiththelogofbattersfacedaresignificant.
TommyJohnandFastballVelocityaresignificant,whichcouldserveasanindicatortothelevel
ofhealingfromthelastinjury—perhapspitcherswhorecovermorefullyfromitthrowfasterupon
return.Lastly,ageisalsointheregression,butthatismostlikelyduetosamplebiasandisa
slightshortcomingofourmodel:thepitcherswholastedto32pluswereusuallythereallygood
oneswhoinmostlikelihoodneverexperiencedanydevastatinginjury;meanwhile,alotofreally
youngpitchersgetinjuredandnevercomebackfromit.
Predictionsfor2016(Final):
Lastyear,598pitchersthrewmorethan10innings.Wehadtoremove19pitchers
becausetheydidnothaveBaseballInfoSolutionsdata,whichleft579pitchersfromlastseason
whothrew10ormoreinningsandhadtherelevantvelocityandpitchselectiondata.The
followingisthehistogramoftheinjuryriskpredictions:
Wepredictedtheaverageriskofapitchergettinginjuredin2016tobe23.2%,which
makessense,sincethefiveyearmeanwasinfact23.4%.Wewouldexpectfutureyearsto
hoveraroundthisvalue,sincetherehasn’tbeenasecularchangeinpitcherusageor
philosophy.
Therangeofthemodel’spredictionswentfromapeakof80.3%toalowof2.8%.We
presentthetop10mostlikelyandleastlikelypitcherstogetinjurednextseason,respectively.
TopTen
BottomTen
ThepitchermostlikelytogetinjuredisClevelandStarterJoshTomlin.Thestarting
pitcherisjustcomingoffofTommyJohnsurgery,andalsoonlythrows53%fastballs,whichwas
inthelowestquartilelastseason.He’sfaced1,675battersinhiscareeralready,andalsothrew
twocompletegameslastyear.Thepitcherwillturn31thisseasonaswell,whichdoesn’tbode
wellsincehestrikesoutapproximatelyoneeveryfourbatters.
HighuponthelistisAroldisChapman,whotheDodgerstriedtotradeforrecently.The
Reds’relieverstrikesouttwooutofeveryfivebatters,andalsothrowshisfastballanaverageof
99.5MPH,whichisthehighestintheleague.
TowardsthebottomofthelistisR.A.Dickey,whichalsomakessense,sinceheis
knuckleballpitcher,andthattendstoputlessstressonthearmthandocuttersandcurveballs
andhighvelocityfastballs.
FinalThoughts:
"Allmodelsarewrong,butsomeareuseful"GeorgeE.P.Box
Webelievethatthelogisticregressionpredictingpitcherinjuriesisausefulmodelgiven
thatitwasconstructedusingonlypubliclyavailablebaseballstatistics.However,thereismore
informationthatwewouldwanttomakeabettermodelthatwecan’thave—eatinghabitsand
trainingregimen,amongstotherthingsthatthereisnodatafor.Inaddition,pitchermechanics
arealsoalargecomponentofinjuryrisk—thosewhohaveworsefundamentalstendtoget
injuredatahigherrate.Usingthismodelinconjunctionwithqualitativeanalysisofone’s
pitchingmotionwouldperhapsbeevenmorehelpful.Nonetheless,ourmodelisausefulstartin
identifyingpitcherinjuryrisksothatbothpitchersthemselvesandteammanagementcanadjust
pitchingselectionandworkloadasameansofinjuryprevention.