
// Configure an ML pipeline
val cleaner = new Cleaner()
.setInputCol("content")
.setOutputCol("cleaned")
val tokenizer = new RegexTokenizer()
.setInputCol(cleaner.getOutputCol)
.setOutputCol("words")
.setPattern("\\W")
val remover = new StopWordsRemover()
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("filtered")
val ngram = new NGram()
.setN(nGramGranularity)
.setInputCol(remover.getOutputCol)
.setOutputCol("ngram")
val hashingTF = new HashingTF()
.setInputCol(ngram.getOutputCol)
.setOutputCol("keys")
.setNumFeatures(numTextFeatures)
var idf = new IDF()
.setInputCol(hashingTF.getOutputCol)
.setOutputCol("features")
val pipeline = new Pipeline()
.setStages(Array(cleaner, tokenizer,
remover, ngram, hashingTF, idf))
// Fit the pipeline
val model =pipeline.fit(train)
[9]
Karen Mossberger. The politics of ideas and the spread of enterprise zones. Georgetown University Press, 2000.
[10]
Julianna Pacheco. Attitudinal policy feedback and public opinion: the impact of smoking bans on attitudes towards
smokers, secondhand smoke, and antismoking policies. Public opinion quarterly, page nft027, 2013.
[11]
Charles R Shipan and Craig Volden. Policy diffusion: Seven lessons for scholars and practitioners. Public
Administration Review, 72(6):788–796, 2012.
[12]
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster
computing with working sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing,
HotCloud’10, pages 10–10, Berkeley, CA, USA, 2010. USENIX Association.
[13]
Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans,
Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O’Malley, Sanjay
Radia, Benjamin Reed, and Eric Baldeschwieler. Apache hadoop yarn: Yet another resource negotiator. In
Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC ’13, pages 5:1–5:16, New York, NY, USA,
2013. ACM.
[14]
Anand Rajaraman and Jeffrey David Ullman. Mining of Massive Datasets. Cambridge University Press, New
York, NY, USA, 2011.
[15]
Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer
Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. Spark sql: Relational data processing in spark. In
Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, pages
1383–1394, New York, NY, USA, 2015. ACM.
[16]
Ankur Dave, Alekh Jindal, Li Erran Li, Reynold Xin, Joseph Gonzalez, and Matei Zaharia. Graphframes: An
integrated api for mixing graph and relational queries. In Proceedings of the Fourth International Workshop on
Graph Data Management Experiences and Systems, GRADES ’16, pages 2:1–2:8, New York, NY, USA, 2016.
ACM.
11