Run Word Count With Scala and Spark on HDInsight

Previously, we tried to solve a word count problem with a Scala and Spark approach.

The next step is to deploy our solution to HDInsight using Spark, HDFS, and Scala.

We shall provision a Spark cluster:


Since we are going to use HDInsight, we can utilize HDFS and therefore use the Azure storage:


Then, we choose our instance types:


We are ready to create the Spark cluster:


Our data shall be uploaded to the HDFS file system.

To do so, we will upload our text files to the Azure storage account that is integrated with HDFS.

For more information on managing a storage account with the Azure CLI, check the official guide. Any text file will work.

azure storage blob upload mytextfile.txt sparkclusterscala example/data/mytextfile.txt 

Since we use JDFS, we shall make some changes to the original script:

val text = sc.textFile("wasb:///example/data/mytextfile.txt")
val counts = text.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)

Then, we can upload our Scala class to the head node using SSH:

scp WordCountscala.scala demon@{your cluster} 

Again, in order to run the script, things are pretty straightforward:

spark-shell -i WordCountscala.scala 

And once the task is done, we are presented with the Spark prompt. Plus, we can now save our results to the HDFS file system.

<scala> counts.saveAsTextFile("/wordcount_results")          

And do a quick check.

hdfs dfs -ls
hdfs dfs -text /wordcount_results/part-00000

And that's it!