Thursday 5 November 2020

Spark submit Scala jar using Boto3

When I spin up an EMR cluster manually in the AWS console, I run the following after SSH'ing into my cluster:

spark-submit --master yarn-cluster --deploy-mode cluster --class spark_pkg.SparkMain 
s3://mybucket/scala-1.0.jar -s arg1 -l arg1 

How do I do this when using Boto3 in Python? Here is my steps code:

steps = [
            {
            'Name': 'Running jar file',   
                    'ActionOnFailure': 'TERMINATE_CLUSTER',
                    'HadoopJarStep': {
                        'Jar': 's3://mybucket/{0}'.format(jar_file),
                        'Args': ['spark-submit', '--master yarn-cluster', 
                                '--deploy-mode cluster', '--class spark_pkg.SparkMain', 
                                '-s', arg1, '-l', arg2
                        ]
                    }
                }
        ]

It looks like these arguments are incorrect: 'spark-submit', '--master yarn-cluster', '--deploy-mode cluster', '--class spark_pkg.SparkMain'

And the error I am getting is below. How can I correctly define those arguments?

Error: Unknown argument 'spark-submit'
Error: Unknown option --master yarn-cluster
Error: Unknown option --deploy-mode cluster
Error: Unknown option --class spark_pkg.SparkMain
Usage: spark-zoning [options]

  -l, --id1 <value>  
  -s --id2 <value>
Exception in thread "main" scala.MatchError: None (of class scala.None$)
    at spark_pkg.SparkMain$.main(SparkMain.scala:208)
    at spark_pkg.SparkMain.main(SparkMain.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)


from Spark submit Scala jar using Boto3

No comments:

Post a Comment