Hey guys, This is from Allwebr.com, Welcome to our website. If you are new in our website then please subscribe to our newsletter and
stay updated with all the latest news and information. Today we will talk about How to secure your Dataproc clusters with a custom VPC.
When creating data product clusters for using open source big data tools, such as Spark, Hadoop, or Presto, it is imperative that they are configured with well-defined security settings to prohibit unauthorized use and access to the cluster. Google Cloud contains several options for how to control the security settings on your cluster.
So let’s jump in and learn more about this. In this video, we’ll be going over how to secure your Dataproc clusters using a custom virtual private cloud, or VPC. A VPC is a managed networking service for your Google Cloud resources. Resources within your VPC are secured via isolation from the public internet. For more info on virtual private cloud, check out the video link in the description. We’ll start by setting up the cluster to allow the cluster nodes to connect to one another.
The cluster cannot distribute workloads to its nodes without this. Additionally, we’ll also allow SSH access into the cluster. SSH is important for submitting workloads with tools such as Spark Submit and for making modifications to long-running clusters. We can do this by creating a custom VPC with two firewall rules, one that allows the nodes of the cluster to interact with each other and another that enables SSH access into the Dataproc cluster. These rules will provide the Dataproc cluster with basic security features.
Next, we’ll create the cluster with our custom VPC. We’ll then test the functionality of the VPC by submitting a Spark job to the cluster using the Cloud SDK, as well as show how to SSH into the cluster. We’ll then delete the individual firewall rules and show how we protected the cluster and the job successfully. Without the firewall rules, we are locked out, and the cluster is effectively rendered unusable. Yay. You can jump ahead to any of these steps by visiting the chapters in the YouTube description below. Start by searching for VPC network in the search bar, and click on it. Next, click Create VPC Network. Provide a name for the VPC network.
Next, we’ll create a subnet using the Custom Subnet Creation mode. Provide a name for your new subnet. Select a region. In this case, we’ll pick US Central 1. Provide an IP address. We’ll go with 10.0.0.0/9. Leave the rest of the default as is, and press Create. This should take about 20 seconds or so to finish creating. When it’s finished creating, click on it. Click on Firewall Rules, then click Add Firewall Rule. We’ll first create a firewall rule that will allow nodes of the cluster to communicate with each other. Provide a name for the rule. Make sure the name of your network is selected. Make sure the direction of traffic is set to Ingress.
For target tags, provide the name dataproc-network. For source IP ranges, input 10.0.0.0/9. For protocols and ports, check the box next to TCP, and enter the ports 0-65535 to allow access on all ports. Also, check the box next to UDP and input All. Press Create. You’ll see the new firewall rule here. Now, let’s create our second firewall rule by again clicking Add Firewall Rule. This firewall rule will allow SSH access to the cluster. Provide a name for the rule. Make sure the name of your network is selected. For target tags, again provide the name dataproc-network. For source IP ranges, input 0.0.0.0/0. For protocols and ports, check the box next to TCP, and enter the port 22, which is the port used for SSH login. Press Create. We’ll now create our Dataproc cluster configured with our VPC. In the search bar, type in Dataproc, and click on it. Next, click Create Cluster. Provide a name. Select the same region as your VPC. Leave everything else in this tab as is.
Next, select Customize Cluster. Under Network Configuration, for the primary network dropdown, select your network, and for the sub network dropdown, select your subnet. Under Network Tags, enter the same tags you created earlier, dataproc-network. Everything else will be left as is. Go ahead and press Create. We see our cluster is provisioning, which will take about 90 seconds to finish. We’ll skip the video ahead here. Now that our cluster has finished provisioning, let’s try submitting a job to it using the Dataproc jobs API. Click on the Cloud Shell button to open up your Cloud Shell. Now we’ll copy this command into the terminal to submit an example job to Spark. In this case, will you Spark Pi, which will approximately compute the value of pi. Great. This worked. We can see the success message in the output.
Now, we’ll go back to our VPC and remove the firewall rule. Again, search for VPC network. Click on my-vpc-network-demo, and then firewall rules. Check the box next to the cluster-coms rule, then press Delete. Now, we’ll try submitting the Spark job again in our Cloud Shell. After about 60 seconds, it will fail. We’ll again skip the video ahead. As you can see, the cluster failed to execute the job, as it wasn’t able to communicate with the other nodes. This means having the rule there worked. Go back to the Dataproc page by typing Dataproc into the search bar. Click on the cluster, then click the VM Instances tab. Next to the master node, press SSH.
We’re now SSH’ed into the cluster. Let’s see if we’re moving the firewall rule for SSH prevents us from doing this again. Close this SSH window, and now let’s search for VPC network again. Click on my-vpc-network-demo, and then firewall rules. Check the box next to the allow-ssh rule, then press Delete. Go back to the Dataproc page by typing Dataproc into the search bar.
From the Clusters page, click on the cluster. Click on the VM Instances tab. Next to the master node, click SSH. This process should hang for about a minute or so before unsuccessfully connecting. We’ll skip the video again here. This means the allow-ssh firewall rule worked, as well. In this video, we created a VPC network with two firewall rules in place to limit access. We then created a Dataproc cluster configured with the VPC network to secure it via isolation from the public internet.
We then tested the functionality of the cluster and showed this functionality disappear as we removed the firewall rules. You can read more about network configuration for your Dataproc cluster in the Google Cloud Docs.
If you are new then subscribe to our newsletter for useful posts. Stay safe and follow our website for more latest informative posts.
Take care and Bye Bye. Best regards allwebr.com