Up and Running: Data Engineering on the Google Cloud Platform
The completely free E-Book for setting up and running a Data Engineering stack on Google Cloud Platform.
NOTE: This book is currently incomplete. If you find errors or would like to fill in the gaps, read the Contributions section.
Table of Contents
Preface
Chapter 1: Setting up a GCP Account
Chapter 2: Setting up Batch Processing Orchestration with Composer and Airflow
Chapter 3: Building a Data Lake with Google Cloud Storage (GCS)
Chapter 4: Building a Data Warehouse with BigQuery
Chapter 5: Setting up DAGs in Composer and Airflow
Chapter 6: Setting up Event-Triggered Pipelines with Cloud Functions
Chapter 7: Parallel Processing with Dataproc and Spark
Chapter 8: Streaming Data with Pub/Sub
Chapter 9: Managing Credentials with Google Secret Manager
Chapter 10: Infrastructure as Code with Terraform
Chapter 11: Deployment Pipelines with Cloud Build
Chapter 12: Monitoring and Alerting
Chapter 13: Up and Running - Building a Complete Data Engineering Infrastructure
Appendix A: Example Code Repository
Chapter 9: Managing Credentials with Google Secret Manager: Managing Credentials with Google Secret Manager
There will likely be times where you are going to need to give your data pipelines access to credentials. For GCP resources we can manage access through permissions on our service accounts, but often your pipeline will need to access systems outside of GCP. By using Google Secret Manager we are able to securely store passwords and other secret information.
In this chapter I will show you how to create and access secrets using Google Secret Manager. Then we'll create an Airflow DAG that will simulate making an HTTP request to a secured web API, then saving the results to GCS.
Creating a Secret
The first thing we need to do is enable the Google Secret Manager service:
> gcloud services enable secretmanager.googleapis.com
Creating a secret is quite simple. Here I'm creating a secret called "source-api-password" that contains the value "abc123":
> echo -n "abc123" | gcloud secrets create source-api-password --data-file=-
We can also create a secret where the value is the contents of a file:
> gcloud secrets create source-api-password-2 --data-file=my-password.txt
In Chapter 10: Infrastructure as Code with Terraform we'll discuss how to manage secrets with Terraform. While it is good practice to create secrets with your Infrastructure as Code solution, the values of those secrets still need to be added manually to ensure they are not saved in your code repositories.
Accessing a Secret
Accessing a secret is just as easy as creating it:
> gcloud secrets versions access latest --secret=source-api-password
abc123
The values in each secret have a particular "version", so you can update the value in a secret and still access older values for that same secret. In the above code we specify we want the value for the "latest" version, but we also could have specified an ID for a specific version.
While you will likely be setting secret values through the command line, you will most likely be accessing them within your code:
from google.cloud import secretmanager
def get_secret(project, secret_name, version):
client = secretmanager.SecretManagerServiceClient()
secret_path = client.secret_version_path(project, secret_name, version)
secret = client.access_secret_version(secret_path)
return secret.payload.data.decode("UTF-8")
project = "de-book-dev"
secret_name = "source-api-password"
version = "latest"
plaintext_secret = get_secret(project, secret_name, version)
Using Google Secret Manager in Airflow
Below is an example of how you might typically use Google Secret Manager to access secrets within your DAG for getting data from a web API. The web API in this example doesn't exist (so it won't work if you run it unmodified in your own Airflow instance), but the below code shows you a common type of DAG that would require accessing a secret.
import requests
import json
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
default_args = {
'owner': 'DE Book',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 3,
'retry_delay': datetime.timedelta(seconds=30),
'start_date': datetime.datetime(2020, 10, 17),
}
dag = DAG(
'download_form_web_api',
schedule_interval="0 0 * * *", # run every day at midnight UTC
max_active_runs=1,
catchup=False,
default_args=default_args
)
def get_secret(project, secret_name, version='latest'):
client = secretmanager.SecretManagerServiceClient()
secret_path = client.secret_version_path(project, secret_name, version)
secret = client.access_secret_version(secret_path)
return secret.payload.data.decode('UTF-8')
def generate_credentials():
username = 'my_username'
password = get_secret('my_project', 'source-api-password')
credentials = f'{username}:{password}'
return credentials
def download_data_to_local():
url = 'https://www.example.com/api/source-endpoint'
credentials = generate_credentials()
request_headers = {"Accept": "application/json",
"Content-Type": "application/json",
"Authorization": credentials}
export_json = {
"exportType": "Full_Data",
}
response = requests.post(url=url, json=export_json, headers=request_headers)
data = response.json()
# composer automatically maps "/home/airflow/gcs/data/" to a bucket so it can be treated as a local directory
with open('/home/airflow/gcs/data/data.json', 'w') as f:
json.dump(data, f)
t_download_data_to_local = PythonOperator(
task_id='download_data_to_local',
python_callable=get_data,
op_kwargs={"sku_list": sku_list},
dag=dag
)
t_copy_data_to_gcs = BashOperator(
task_id='copy_data_to_gcs',
bash_command='gsutil cp /home/airflow/gcs/data/data.json gs://my-bucket/web-api-files/'
dag=dag
)
t_download_data_to_local.set_upstream(t_download_data_to_local)
Cleaning Up
Google Secret Manager is quite cheap, with each secret version costing 6 cents per month, and an additional 3 cents for every 10,000 times you access your secret. Nonetheless, let's clean up what we're not using.
> gcloud secrets list
NAME CREATED REPLICATION_POLICY LOCATIONS
source-api-password 2021-01-13T04:07:31 automatic -
> gcloud secrets delete source-api-password
Next Chapter: Chapter 10: Infrastructure as Code with Terraform