ETL
There are 2 ways to run the ETL process manually:
- Using Workflow. This will process everything in one command, but potentially costly because some unneeded process might be executed along.
- Using crawlers and job execution. There are more steps to do, but the process can be targeted such that only required data will be processed, thus save some cost.
To run the ETL workflow for an environment (in the following example, the environment is dev
), execute:
aws glue start-workflow-run --name wap-audit-log-processing-workflow-dev
Above command will return an ID called “Run ID”. Please change dev
with the name of the environment that the job you want to execute.
To check the status of the run, use the “Run ID” returned from above command and do:
aws glue get-workflow-run --name wap-audit-log-processing-workflow-dev --run-id put-run-id-here
There are 3 types of data source currently supported by ETL:
hue
: native HUE log file being sent by copy lambda in HUE native side.ac
: AC operation log table dump, being fetched by regular EC2 batch job.integrated
: WorksAudit protobuf format being sent through Firehose.
To run ETL for a particular type of data source requires 3 steps:
- Run data source crawler.This data source crawler is specific to each of
- Run data source job.
- Run output crawler.