The WorksAudit Book
Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode

ETL

ETL Operation

Running ETL

There are 2 ways to run the ETL process manually:

  1. Using Workflow. This will process everything in one command, but potentially costly because some unneeded process might be executed along.
  2. Using crawlers and job execution. There are more steps to do, but the process can be targeted such that only required data will be processed, thus save some cost.

Running ETL Using Workflow

To run the ETL workflow for an environment (in the following example, the environment is dev), execute:

aws glue start-workflow-run --name wap-audit-log-processing-workflow-dev

Above command will return an ID called “Run ID”. Please change dev with the name of the environment that the job you want to execute.

To check the status of the run, use the “Run ID” returned from above command and do:

aws glue get-workflow-run --name wap-audit-log-processing-workflow-dev --run-id put-run-id-here

Running ETL Using Crawlers and Job

There are 3 types of data source currently supported by ETL:

  1. hue: native HUE log file being sent by copy lambda in HUE native side.
  2. ac: AC operation log table dump, being fetched by regular EC2 batch job.
  3. integrated: WorksAudit protobuf format being sent through Firehose.

To run ETL for a particular type of data source requires 3 steps:

  1. Run data source crawler.This data source crawler is specific to each of
  2. Run data source job.
  3. Run output crawler.