The WorksAudit Book

Navigation

Book Maintenance
- Quick Start
- Technologies
Conceptual Knowledge
Architecture and Design
- Overview
- Data Structure
  - Protobuf
  - Glue Catalogue
  - Athena
  - Dynamodb
  - References
- Producer
  - Java Producer SDK
  - Typescript Producer SDK
  - Hue Log Synchronization
  - AC Legacy Log Synchronization
  - Expense Legacy Log Synchronization
- Data Processing
- APIs
  - Query API
  - Query Lite API
  - Reference API
  - Authentication and Authorization API
  - Settings API
  - Register API
- Consumer
  - Typescript Consumer SDK
- Viewer
  - Overview
Development
- Producer Development
- Glue Development
- Viewer Development
- Api Development
- Conventions
  - Java
  - Scala
  - Python
  - TypeScript
Operation
- Environment
  - Overview
  - Construction
  - Update
- Integration
- CI/CD
- ETL
- Maintenance
- Backup and Recovery

ETL

ETL Operation

Running ETL

There are 2 ways to run the ETL process manually:

Using Workflow. This will process everything in one command, but potentially costly because some unneeded process might be executed along.
Using crawlers and job execution. There are more steps to do, but the process can be targeted such that only required data will be processed, thus save some cost.

Running ETL Using Workflow

To run the ETL workflow for an environment (in the following example, the environment is dev), execute:

aws glue start-workflow-run --name wap-audit-log-processing-workflow-dev

Above command will return an ID called “Run ID”. Please change dev with the name of the environment that the job you want to execute.

To check the status of the run, use the “Run ID” returned from above command and do:

aws glue get-workflow-run --name wap-audit-log-processing-workflow-dev --run-id put-run-id-here

Running ETL Using Crawlers and Job

There are 3 types of data source currently supported by ETL:

hue: native HUE log file being sent by copy lambda in HUE native side.
ac: AC operation log table dump, being fetched by regular EC2 batch job.
integrated: WorksAudit protobuf format being sent through Firehose.

To run ETL for a particular type of data source requires 3 steps:

Run data source crawler.This data source crawler is specific to each of
Run data source job.
Run output crawler.