Integrated ETL runtime (#657459)
ETL transformations and jobs which were created in Pentaho Data Integration can now be managed and executed in ConSol CM. The solution consists of two parts:
- New ETL application, called ETL Runner
- Administration interface integrated into the Web Admin Suite
ETL Runner application
The ETL Runner application can be deployed either in the same application server as ConSol CM or be executed as a standalone Java application on the same or another machine.
- Overlay deployment:
etl-runner-application-package-war.war
- Standalone deployment:
etl-runner-application-package-app.jar
ETL Runner communicates with ConSol CM using a REST API.
ETL page in the Web Admin Suite
The Staging menu of the Web Admin Suite has been extended by the page ETL. The page consists of three tabs:
- Connection: Enter the URL of the ETL Runner application.
- Tasks: Lists all existing ETL tasks, i.e. all configurations which have been created for scheduling transformations or jobs. In the details of a task, you can configure the task and view its execution details.
- Files: Lists the content of the ETL workspace, i.e. the directory where all the required files are saved. This directory is placed on the machine where the ETL Runner application is executed.
Connection tab
In the Connection tab, you need to enter the URL to ETL Runner and the secret of ETL Runner. The URL depends on the deployment mode. Examples:
- Overlay deployment: http://localhost:8888/etl-runner
- Standalone deployment: http://localhost:8080
The secret must match the value provided in the setting application.secret
from the etlRunnerApplication.properties
file.
In addition, the path to the workspace on the ETL Runner machine is displayed read-only. It is configured in the setting application.workspace.directory
from the etlRunnerApplication.properties
file.
Tasks tab
The Tasks tab shows a list of all ETL tasks. The task details consists of two tabs:
- Execution details: Shows the parameters of the next planned execution and the currently running / last execution. In addition, the results and logs of the last execution are displayed. If the options Gather metrics and Gather step performance are enabled in the configuration, metrics and performance details of the last execution are shown as well.
All the data is shown in its raw JSON or text format.
-
Configuration: Allows to configure the task. The configuration is saved in JSON format to the main workspace directory. The file name is composed as follows:
etl-runner.<TASK_NAME>.json
.You need to provide a name for the task and choose the transformation or job which should be executed.
.ktr
and.kjb
files which are saved in the workspace are suggested automatically. If you choose to upload a new file, it is saved to the main workspace directory. Alternatively, you can enter an absolute path to a file on the ETL Runner machine.You must configure the scheduling, i.e. when the transformation or job will be executed. This can either manually by clicking the Start icon, once at a defined point in time, or periodically. The periodic execution is defined using cron expressions. You can either enter a complete expression or edit the parts of the expression separately. In addition, you can show examples of cron expressions and view the next execution dates.
In addition, you can provide settings for logging, database access and execution, and add parameters and variables.
Files tab
The page consists of a table which acts as a file browser. The users can browse the folders and upload and download files. The table shows the type of object (file or folder), name, extension (only files), size and last modification date.
The user can double-click to open a folder and navigate the directory. Above the table, the navigation path with links is shown. Next to the path, there is a Refresh icon to refresh the table. In addition, the user can navigate with the browser back button. When entering a keyword in the search field, the table is filtered accordingly. Links to search results in other locations are displayed above the table.
Text files which are smaller than 64 KB can be edited directly by double-clicking them. The content of larger files is cut off at 58 KB and editing is disabled to avoid performance problems due to loading large files in the browser. The users can download the complete workspace by clicking the Download all button. Individual files or folders can be downloaded by clicking the Download icon in the respective rows. Folders are downloaded as ZIP files.
The users can upload files to the current path by clicking the Upload file button. It is possible to upload a ZIP file containing several files and / or folders. It will be unpacked automatically.
New files or folders can be created in the current path by clicking the New file or New directory button. The user can rename files or folders by clicking the Rename icon in the respective row.
Migrating to the new ETL runtime
The following steps are required to start using the new ETL runtime:
- Install the ETL Runner application.
- Connect the Web Admin Suite to the ETL Runner application to prepare a workspace directory.
- Upload the transformations and jobs to the workspace.
- Create a task for each transformation or job. The
.ktr
and.kjb
files from the workspace are suggested automatically in the Path to the transformation or job file field. A JSON file with the task configuration will be saved in the workspace. - Upload the files needed by the transformation to the workspace, e.g. input files.
The usage of the workspace is recommended but not mandatory. If you want to save your files in other paths, you can provide absolute paths during the task creation. The paths must exist on the file system of the ETL Runner application.
To use JNDI for connecting to external databases, you can save the jdbc.properties
file from the simple-jndi
folder of Pentaho Data Integration to the ETL workspace. The required drivers need to be saved in the path configured in application.libs.directory
of the etlRunnerApplication.properties
file. In addition, there need to be entries with the drivers’ names and checksums starting with application.libs.checksums
. Entries for the drivers of the databases supported by ConSol CM are present in the default configuration. Note that ETL Runner must be able to access the database URL at runtime.
Backwards compatibility
If you do not want to use the new ETL runtime, you can still use the previous approach of installing Pentaho Data Integration on the machine of the ConSol CM server and schedule ETL transformations and jobs using other mechanisms, e.g. create executable files which are run automatically by using operating system tools.