Batch Jobs
Step-by-step guide to creating, deploying and monitoring Batch Jobs in Datatailr
Understanding Batch Jobs
Batch Jobs are essential for automating tasks that can be executed without user interaction, making them ideal for data processing, automated testing, and other background tasks.
Datatailr allows you to execute scripts as Batch Jobs according to a schedule or trigger. These jobs can perform a wide range of tasks – from data analysis to sending out automated reports.
Key Concepts
- Task: A single job, defined by an image and a
__batch_main__
entrypoint. - Batch Job: A set of tasks (jobs) that can be scheduled in Datatailr to execute as a background process automatically without user interaction.
- Scheduling: Batch Jobs can be scheduled to run at specific times or in response to specific events.
- Entrypoint: A function with a specific signature that will be called at job runtime. Signature for Batch Jobs:
def __batch_main__(sub_job_name, scheduled_time, runtime, part_num, num_parts, job_config, rundate, *args): # logic of your job return
- job_config: An arbitrary JSON with extra arguments that is passed to jobs at runtime. It allows you to customize the behavior and execution of each job's
__batch_main__
entrypoint. For example, it can include a country code for which the job will be processing some data, or a set of specific hyperparameters for an ML model. - Package: A directory with your code that is packaged using Package Builder and stored in the internal PyPi server. Packages can contain one or more entrypoints of type A (App), B (Batch Job), E (Excel Add-in), J (Jupyter Notebook), T (Tests).
- Image: Each job in a batch runs in its own isolated Docker container, created from a specified image. Jobs in a batch can be using different images. An image contains all entrypoints from internal packages that are included in it. If you want to use an image to create a Batch Job, the image has to include a batch entrypoint.
Step 1: Preparing Your Batch Job Script
Before deploying a batch job, you’ll need a script that defines what the job will do.
Creating a Script
- Navigate to the cloned repository in your Datatailr IDE.
- Create a new Python script file for your batch job – for example,
my_batch.py
.
Alternatively, open an existing file that you want to run as a task in a batch job.
Defining Batch Entrypoint
Define a __batch_main__
entrypoint:
def __batch_main__(sub_job_name, scheduled_time, runtime, part_num, num_parts, job_config, rundate, *args):
# logic of your job
return
Fill in the __batch_main__
function according to what you want that job to do – for example, write code to process a dataset.
Passing *args to Jobs
Your job can also be returning a result –
def __batch_main__(sub_job_name, scheduled_time, runtime, part_num, num_parts, job_config, rundate, *args):
# logic of your job
return result
This result will be passed via *args to downstream dependencies of the Job, which you can define when scheduling your Batch Job to run.
In a dependent Job which receives the execution result from an upstream Job, you can get it as follows –
def __batch_main__(sub_job_name, scheduled_time, runtime, part_num, num_parts, job_config, rundate, *args):
# read the result of the Job 1
result = args[0]
# logic of your job
return
Testing the Batch Job
To be able to test a batch job in your IDE before deploying it, add the following block at the end of the file:
if __name__ == '__main__':
__batch_main__(None, None, None, None, None, {'my_param': my_param_value, …}, datetime.datetime.now().date())
Note – Don’t forget to fill in the job_config dict with parameters that you want to be passed to your job, if any.
You can now run the .py
file to test your batch job. If everything works as expected, proceed to the next step.
Step 2: Packaging Your Batch Job
Once your script is ready, you need to package it and include it in an image.
Depending on whether you are adding the batch job into a package that already exists in Datatailr, or creating a new package for it, proceed with one of the following:
If Package Already Exists in Datatailr
- Simply commit and push your changes, and Autobuilder will update the package automatically.
- If this package is included in any images which are set for autobuild, the images will be rebuilt as well.
- Wait for a notification from the Autobuilder and proceed to Step 3.
- Manually build a new version of the package using Package Builder.
- Manually build a new version of an image (or a new image in case it does not exist yet) with your package using Image Builder
- Proceed to Step 3.
If New Package is Created
- Check out your git repo in Package Builder and create a new package from the directory that contains your script.
- Using Image Builder, create a new image that would include the package with your batch job. If you want to include the package into an image which already exists, select «Build from image» instead of creating a new image.
- Add your package and build the image.
Step 3: Scheduling Your Batch Job
Once you have the image ready, let’s proceed to scheduling the batch job to run.
There are 2 ways to do that: via the GUI, or via scheduler API. For the sake of simplicity we will start with the GUI way, but feel free to skip it if you want to start with the API directly.
Scheduling via GUI
- From the Datatailr landing page, open Job Scheduler app and navigate to the «Batch Jobs» section.
- Click the + button in the top right corner in Job Scheduler to add a new batch job — a batch configuration pop-up will appear.
- The top half of this pop-up defines batch-wise settings, and the bottom part — job-wise settings.
- You can add more jobs to your batch using the + button in the bottom left corner of the popup.
- If the image(s) you want to use is in Dev environment (which is the case if it wasn’t manually copied to pre or prod earlier), change the «Tag» field value from prod to dev.
- Using «Image» selectbox, pick the image that you want to be used by your job.
- From the «Entrypoint» selectbox choose an entrypoint for the job. The format there is
package_name.file_name
, where package_name is a name of Dt package that has the entrypoint, and file_name is a name of the file where__batch_main__
is defined. - For every job in a batch you can create dependencies using «Dependencies» selectbox — this defines the order in which jobs will be executed.
- Fill out the «Name», «Group», «Schedule» and «Description» fields.
- Do not forget to specify job_config in JSON format in the «Extra Arguments» field if you want to pass any parameters to the job.
- You can set CPU/Memory requirements on a per-job level. Other fields are optional.
- Once you are happy with configuration, click «OK». The Batch Job will appear in «Batch Jobs» section with status Scheduled. Column «Next Run» shows the closest scheduled run of the batch.
- You can also trigger a batch run outside of its schedule if you right click on the batch and choose «Run».
Note – in this case it may take a couple of minutes for a batch to start running if new VMs have to be brought up to run the batch.
Scheduling via API
If you prefer, a Batch Job can also be scheduled programmatically (by executing a Python script) instead of defining it in the user interface using the Job Scheduler Batch Jobs tab. The following will create a scheduled Batch Job so it runs in the same manner as if it were defined in the Datatailr user interface. The scheduler API is written like an Apache Airflow.
Create a new script in your IDE with the following content:
from dt.scheduler.api import DAG, Schedule, Task
from datetime import datetime, timedelta
SCHEDULE_IN_MINUTES_FROM_NOW = 5 # feel free to modify or set other schedule
scheduled_time = datetime.now() + timedelta(minutes=SCHEDULE_IN_MINUTES_FROM_NOW)
SCHEDULE = Schedule(at_minutes=[scheduled_time.minute], at_hours=[scheduled_time.hour], timezone='UTC')
with DAG(Name='My Batch Job', Tag='dev', Schedule=SCHEDULE) as dag:
job_1 = Task(Name='Job #1',
Image='Name of your Image',
Description='Provide a short description for the job',
dag=dag,
Entrypoint='package_name.file_name',
ConfigurationJson={'param_1': param_1_value})
job_2 = Task(Name='Job #2',
Image='Name of your Image',
Description='Provide a short description for the job',
dag=dag,
Entrypoint='package_name.file_name',
ConfigurationJson={'param_2': param_2_value})
job_2 >> job_1 # dependencies definition - job_2 will run after job_1 finishes
dag.save() # this schedules the batch. Use dag.run() to run immediately
Running the above script in your IDE will schedule the Batch Job to run 5 minutes from now every day.
Note – You can use list comprehension as follows to create multiple tasks easily –
jobs = [Task( Name=f'Job #{i}', Image='Name of your Image', Description='Provide a short description for the job', dag=dag, Entrypoint='package_name.file_name', ConfigurationJson={'param_1': param_1_value, 'job_num': i}) for i in range(5)] jobs >> job_1 # group of tasks can depend on single tasks job_2 >> jobs # and the other way around
Modify parameters as you see fit and execute the script to schedule the batch.
Step 4: Monitoring Your Batch Job
Datatailr Job Scheduler also provides tools to monitor the execution and outcome of your batch jobs. In order to do so, navigate to the «Batch Runs» section, where past and ongoing batch job executions are displayed.
Once your batch job starts running, it will appear on top of the list in status running.
Expand and examine the row. In the dropdown you can find individual jobs from that batch run, each job appearing as soon as it starts running. You can expand every job in a batch to examine their Stdout, Stderr and a job_config passed to the job. Logs are updated dynamically close to real-time.
In any batch run you can also find a tab with a Gantt chart of execution times, which can be very useful to understand the structure of your batch better and to quickly check on what is going in a running batch.
When a job finishes running, its status transitions into success if it finishes successfully, or failure if its execution fails. You can expand the batch run and sort the list of included jobs by status to locate failed ones, examine their logs and find out what went wrong.
Rerunning failed Batch Jobs
Once you fix the error, push your changes to trigger image autobuild (in case the image is set for autobuild). After fixed version of the image is ready, find the failed run, right click on it and select Rerun to rerun only failed jobs, or Rerun (clean) to rerun the whole batch from scratch.
Congratulations on deploying your first Batch Job on the Datatailr platform!
Updated 5 months ago