/ TOOLSGITLAB
 / 10.59350/j7m7h-9ss82

Data Import by CI Job

Wikicommons - CC0

A data import is part of most of our projects and in some cases, we need to re-import all the data from time to time. This post shows how you can delegate this task to a GitLab CI job and start it by adding a certain keyword to your commit message.

What is the problem?

There are several circumstances when a data import (or a re-import) is required. One reason is, that the code regarding the import process has changed and we want these changes to be applied to our instances. In conjunction with a code change introduced by a git commit, we can track changes and data imports together. However, not every push to GitLab should trigger this job, as it can be time-consuming and may causes a service interruption.

In generall we should automatize this task as much as possible.

GitLab CI

GitLab CI can add jobs to the pipeline when a specific condition evaluates to true. So we can add utilize a keyword added to the commit message. GitLab offers predefined keywords to change the CI behavior like [skip CI] to omit the pipeline completely.

General Setup

First of all my recommendation is to introduce a new stage for the data import called data_import for example.

We set up a job with the rules key to contain our defined condition:

import:
  image: bash:latest
  stage: data_import
  rules:
    - if: '$CI_COMMIT_MESSAGE =~ /ci-reimport-data/'
  script:
    - printf "your data import here\n" 

This job will be added to the pipeline whenever your commit message contains ci-reimport-data. If not it will not be present in the pipeline at all. It is recommended to put this in a separate paragraph of your message. Leave a blank line between your semantic commit message, like

feat: the most awesome feature ever

incredible performance boost for blockchain records

ci-reimport-data

or, via CLI, use multiple -m arguments like

git commit -a -m "commit title" -m "extensive description" -m "ci-reimport-data"

Branch-aware

To make this aware of the branch you are working on, the test can be extended. This is useful, when the call for a data import should be aware of a specific instance. We want to test our changes in atopic branch at first and so this job should not run for pipelines triggered from the develop and main branch.

$CI_COMMIT_MESSAGE =~ /ci-reimport-data/ && $CI_COMMIT_REF_NAME != "develop" && $CI_COMMIT_REF_NAME != "main"

To omit this job for a pipeline on a merge request or tag, add && $CI_COMMIT_TAG == null && $CI_MERGE_REQUEST_ID == null' to the condition.

API-triggered Pipelines

When a pipeline is triggered via API (e.g. WebUI) we do can not alter the latest commit message. But we can inject an environment variable to the pipeline and simply check for it:

($CI_COMMIT_MESSAGE =~ /ci-reimport-data/ || $REIMPORT == "true")

Git Flow

When you add your code following a branch model like git flow changes to develop and main branch are usually done via merge requests. In this case you have to add the keyword to the merge commit message, what can be done via web interface:

Ahiqar project

In the Ahiqar project (project description, GitLab) the import job is sensitive to either the commit message containing ci-reimport-data or a variable $REIMPORT set to true. The job is also aware of the branch will prepare the data import on the corresponding instance only (topic branches → test instance; develop → development instance; main → production instance).

The job setup for topic branches looks like this:

import-test:
  image: curlimages/curl
  stage: data_import
  rules:
    - if: '($CI_COMMIT_MESSAGE =~ /ci-reimport-data/ || $REIMPORT == "true") && $CI_COMMIT_REF_NAME != "develop" && $CI_COMMIT_REF_NAME != "main" && $CI_COMMIT_TAG == null && $CI_MERGE_REQUEST_ID == null'
  script:
    - curl https://ahikar-test.sub.uni-goettingen.de/api/import-data?token=${APP_DEPLOY_TOKEN}

Next Steps

One idea to improve the usability might be to refine the condition to be sensitive to checkboxes marked at the merge request description. It would enable us to set or unset a job by clicking a checkbox before a merge is performed.

You might have your own idea to improve this? Leave a comment below 👇 to tell about!