Create easily python packages for Airflow MWAA

Angelos Alexopoulos
3 min readJan 31, 2023

--

Lately, we have decided to move our custom airflow workflow servers to the more managed approach of AWS MWAA. The basic thought behind this migration is to leverage AWS-managed environment and integrations, automatic scaling, various issues with managing security patches and libraries for our workflow ec2 machines, etc.

Here is a detailed and self-explanatory architecture diagram for mwaa:

On the other hand, it seemed that MWAA is not following fast enough airflow version updates (it seemed it had stuck to version 2.2.2 for a long time but now mwaa supports version 2.4 — although official airflow supports 2.5. Finally, we had some second thoughts on the cost of MWAA since the smallest offered environment that can support up to 50 DAGs will deploy at minimum 5 machines in total (2 webservers, 2 schedulers and 1 celery worker)

The user guide of MWAA from AWS documentation is pretty extensive and detailed. The Code Examples section offers real practical examples that cover most of the cases.

Before migrating to MWAA we tried and tested our dags with the aws-mwaa-local-runner. The CLI builds a Docker container image locally that’s similar to a MWAA production image. This allows you to run a local Apache Airflow environment to develop and test DAGs, custom plugins, and dependencies before deploying to MWAA. From our experience its not identical with the AWS MWAA but it will give you a good idea of potential code issues. But of course it will not catch issues with network troubles, aws permissions etc.

Here I would like to give a simple but really helpful tip that if I knew before I would gain a lot of time. In our production environment we have selected that our Web Server Access to be very limited, entirely private access with NO INTERNET ACCESS. This is very good for the security of your environment and lets you access airflow UI only from inside VPC.

The main drawback of Private access is that airflow fails to install pip requirements to Webserver machines since it cannot access the public pip repository. To overcome this limitations you can either create a private pip repository or alternatively you can package the required python libraries as wheel packages. Please follow the steps below:

  1. Upload your requirements.txt with the correct python libraries and the correct versions. Update Airflow Environment. This will install with pip the libraries in scheduler airflow machines but it will fail in webserver. The Airflow UI will not be available.
  2. Upload a new requirements.txt without any library. This time webserver will succeed so it will be accessible via UI.
  3. Create a sample dag to package all python libraries installed from step #1.
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.bash_operator import BashOperator
from airflow.models import Variable

with DAG("mwaa_tests", # Dag id
start_date=datetime(2023, 1, 1), # start date, the 1st of January 2021
schedule_interval='@daily', # Cron expression, here it is a preset of Airflow, @daily means once every day.
catchup=False # Catchup
) as dag:
# The BashOperator is a task to execute a bash command
commands = BashOperator(
task_id='run_command',
bash_command=Variable.get('mwaa_command'),
dag=dag
)

4. We can now execute any bash command we want to the Celery worker environment. So let’s create a new airflow variable named: mwaa_command:

python3 -m pip download -r /usr/local/airflow/requirements/requirements.txt -d /usr/local/airflow/libs; aws s3 cp /usr/local/airflow/libs s3://<YOUR_BUCKET>/mwaa_libs — recursive

5. This way we create a folder under /usr/local/airflow/libs with all wheel packages and then using aws copy command we upload to one of our S3 buckets.

6. We can the download all wheel packages from: s3://<YOUR_BUCKET>/mwaa_libs

7. Zip and upload downloaded mwaa_libs to plugins.zip

8. Update requirements.txt to follow the full path to the wheel package by replacing the python libraries with file paths. Add the following 2 lines at the beginning of the file so that pip will not search in remote servers the python libraries.

e.g. requirements.txt file with only requests pip file

--find-links /usr/local/airflow/plugins
--no-index
/usr/local/airflow/plugins/mwaa_libs/requests-2.28.2-py3-none-any.whl

With the above way we can install python libraries even under private access environments.

P.S. After writing this article I saw that AWS has a similar approach in their Best Practices for Dependencies.

--

--

Angelos Alexopoulos
Angelos Alexopoulos

No responses yet