Build An Airflow Data Pipeline To Download Podcasts [Beginner Data Engineer Tutorial]

https://www.youtube.com/watch?v=s-r2gEr7YW4&ab_channel=Dataquest

1. 작업 전 생각하기

팁 : Pipeline 단계를 미리 설계하고 작업하는 게 좋을거 같다.

예를 들어 아래 사진처럼 나올 경우

첫 번째 단계는 create_table_sqlite

두 번째 단계는 get_episodes

이런식으로

2. 설치

공식문서에서 알려준대로 설치하면 docker-compose 사용하지 않고도 매우 쉽고 간단하게 설치가 가능하다

https://airflow.apache.org/docs/apache-airflow/stable/start/local.html

Running Airflow locally — Airflow Documentation

airflow.apache.org

- airflow를 standalone 방식으로 실행시키면 같은 폴더에 데이터와 팟을 관리할 수 있음.

- 다만 production환경에서는 활용하지 않아야 함.

3. 기본 개념

task & dag

dag - 파이프라인을 어떻게 정의할 건지 알려주는 오브젝트

task - 파이프라인 속에서 각각의 태스크를 어떻게 정의하는지 알기 위한 오브젝트

pendulum - start times와 end times를 설정해주는 라이브러리

4. Airflow Configuration

a. nano ~/airflow/airflow.cfg 또는 vi ~/airflow/airflow.cfg를 한 다음

b. dag_folder를 원하는 폴더로 설정

5. Sqlite3로 DB 설정 해보기

a. sqlite3 episodes.db

b. .databases

c. .quit

d. airflow connections add 'podcasts' --conn-type 'sqlite' --conn-host '/blahblah~~~~~/episodes.db'

e. 다만 sqlite의 경우 테스트용도로 사용하고 PostgreSQL 또는 MySQL을 사용하는 걸 Airflow에서는 권장

https://airflow.apache.org/docs/apache-airflow/2.3.2/howto/set-up-database.html

Set up a Database Backend — Airflow Documentation

airflow.apache.org

6. Operator vs task decorator

a. @task Decorator를 붙이는 순간 python operator로 됨 (Airflow 2부터)

b. 내장된 Operator들을 사용하면 decorator없이 다양한 플랫폼들과 상호작용할 수 있음.

7. set_downstream

a. SqliteOperator.set_downstream(~~) 하면 ~~를 실행시키기 이전에 먼저 Operator를 실행시킴

8. apache-airflow[pandas] 설치

a. pip install apache-airflow[pandas]

9. Hook

a. 쿼리를 조금 더 쉽게 사용할 수 있도록 도와주는 역할.

저작자표시 비영리 변경금지

'CS > DataEngineering' 카테고리의 다른 글

DATA ENGINEERING EXPLAINED (0)	2022.08.14
데이터 중심 애플리케이션 설계 (0)	2022.06.28
Kubernetes (0)	2022.06.02
docker (0)	2022.06.01
Running Airflow 2.0 with Docker in 5 mins (0)	2022.05.26

UGONG2SAN

Build An Airflow Data Pipeline To Download Podcasts [Beginner Data Engineer Tutorial]

1. 작업 전 생각하기

2. 설치

3. 기본 개념

4. Airflow Configuration

5. Sqlite3로 DB 설정 해보기

6. Operator vs task decorator

7. set_downstream

8. apache-airflow[pandas] 설치

9. Hook

'CS > DataEngineering' 카테고리의 다른 글

댓글

티스토리툴바

Build An Airflow Data Pipeline To Download Podcasts [Beginner Data Engineer Tutorial]

1. 작업 전 생각하기

2. 설치

3. 기본 개념

4. Airflow Configuration

5. Sqlite3로 DB 설정 해보기

6. Operator vs task decorator

7. set_downstream

8. apache-airflow[pandas] 설치

9. Hook

'CS > DataEngineering' 카테고리의 다른 글

관련글

댓글

티스토리툴바