Reducing Docker Image sizes with Multi-Stage Builds and Distroless
Imagine you are a Data Engineer at a large company with multiple deployments per day. You’re using Docker images to containerize your ETL jobs, which consume data from an external API and load it into your data warehouse. You’ve noticed that your CI/CD pipeline takes around 15 minutes to complete, as each deployment requires building, […]
Data Engineering in Azure: understand PDFs using LLMs
Dealing with non-structured data is always interesting, especially when it means building solution to parse PDFs. Many companies and individuals use PDF files daily, and PDFs are used to distribute all kind of information: from simple text, to complex tables and diagrams. Over the years, there have been multiple approaches to convert (non-structured) data from […]