Lavian Dsouza

Lavian Dsouza

Data Engineer & Analytics Specialist

Master's in Mathematics | Python | SQL | Java | Scala | Data Visualization | Cloud Infrastructure | Machine Learning

Professional Summary

Mathematical problem solver and results-driven Data Engineer with 8+ years of experience designing and optimizing scalable data pipelines, ETL workflows, and analytical systems. Expert in applying graph theory, statistical inference, and operations research to deliver actionable insights, reduce processing times by up to 70%, and enhance system efficiency. Proficient in Python, SQL, Apache Spark, and cloud platforms (AWS, Azure, GCP, Oracle), with strong expertise in data governance, automation, and predictive analytics. Passionate about transforming raw data into high-impact solutions in finance, logistics, and AI domains.

Key Skills and Tools

Programming & Languages

Python

Python logo

Python is a programming language that lets you work quickly and integrate systems more effectively. The official home of the Python Programming Language.

Official Documentation

Pandas

Pandas logo

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

Official Documentation

NumPy

NumPy logo

NumPy is the fundamental package for scientific computing with Python. It is the fundamental package for scientific computing with Python.

Official Documentation

Scikit-learn

Scikit-learn logo

scikit-learn is an open source Python module that integrates a range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language.

Official Documentation

Matplotlib

Matplotlib logo

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. Matplotlib makes easy things easy and hard things possible.

Official Documentation

Plotly

Plotly logo

Plotly is a graphing library that makes interactive, publication-quality graphs online. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, polar charts, and bubble charts.

Official Documentation

SQL

SQL logo

SQL is a domain-specific language used in programming and designed for managing data held in a relational database management system, or for stream processing in a relational data stream management system.

Official Documentation

R

R logo

R is a programming language for statistical computing and data visualization. It has been adopted in the fields of data mining, bioinformatics, and data analysis.

Official Documentation

DAX

DAX logo

Data Analysis Expressions (DAX) is a formula expression language used in Analysis Services, Power BI, and Power Pivot in Excel. DAX formulas include functions, operators, and values to perform advanced calculations and queries on data in related tables and columns in tabular data models.

Official Documentation

Bash

Bash logo

Bash is a Unix shell and command language written by Brian Fox for the GNU Project as a free software replacement for the Bourne shell.

Official Documentation

PowerShell

PowerShell logo

PowerShell is a task automation and configuration management program from Microsoft, consisting of a command-line shell and the associated scripting language.

Official Documentation

Data Engineering & ETL

Apache Spark

Apache Spark logo

Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

Official Documentation

Flink

Flink logo

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams.

Official Documentation

Dask

Dask logo

Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love. Dask's schedulers scale to thousand-node clusters and its algorithms have been tested on some of the largest supercomputers in the world.

Official Documentation

Ray

Ray logo

Ray is an open-source unified compute framework that makes it easy to scale AI and Python workloads — from reinforcement learning to deep learning to tuning, serving, and model building.

Official Documentation

dbt

dbt logo

dbt™ is a SQL-first transformation workflow that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation.

Official Documentation

Airbyte

Airbyte logo

Airbyte is an open-source data integration platform that syncs data from applications, APIs & databases to data warehouses, lakes and other destinations.

Official Documentation

Debezium

Debezium logo

Debezium is an open source distributed platform for change data capture. Start it up, point it at your databases, and your apps can start responding to all of the inserts, updates, and deletes that other apps commit to your databases.

Official Documentation

Airflow

Airflow logo

Apache Airflow® is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows.

Official Documentation

Prefect

Prefect logo

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.

Official Documentation

Dagster

Dagster logo

Dagster is an orchestration platform for the development, production, and observation of data assets.

Official Documentation

Great Expectations

Great Expectations logo

Great Expectations is the leading tool for validating, documenting, and profiling your data to maintain quality and improve communication between teams.

Official Documentation

Databases & Storage

PostgreSQL

PostgreSQL logo

PostgreSQL is a powerful, open source object-relational database system with over 35 years of active development that has earned it a strong reputation for reliability, feature robustness, and performance.

Official Documentation

MySQL

MySQL logo

MySQL is an open-source relational database management system.

Official Documentation

CockroachDB

CockroachDB logo

CockroachDB is the SQL database for building global, scalable cloud services that survive disasters.

Official Documentation

TiDB

TiDB logo

TiDB is an open-source, cloud-native, distributed SQL database for elastic scale and real-time analytics.

Official Documentation

Redis

Redis logo

Redis is an open source (BSD licensed), in-memory data structure store used as a database, cache, message broker, and streaming engine.

Official Documentation

MongoDB

MongoDB logo

MongoDB is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas.

Official Documentation

Neo4j

Neo4j logo

Neo4j is the world’s leading graph database, with native graph storage and processing.

Official Documentation

Cassandra

Cassandra logo

Apache Cassandra® is a widely used NoSQL database providing linear scalability and fault tolerance on commodity hardware or cloud infrastructure.

Official Documentation

Elasticsearch

Elasticsearch logo

Elasticsearch is a distributed, RESTful search and analytics engine capable of addressing a growing number of use cases.

Official Documentation

InfluxDB

InfluxDB logo

InfluxDB is the open source time series database. It is designed to handle high write and query loads and is an integral component of the TICK stack.

Official Documentation

DuckDB

DuckDB logo

DuckDB is an in-process SQL OLAP database management system.

Official Documentation

ClickHouse

ClickHouse logo

ClickHouse is an open-source column-oriented database management system that allows generating analytical data reports in real-time using SQL queries.

Official Documentation

Data Lakes & OLAP

Delta Lake

Delta Lake logo

Delta Lake is an open-source storage layer that brings reliable lakehouse architecture to data lakes.

Official Documentation

Iceberg

Iceberg logo

Apache Iceberg is an open table format for huge analytic datasets.

Official Documentation

Hudi

Hudi logo

Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development.

Official Documentation

Trino

Trino logo

Trino is a distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources.

Official Documentation

Presto

Presto logo

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

Official Documentation

Druid

Druid logo

Apache Druid is a real-time analytics database designed for fast slice-and-dice analytics ("OLAP" queries) on large data sets.

Official Documentation

Doris

Doris logo

Apache Doris is an easy-to-use, high-performance and real-time analytical database based on MPP architecture, known for its high concurrency and low latency.

Official Documentation

StarRocks

StarRocks logo

StarRocks is a next-gen, high-performance analytical data warehouse that enables real-time, multi-dimensional, and highly concurrent data analysis.

Official Documentation

Analytics & Visualization

Power BI

Power BI logo

Power BI is a collection of software services, apps, and connectors that work together to turn your unrelated sources of data into coherent, visually immersive, and interactive insights.

Official Documentation

Tableau

Tableau logo

Tableau is visual analytics software for business intelligence. See and understand any data with Tableau.

Official Documentation

Superset

Superset logo

Apache Superset is a modern data exploration and visualization platform.

Official Documentation

Metabase

Metabase logo

Metabase is an open source business intelligence tool. It lets you ask questions about your data, and displays answers in formats that make sense, whether that's a bar graph or a detailed table.

Official Documentation

Redash

Redash logo

Redash helps you make sense of your data. Connect and query your data sources, build dashboards to visualize data and share them with your company.

Official Documentation

Dash

Dash logo

Dash is a Python framework for building analytical web applications. No JavaScript required.

Official Documentation

Streamlit

Streamlit logo

Streamlit is an open-source Python library that makes it easy to create and share beautiful, custom web apps for machine learning and data science.

Official Documentation

ML & AI Tools

MLflow

MLflow logo

MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.

Official Documentation

Kubeflow

Kubeflow logo

The Machine Learning Toolkit for Kubernetes.

Official Documentation

DVC

DVC logo

Data Version Control or DVC is an open-source tool for data science and machine learning projects. It is designed to make ML models shareable and experiments reproducible.

Official Documentation

Vertex AI

Vertex AI logo

Vertex AI is a machine learning (ML) platform that lets you train and deploy ML models and AI applications, and customize large language models (LLMs) for use in your AI-powered applications.

Official Documentation

PaLM 2

PaLM 2 logo

PaLM 2 is a state-of-the-art language model with improved multilingual, reasoning and coding capabilities.

Official Documentation

Cloud & DevOps

AWS RDS

AWS RDS logo

Amazon Relational Database Service (Amazon RDS) is a managed service that makes it easier to set up, operate, and scale a relational database in the cloud.

Official Documentation

AWS EMR

AWS EMR logo

Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto.

Official Documentation

AWS Athena

AWS Athena logo

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.

Official Documentation

Azure Cosmos DB

Azure Cosmos DB logo

Azure Cosmos DB is a fully managed NoSQL database service for modern app development.

Official Documentation

Azure Event Hubs

Azure Event Hubs logo

Azure Event Hubs is a big data streaming platform and event ingestion service.

Official Documentation

GCP BigQuery

GCP BigQuery logo

BigQuery is a serverless, highly scalable, and cost-effective multi-cloud data warehouse designed for business agility.

Official Documentation

GCP Dataproc

GCP Dataproc logo

Dataproc is a fully managed and highly scalable service for running Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks.

Official Documentation

Snowflake

Snowflake logo

Snowflake is a cloud-based data warehousing company that was founded in 2012. It offers a cloud-based data storage and analytics service, generally termed "data warehouse-as-a-service".

Official Documentation

Dataiku

Dataiku logo

Dataiku is the platform for Everyday AI, systemizing the use of data for exceptional business results.

Official Documentation

Git

Git logo

Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.

Official Documentation

Docker

Docker logo

Docker is a set of platform as a service products that use OS-level virtualization to deliver software in packages called containers.

Official Documentation

JIRA

JIRA logo

Jira is a proprietary issue tracking product developed by Atlassian that allows bug tracking and agile project management.

Official Documentation

Oracle Fusion ERP

Oracle Fusion ERP logo

Oracle Fusion Cloud ERP is a cloud-based enterprise resource planning suite for midsize to enterprise-level customers.

Official Documentation

Microsoft Dynamics 365

Microsoft Dynamics 365 logo

Dynamics 365 is a set of interconnected, modular SaaS applications and services designed to both transform and enable your core customers, employees, and business activities.

Official Documentation

Salesforce

Salesforce logo

Salesforce is a customer relationship management solution that brings companies and customers together.

Official Documentation

HubSpot

HubSpot logo

HubSpot offers a full platform of marketing, sales, customer service, and CRM software — plus the methodology, resources, and support — to help businesses grow better.

Official Documentation

Power Apps

Power Apps logo

Power Apps is a suite of apps, services, connectors, and data platform that provides a rapid application development environment to build custom apps for your business needs.

Official Documentation

SharePoint

SharePoint logo

SharePoint empowers teamwork with dynamic and productive team sites for every project team, department, and division.

Official Documentation

Certifications

IBM Data Engineering Professional Certificate

Courses:

  • Introduction to Data Engineering
  • Python for Data Science, AI & Development
  • Introduction to Relational Databases (DB2)
  • Databases and SQL for Data Science with Python
  • Hands-on Introduction to Linux Commands and Shell Scripting
  • Relational Database Administration (DBA)
  • ETL and Data Pipelines with Shell, Airflow and Kafka
  • Getting Started with Data Warehousing and BI Analytics
  • Introduction to Big Data with Spark and Hadoop
  • Data Engineering and Machine Learning using Spark
  • Data Engineering Capstone Project

Snowflake Data Engineering Professional Certificate

Courses:

  • Snowflake - The Complete Masterclass (2024 Edition)
  • Snowflake for Developers, Data Science, AI and ML
  • Snowflake - SnowPro Core Certification Exam Prep
  • Snowflake - SnowPro Advanced Data Engineer Certification
  • Mastering Snowflake Cloud Data Warehouse
  • Other Snowflake-specific courses from Udemy or Pluralsight

Google Cloud Data Analytics Professional Certificate

Courses:

  • Google Data Analytics Foundations
  • Foundations: Data, Data, Everywhere
  • Ask Questions to Make Data-Driven Decisions
  • Prepare Data for Exploration
  • Process Data from Dirty to Clean
  • Analyze Data to Answer Questions
  • Share Data Through the Art of Visualization
  • Data Analysis with R Programming
  • Google Data Analytics Capstone

AWS Developer Specialization

Courses:

  • Developing on AWS
  • AWS Developer: Getting Started
  • AWS Developer: Building on AWS
  • AWS Developer: Deployment and Security
  • AWS Developer: Optimization
  • AWS Developer: Lambda Deep Dive
  • AWS Developer: Designing and Developing
  • AWS Developer: Deploying and Managing
  • AWS Certified Developer - Associate (DVA-C01)
  • AWS Developer Associate 2022
  • AWS Certified Developer Associate (DVA-C02)

Hadoop & Spark Fundamentals / Big Data Processing Using Hadoop

Courses:

  • Big Data Hadoop and Spark Developer
  • Spark Fundamentals
  • Spark Developer in Scala
  • Spark Developer in Python
  • Hadoop Developer In Real World
  • CCA 175 Spark and Hadoop Developer
  • CCA 175 Spark and Hadoop Developer - Python
  • Big Data Hadoop Architect (All in 1)
  • Hadoop Administration

Google Advanced Data Analytics Professional Certificate

Courses:

  • Foundations of Advanced Data Analytics
  • Statistical Analysis with Python
  • Python Data Analytics
  • Data Storytelling and Visualization with Tableau
  • Data Processing with Python
  • Machine Learning Introduction for Everyone
  • Advanced Google Data Analytics Capstone

Google IT Automation with Python Professional Certificate

Courses:

  • Crash Course on Python
  • Using Python to Interact with the Operating System
  • Introduction to Git and GitHub
  • Troubleshooting and Debugging Techniques
  • Configuration Management and the Cloud
  • Automating Real-World Tasks with Python

Tableau Business Intelligence Analyst Professional Certificate

Courses:

  • Tableau Fundamentals
  • Data Visualization with Tableau
  • Data Storytelling and Visualizations with Tableau
  • Data Analysis with Tableau
  • Data-Driven Decisions with Beginner Tableau
  • Tableau Business Intelligence Analyst Professional Certificate Capstone

IBM RAG and Agentic AI: Build Next-Gen AI Systems Professional Certificate

Courses:

  • Introduction to Generative AI and LLMs
  • Generative AI Prompt Engineering
  • Building Generative AI Applications Using LangChain
  • Retrieval Augmented Generation (RAG) Concepts
  • Building Retrieval Augmented Generation (RAG) Systems with Vector Databases
  • LangGraph: Multi-Agent Workflows

IBM AI Engineering with Python, PyTorch & TensorFlow Professional Certificate

Courses:

  • Machine Learning with Python
  • Scalable Machine Learning on Big Data using Apache Spark
  • Introduction to Deep Learning & Neural Networks with Keras
  • Deep Neural Networks with PyTorch
  • Building Deep Learning Models with TensorFlow
  • AI Capstone Project with Deep Learning

Meta Data Analyst with GenAI Professional Certificate

Courses:

  • Introduction to Data Analytics
  • Data Analysis with Python
  • Data Analysis with R Programming
  • Data Visualization with Tableau
  • GenAI for Data Analysts
  • GenAI for Data Visualization
  • Data Analyst Capstone Project

Google Generative AI Learning Path

Courses:

  • Introduction to Generative AI
  • Introduction to Large Language Models
  • Introduction to Responsible AI
  • Generative AI Fundamentals
  • Create Image Captioning Models
  • Introduction to Generative AI Studio

IBM Data Management Professional Certificate

Courses:

  • Introduction to Relational Databases
  • Databases and SQL for Data Science
  • Database Administration Fundamentals
  • Relational Database Administration (DBA)
  • Relational Database Design
  • Relational Database Implementation

Google IT Support Professional Certificate

Courses:

  • Technical Support Fundamentals
  • The Bits and Bytes of Computer Networking
  • Operating Systems and You: Becoming a Power User
  • System Administration and IT Infrastructure Services
  • Introduction to Hardware and Operating Systems
  • IT Security: Defense against the digital dark arts

Oracle Cloud and AI Specialization

Courses:

  • Oracle Cloud Infrastructure Foundations
  • Oracle Cloud Infrastructure Architect Associate
  • Oracle Cloud Infrastructure Architect Professional
  • Oracle Cloud Infrastructure AI Foundations
  • Oracle Cloud Infrastructure Generative AI Professional
  • Oracle Cloud Infrastructure Data Foundations Associate

Introduction to MongoDB

Courses:

  • Introduction to MongoDB
  • MongoDB Basics
  • MongoDB Aggregation Framework
  • MongoDB Security
  • MongoDB Performance
  • MongoDB Administration
  • MongoDB and Python
  • MongoDB and Ruby
  • MongoDB and PHP
  • MongoDB and Node.js

SQL for Any IT Professional Specialization

Courses:

  • SQL Server for Database Administrators
  • SQL Server for Developers
  • SQL Server for Data Analysts
  • SQL Server for Business Intelligence Professionals
  • SQL Server for Everyone
  • SQL Server for Beginners

Google Cybersecurity Professional Certificate

Courses:

  • Foundations of Cybersecurity
  • Play It Safe: Manage Security Risks
  • Connect and Protect: Networks and Network Security
  • Tools of the Trade: Linux and SQL
  • Assets, Threats, and Vulnerabilities
  • Sound the Alarm: Detection and Response
  • Automate Cybersecurity Tasks with Python
  • Put It to Work: Prepare for Cybersecurity Jobs

Professional Experience

Data Analyst (Data Engineering)

AD Ports Group • Sep 2024 – Present

  • Designed scalable data pipelines integrating Microsoft Dynamics 365 and Oracle Fusion using Python and SQL, reducing reporting time by 40% and overstocking by 20%.
  • Developed automated Power BI dashboards and models with Kafka streaming and Great Expectations for data quality, improving operational efficiency across 10+ departments.
  • Conducted predictive modeling for procurement trends using Prophet, Lasso, Ridge, optimizing Trino/Presto queries and Delta Lake storage to achieve 15–20% efficiency gains.
  • Automated cross-department workflows with Apache Airflow and DataHub metadata management, reducing manual effort by 70%.

Technical Data Analyst (ERP + CRM)

GlowTouch Technologies • Sep 2021 – May 2024

  • Built data extraction pipelines with SQL, Python, and Debezium, integrating HubSpot, Salesforce, and Microsoft Dynamics 365, cutting troubleshooting time by 35%.
  • Automated ETL scripts and log analysis using Apache Spark, improving process efficiency by 30%.
  • Developed internal compliance dashboards and fraud monitoring automation using Power BI and Elasticsearch, ensuring scalable infrastructure and 20% faster data delivery.
  • Implemented data governance practices with Apache Atlas and QA coordination, reducing response times by 25%.

Data Associate (Data Handling & Automation)

Amazon • Aug 2020 – Feb 2021

  • Handled high-volume customer data with ETL workflows in Excel and SQL, reducing invoice dispute resolution time by 25%.
  • Supported backend incident diagnostics with Python pipelines, improving audit efficiency by 15%.
  • Built reporting tools incorporating MongoDB and basic ML for trend analysis, enhancing SLA adherence by 20%.

Data Representative – Technical Support

Concentrix • Nov 2017 – Oct 2019

  • Analyzed system logs and automated processes with Python and batch scripts, reducing manual effort by 30%.
  • Managed SAP CRM data flows and ETL processes, improving issue resolution by 25%.
  • Supported network security audits and infrastructure maintenance using Active Directory and Linux tools.

Technical Support Engineer

Anmol Solutions • Jun 2016 – Aug 2017

  • Managed server operations, SQL pipelines, and VPN/Active Directory setups, achieving 95% uptime.
  • Installed and maintained hardware/software, optimizing data storage and retrieval, reducing downtime by 40%.
  • Integrated POS and accounting software via Python scripting, improving operational efficiency by 30%.

Projects

  • Gold Price Prediction Engine: Time-series forecasting with Prophet, Lasso/Ridge, Spark, delivering 5-minute interval predictions.
  • Formula 1 Dashboard: Interactive analytics with Plotly Dash and NetworkX, hosted on GitHub.
  • Discovering Amazon Archaeological Sites: Geospatial analysis using Python, rasterio, Plotly, OpenAI APIs for 3D modeling.
  • Cyclistic Bike-Share Growth Optimization: R, SQL, Python analytics, recommending strategies to increase revenue 10–15%.
  • NYC Taxi Fare Prediction: Regression models with PACE framework, improving fare transparency.
  • User Churn Prediction (Waze): Supervised ML models to forecast churn and enhance retention.
  • Claim vs. Opinion Classification (TikTok): NLP-based classification system for content moderation.

Education

M.Sc. in Mathematics

St. Aloysius College, Mangalore (2013 – 2015)

B.Sc. in Physics, Chemistry, and Mathematics

St. Aloysius College, Mangalore (2010 – 2013)

Contact

Lavian Dsouza | Abu Dhabi, UAE | lavianvishal23@gmail.com | +971 54 752 6875 | LinkedIn: https://www.linkedin.com/in/lavian-d-4975442ab/

Email

lavianvishal23@gmail.com

Phone

+971 54 752 6875

LinkedIn

lavian-d-4975442ab

Currently Available For

Data Engineering Contracts Cloud Migration Projects ML Pipeline Development Consulting & Architecture