Big Data Engineer
- Verfügbarkeit einsehen
- 0 Referenzen
- auf Anfrage
- 61-306 Poznań
- Europa
- pl | en | de
- 04.11.2024
Kurzvorstellung
Qualifikationen
Projekt‐ & Berufserfahrung
11/2022 – offen
Tätigkeitsbeschreibung
Working on Credit and Liquidity Risk Stress Testing system.
Programming languages: Scala, Python, Java.
Used technologies: Spark, Hadoop, Dremio, ActivePivot, Iceberg, AWS.
• Refactored existing ETL jobs to improve code re-use and make the code easier to
understand, test and extend.
• Changed the partitioning of existing data sets
• Achieved huge performance improvement after optimizing existing Spark jobs by:
- using Data Frame and Dataset APIs instead of RDDs
- using built-in Spark functions instead of custom row transformations
- using aggregations that support partial aggregation
- using broadcast joins
• Implemented a tool that verifies whether the differences in the number of records and
calculated measures between consecutive days fall within specified thresholds.
• Improved unit test coverage percentage
Apache Hadoop, Apache Spark, Java (allg.), Python, Scala, Amazon Web Services (AWS)
6/2022 – 10/2022
Tätigkeitsbeschreibung
Worked on migration from on-prem Hadoop cluster to AWS.
Used technologies: Hadoop, EMR, Spark, Airflow, S3, Docker, Terraform.
Programming languages: Scala, Python.
• Converted multiple MapReduce jobs to Spark jobs.
• Updated Spark jobs to use DataFrames and Datasets instead of RDDs.
• Optimized existing Spark jobs.
• Created Airflow DAGs to schedule data processing.
• Deployed infrastructure using Terraform.
Python, Scala, Amazon Web Services (AWS), Apache Hadoop, Apache Spark, Docker
3/2021 – 6/2021
Tätigkeitsbeschreibung
Used technologies: Hadoop, Spark, Hive, Docker, Kubernetes, Airflow, Terraform, AWS, MS SQL Server, Snowflake.
Programming language: Python.
• Implemented multiple PySpark jobs running on Kubernetes cluster to transform data
from MS SQL Server and store it into S3.
• Implemented Airflow pipelines to schedule PySpark jobs and define dependencies between
them.
• Deployed infrastructure using Terraform.
Apache Hadoop, Microsoft SQL-Server (MS SQL), Apache Spark, Docker, Snowflake, Amazon Web Services (AWS), Kubernetes, Python
5/2020 – 6/2022
Tätigkeitsbeschreibung
Used technologies: Hadoop, Spark, Hive, Databricks, Docker, Kubernetes, EMR, S3, CloudFormation, Lambda, DynamoDB, Akka HTTP, Flask, Gunicorn.
Programming languages: Scala, Python.
• Designed and implemented a feature store for machine learning. Prepared a framework
for efficient calculation of thousands different aggregate values (features) from terabytes
of data.
• Dockerized Spark applications to run them as containers on EMR cluster in more isolated
and standardized way.
• Implemented application for serving machine learning model as REST API on Kubernetes
cluster using Flask, Gunicorn and Tensorflow Serving API. Significantly improved response
time of the API by using approximate nearest neighbor search algorithm.
• Implemented lambda function to transform new objects created in S3 bucket and store
records in DynamoDB table.
• Implemented REST API application using Akka HTTP to serve recommendations stored
in DynamoDB table.
• Optimized existing Spark applications.
• Worked with data scientists to optimize their solutions and make them production ready.
Apache Hadoop, Apache Spark, Databricks, Docker, Python, Scala, Amazon Web Services (AWS), Kubernetes
1/2018 – 4/2020
Tätigkeitsbeschreibung
Used technologies: Hadoop, Spark, Kafka, Hive, Flume, HBase, Oozie, Splunk, Ansible.
Programming languages: Scala, Python.
• Implemented report generators for Core Banking Platform using Spark.
• Implemented Spark jobs for file compaction and repartitioning to improve performance of
report generators and Hive queries.
• Implemented random data generators for the purpose of verifying the performance of Spark
applications. Analyzed outputs of performance tests and made necessary improvements.
• Worked on migration from Cloudera to MapR distribution for Hadoop.
• Used Flume to read messages from Kafka, transform them and persist into HDFS and
HBase.
• Used Sqoop to ingest data from Oracle database into HDFS.
• Implemented ETL jobs for transforming files in various formats into Avro format.
• Automated deployment of applications using Ansible which greatly reduced the number
of issues during production deployments.
Apache Kafka, Python, Scala, Ansible, Apache Hadoop, Apache Spark
1/2018 – 2/2018
Tätigkeitsbeschreibung
Used technologies: Spark, Athena, S3, EMR, Neo4j.
Programming language: Python.
• Used PySpark and GraphFrames to run graph algorithms. Compared the performance
with Neo4j.
• Used PySpark to transform data stored in S3 and generate CSV files in order to import
them into Neo4j.
Apache Spark, Python
7/2017 – 11/2017
Tätigkeitsbeschreibung
Developed application for collecting risk data from various sources and processing it in
real-time.
Used technologies: Hadoop, Spark Streaming, Kafka, Avro, Camel.
Apache Hadoop, Apache Kafka, Apache Spark, Scala, Apache Camel
7/2015 – 8/2017
Tätigkeitsbeschreibung
Developed system that uses data from public sources to conclude „who-knows-who” relationships
and help companies to identify valuable relations within their existing customers.
Used technologies: Neo4j, Cassandra, Spark Streaming, Spark GraphX, Spring, Spray,
ActiveMQ, Docker, Redis, Solr.
Used technologies: Neo4j, Cassandra, Spark, ActiveMQ, Spring, Spray.
Details:
• Designed and implemented algorithm for concluding „knows” relationships between
persons using Spark.
• Designed and implemented algorithm for finding ultimate beneficial owner of company
using Spark GraphX.
• Created Neo4j Server plugin for finding shortest paths between nodes in the graph using
defined business rules.
• Implemented REST services that perform Cypher queries in order to retrieve data from
nodes and relationships.
• Implemented fast data import to Neo4j database by writing directly to the files using
batch inserter API.
• Implemented transformations of data stored in Cassandra using Spark into the format
that can be easily used to import data into Neo4j database.
• Designed and implementing synchronization between Cassandra and Neo4j using event driven architecture.
• Implemented searching nodes in the graph using Cypher queries and Lucene index.
• Data modeling.
• Configured and tuned Neo4j database.
Apache Spark, Docker, Java (allg.), Scala, Apache Solr
7/2014 – 6/2015
Tätigkeitsbeschreibung
Developed the PSIcarlos system for optimal planning and precise balancing of crude oil
transportation.
Used technologies: Spring, Hibernate, ActiveMQ, Oracle, Apache Tomcat.
Details:
• Designed and implemented new system functions based on defined requirements.
• Prepared technical documentation.
• Close cooperation within international team. Discussed customer’s requirements.
• Technical support for system users.
Oracle Database, Apache Tomcat, Hibernate (Java), Spring Framework
3/2014 – 5/2014
Tätigkeitsbeschreibung
Developed V-Desk workflow system for document circulation.
Used technologies: WPF, WinForms, MS SQL.
Details:
• Developed (mainly optimized) service for automatic text recognition from scanned
documents and retrieving key information from documents using regular expressions.
• Developed application for document scanning and barcode recognition.
Microsoft SQL-Server (MS SQL), C#, Java (allg.)
6/2012 – 2/2014
Tätigkeitsbeschreibung
Coauthor of call center system.
Used technologies: WCF, WPF, Mono, PostgreSQL, MongoDB, Asterisk.
Details:
• Designed scalable system architecture.
• Developed multithreaded WCF services.
• Implemented calling in different modes by sending requests and handling events sent
using the AMI protocol from the Asterisk PBX.
• Implemented automatic calling by using integration of the Asterisk PBX with the PostgreSQL
database to create dynamic call queues.
• Developed the predictive dialer algorithm that calculates the number of calls to be made
based on collected statistics, e.g. percentage of received calls and talk time.
• Prepared the mechanism of sending, mixing, compressing and saving recorded calls to
the database.
Mongodb, Postgresql, C#
7/2011 – 2/2013
Tätigkeitsbeschreibung
Developed Verax Network Management System.
Used technologies: Spring, Hibernate, Adobe Flex, Oracle, MS SQL.
Details:
• Implemented advanced plugins for detecting problems and real-time monitoring of devices
and applications such as:
- PostgreSQL and MySQL database
- Active Directory service
- VMware ESX servers and virtual machines
- .NET applications
- Windows and Unix workstations
- Cisco, MRV and Juniper routers and switches
- APC UPS devices
- Devices with undetected type
• Created a module for monitoring changes in software installed on detected devices.
Microsoft SQL-Server (MS SQL), Oracle Database, Hibernate (Java), Java (allg.), Spring Framework
Zertifikate
Google Cloud
openHPI
Neo4j
dbt Labs
Snowflake
Databricks
Amazon Web Services Training and Certification
MapR Technologies, acquired by Hewlett Packard Enterprise company in 2019
Coursera
Coursera
Coursera
Coursera
Coursera
Persönliche Daten
- Polnisch (Muttersprache)
- Englisch (Fließend)
- Deutsch (Grundkenntnisse)
- Europäische Union
Kontaktdaten
Nur registrierte PREMIUM-Mitglieder von freelance.de können Kontaktdaten einsehen.
Jetzt Mitglied werden