Sign in

All things data related — Data Engineering/DWH/ETL/BI/DataScience. Professional Data Wizard @

Polyglot notebook app. Prototype and play around using python and scala in the same notebook.


I’ve seen polynote, I think, around 2019 Oct in this blog post about it. And I’ve fallen in love with it at that moment. For me knowing python only trying out scala with almost zero effort was a dream come true. I could compare Python vs. Scala execution times, try Dataset vs. Dataframe, do some tutorials or even help me write a blog post. But at that time, they said it’s not perfect and in its initial stages. One and a half years have passed, and I’m back to re-kindle this spark (pun intended).

I have plans to create an…

Introduction to working with Docker and creating your own development environment


Working in a fast-changing environment and getting more and more tools released to open source, I have made a mess in my laptop. Installing tons of applications, forgetting to clean them up. Similarly, I had things happening with python — no virtual environments, so basically, many libraries were free to roam my laptop.

The solution was initially to remove Python and all other clutter from my laptop. But I figured that later on, if I do a PoC for a blog post or on work-related matters, I will end up in the same place (though not with python if I…

My story of exploring and analyzing data and trying to build a robust pipeline with classification task on my hands, having almost 0 experience in data science


I have a statistical background, though never in my life, I’ve worked with statistics. I got rusty; I mean, short fragments exist about what is MSE, AUC, etc., but that is it. I have to start almost from 0 if I’d want to progress further (statistical courses I took in University were 10 years ago).

I work as a data developer/engineer by day and father of two 24/7. Whenever I have free time from these activities, I try to do something interesting to me. I had a data science field for some time in my backlog, but I couldn’t find…

And why you should try it too!


Every five years I raise goals for me, not too far from my next life “Sprint planning” for the next period, I realized that it would be fun to see myself presenting at some conference. One of the ways I could start this — writing some blog posts, as I see a presentation it’s just a fancier way of doing this, right? You choose a topic, do a deep dive, comparisons, you come out with conclusions, and you present it to your reader/listener. …

Short and comprehensive information about different data modeling techniques

In this guide, I’ll try to cover several methodologies, explain their differences and when and why (in my point of view) one is better than the other, and maybe introduce some tools you can use when modeling DWH (Data Warehouse) or EDW (Enterprise Data Warehouse).

My story

I started my BI (Business Intelligence) career in a small company that consulted with other companies on how to improve their processes or just helped them build a BI system so that they can make the decisions themselves.

How I imagined working during that time (almost straight out of the university): I would come to…

Making your Apache Spark application run faster with minimal changes to your code!


While developing Spark applications, one of the most time-consuming parts was optimization. In this blog post, I’ll give some performance tips and (at least for me) not known configuration parameters I could have used when I’ve started.

So I’m going to cover these topics:

What we can improve?

Working with multiple small files?

OpenCostInBytes (from documentation) — The estimated cost to open a file, measured by the number of bytes could be scanned at the same time. This is used when putting multiple files into a partition. It is better…


Coming from relational DB background, I wasn’t familiar with any of the NoSQL type DBs. When I switched to a more Big Data-oriented position, I had to adjust and understand it to do my job correctly. This blog post aims to do a general setting up of MongoDB from the docker image and help you know some of the different partitioner options and give a simple use case example with performance tips.

MongoDB Setup

MongoDB will be run from a docker image. Which you can get with and run:

docker pull mongo
sudo docker run -d -p 27017:27017 -v ~/data:/data/db mongo


In this post, I’ll share my experience with Spark function explode and one case where I’m happy that I avoided using it and created a faster approach to a particular use case.

Disclaimer: I’m not saying that there is always a way out of using explode and expanding data set size in memory. …

A quick recap on what I’ve covered in the first part: Dask beats Pandas and Spark while doing read + group by +mean value + print top five rows results. You can check it out “Single Node processing — Spark, Dask, Pandas, Modin, Koalas vol. 1”.

In this post, I’ll try and compare how Dask, Spark, and Pandas read a CSV file, apply some arbitrary calculation (some tips on performance), and output to a single CSV file again.


Doing some calculations in pandas is the fastest if you can express it in column expression. i.e.

Simple Columnar expression

What it…

For a long time, I’ve been hearing and seeing in blog posts — “use Pandas/Spark/Dask; it’s better than the others.” From my point of view, it was precisely a stalemate, where anything could happen. Finally, I got bored of hearing the same over and over again, so I’ve decided to benchmark the most popular ones and allow myself to know in which case use what.

By the way, check out Vol. 2 of Single Node Processing.


The first idea was to go to Kaggle and download some of the data sets from there, but since I’m trying new things, I’ve…

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store