Apache – Here we come…

In the past year or so, a long time friend, college, software mentor and Jedi Padawan, Yaniv Rodenski, has approached me and asked me to join his passion project, Amaterasu.

The project had a goal in mind, which is to introduce a CI/CD platform for Big Data Pipelines.

DataOpsJoining Yaniv and his Australian mates (that’s how you say it, down there in upside-down country, right? mate?) on this project, it felt good to work on a project, that is open-sourced and driven towards big data.

Being a long-time apache stack user of many of the Big-Data stack (Spark, Flink, Hadoop stack etc.) it has always been a wish of mine to actually stand behind one of these projects, and recently, that wish will probably become true!

For the meantime, here is a talk that Yaniv and myself delivered on the Israeli meetup group “Big Things”, hosted by Demi Ben-Ari and Shlomi Hassan (WARNING: This meetup was after a full-day workshop I delivered in that morning, which I finished to prepare about 15 before it started, hence, the mumbling and beer).


The OOP King is dead

The ideas and paradigms of OOP (Object-Oriented-Programming) have been around for quite some time. They have been monumental in helping to produce massive software and has helped programmers use pre-defined constructs, called “design patterns” to build those softwares.

The state of software development has changed. While in the past software was built from the ground up – entire components are built and the construction of each component, its internal structure and the way it communicate with other components, are critical to the success of the project. Therefor a good software must begin with a intensive planning, and OOP thinking help you do just that.

Today, things tend to look different. In the past few years, several technologies became more and more popular in a way that makes us (or at least, me) rethink on how we design software.

The infrastructure on which we run has changed. Cloud providers provide. They provide servers that are easy to setup and maintain through different scripts and configurations. And if a server goes down? Screw it. We’ll setup a new one, in just a few clicks (or none).

This way of doing things, forces our software to be stateless. So our application cannot relay on data in memory or on local disks. What can we use in order for our application to be stateless? Cloud providers provide. They provide ready made software to be used in ours. Queues, NoSQL, YesSQL, Blob storage, anything you need to save data in a centric, replicated safe manner.

But it doesn’t have to be the Cloud provider own platforms. We can pick and choose our queues, NoSQL, SQL etc. and still not having to install each and maintain each with care. Container technologies like docker enable us to have entire solutions can be integrated together with just a few files script files.

I can have Hadoop + Spark + Cassandra + Couchbase + ElasticSearch + Kafka + PostgreSQL up on my laptop, write some code in between, and Poof – I have a solution ready to be deployed in minutes on any scale, doing whatever it is I want it to do.

The last thing that has changed dramatically is the software that we need to write. Bits and pieces of software is available to us, so we can ramp up our software without re-inventing everything. In a world where Maven Repositories, Nuget, NPM, pip and bower exists, who needs to write everything from scratch?

I know that this does not always apply. Some softwares needs to be implemented from scratch. Usually when performance needs to be controlled on every level. What that is the case, it is hard to relay on 3rd party software. But in most cases, relaying on test-proven components from a central repositories will save you hundreds and thousand of lines of code, that you’ll probably just screw up anyway.

In a world that all of this exists, and easy to use, who am I to take 2 months of development to discuss inheritance issues and what classes should inherit from where, and whether or not to declare a field member, an interface or something else. Software is moving fast, and so should mine.


Spark 2 – What’s new?

I’ve prepared a lecture titled “What’s new in Spark 2?”. The slides can be found here.

Spark 2.x line started Jul 2016, and had over 1,000 JIRA issues associated with it (version 2.0.0). Version 1.6.0, which is the last 1.x line had just over 600 issues, for comparison.

Most changes were made to the Spark SQL library, as Databricks seem to be building on it to be the future of Spark.

SparkSession is introduced as a new entry point, and will replace SQLContext and HiveContext (kept for backwards compatibility).

SparkSession creates DataSet (DataFrame is now just an alias for DataSet of Rows) which are now capable for streaming. This is one of the coolest, yet tagged as ‘Preview’, features that are new, and should be interesting in the future.