Your big data is not that big - here is why

Your big data is not that big – Here is why!

There is no clear definition of how much size of the data can be considered as Big Data. The term Big Data is thrown around for all sorts of data sizes and technologies like Hadoop, Cassandra, Storm etc are associated with.
Modern databases can easily handle database of few terabytes, anything more than that requires Big Data analytics tools.

Developers are typically involved when data that needs analysis does not fit in Excel anymore.

Beyond the scope of Excel:

If data is too big for Excel then our tool of choice is Pandas, that is based on Numpy. Using Pandas, you can loads 100s of megabytes of data in memory and perform all sorts of data crunching and analysis. It can output CSV result.

Pandas is capable to handle 10s of gigabytes of data provided you have efficient hardware. For example, we could easily load 10GB worth of data on laptop with configuration of 16GB RAM and 240GB SSD.

My data is in Terabytes:

In this case you can go for PostgreSQL, it can easily handle data worth few terabytes. You have full power of RDBS with SQL querying capabilities.

What if my data is more than 5TB?

Then you need Big Data analytics like Hadoop or Cassandra. Keep in mind that Hadoop is not as capable as PostgreSQL. It does not have a concept of Indexing so all your queries will result in full table scan; that will often lead you to memory errors if you are not on a powerful hardware.

The choice of tool is decided by size of data and the result we want to obtain. For example choosing Hadoop for your 1TB Big Data may not be good choice; a standard RDBMS can handle it just fine.