top of page
Search

Big Data Analysis Architecture Considerations (MPP DWs vs MRFs)



In this post, we will be primarily focusing on two of the most commonly used Big Data Processing/Storage methods, Massively Parallel Processing Data Warehouses (MPP DWs) and MapReduce Frameworks (MRFs). The intended audience is someone that has a basic understanding of data concepts, and is interested in Data Infrastructure considerations. Links will be included for some of the more complex vocabulary/topics that will provide additional context for the layman. Please feel free to reach out to me personally with any constructive feedback on any of the topics covered here.

This section aims to summarize and analyze the high level differences between MPP DWs and MRFs (i.e. Spark/Hadoop). Some people may argue that Spark does not use MapReduce (for more info on this please read this post), but for simplicity, we will categorize all open-source distributed general-purpose cluster-computing frameworks as “MapReduce Frameworks”.

What Are They?

From a purely practical standpoint, one can think of a Massively Parallel Processing Data Warehouse (MPP DW) as a columnar, amplified Database. They use the concept of Sharding to distribute the data and processing across many nodes. For the user/developer, the process for interacting with the MPP DW is virtually the same as if they were interacting with a standard Relational Database where they can view the schemas, tables, and execute queries against those tables. Some examples of MPP Databases are Teradata, Greenplum, Netezza (IBM), Redshift (AWS), Azure SQL DW (Azure), BigQuery (Google Cloud Platform), and Vertica (HP). MPP DWs use on-write on-write schemas which means the schemas are defined when the data is written to the MPP DW.

MapReduce Framework (MRFs) technologies use Virtual Machines (VMs) to distribute the computational processing of large datasets. This group of VMs are generally referred to as a “Cluster”. Contrary to MPP DWs, where the primary language used for processing is SQL, MRFs can process data using a variety of programming languages including Pig (Hadoop), Hive (Hadoop), Scala (Spark), Python (Spark), and R (Spark). Scripts/jobs are written to transform the data using anything from simple processes (joins, casts, math operations, etc.), to more complex processes (running ML algorithms against datasets). MRFs generally use on-read schemas which are more difficult to manage for tables with many columns.

When Should I Use One Over the Other?

This section aims to analyze the costs/benefits of choosing one of these technologies for a specific business application over the other. Table 1 provides a breakdown of metrics for relevant features for each technology.

MPP DWs are best when utilized for Business Intelligence solutions, which require less complex processes to be applied to your data. This includes things like regular transformations, aggregations, and relatively simple IFTTT (if this then that) applications. It is extremely difficult to run Machine Learning (ML) Models in MPP DWs. This is primarily due to the fact that SQL was not designed for training or running ML Models. MPP DWs should not be used for situations where very high throughput (inserts, updates, and deletes) are required primarily because they are columnar. If your companies data is exceeding a magnitude of 100s of TBs, MPP DWs will start to get very expensive. You pay more for an MPP DW, with the trade-off of not having to pay employees as much to use/manage the system. However, at a certain point it may be worth it to pay more for the developers and less for the system itself.

Due to their ability to process in a variety of programming languages (Python, R, Scala, etc.), MRFs excel at running ML Models on large datasets. The underlying code for MRFs enables parallel processing of data using complex R and Python scripts. While MRFs are great for running ML Models against large datasets, they are more difficult to manage and configure. MRFs themselves are very inexpensive as they are built on open source frameworks, but require highly skilled programmers and developers to manage and configure the system. If configured properly, you can find the right tools (Hive, Presto, etc.) for performing Business Intelligence solutions with MRFs as well.

Most companies should have both MPP DWs and MRFs, and know when to use the appropriate tool for each business application that arises.


 
 
 

Comments


bottom of page