Hadoop

Datamation's "Leading Big Data Companies" Report

James Gray — Fri, 13 Oct 2017 16:13:37 +0000

The Big Data market is in a period of remarkable transition. If keeping tabs on this dynamic sector is in your wheelhouse, Datamation has made your homework easier by developing "Leading Big Data Companies", a report that provides "a snapshot of a market sector in transition". Ranging from established legacy vendors to start-ups, this report details the numerous strategies that are exploited in today's Big Data landscape. The core technologies employed by this diverse group of vendors include cloud, open source, AI and several others.

This report belongs to Datamation's ongoing focus on the latest emerging tech for the enterprise. In the mere seven years that have passed since Yahoo! introduced Hadoop, Big Data has burgeoned in popularity as ever more firms seek insights from the massive amounts of data at their disposal.

Because Big Data has matured differently from most technologies in that no single leader has emerged after nearly a decade, the analytics industry finds itself still in growth mode, making it dynamic and challenging for those trying to make sense of it on their own.

Go to Full Article

How YARN Changed Hadoop Job Scheduling

Adam Diaz — Fri, 27 Jun 2014 20:59:23 +0000

by Adam Diaz

Scheduling means different things depending on the audience. To many in the business world, scheduling is synonymous with workflow management. Workflow management is the coordinated execution of a collection of scripts or programs for a business workflow with monitoring, logging and execution guarantees built in to a WYSIWYG editor. Tools like Platform Process Manager come to mind as an example. To others, scheduling is about process or network scheduling. In the distributed computing world, scheduling means job scheduling, or more correctly, workload management.

Workload management is not only about how a specific unit of work is submitted, packaged and scheduled, but it's also about how it runs, handles failures and returns results. The HPC definition is fairly close to the Hadoop definition of scheduling. One interesting way that HPC scheduling and resource management cross paths is within the Hadoop on Demand project. The Torque resource manager and Maui Meta Scheduler both were used for scheduling in the Hadoop on Demand project during Hadoop's early days at Yahoo.

This article compares and contrasts the historically robust field of HPC workload management with the rapidly evolving field of job scheduling happening in Hadoop today.

Both HPC and Hadoop can be called distributed computing, but they diverge rapidly architecturally. HPC is a typical share-everything architecture with compute nodes sharing common storage. In this case, the data for each job has to be moved to the node via the shared storage system. A shared storage layer makes writing job scripts a little easier, but it also injects the need for more expensive storage technologies. The share-everything paradigm also creates an ever-increasing demand on the network with scale. HPC centers quickly realize they must move to higher speed networking technology to support parallel workloads at scale.

Hadoop, on the other hand, functions in a share-nothing architecture, meaning that data is stored on individual nodes using local disk. Hadoop moves work to the data and leverages inexpensive and rapid local storage (JBOD) as much as possible. A local storage architecture scales nearly linearly due to the proportional increase in CPU, disk and I/O capacity as node count increases. A fiber network is a nice option with Hadoop, but two bonded 1GbE interfaces or a single 10GbE in many cases is fast enough. Using the slowest practical networking technology provides a net savings to a project budget.

From a Hadoop philosophy, funds really should be allocated for additional data nodes. The same can be said about CPU, memory and the drives themselves. Adding nodes is what makes the entire cluster both more parallel in operation as well as more resistant to failure. The use of mid-range componentry, also called commodity hardware is what makes it affordable.

Go to Full Article

Introduction to MapReduce with Hadoop on Linux

Adam Monsen — Wed, 05 Jun 2013 19:26:51 +0000

by Adam Monsen

When your data and work grow, and you still want to produce results in a timely manner, you start to think big. Your one beefy server reaches its limits. You need a way to spread your work across many computers. You truly need to scale out.

In pioneer days they used oxen for heavy pulling, and when one ox couldn't budge a log, they didn't try to grow a larger ox. We shouldn't be trying for bigger computers, but for more systems of computers.—Grace Hopper

Clearly, cluster computing is old news. What's changed? Today:

We collect more data than ever before.
Even small-to-medium-size businesses can benefit from tools like Hadoop and MapReduce.
You don't have to have a PhD to create and use your own cluster.
Many decent free/libre open-source tools can help you easily cluster commodity hardware.

Let me start with some simple examples that will run on one machine and scale to meet larger demands. You can try them on your laptop and then transition to a larger cluster—like one you've built with commodity Linux machines, your company or university's Hadoop cluster or Amazon Elastic MapReduce.

Parallel Problems

Let's start with problems that can be divided into smaller independent units of work. These problems are roughly classified as "embarrassingly parallel" and are—as the term suggests—suitable for parallel processing. Examples:

Classify e-mail messages as spam.
Transcode video.
Render an Earth's worth of map tile images.
Count logged lines matching a pattern.
Figure out errors per day of week for a particular application.

Now the hard work begins. Parallel computing is complex. Race conditions, partial failure and synchronization impede our progress. Here's where MapReduce saves our proverbial bacon.

MapReduce by Example

MapReduce is a coding pattern that abstracts much of the tricky bits of scalable computations. We're free to focus on the problem at hand, but it takes practice. So let's practice!

Say you have 100 10GB log files from some custom application—roughly a petabyte of data. You do a quick test and estimate it will take your desktop days do grep every line (assuming you even could fit the data on your desktop). And, that's before you add in logic to group by host and calculate totals. Your tried-and-true shell utilities won't help, but MapReduce can handle this without breaking a sweat.

First let's look at the raw data. Log lines from the custom application look like this:

Go to Full Article