<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:og="http://ogp.me/ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:schema="http://schema.org/" xmlns:sioc="http://rdfs.org/sioc/ns#" xmlns:sioct="http://rdfs.org/sioc/types#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" version="2.0" xml:base="https://www.linuxjournal.com/">
  <channel>
    <title>Hadoop</title>
    <link>https://www.linuxjournal.com/</link>
    <description/>
    <language>en</language>
    
    <item>
  <title>Datamation's "Leading Big Data Companies" Report</title>
  <link>https://www.linuxjournal.com/content/datamations-leading-big-data-companies-report</link>
  <description>  &lt;div data-history-node-id="1339522" class="layout layout--onecol"&gt;
    &lt;div class="layout__region layout__region--content"&gt;
      
            &lt;div class="field field--name-node-author field--type-ds field--label-hidden field--item"&gt;by &lt;a title="View user profile." href="https://www.linuxjournal.com/users/james-gray" lang="" about="https://www.linuxjournal.com/users/james-gray" typeof="schema:Person" property="schema:name" datatype="" xml:lang=""&gt;James Gray&lt;/a&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-body field--type-text-with-summary field--label-hidden field--item"&gt;&lt;p&gt;
The Big Data market is in a period of remarkable transition. If keeping tabs on
this dynamic sector is in your wheelhouse, &lt;a href="http://www.datamation.com"&gt;Datamation&lt;/a&gt; has made your homework
easier by developing "Leading Big Data Companies", a report that provides "a
snapshot of a market sector in transition". Ranging from established legacy
vendors to start-ups, this report details the numerous strategies that are
exploited in today's Big Data landscape. 
The core technologies employed by this
diverse group of vendors include cloud, open source, AI and several others. 
&lt;/p&gt;

&lt;p&gt;
This
report belongs to Datamation's ongoing focus on the latest emerging tech for
the enterprise. In the mere seven years that have passed since Yahoo! introduced
Hadoop, Big Data has burgeoned in popularity as ever more firms seek insights from
the massive amounts of data at their disposal. 
&lt;/p&gt;

&lt;p&gt;
Because Big Data has matured
differently from most technologies in that no single leader has emerged after
nearly a decade, the analytics industry finds itself still in growth mode, making
it dynamic and challenging for those trying to make sense of it on their own.
&lt;/p&gt;
&lt;img src="http://www.linuxjournal.com/files/linuxjournal.com/ufiles/imagecache/large-550px-centered/u1000009/12237f3.jpg" alt="" title="" class="imagecache-large-550px-centered" /&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-node-link field--type-ds field--label-hidden field--item"&gt;  &lt;a href="https://www.linuxjournal.com/content/datamations-leading-big-data-companies-report" hreflang="und"&gt;Go to Full Article&lt;/a&gt;
&lt;/div&gt;
      
    &lt;/div&gt;
  &lt;/div&gt;

</description>
  <pubDate>Fri, 13 Oct 2017 16:13:37 +0000</pubDate>
    <dc:creator>James Gray</dc:creator>
    <guid isPermaLink="false">1339522 at https://www.linuxjournal.com</guid>
    </item>
<item>
  <title>How YARN Changed Hadoop Job Scheduling </title>
  <link>https://www.linuxjournal.com/content/how-yarn-changed-hadoop-job-scheduling</link>
  <description>  &lt;div data-history-node-id="1335912" class="layout layout--onecol"&gt;
    &lt;div class="layout__region layout__region--content"&gt;
      
            &lt;div class="field field--name-node-author field--type-ds field--label-hidden field--item"&gt;by &lt;a title="View user profile." href="https://www.linuxjournal.com/users/adam-diaz" lang="" about="https://www.linuxjournal.com/users/adam-diaz" typeof="schema:Person" property="schema:name" datatype="" xml:lang=""&gt;Adam Diaz&lt;/a&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-body field--type-text-with-summary field--label-hidden field--item"&gt;&lt;p&gt;
Scheduling means different things depending on the audience. To many
in the business world, scheduling is synonymous with workflow management. 
Workflow management is the coordinated execution of a collection of
scripts or programs for a business workflow with monitoring, logging and
execution guarantees built in to a WYSIWYG editor. Tools like Platform
Process Manager come to mind as an example. To others, scheduling
is about process or network scheduling. In the distributed computing 
world, scheduling means job scheduling, or more correctly,
workload management. 
&lt;/p&gt;

&lt;p&gt;
Workload management is not only about how a specific unit
of work is submitted, packaged and scheduled, but it's also about how it runs,
handles failures and returns results. The HPC definition is fairly
close to the Hadoop definition of scheduling. One interesting way that
HPC scheduling and resource management cross paths is within the
Hadoop on Demand project. The Torque resource manager and Maui Meta
Scheduler both were used for scheduling in the Hadoop on Demand project
during Hadoop's early days at Yahoo. 
&lt;/p&gt;

&lt;p&gt;
This article compares and
contrasts the historically robust field of HPC workload management with
the rapidly evolving field of job scheduling happening in Hadoop
today.
&lt;/p&gt;

&lt;p&gt;
Both HPC and Hadoop can be called distributed computing, but they diverge rapidly 
architecturally. HPC is a typical share-everything architecture
with compute nodes sharing common storage. In this case, the data for
each job has to be moved to the node via the shared storage system. A
shared storage layer makes writing job scripts a little easier, but it
also injects the need for more expensive storage technologies. The
share-everything paradigm also creates an ever-increasing demand on the
network with scale. HPC centers quickly realize they must move to
higher speed networking technology to support parallel workloads at scale.
&lt;/p&gt;

&lt;p&gt;
Hadoop, on the other hand, functions in a share-nothing architecture, meaning
that data is stored on individual nodes using local disk. Hadoop moves work to
the data and leverages inexpensive and rapid local storage (JBOD) as much as
possible. A local storage architecture scales nearly linearly due to the
proportional increase in CPU, disk and I/O capacity as node count increases. A
fiber network is a nice option with Hadoop, but two bonded 1GbE interfaces or a
single 10GbE in many cases is fast enough. Using the slowest practical
networking technology provides a net savings to a project budget. 
&lt;/p&gt;

&lt;p&gt;
From a Hadoop
philosophy, funds really should be allocated for additional data nodes. The same
can be said about CPU, memory and the drives themselves. Adding nodes is what
makes the entire cluster both more parallel in operation as well as more
resistant to failure. The use of mid-range componentry, also called commodity
hardware is what makes it affordable. 
&lt;/p&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-node-link field--type-ds field--label-hidden field--item"&gt;  &lt;a href="https://www.linuxjournal.com/content/how-yarn-changed-hadoop-job-scheduling" hreflang="und"&gt;Go to Full Article&lt;/a&gt;
&lt;/div&gt;
      
    &lt;/div&gt;
  &lt;/div&gt;

</description>
  <pubDate>Fri, 27 Jun 2014 20:59:23 +0000</pubDate>
    <dc:creator>Adam Diaz</dc:creator>
    <guid isPermaLink="false">1335912 at https://www.linuxjournal.com</guid>
    </item>
<item>
  <title>Introduction to MapReduce with Hadoop on Linux</title>
  <link>https://www.linuxjournal.com/content/introduction-mapreduce-hadoop-linux</link>
  <description>  &lt;div data-history-node-id="1085922" class="layout layout--onecol"&gt;
    &lt;div class="layout__region layout__region--content"&gt;
      
            &lt;div class="field field--name-node-author field--type-ds field--label-hidden field--item"&gt;by &lt;a title="View user profile." href="https://www.linuxjournal.com/users/adam-monsen" lang="" about="https://www.linuxjournal.com/users/adam-monsen" typeof="schema:Person" property="schema:name" datatype="" xml:lang=""&gt;Adam Monsen&lt;/a&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-body field--type-text-with-summary field--label-hidden field--item"&gt;&lt;p&gt;
When your data and work grow, and you still want to produce results in a
timely manner, you start to think big. Your one beefy server reaches its
limits. You need a way to spread your work across many computers. You
truly need to scale out.
&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;
In pioneer days they used oxen for heavy pulling, and when
one ox couldn't budge a log, they didn't try to grow a larger ox. We
shouldn't be trying for bigger computers, but for more systems of
computers.—Grace Hopper
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;
Clearly, cluster computing is old news. What's changed? Today:
&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;
We collect more data than ever before.
&lt;/li&gt;

&lt;li&gt;
Even small-to-medium-size businesses can benefit from tools like Hadoop and
MapReduce.
&lt;/li&gt;

&lt;li&gt;
You don't have to have a PhD to create and use your own cluster. 
&lt;/li&gt;

&lt;li&gt;
Many decent free/libre open-source tools can help you easily cluster commodity
hardware.
&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;
Let me start with some simple examples that will run on one machine and
scale to meet larger demands. You can try them on your laptop and then
transition to a larger cluster—like one you've built with commodity
Linux machines, your company or university's Hadoop cluster or Amazon
Elastic MapReduce.
&lt;/p&gt;

&lt;span class="h3-replacement"&gt;
Parallel Problems&lt;/span&gt;

&lt;p&gt;
Let's start with problems that can be divided into smaller
independent units of work. These problems are roughly classified as
"embarrassingly parallel" and are—as the term
suggests—suitable for
parallel processing. Examples:
&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;
Classify e-mail messages as spam.
&lt;/li&gt;

&lt;li&gt;
Transcode video.
&lt;/li&gt;

&lt;li&gt;
Render an Earth's worth of map tile images.
&lt;/li&gt;

&lt;li&gt;
Count logged lines matching a pattern.
&lt;/li&gt;

&lt;li&gt;
Figure out errors per day of week for a particular application.
&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;
Now the hard work begins. Parallel computing is complex. Race conditions,
partial failure and synchronization impede our progress. Here's where
MapReduce saves our proverbial bacon.
&lt;/p&gt;

&lt;span class="h3-replacement"&gt;
MapReduce by Example&lt;/span&gt;

&lt;p&gt;
MapReduce is a coding pattern that abstracts much of the tricky bits of
scalable computations. We're free to focus on the problem at hand, but
it takes practice. So let's practice!
&lt;/p&gt;

&lt;p&gt;
Say you have 100 10GB log files from some custom
application—roughly a petabyte of data. You do a quick test and estimate
it will take your desktop days do grep every line (assuming you even could
fit the data on your desktop). And, that's before you add in logic to
group by host and calculate totals. Your tried-and-true shell utilities
won't help, but MapReduce can handle this without breaking a sweat.
&lt;/p&gt;

&lt;p&gt;
First let's look at the raw data. Log lines from the custom application
look like this:

&lt;/p&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-node-link field--type-ds field--label-hidden field--item"&gt;  &lt;a href="https://www.linuxjournal.com/content/introduction-mapreduce-hadoop-linux" hreflang="und"&gt;Go to Full Article&lt;/a&gt;
&lt;/div&gt;
      
    &lt;/div&gt;
  &lt;/div&gt;

</description>
  <pubDate>Wed, 05 Jun 2013 19:26:51 +0000</pubDate>
    <dc:creator>Adam Monsen</dc:creator>
    <guid isPermaLink="false">1085922 at https://www.linuxjournal.com</guid>
    </item>

  </channel>
</rss>
