<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:og="http://ogp.me/ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:schema="http://schema.org/" xmlns:sioc="http://rdfs.org/sioc/ns#" xmlns:sioct="http://rdfs.org/sioc/types#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" version="2.0" xml:base="https://www.linuxjournal.com/">
  <channel>
    <title>Big Data</title>
    <link>https://www.linuxjournal.com/</link>
    <description/>
    <language>en</language>
    
    <item>
  <title>FOSS Project Spotlight: Sawmill, the Data Processing Project</title>
  <link>https://www.linuxjournal.com/content/foss-project-spotlight-sawmill-data-processing-project</link>
  <description>  &lt;div data-history-node-id="1339777" class="layout layout--onecol"&gt;
    &lt;div class="layout__region layout__region--content"&gt;
      
            &lt;div class="field field--name-node-author field--type-ds field--label-hidden field--item"&gt;by &lt;a title="View user profile." href="https://www.linuxjournal.com/users/daniel-berman" lang="" about="https://www.linuxjournal.com/users/daniel-berman" typeof="schema:Person" property="schema:name" datatype="" xml:lang=""&gt;Daniel Berman&lt;/a&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-body field--type-text-with-summary field--label-hidden field--item"&gt;&lt;p&gt;&lt;em&gt;Introducing Sawmill, an open-source Java library for enriching, transforming and filtering JSON documents.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you're into centralized logging, you are probably familiar with the ELK Stack: Elasticsearch, Logstash and Kibana. Just in case you're not, ELK (or Elastic Stack, as it's being renamed these days) is a package of three open-source components, each responsible for a different task or stage in a data pipeline.&lt;/p&gt;

&lt;p&gt;Logstash is responsible for aggregating the data from your different data sources and processing it before sending it off for indexing and storage in Elasticsearch. This is a key role. How you process your log data directly impacts your analysis work. If your logs are not structured correctly and you have not configured Logstash correctly, your logs will not be parsed in a way that enables you to query and visualize them in Kibana.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://logz.io"&gt;Logz.io&lt;/a&gt; used to rely heavily on Logstash for ingesting data from our customers, running multiple Logstash instances at any given time. However, we began to experience some pain points that ultimately led us down the path to the project that is the subject of this article: Sawmill.&lt;/p&gt;

&lt;span class="h3-replacement"&gt;Explaining the Motivation&lt;/span&gt;

&lt;p&gt;Over time, and as our data pipelines became more complex and heavy, we began to encounter serious performance issues. Our Logstash configuration files became extremely complicated, which resulted in extremely long startup times. Processing also was taking too long, especially in the case of long log messages and in cases where there was a mismatch between the configuration and the actual log message.&lt;/p&gt;

&lt;p&gt;The above points resulted in serious stability issues, with Logstash coming to a halt or sometimes crashing. The worst thing about it was that troubleshooting was a huge challenge. We lacked visibility and felt a growing need for a way to monitor key performance metrics.&lt;/p&gt;

&lt;p&gt;There were additional issues we encountered, such as dynamic configuration reload and the ability to apply business logic, but suffice it to say, Logstash was simply not cutting it for us.&lt;/p&gt;

&lt;span class="h3-replacement"&gt;Introducing Sawmill&lt;/span&gt;

&lt;p&gt;Before diving into Sawmill, it's important to point out that Logstash has developed since the time we began working on this project, with new features that help deal with some of the pain points described above.&lt;/p&gt;

&lt;p&gt;So, what is Sawmill?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/logzio/sawmill"&gt;Sawmill&lt;/a&gt; is an open-source Java library for enriching, transforming and filtering JSON documents.&lt;/p&gt;

&lt;p&gt;For Logstash users, the best way to understand Sawmill is as a replacement of the filter section in the Logstash configuration file. Unlike Logstash, Sawmill does not have any inputs or outputs to read and write data. It is responsible only for data transformation.&lt;/p&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-node-link field--type-ds field--label-hidden field--item"&gt;  &lt;a href="https://www.linuxjournal.com/content/foss-project-spotlight-sawmill-data-processing-project" hreflang="en"&gt;Go to Full Article&lt;/a&gt;
&lt;/div&gt;
      
    &lt;/div&gt;
  &lt;/div&gt;

</description>
  <pubDate>Thu, 22 Mar 2018 19:50:11 +0000</pubDate>
    <dc:creator>Daniel Berman</dc:creator>
    <guid isPermaLink="false">1339777 at https://www.linuxjournal.com</guid>
    </item>
<item>
  <title>Galit Shmueli et al.'s Data Mining for Business Analytics (Wiley)</title>
  <link>https://www.linuxjournal.com/content/galit-shmueli-et-als-data-mining-business-analytics-wiley</link>
  <description>  &lt;div data-history-node-id="1339537" class="layout layout--onecol"&gt;
    &lt;div class="layout__region layout__region--content"&gt;
      
            &lt;div class="field field--name-node-author field--type-ds field--label-hidden field--item"&gt;by &lt;a title="View user profile." href="https://www.linuxjournal.com/users/james-gray" lang="" about="https://www.linuxjournal.com/users/james-gray" typeof="schema:Person" property="schema:name" datatype="" xml:lang=""&gt;James Gray&lt;/a&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-body field--type-text-with-summary field--label-hidden field--item"&gt;&lt;p&gt;
The updated 5th edition of the book &lt;em&gt;Data Mining for Business
Analytics&lt;/em&gt; from Galit
Shmueli and collaborators and published by &lt;a href="http://wiley.com"&gt;Wiley&lt;/a&gt; is a standard guide to data mining and analytics that adds
two new co-authors and a trove of new material vis-á-vis its predecessor. R is a
free, open-source and popularity-gaining software environment for statistical
computing and graphics. Trailing with the subtitle &lt;em&gt;Concepts, Techniques, and
Applications in R&lt;/em&gt;, the new 5th edition of &lt;em&gt;Data Mining for
Business Analytics&lt;/em&gt;
continues to provide an applied approach to data-mining concepts and methods,
using the R software as a canvas on which to illustrate. 
&lt;/p&gt;
&lt;img src="http://www.linuxjournal.com/files/linuxjournal.com/ufiles/imagecache/large-550px-centered/u1000009/12237f8.jpg" alt="" title="" class="imagecache-large-550px-centered" /&gt;&lt;p&gt;
With the book, readers
learn how to implement a variety of popular data-mining algorithms in R to tackle
business problems and opportunities. Material covered in-depth includes both
statistical and machine-learning algorithms for prediction, classification,
visualization, dimension reduction, recommender systems, clustering, text mining
and network analysis. 
&lt;/p&gt;

&lt;p&gt;
The new 5th edition includes material from business,
government, a dozen case studies demonstrating applications for the data-mining
techniques described, and exercises in each chapter that help readers gauge and
expand their comprehension and competency of the material. &lt;em&gt;Data Mining for
Business Analytics&lt;/em&gt; can serve as either a text book or a reference for
analysts, researchers and practitioners working with quantitative methods in
myriad fields.
&lt;/p&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-node-link field--type-ds field--label-hidden field--item"&gt;  &lt;a href="https://www.linuxjournal.com/content/galit-shmueli-et-als-data-mining-business-analytics-wiley" hreflang="und"&gt;Go to Full Article&lt;/a&gt;
&lt;/div&gt;
      
    &lt;/div&gt;
  &lt;/div&gt;

</description>
  <pubDate>Fri, 03 Nov 2017 16:11:00 +0000</pubDate>
    <dc:creator>James Gray</dc:creator>
    <guid isPermaLink="false">1339537 at https://www.linuxjournal.com</guid>
    </item>
<item>
  <title>InfluxData</title>
  <link>https://www.linuxjournal.com/content/influxdata</link>
  <description>  &lt;div data-history-node-id="1339533" class="layout layout--onecol"&gt;
    &lt;div class="layout__region layout__region--content"&gt;
      
            &lt;div class="field field--name-node-author field--type-ds field--label-hidden field--item"&gt;by &lt;a title="View user profile." href="https://www.linuxjournal.com/users/james-gray" lang="" about="https://www.linuxjournal.com/users/james-gray" typeof="schema:Person" property="schema:name" datatype="" xml:lang=""&gt;James Gray&lt;/a&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-body field--type-text-with-summary field--label-hidden field--item"&gt;&lt;p&gt;
What is ephemeral data, you ask? &lt;a href="https://www.influxdata.com"&gt;InfluxData&lt;/a&gt; can supply the answer, because
handling it is the business of the company's InfluxData open-source platform
that is custom-built for metrics and events. Ephemeral data is transitory,
existing only briefly, and is becoming vital for modern applications built where
containers, microservices and sensors can come and go and are intermittently
connected. The updated InfluxData 1.3 Platform can handle a billion (yes, with a
"b"!) unique time series, making it easier to handle ephemeral data coming
from containers or adding and removing sensors in IoT-tracking systems. 
InfluxData
addresses the explosion of data points and sources, monitoring and controls
requiring nanosecond precision coming from sensors and microservices. 
&lt;/p&gt;

&lt;p&gt;
The
InfluxData platform provides a comprehensive set of tools and services to
accumulate metrics and events data, analyze the data and act on the data via
powerful visualizations and notifications. New features in release 1.3 include
time-series indexing, high-availability anomaly detection, query language
improvements and automatic cluster rebalancing. InfluxData calls the new release
"one of the most significant technical advancements in the platform to
date".
&lt;/p&gt;
&lt;img src="http://www.linuxjournal.com/files/linuxjournal.com/ufiles/imagecache/large-550px-centered/u1000009/12237f7.jpg" alt="" title="" class="imagecache-large-550px-centered" /&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-node-link field--type-ds field--label-hidden field--item"&gt;  &lt;a href="https://www.linuxjournal.com/content/influxdata" hreflang="und"&gt;Go to Full Article&lt;/a&gt;
&lt;/div&gt;
      
    &lt;/div&gt;
  &lt;/div&gt;

</description>
  <pubDate>Fri, 27 Oct 2017 17:03:50 +0000</pubDate>
    <dc:creator>James Gray</dc:creator>
    <guid isPermaLink="false">1339533 at https://www.linuxjournal.com</guid>
    </item>
<item>
  <title>Learning Data Science</title>
  <link>https://www.linuxjournal.com/content/learning-data-science</link>
  <description>  &lt;div data-history-node-id="1339530" class="layout layout--onecol"&gt;
    &lt;div class="layout__region layout__region--content"&gt;
      
            &lt;div class="field field--name-node-author field--type-ds field--label-hidden field--item"&gt;by &lt;a title="View user profile." href="https://www.linuxjournal.com/users/reuven-m-lerner" lang="" about="https://www.linuxjournal.com/users/reuven-m-lerner" typeof="schema:Person" property="schema:name" datatype="" xml:lang=""&gt;Reuven M. Lerner&lt;/a&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-body field--type-text-with-summary field--label-hidden field--item"&gt;&lt;p&gt;
In my last few articles, I've written about data science and
machine learning. In case my enthusiasm wasn't obvious from my
writing, let me say it plainly: it has been a long time since I last
encountered a technology that was so poised to revolutionize the world
in which we live.
&lt;/p&gt;

&lt;p&gt;
Think about it: you can download, install and use open-source data science libraries, for free. You can download rich data sets on nearly
every possible topic you can imagine, for free. You can analyze that
data, publish it on a blog, and get reactions from governments and
companies.
&lt;/p&gt;

&lt;p&gt;
I remember learning in high school that the difference between freedom
of speech and freedom of the press is that not everyone has a printing
press. Not only has the internet provided everyone with the
equivalent of a printing press, but it has given us the power to
perform the sort of analysis that until recently was exclusively
available to governments and wealthy corporations.
&lt;/p&gt;

&lt;p&gt;
During the past year, I have increasingly heard that data science is
the sexiest profession of the 21st century and the one that will
be in greatest demand. Needless to say, those two things make for a very
appealing combination! It's no surprise that I've seen a major uptick
in the number of companies inviting me to teach on this subject.
&lt;/p&gt;

&lt;p&gt;
The upshot is that you—yes, you, dear reader—should spend time
in the coming months, weeks and years learning whatever you can
about data science. This isn't because you will change jobs and
become a data scientist. Rather, it's because everyone is going to become a data scientist. No matter what work you do, you'll be better at
it, because you will be able to use the tools of data science to analyze
past performance and make predictions based on it.
&lt;/p&gt;

&lt;p&gt;
Back when I started to develop web applications, it was the norm to
have a database team that created the tables and queries. Nowadays,
although there certainly are places that have a full-time database staff, the
assumption is that every developer has at least a passing familiarity
with relationship (or even NoSQL) databases and how to work with
them. In the same way that developers who understand databases are
more powerful than those who don't, people in the computer field who
understand data science are more powerful than those who don't.
&lt;/p&gt;

&lt;p&gt;
There is a bit of bad news on this front, though. If you thought that
the pace of technological change in programming and the web moved at a
breakneck pace, you haven't seen anything yet! The world of data
science—the tools, the algorithms, the applications—are moving
at an overwhelming speed. The good news is that everyone is
struggling to keep up, which means if you find yourself
overwhelmed, you're probably in very good company. Just be sure to keep
moving ahead, aiming to increase your understanding of the theory,
algorithms, techniques and software that data scientists use.
&lt;/p&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-node-link field--type-ds field--label-hidden field--item"&gt;  &lt;a href="https://www.linuxjournal.com/content/learning-data-science" hreflang="und"&gt;Go to Full Article&lt;/a&gt;
&lt;/div&gt;
      
    &lt;/div&gt;
  &lt;/div&gt;

</description>
  <pubDate>Tue, 24 Oct 2017 12:19:27 +0000</pubDate>
    <dc:creator>Reuven M. Lerner</dc:creator>
    <guid isPermaLink="false">1339530 at https://www.linuxjournal.com</guid>
    </item>
<item>
  <title>Datamation's "Leading Big Data Companies" Report</title>
  <link>https://www.linuxjournal.com/content/datamations-leading-big-data-companies-report</link>
  <description>  &lt;div data-history-node-id="1339522" class="layout layout--onecol"&gt;
    &lt;div class="layout__region layout__region--content"&gt;
      
            &lt;div class="field field--name-node-author field--type-ds field--label-hidden field--item"&gt;by &lt;a title="View user profile." href="https://www.linuxjournal.com/users/james-gray" lang="" about="https://www.linuxjournal.com/users/james-gray" typeof="schema:Person" property="schema:name" datatype="" xml:lang=""&gt;James Gray&lt;/a&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-body field--type-text-with-summary field--label-hidden field--item"&gt;&lt;p&gt;
The Big Data market is in a period of remarkable transition. If keeping tabs on
this dynamic sector is in your wheelhouse, &lt;a href="http://www.datamation.com"&gt;Datamation&lt;/a&gt; has made your homework
easier by developing "Leading Big Data Companies", a report that provides "a
snapshot of a market sector in transition". Ranging from established legacy
vendors to start-ups, this report details the numerous strategies that are
exploited in today's Big Data landscape. 
The core technologies employed by this
diverse group of vendors include cloud, open source, AI and several others. 
&lt;/p&gt;

&lt;p&gt;
This
report belongs to Datamation's ongoing focus on the latest emerging tech for
the enterprise. In the mere seven years that have passed since Yahoo! introduced
Hadoop, Big Data has burgeoned in popularity as ever more firms seek insights from
the massive amounts of data at their disposal. 
&lt;/p&gt;

&lt;p&gt;
Because Big Data has matured
differently from most technologies in that no single leader has emerged after
nearly a decade, the analytics industry finds itself still in growth mode, making
it dynamic and challenging for those trying to make sense of it on their own.
&lt;/p&gt;
&lt;img src="http://www.linuxjournal.com/files/linuxjournal.com/ufiles/imagecache/large-550px-centered/u1000009/12237f3.jpg" alt="" title="" class="imagecache-large-550px-centered" /&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-node-link field--type-ds field--label-hidden field--item"&gt;  &lt;a href="https://www.linuxjournal.com/content/datamations-leading-big-data-companies-report" hreflang="und"&gt;Go to Full Article&lt;/a&gt;
&lt;/div&gt;
      
    &lt;/div&gt;
  &lt;/div&gt;

</description>
  <pubDate>Fri, 13 Oct 2017 16:13:37 +0000</pubDate>
    <dc:creator>James Gray</dc:creator>
    <guid isPermaLink="false">1339522 at https://www.linuxjournal.com</guid>
    </item>
<item>
  <title>Novelty and Outlier Detection</title>
  <link>https://www.linuxjournal.com/content/novelty-and-outlier-detection</link>
  <description>  &lt;div data-history-node-id="1339508" class="layout layout--onecol"&gt;
    &lt;div class="layout__region layout__region--content"&gt;
      
            &lt;div class="field field--name-node-author field--type-ds field--label-hidden field--item"&gt;by &lt;a title="View user profile." href="https://www.linuxjournal.com/users/reuven-m-lerner" lang="" about="https://www.linuxjournal.com/users/reuven-m-lerner" typeof="schema:Person" property="schema:name" datatype="" xml:lang=""&gt;Reuven M. Lerner&lt;/a&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-body field--type-text-with-summary field--label-hidden field--item"&gt;&lt;p&gt;
In my last few articles, I've looked at a number of ways
machine learning can help make predictions. The basic idea is
that you create a model using existing data and then ask that model to
predict an outcome based on new data.
&lt;/p&gt;

&lt;p&gt;
So, it's not surprising that one of the most amazing ways machine
learning is being applied is in predicting the future. Just a few days
before writing this piece, it was announced that machine learning
models actually might be able to predict earthquakes—a goal that
has eluded scientists for many years and that has the potential to
save thousands, and maybe even millions, of lives.
&lt;/p&gt;

&lt;p&gt;
But as you've also seen, machine learning can be used to
"cluster" data—that is, to find patterns that humans either can't or won't see,
and to try to put the data into various "clusters", or machine-driven
categories. By asking the computer to divide data into distinct
groups, you gain the opportunity to find and make use of previously
undetected patterns.
&lt;/p&gt;

&lt;p&gt;
Just as clustering can be used to divide data into a number of
coherent groups, it also can be used to decide which data points
belong inside a group and which don't. In "novelty
detection", you
have a data set that contains only good data, and you're trying to
determine whether new observations fit within the existing data
set. In "outlier detection", the data may contain outliers,
which you
want to identify.
&lt;/p&gt;

&lt;p&gt;
Where could such detection be useful? Consider just a few
questions you could answer with such a system:
&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;
&lt;p&gt;
Are there an unusual amount of login attempts from a particular IP
address?
&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;
&lt;p&gt;
Are any customers buying more than the typical number of products
at a given hour?
&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;
&lt;p&gt;
Which homes are consuming above-average amounts of water during a
drought?
&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;
&lt;p&gt;
Which judges convict an unusual number of defendants?
&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;
&lt;p&gt;
Should a patient's blood tests be considered normal, or are there
outliers that require further checks and examinations?
&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;
In all of those cases, you could set thresholds for minimum and maximum
values and then tell the computer to use those thresholds in
determining what's suspicious. But machine learning changes that
around, letting the computer figure out what is considered "normal"
and then identify the anomalies, which humans then
can investigate. This allows people to concentrate their energies on
understanding whether the outliers are indeed problematic, rather than
on identifying them in the first place.
&lt;/p&gt;

&lt;p&gt;
So in this article, I look at a number of ways you can try to
identify outliers using the tools and libraries that Python provides
for working with data: NumPy, Pandas and scikit-learn. Just which
technique and tools will be appropriate for your data depend on what
you're doing, but the basic theory and practice presented here should
at least provide you with some food for thought.
&lt;/p&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-node-link field--type-ds field--label-hidden field--item"&gt;  &lt;a href="https://www.linuxjournal.com/content/novelty-and-outlier-detection" hreflang="und"&gt;Go to Full Article&lt;/a&gt;
&lt;/div&gt;
      
    &lt;/div&gt;
  &lt;/div&gt;

</description>
  <pubDate>Thu, 28 Sep 2017 12:31:03 +0000</pubDate>
    <dc:creator>Reuven M. Lerner</dc:creator>
    <guid isPermaLink="false">1339508 at https://www.linuxjournal.com</guid>
    </item>
<item>
  <title>Classifying Text</title>
  <link>https://www.linuxjournal.com/content/classifying-text</link>
  <description>  &lt;div data-history-node-id="1339480" class="layout layout--onecol"&gt;
    &lt;div class="layout__region layout__region--content"&gt;
      
            &lt;div class="field field--name-node-author field--type-ds field--label-hidden field--item"&gt;by &lt;a title="View user profile." href="https://www.linuxjournal.com/users/reuven-m-lerner" lang="" about="https://www.linuxjournal.com/users/reuven-m-lerner" typeof="schema:Person" property="schema:name" datatype="" xml:lang=""&gt;Reuven M. Lerner&lt;/a&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-body field--type-text-with-summary field--label-hidden field--item"&gt;&lt;p&gt;
In my last few articles, I've looked
at several ways one
can apply machine learning, both supervised and
unsupervised. This time, I want to bring your attention to a
surprisingly simple—but powerful and widespread—use of machine
learning, namely document classification.
&lt;/p&gt;

&lt;p&gt;
You almost certainly have seen this technique used in day-to-day
life. Actually, you might not have seen it in action, but you
certainly have benefited from it, in the form of an email spam filter.
You might remember that back in the earliest days of spam filters, you
needed to "train" your email program, so that it would know what your
real email looked like. Well, that was a machine-learning model in
action, being told what "good" documents looked like, as opposed to
"bad" documents. Of course, spam filters are far more sophisticated
than that nowadays, but as you'll see over the course of this
article, there are logical reasons why spammers include
innocent-seeming (and irrelevant to their business) words in the text
of their spam.
&lt;/p&gt;

&lt;p&gt;
Text classification is a problem many businesses and
organizations have to deal with. Whether it's classifying legal
documents, medical records or tweets, machine learning can help you
look through lots of text, separating it into different groups.
&lt;/p&gt;

&lt;p&gt;
Now, text classification requires a bit more sophistication than
working with purely numeric data. In particular, it requires that you
spend some time collecting and organizing data into a format that
a model can handle. Fortunately, Python's scikit-learn comes with a
number of tools that can get you there fairly easily.
&lt;/p&gt;

&lt;span class="h3-replacement"&gt;
Organizing the Data&lt;/span&gt;

&lt;p&gt;
Many cases of text classification are supervised learning
problems—that is, you'll train the model, give it inputs (for example,
text documents) and the "right" output for each input (for
example, categories). In scikit-learn, the general template for supervised
learning is:

&lt;/p&gt;&lt;pre&gt;
&lt;code&gt;
model = CLASS()
model.fit(X, y)
model.predict(new_data_X)
&lt;/code&gt;
&lt;/pre&gt;


&lt;p&gt;
&lt;code&gt;CLASS&lt;/code&gt; is one of the 30 or so Python classes that come with
scikit-learn, each of which implements a different type of
"estimator"—a machine-learning algorithm. Some estimators work best with
supervised classification problems, some work with supervised
regression problems, and still others work with clustering (that is,
unsupervised classification) problems. You often will be able to
choose from among several different estimators, but the general format
remains the same.
&lt;/p&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-node-link field--type-ds field--label-hidden field--item"&gt;  &lt;a href="https://www.linuxjournal.com/content/classifying-text" hreflang="und"&gt;Go to Full Article&lt;/a&gt;
&lt;/div&gt;
      
    &lt;/div&gt;
  &lt;/div&gt;

</description>
  <pubDate>Tue, 05 Sep 2017 14:35:37 +0000</pubDate>
    <dc:creator>Reuven M. Lerner</dc:creator>
    <guid isPermaLink="false">1339480 at https://www.linuxjournal.com</guid>
    </item>
<item>
  <title>Unsupervised Learning</title>
  <link>https://www.linuxjournal.com/content/unsupervised-learning</link>
  <description>  &lt;div data-history-node-id="1339461" class="layout layout--onecol"&gt;
    &lt;div class="layout__region layout__region--content"&gt;
      
            &lt;div class="field field--name-node-author field--type-ds field--label-hidden field--item"&gt;by &lt;a title="View user profile." href="https://www.linuxjournal.com/users/reuven-m-lerner" lang="" about="https://www.linuxjournal.com/users/reuven-m-lerner" typeof="schema:Person" property="schema:name" datatype="" xml:lang=""&gt;Reuven M. Lerner&lt;/a&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-body field--type-text-with-summary field--label-hidden field--item"&gt;&lt;p&gt;
In my last few articles, I've looked into machine learning and
how you can build a model that describes the world in some way. All of
the examples I looked at were of "supervised learning", meaning
that you loaded data that already had been categorized or classified in
some way, and then created a model that "learned" the ways
the inputs mapped to the outputs. With a good model, you then
were able to predict the output for a new set of inputs.
&lt;/p&gt;
&lt;p&gt;
Supervised learning is a very useful technique and is quite
widespread. But, there is another set of techniques in machine
learning known as &lt;em&gt;unsupervised learning&lt;/em&gt;. These techniques, broadly
speaking, ask the computer to find the hidden structure in the
data—in other words, to "learn" what the meaning of the data is, what
relationships it contains, which features are of importance, and which
data records should be considered to be outliers or anomalies.
&lt;/p&gt;

&lt;p&gt;
Unsupervised learning also can be used for what's known as
"dimensionality reduction", in which the model functions as a
preprocessing step, reducing the number of features in order to
simplify the inputs that you'll hand to another model.
&lt;/p&gt;

&lt;p&gt;
In other words, in supervised learning, you teach the computer about
your data and hope that it understands the relationships and
categorization well enough to categorize data it hasn't
seen before successfully.
&lt;/p&gt;

&lt;p&gt;
In unsupervised learning, by contrast, you're asking the computer to
tell you something interesting about the data.
&lt;/p&gt;

&lt;p&gt;
This month, I take an initial look at the world of unsupervised
learning. Can a computer categorize data as well as a human? How can
you use Python's scikit-learn to create such models?
&lt;/p&gt;

&lt;span class="h3-replacement"&gt;
Unsupervised Learning&lt;/span&gt;

&lt;p&gt;
There's a children's card game called &lt;em&gt;Set&lt;/em&gt; that is a useful way to
think about machine learning. Each card in the game contains a
picture. The picture contains one, two or three shapes. There are
several different shapes, and each shape has a color and a fill
pattern. In the game, players are supposed to identify three-card
groups of cards using any one of those properties. Thus, you could
create a group based on the color green, in which all cards are green
in color (but contain different numbers of shapes, shapes and fill
patterns). You could create a group based on the number of shapes, in
which every card has two shapes, but those shapes can be of any
color, any shape and any fill pattern.
&lt;/p&gt;
                 
&lt;p&gt;
The idea behind the game is that players can create a variety of
different groups and should take advantage of this in order to win
the game.
&lt;/p&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-node-link field--type-ds field--label-hidden field--item"&gt;  &lt;a href="https://www.linuxjournal.com/content/unsupervised-learning" hreflang="und"&gt;Go to Full Article&lt;/a&gt;
&lt;/div&gt;
      
    &lt;/div&gt;
  &lt;/div&gt;

</description>
  <pubDate>Thu, 10 Aug 2017 12:13:28 +0000</pubDate>
    <dc:creator>Reuven M. Lerner</dc:creator>
    <guid isPermaLink="false">1339461 at https://www.linuxjournal.com</guid>
    </item>
<item>
  <title>JMR SiloStor NVMe SSD Drives</title>
  <link>https://www.linuxjournal.com/content/jmr-silostor-nvme-ssd-drives</link>
  <description>  &lt;div data-history-node-id="1339460" class="layout layout--onecol"&gt;
    &lt;div class="layout__region layout__region--content"&gt;
      
            &lt;div class="field field--name-node-author field--type-ds field--label-hidden field--item"&gt;by &lt;a title="View user profile." href="https://www.linuxjournal.com/users/james-gray" lang="" about="https://www.linuxjournal.com/users/james-gray" typeof="schema:Person" property="schema:name" datatype="" xml:lang=""&gt;James Gray&lt;/a&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-body field--type-text-with-summary field--label-hidden field--item"&gt;&lt;p&gt;
Compute-intensive workflows are the environments in which the newly developed &lt;a href="http://jmr.com"&gt;JMR&lt;/a&gt;
SiloStor NVMe family of SSD drives is designed to show its colors. Ideal for HPC,
data centers, genome research, content creation, CGI/animation, codec processing
and gaming, among others, the SiloStor drive family comes in three NVMe/PCIe
configurations: single-drive module, x4 PCIe connectivity in 512GB/1TB/2TB
capacities; dual-drive, x8 connectivity in 1TB/2TB/4TB capacities; and quad-drive
module, x8 connectivity, available in 2TB/4TB/8TB capacities. The dual- and
quad-drive cards incorporate a PCIe switch, and the drives can be striped (on a
single card) for additional performance. 
&lt;/p&gt;
&lt;img src="http://www.linuxjournal.com/files/linuxjournal.com/ufiles/imagecache/large-550px-centered/u1000009/12217f1.jpg" alt="" title="" class="imagecache-large-550px-centered" /&gt;&lt;p&gt;
All SiloStor designs incorporate active
heatsink coolers on the drive modules themselves, maintaining low operating
temperatures even during intensive sequential write operations. Key performance
metrics include &lt;1 mS average access time of &lt;1 mS, 2 million hours MTBF, 1,200
TBW minimum endurance, 90,000/70,000 IOPS random 4K read/write speed and
4,000/3,000 MB/sequential read/write speed.
&lt;/p&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-node-link field--type-ds field--label-hidden field--item"&gt;  &lt;a href="https://www.linuxjournal.com/content/jmr-silostor-nvme-ssd-drives" hreflang="und"&gt;Go to Full Article&lt;/a&gt;
&lt;/div&gt;
      
    &lt;/div&gt;
  &lt;/div&gt;

</description>
  <pubDate>Wed, 09 Aug 2017 15:50:57 +0000</pubDate>
    <dc:creator>James Gray</dc:creator>
    <guid isPermaLink="false">1339460 at https://www.linuxjournal.com</guid>
    </item>
<item>
  <title>Kodiak Data's MemCloud</title>
  <link>https://www.linuxjournal.com/content/kodiak-datas-memcloud</link>
  <description>  &lt;div data-history-node-id="1339452" class="layout layout--onecol"&gt;
    &lt;div class="layout__region layout__region--content"&gt;
      
            &lt;div class="field field--name-node-author field--type-ds field--label-hidden field--item"&gt;by &lt;a title="View user profile." href="https://www.linuxjournal.com/users/james-gray" lang="" about="https://www.linuxjournal.com/users/james-gray" typeof="schema:Person" property="schema:name" datatype="" xml:lang=""&gt;James Gray&lt;/a&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-body field--type-text-with-summary field--label-hidden field--item"&gt;&lt;p&gt;
Scientists working with big data regularly confront the high cost 
of acquiring the computational power needed to push the boundaries
and innovate in data science. In an effort to bridge the Big Data
infrastructure chasm, Kodiak Data—a leader in cluster virtualization
technology—presents &lt;a href="http://www.memcloud.works"&gt;MemCloud&lt;/a&gt;, an
 innovative IaaS solution that accelerates the entire big data-deployment
chain. 
&lt;/p&gt;

&lt;p&gt;
MemCloud is also "the first memory-speed cloud infrastructure
solution for big data scientists and software developers"
that provides big data analytic clusters "at up to one-fifth
the cost and five times the performance of typical leading cloud
hosting services". MemCloud is built on Kodiak Data's Virtual
Cluster Infrastructure platform, "the only solution capable of
in-software provisioning of compute, networking, storage and data at
the cluster level within minutes". 
&lt;/p&gt;

&lt;p&gt;
Besides the hosted cloud
service option, MemCloud also is available as a compact on-premises
appliance for private clouds, an industry first, asserts Kodiak.
&lt;/p&gt;
&lt;img src="http://www.linuxjournal.com/files/linuxjournal.com/ufiles/imagecache/large-550px-centered/u1000009/12202f8.png" alt="" title="" class="imagecache-large-550px-centered" /&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-node-link field--type-ds field--label-hidden field--item"&gt;  &lt;a href="https://www.linuxjournal.com/content/kodiak-datas-memcloud" hreflang="und"&gt;Go to Full Article&lt;/a&gt;
&lt;/div&gt;
      
    &lt;/div&gt;
  &lt;/div&gt;

</description>
  <pubDate>Fri, 04 Aug 2017 15:30:56 +0000</pubDate>
    <dc:creator>James Gray</dc:creator>
    <guid isPermaLink="false">1339452 at https://www.linuxjournal.com</guid>
    </item>

  </channel>
</rss>
