Big Data

FOSS Project Spotlight: Sawmill, the Data Processing Project

Daniel Berman — Thu, 22 Mar 2018 19:50:11 +0000

Introducing Sawmill, an open-source Java library for enriching, transforming and filtering JSON documents.

If you're into centralized logging, you are probably familiar with the ELK Stack: Elasticsearch, Logstash and Kibana. Just in case you're not, ELK (or Elastic Stack, as it's being renamed these days) is a package of three open-source components, each responsible for a different task or stage in a data pipeline.

Logstash is responsible for aggregating the data from your different data sources and processing it before sending it off for indexing and storage in Elasticsearch. This is a key role. How you process your log data directly impacts your analysis work. If your logs are not structured correctly and you have not configured Logstash correctly, your logs will not be parsed in a way that enables you to query and visualize them in Kibana.

Logz.io used to rely heavily on Logstash for ingesting data from our customers, running multiple Logstash instances at any given time. However, we began to experience some pain points that ultimately led us down the path to the project that is the subject of this article: Sawmill.

Explaining the Motivation

Over time, and as our data pipelines became more complex and heavy, we began to encounter serious performance issues. Our Logstash configuration files became extremely complicated, which resulted in extremely long startup times. Processing also was taking too long, especially in the case of long log messages and in cases where there was a mismatch between the configuration and the actual log message.

The above points resulted in serious stability issues, with Logstash coming to a halt or sometimes crashing. The worst thing about it was that troubleshooting was a huge challenge. We lacked visibility and felt a growing need for a way to monitor key performance metrics.

There were additional issues we encountered, such as dynamic configuration reload and the ability to apply business logic, but suffice it to say, Logstash was simply not cutting it for us.

Introducing Sawmill

Before diving into Sawmill, it's important to point out that Logstash has developed since the time we began working on this project, with new features that help deal with some of the pain points described above.

So, what is Sawmill?

Sawmill is an open-source Java library for enriching, transforming and filtering JSON documents.

For Logstash users, the best way to understand Sawmill is as a replacement of the filter section in the Logstash configuration file. Unlike Logstash, Sawmill does not have any inputs or outputs to read and write data. It is responsible only for data transformation.

Go to Full Article

Galit Shmueli et al.'s Data Mining for Business Analytics (Wiley)

James Gray — Fri, 03 Nov 2017 16:11:00 +0000

by James Gray

The updated 5th edition of the book Data Mining for Business Analytics from Galit Shmueli and collaborators and published by Wiley is a standard guide to data mining and analytics that adds two new co-authors and a trove of new material vis-á-vis its predecessor. R is a free, open-source and popularity-gaining software environment for statistical computing and graphics. Trailing with the subtitle Concepts, Techniques, and Applications in R, the new 5th edition of Data Mining for Business Analytics continues to provide an applied approach to data-mining concepts and methods, using the R software as a canvas on which to illustrate.

With the book, readers learn how to implement a variety of popular data-mining algorithms in R to tackle business problems and opportunities. Material covered in-depth includes both statistical and machine-learning algorithms for prediction, classification, visualization, dimension reduction, recommender systems, clustering, text mining and network analysis.

The new 5th edition includes material from business, government, a dozen case studies demonstrating applications for the data-mining techniques described, and exercises in each chapter that help readers gauge and expand their comprehension and competency of the material. Data Mining for Business Analytics can serve as either a text book or a reference for analysts, researchers and practitioners working with quantitative methods in myriad fields.

Go to Full Article

InfluxData

James Gray — Fri, 27 Oct 2017 17:03:50 +0000

by James Gray

What is ephemeral data, you ask? InfluxData can supply the answer, because handling it is the business of the company's InfluxData open-source platform that is custom-built for metrics and events. Ephemeral data is transitory, existing only briefly, and is becoming vital for modern applications built where containers, microservices and sensors can come and go and are intermittently connected. The updated InfluxData 1.3 Platform can handle a billion (yes, with a "b"!) unique time series, making it easier to handle ephemeral data coming from containers or adding and removing sensors in IoT-tracking systems. InfluxData addresses the explosion of data points and sources, monitoring and controls requiring nanosecond precision coming from sensors and microservices.

The InfluxData platform provides a comprehensive set of tools and services to accumulate metrics and events data, analyze the data and act on the data via powerful visualizations and notifications. New features in release 1.3 include time-series indexing, high-availability anomaly detection, query language improvements and automatic cluster rebalancing. InfluxData calls the new release "one of the most significant technical advancements in the platform to date".

Go to Full Article

Learning Data Science

Reuven M. Lerner — Tue, 24 Oct 2017 12:19:27 +0000

by Reuven M. Lerner

In my last few articles, I've written about data science and machine learning. In case my enthusiasm wasn't obvious from my writing, let me say it plainly: it has been a long time since I last encountered a technology that was so poised to revolutionize the world in which we live.

Think about it: you can download, install and use open-source data science libraries, for free. You can download rich data sets on nearly every possible topic you can imagine, for free. You can analyze that data, publish it on a blog, and get reactions from governments and companies.

I remember learning in high school that the difference between freedom of speech and freedom of the press is that not everyone has a printing press. Not only has the internet provided everyone with the equivalent of a printing press, but it has given us the power to perform the sort of analysis that until recently was exclusively available to governments and wealthy corporations.

During the past year, I have increasingly heard that data science is the sexiest profession of the 21st century and the one that will be in greatest demand. Needless to say, those two things make for a very appealing combination! It's no surprise that I've seen a major uptick in the number of companies inviting me to teach on this subject.

The upshot is that you—yes, you, dear reader—should spend time in the coming months, weeks and years learning whatever you can about data science. This isn't because you will change jobs and become a data scientist. Rather, it's because everyone is going to become a data scientist. No matter what work you do, you'll be better at it, because you will be able to use the tools of data science to analyze past performance and make predictions based on it.

Back when I started to develop web applications, it was the norm to have a database team that created the tables and queries. Nowadays, although there certainly are places that have a full-time database staff, the assumption is that every developer has at least a passing familiarity with relationship (or even NoSQL) databases and how to work with them. In the same way that developers who understand databases are more powerful than those who don't, people in the computer field who understand data science are more powerful than those who don't.

There is a bit of bad news on this front, though. If you thought that the pace of technological change in programming and the web moved at a breakneck pace, you haven't seen anything yet! The world of data science—the tools, the algorithms, the applications—are moving at an overwhelming speed. The good news is that everyone is struggling to keep up, which means if you find yourself overwhelmed, you're probably in very good company. Just be sure to keep moving ahead, aiming to increase your understanding of the theory, algorithms, techniques and software that data scientists use.

Go to Full Article

Datamation's "Leading Big Data Companies" Report

James Gray — Fri, 13 Oct 2017 16:13:37 +0000

by James Gray

The Big Data market is in a period of remarkable transition. If keeping tabs on this dynamic sector is in your wheelhouse, Datamation has made your homework easier by developing "Leading Big Data Companies", a report that provides "a snapshot of a market sector in transition". Ranging from established legacy vendors to start-ups, this report details the numerous strategies that are exploited in today's Big Data landscape. The core technologies employed by this diverse group of vendors include cloud, open source, AI and several others.

This report belongs to Datamation's ongoing focus on the latest emerging tech for the enterprise. In the mere seven years that have passed since Yahoo! introduced Hadoop, Big Data has burgeoned in popularity as ever more firms seek insights from the massive amounts of data at their disposal.

Because Big Data has matured differently from most technologies in that no single leader has emerged after nearly a decade, the analytics industry finds itself still in growth mode, making it dynamic and challenging for those trying to make sense of it on their own.

Go to Full Article

Novelty and Outlier Detection

Reuven M. Lerner — Thu, 28 Sep 2017 12:31:03 +0000

by Reuven M. Lerner

In my last few articles, I've looked at a number of ways machine learning can help make predictions. The basic idea is that you create a model using existing data and then ask that model to predict an outcome based on new data.

So, it's not surprising that one of the most amazing ways machine learning is being applied is in predicting the future. Just a few days before writing this piece, it was announced that machine learning models actually might be able to predict earthquakes—a goal that has eluded scientists for many years and that has the potential to save thousands, and maybe even millions, of lives.

But as you've also seen, machine learning can be used to "cluster" data—that is, to find patterns that humans either can't or won't see, and to try to put the data into various "clusters", or machine-driven categories. By asking the computer to divide data into distinct groups, you gain the opportunity to find and make use of previously undetected patterns.

Just as clustering can be used to divide data into a number of coherent groups, it also can be used to decide which data points belong inside a group and which don't. In "novelty detection", you have a data set that contains only good data, and you're trying to determine whether new observations fit within the existing data set. In "outlier detection", the data may contain outliers, which you want to identify.

Where could such detection be useful? Consider just a few questions you could answer with such a system:

Are there an unusual amount of login attempts from a particular IP address?
Are any customers buying more than the typical number of products at a given hour?
Which homes are consuming above-average amounts of water during a drought?
Which judges convict an unusual number of defendants?
Should a patient's blood tests be considered normal, or are there outliers that require further checks and examinations?

In all of those cases, you could set thresholds for minimum and maximum values and then tell the computer to use those thresholds in determining what's suspicious. But machine learning changes that around, letting the computer figure out what is considered "normal" and then identify the anomalies, which humans then can investigate. This allows people to concentrate their energies on understanding whether the outliers are indeed problematic, rather than on identifying them in the first place.

So in this article, I look at a number of ways you can try to identify outliers using the tools and libraries that Python provides for working with data: NumPy, Pandas and scikit-learn. Just which technique and tools will be appropriate for your data depend on what you're doing, but the basic theory and practice presented here should at least provide you with some food for thought.

Go to Full Article

Classifying Text

Reuven M. Lerner — Tue, 05 Sep 2017 14:35:37 +0000

by Reuven M. Lerner

In my last few articles, I've looked at several ways one can apply machine learning, both supervised and unsupervised. This time, I want to bring your attention to a surprisingly simple—but powerful and widespread—use of machine learning, namely document classification.

You almost certainly have seen this technique used in day-to-day life. Actually, you might not have seen it in action, but you certainly have benefited from it, in the form of an email spam filter. You might remember that back in the earliest days of spam filters, you needed to "train" your email program, so that it would know what your real email looked like. Well, that was a machine-learning model in action, being told what "good" documents looked like, as opposed to "bad" documents. Of course, spam filters are far more sophisticated than that nowadays, but as you'll see over the course of this article, there are logical reasons why spammers include innocent-seeming (and irrelevant to their business) words in the text of their spam.

Text classification is a problem many businesses and organizations have to deal with. Whether it's classifying legal documents, medical records or tweets, machine learning can help you look through lots of text, separating it into different groups.

Now, text classification requires a bit more sophistication than working with purely numeric data. In particular, it requires that you spend some time collecting and organizing data into a format that a model can handle. Fortunately, Python's scikit-learn comes with a number of tools that can get you there fairly easily.

Organizing the Data

Many cases of text classification are supervised learning problems—that is, you'll train the model, give it inputs (for example, text documents) and the "right" output for each input (for example, categories). In scikit-learn, the general template for supervised learning is:


model = CLASS()
model.fit(X, y)
model.predict(new_data_X)

CLASS is one of the 30 or so Python classes that come with scikit-learn, each of which implements a different type of "estimator"—a machine-learning algorithm. Some estimators work best with supervised classification problems, some work with supervised regression problems, and still others work with clustering (that is, unsupervised classification) problems. You often will be able to choose from among several different estimators, but the general format remains the same.

Go to Full Article

Unsupervised Learning

Reuven M. Lerner — Thu, 10 Aug 2017 12:13:28 +0000

by Reuven M. Lerner

In my last few articles, I've looked into machine learning and how you can build a model that describes the world in some way. All of the examples I looked at were of "supervised learning", meaning that you loaded data that already had been categorized or classified in some way, and then created a model that "learned" the ways the inputs mapped to the outputs. With a good model, you then were able to predict the output for a new set of inputs.

Supervised learning is a very useful technique and is quite widespread. But, there is another set of techniques in machine learning known as unsupervised learning. These techniques, broadly speaking, ask the computer to find the hidden structure in the data—in other words, to "learn" what the meaning of the data is, what relationships it contains, which features are of importance, and which data records should be considered to be outliers or anomalies.

Unsupervised learning also can be used for what's known as "dimensionality reduction", in which the model functions as a preprocessing step, reducing the number of features in order to simplify the inputs that you'll hand to another model.

In other words, in supervised learning, you teach the computer about your data and hope that it understands the relationships and categorization well enough to categorize data it hasn't seen before successfully.

In unsupervised learning, by contrast, you're asking the computer to tell you something interesting about the data.

This month, I take an initial look at the world of unsupervised learning. Can a computer categorize data as well as a human? How can you use Python's scikit-learn to create such models?

Unsupervised Learning

There's a children's card game called Set that is a useful way to think about machine learning. Each card in the game contains a picture. The picture contains one, two or three shapes. There are several different shapes, and each shape has a color and a fill pattern. In the game, players are supposed to identify three-card groups of cards using any one of those properties. Thus, you could create a group based on the color green, in which all cards are green in color (but contain different numbers of shapes, shapes and fill patterns). You could create a group based on the number of shapes, in which every card has two shapes, but those shapes can be of any color, any shape and any fill pattern.

The idea behind the game is that players can create a variety of different groups and should take advantage of this in order to win the game.

Go to Full Article

JMR SiloStor NVMe SSD Drives

James Gray — Wed, 09 Aug 2017 15:50:57 +0000

by James Gray

Compute-intensive workflows are the environments in which the newly developed JMR SiloStor NVMe family of SSD drives is designed to show its colors. Ideal for HPC, data centers, genome research, content creation, CGI/animation, codec processing and gaming, among others, the SiloStor drive family comes in three NVMe/PCIe configurations: single-drive module, x4 PCIe connectivity in 512GB/1TB/2TB capacities; dual-drive, x8 connectivity in 1TB/2TB/4TB capacities; and quad-drive module, x8 connectivity, available in 2TB/4TB/8TB capacities. The dual- and quad-drive cards incorporate a PCIe switch, and the drives can be striped (on a single card) for additional performance.

All SiloStor designs incorporate active heatsink coolers on the drive modules themselves, maintaining low operating temperatures even during intensive sequential write operations. Key performance metrics include <1 mS average access time of <1 mS, 2 million hours MTBF, 1,200 TBW minimum endurance, 90,000/70,000 IOPS random 4K read/write speed and 4,000/3,000 MB/sequential read/write speed.

Go to Full Article

Kodiak Data's MemCloud

James Gray — Fri, 04 Aug 2017 15:30:56 +0000

by James Gray

Scientists working with big data regularly confront the high cost of acquiring the computational power needed to push the boundaries and innovate in data science. In an effort to bridge the Big Data infrastructure chasm, Kodiak Data—a leader in cluster virtualization technology—presents MemCloud, an innovative IaaS solution that accelerates the entire big data-deployment chain.

MemCloud is also "the first memory-speed cloud infrastructure solution for big data scientists and software developers" that provides big data analytic clusters "at up to one-fifth the cost and five times the performance of typical leading cloud hosting services". MemCloud is built on Kodiak Data's Virtual Cluster Infrastructure platform, "the only solution capable of in-software provisioning of compute, networking, storage and data at the cluster level within minutes".

Besides the hosted cloud service option, MemCloud also is available as a compact on-premises appliance for private clouds, an industry first, asserts Kodiak.

Go to Full Article