There’s a saying we’ve all heard; when something is difficult, it’s an uphill battle. And lacking the right environment to prepare your structured and unstructured data is like an uphill battle followed by bumpy switchbacks – all while climbing barefoot.
Preparing analytic-ready data is already hard enough when working with structured and unstructured data, so it can weaken gains if the right tools and processes are not in place. The current state of data prep is a challenge for most. This morning at the inaugural NG Analytics Summit I shared my view on how to address the data preparation time suck.
Think of the 80/20 rule. Like a good ground beef, you want the meat on the 80 percent side of things and fat on the 20 percent side. Unfortunately for most data scientists, they’re dealing with 80 percent fat and 20 percent meat. The expectations of what data scientists do today are very similar to yesterday’s statistician except there’s a lot more data preparation – a.k.a. fat – to do. Today’s data scientists have exponential amount of responsibility on their hands; handling lots more data, data structures, and pulling data together in technologies (like Hadoop) where there’s little to no structure around the data. The upfront work involves de-normalizing and normalizing data, recoding attributes to the data, data exploration, univariate analysis and data profiling. And on top of all this, it needs to be coded. After the laborious task of traditional data preparation is completed, data scientists typically only have 20 percent of their time to spend in production mode where they can finally tune their algorithm.
But there is a better way.
RedPoint is aiming to reverse this 80/20 proportion. We’re trying to make the process as fast and automated as possible so our customers have more time with tuning their algorithm – the meat.
Programming frameworks like Hadoop allow us to do a lot of data gathering. We used to have data in relational databases but now technologies like Hadoop put even more emphasis on data scientists, and that’s a great thing.
Even when comparing other options, we take vastly different approaches to solving the same problem. For example, we took a sample word count problem, like one you’d see in Project Gutenberg. We took a cluster of desktop discards and coded with MapReduce. It involved six hours of development, with six minutes of runtime and needed extensive optimization. In our application using Hadoop, there was no coding involved, with 15 minutes of development, three minutes of runtime and no tuning or optimization required. Our customers can quickly and easily perform virtually any data management function within the Hadoop cluster. There’s no need to move data out of Hadoop.
We think that this is so important because when looking at 80 percent of data prep with modern data architectures, it has impact on productivity and automation.
We perform all the functionality inside the data lake. We do it in its raw document format. We do the ETL, the quality, the matching, the MDM, de-duping, parsing, merge/purge, address standardization, master keys, and more. We make it immediately available to any application directly to Hadoop. We can auto structure data instantly.
Empowering organizations to break free from the sole reliance on IT for data preparation is critical in a modern data-driven organization. Setting up a data preparatory environment that will enable data scientists and analysts to spend less time on data preparation and more time on business-critical analytic projects is something we believe will help produce better and more analytics, leading to better business outcomes. Besides, what’s the point of having all this data if you can’t effectively monetize it anyway?