I thought it might be beneficial to our partner community to highlight some of the common questions that kept coming up around Hadoop and big data and summarize the panels’ responses. We actually agreed most of the time, making the task of summarizing the panel opinions not overly daunting.
Does Hadoop replace my existing Data Warehouse?
Panel says: No. Hadoop can be an extremely valuable extension to your data warehouse and even off-load some services from your data warehouse (such as ETL), but it does not replace it. Hadoop is not a RDBMS, it’s not an ACID compliant database, it’s not even a database. It is a file system (Hadoop Distributed File System or HDFS) and analytic/calculation engine (MapReduce). Yes, we can add SQL services like Hive and other processing engines like Spark but it still doesn’t replace an Enterprise Data Warehouse. Hive and other SQL on Hadoop tools are not full ANSI SQL standard, rather a sub-set of ANSI SQL 1992 features – which would have significant speed/performance implications. Hadoop is complementary to your data warehouse.
Of course, if we really wanted to complicate things, we could dig deeper into what you consider to be a data warehouse -and we would get a variety of answers that run the spectrum. And if the answer was something like “our data warehouse is really just a repository of data from a handful of sources, without any complex schemas or modeling” – then maybe you “could” actually move everything to Hadoop. But since that is fairly academic and probably of limited applicability to most enterprise customers, I’ll stick with my original answer of: no.
What about Spark, does it replace Hadoop?
Once again: No. Spark is an in-memory processing engine that can run on top of HDFS or stand-alone. As an in-memory engine, Spark is much faster than the traditional MapReduce approach. Spark can process data from HDFS, Hive, Flume and other data sources extremely fast, allowing Hadoop to be an effective streaming or real-time analytics platform. Spark can replace MapReduce as the right tool for many jobs, but it is just one part of the Hadoop ecosystem, which includes tools such as MapReduce, Spark, Storm, Hive, Hbase, Flume etc.
Are dedicated programmers/developers needed to deploy/manage a Hadoop system? Do I need to hire a Data Scientist?
You will certainly need some folks with Hadoop skills, database/data management skills, system admin skills, programing skills and analytics skills. Currently, the market isn’t oversaturated with Hadoop admins that possess all of these skills along with several deployments and a few years of management experience under their belts (I think we’ll see more over the next few years). Experienced DBAs can usually be effective Hadoop admins, as are good system admins (i.e. folks that know more than just navigating the GUI).
As for the data scientist, they’re great if you can find one (and afford him/her). You’re talking about someone who gets statistics, algorithms, coding, data and database technologies and the underlying business logic. In many cases, companies are leveraging the skills of multiple individuals already on staff as opposed to hiring a dedicated data scientist.
[to continue, click HERE]