https://www.quora.
com/topic/MapReduce
Answer written ·
MapReduce
· 2013
What is an intuitive explanation of MapReduce?
Pararth Shah, Stanford MS CS '15, IIT-Bombay BTech CS '13, Google intern
'12, ML enthusiast
Updated Feb 2, 2014 · Upvoted by Amogh Akshintala, PhD Computer Science, University of North
Carolina at Chapel Hill (2020)
Long long ago, in a galaxy far far away, there lived an energetic young space
commander named Sheriff Sequential. Although he ruled the tiny planet
of Pentium Single-Core 1.3ghz, he harbored ambiti... (more)
Answer written ·
MapReduce
· 2014
How does SAP's Big Data platform HANA differs from
Hadoop, Mapreduce platforms?
Brian Feeny, SAP HANA Certified
Written Jan 21, 2014 · Upvoted by Manish Kaduskar, Architect at SAP and Abhishek Kumar Singh, SAP
HANA Trainer
They are nothing alike. It is like comparing Apples and Oranges. SAP HANA is
an in-memory database. Fundamentally, it operates just like many other OLAP
databases, with many performance enhanceme... (more)
Answer written ·
MapReduce
· 2011
How does YARN compare to Mesos?
Jay Kreps, I work with Hadoop at LinkedIn
Updated Sep 8, 2011 · Upvoted by Alex Feinberg, Distributed systems engineer
Both systems have the same goal: allowing you to share a large cluster of
machines between different frameworks.
For those who don't know, NextGen MapReduce is a project to factor the
existing Ma... (more)
Answer written ·
MapReduce
· 2012
What is the best directory structure to store different types
of logs in HDFS over the time?
Eric Sammer, ex-Cloudera ('10-'14), wrote Hadoop Operations for O'Reilly
Written Apr 25, 2012
Best, as you know, is subjective. The way I approach this is to consider what
directory structures are used for:
Organization
Access control, visibility, and auditing
Resource control and allocation ...
(more)
Answer written ·
MapReduce
· 2010
What is Hadoop not good for?
Sameer Al-Sakran
Written Dec 1, 2010
Assuming you're talking about the MapReduce execution system and not
HDFS/HBase/etc --
Easy things out of the way first:
Real time anything
You can use hadoop to do precalculations, but will nee... (more)
Answer written ·
MapReduce
· 2015
What is Map-Reduce?
Eric Wu, worked at Google
Written Jan 21, 2015
Let's say we have the text for the State of the Union address and we want to
count the frequency of each word. You could easily do this by storing each
word and its frequency in a dictionary and looping through all of the words in
the speech. However, imagine if we have the text for all of Wikipedia (say, a
billion words) and we wanted to do the same thing. Our poor computer would
be stuck looping for ages!
MapReduce is a programming model that allows tasks (like counting
frequencies) to be simultaneously (parallel) performed on many (distributed)
computers. Now, instead of one computer having to loop through a billion
words we can now have 1,000 computers simultaneously looping through only
a million words each -- that's a 1000x time improvement!
There are two parts to MapReduce: map(), and reduce().
Map() takes in a word and "emits" a key and value pair. In this case, key = a
word, and value = 1 (we just have one instance of the word). For instance, the
phrase "bright yellow socks" would emit ("bright", 1), ("yellow", 1), and (socks,
1). Map() is called for every single word -- all one billion of them.
(You might be asking why we did this very redundant step. Bear with me for a
second!)
Now, we have one billion key-value pairs. What do we do with these?
This is where Reduce() comes in. All of the key-value pairs with the same key
are combined and fed into Reduce(). Let's take "hello" for instance, and say
that there are 500,000 occurrences of the word in Wikipedia. Then, we have
500,000 key-pairs that are passed into a single Reduce() call. Now, all we have
to do is loop through 500,000 ("hello", 1) pairs and sum up all of the values.
This is much easier than having to loop through a billion. Reduce() then "emits"
another key-value pair; this time, it's ("hello", 500000), or the word, and the
total sum of the values associated with that word. Repeat the Reduce() call for
every other word, and we've got the frequency for every word!
By distributing this massive job to many computers, if any one of these
computers fail for any reason, we can just restart the specific segment it was
working on. Imagine if this happened on a single computer!
Note that this is a very simple example. MapReduce has been used on many
problems that can be fit into this Map()/Reduce() model, such as Google's
search index, machine learning, and statistical analysis.
21.4k Views · View Upvotes
Answer written ·
MapReduce
· 2014
What is the difference between Apache Spark and Apache
Hadoop (Map-Reduce) ?
Suman Bharadwaj, was big data developer at Intel
Updated Mar 28, 2016
I'll mention the differences present at the shuffle side at a very high level, as I
understand it, between Apache Spark and Apache Hadoop Map reduce.
Since few folks have already mentioned about di...
(more)
Answer written ·
MapReduce
· 2015
What makes Spark faster than MapReduce?
Sandy Ryza, Apache Hadoop PMC, software engineer at Cloudera
Written Mar 9, 2015
I think there are three primary reasons.
The main two reasons stem from the fact that, usually, one does not run a
single MapReduce job, but rather a set of jobs in sequence.
1. One of the main l... (more)
Answer written ·
MapReduce
· 2011
How does YARN compare to Mesos?
Jie Li, Data Infra at Pinterest
Written Dec 14, 2011
Thanks for Jay Kreps and Arun C Murthy's good summary!
After playing arround both Mesos and YARN, I want to elaborate one of Arun C
Murthy's points, "Maybe that is the way to look at Mesos v/s YA... (more)
Answer written ·
MapReduce
· 2014
How does YARN compare to Mesos?
Matthew G Trifiro, SVP Marketing & Business Development, Mesosphere
Written Feb 8, 2014
There is one gigantic differences between YARN and Mesos—so big, in fact, I
am surprised that I am the first to mention it.
YARN is Hadoop-specific and is, therefore, specifically targeted at
sche... (more)
Answer written ·
MapReduce
· 2014
What are the advantages of DAG (directed acyclic graph)
execution of big data algorithms over MapReduce?
Tathagata Das, PMC of Apache Spark, builds all things streaming in Spark.
Written Nov 24, 2014 · Upvoted by Don van der Drift, Quora Data Scientist
[Disclaimer: I am an Apache Spark committer]
TL;DR - Conceptually DAG model is a strict generalization of MapReduce
model. DAG-based systems like Spark and Tez that are aware of the whole DAG
of operations can do better global optimizations than systems like Hadoop
MapReduce which are unaware of the DAG to be executed.
Long version:
Conceptually speaking, the MapReduce model simply states that distributed
computation on a large dataset can be boiled down to two kinds of
computation steps - a map step and a reduce step. One pair of map and reduce
does one level of aggregation over the data. Complex computations typically
require multiple such steps. When you have multiple such steps, it essentially
forms a DAG of operations. So a DAG execution model is essentially a
generalization of the MapReduce model.
While this is the theory, different systems implement this theory in different
ways, and that is where the "advantages" and "disadvantages" come from.
Computations expressed in Hadoop MapReduce boil down to multiple
iterations of (i) read data from HDFS, (ii) apply map and reduce, (iii) write back
to HDFS. Each map-reduce round is completely independent of each other, and
Hadoop does not have any global knowledge of what MR steps are going to
come after each MR. For many iterative algorithms this is inefficient as the
data between each map-reduce pair gets written and read from filesystem.
Newer systems like Spark and Tez improves performance over Hadoop by
considering the whole DAG of map-reduce steps and optimizing it globally (e.g.,
pipelining consecutive map steps into one, not write intermediate data to
HDFS). This prevents writing data back and forth after every reduce.
Storm, being a streaming system, is slightly different from the batch processing
systems referred earlier. It also sets up a DAG of nodes lets the records stream
between the nodes. Its best to compare Storm with Spark Streaming
(streaming system built over Spark) than Hadoop MapReduce. Both accepts a
DAG of operations representing the streaming computation, but then process
the DAG in slightly different ways. Storm sets up a DAG of node and allocates
each operation in the DAG of ops to different nodes. Spark Streaming does not
pre-allocate, rather uses the underlying Spark's mechanisms to dynamically
allocate tasks to available resources. This gives different kinds of performance
characteristics.
21.6k Views · View Upvotes · Answer requested by Umanga Bista
Answer written ·
MapReduce
· Mar 1
What are the main ideas behind map reduce?
Michael Ernest, Production-level work in Hadoop operations and application
architectures
Written Mar 1
In short, that a wide domain of computing tasks can be expressed in two
phases of operation: scanning and transforming data (mapping), followed by
consolidating that data to some value or summary (reducing).
...
(more)
Answer written ·
MapReduce
· 2014
What exactly is Apache Spark and how does it work?
Reynold Xin, Chief Architect @ Databricks
Updated Sep 3, 2014 · Upvoted by Thia Kai Xin, Data scientist at Lazada, Co-Founder of DataScience
SG. and Sean Owen, Director, Data Science @ Cloudera
Originally Answered: How does Apache Spark work?
In many ways, Spark is a better implementation of the MapReduce paradigm
(not a huge surprise since Matei Zaharia who created Spark also worked on
Hadoop MapReduce in its early days).
From a progr... (more)
Answer written ·
MapReduce
· 2010
What's the best way to come up to speed on MapReduce,
Hadoop, and Hive?
Amund Tveit, Co-Founder of Atbrox
Updated Apr 27, 2010
The best way is to start as hands-on as possible, e.g. download hadoop and
write some mappers and reducers and run them on a few datasets. The original
mapreduce paper[1] describes a few examples that are nice to start with:
a) word count
b) distributed grep
c) reverse web link graph
d) term vector per host
e) inverted index
f) distributed sort
Then continue at larger scale with Elastic Mapreduce[2-4] or setup and use
Hadoop on your own cluster.
References
[1] http://labs.google.com/papers/ma...
[2] http://aws.amazon.com/elasticmap...
[3] http://developer.amazonwebservic...
[4] http://aws.amazon.com/about-aws/...
6k Views · View Upvotes
Answer written ·
MapReduce
· 2011
What is MapReduce?
Mohsin Shafeeque Hijazee, I've been programming since 2001 an dmy
languages include x86 Assembly, C, C++, C#, Java, Python, Ruby, Gro...
Written Oct 26, 2011
MapReduce when spelled jointly, is a distributed computing paradigm inspired
by the ideas from functional languages. To process large number of inputs,
there are two functions not surprisingly name... (more)
Answer written ·
MapReduce
· 2011
What are some good class projects for machine learning
using MapReduce?
Alex Kamil, studied at Columbia University
Updated Oct 30, 2014 · Upvoted by Amund Tveit, PhD in machine learning. and Manohar Kuse, PhD
Candidate researching computer vision and machine learning in robotics.
Try implementing some ML algorithms not yet covered (or poorly covered)
in Apache Mahout: What are some important algorithms not yet covered in
Mahout? , and What are the top 10 data mining or mac... (more)
Answer written ·
MapReduce
· Mar 27
What are examples in which MapReduce does not work?
What are examples in which it works well? What are the
security issues involved with the cloud?
Nikita Gureev, M.S. from Royal Institute of Technology (2018)
Written Mar 27
While it is three separate questions, I’ll try to cover the first two and provide
some musing for the third one.
MapReduce[1] is generally used in Big Data tasks in order to process enormous
amount ...
(more)
Answer written ·
MapReduce
· 2014
Why did Google stop using MapReduce and start
encouraging Cloud Dataflow?
Kenneth Tran, Machine Learning Engineer at Microsoft
Updated Jul 1, 2014 · Upvoted by David Marek, worked at Google and Jeff Nelson, Invented
Chromebook, Former Googler
I'm not sure if Google has stopped using MR completely. My guess is that no
one is writing new MapReduce jobs anymore, but Google would keep running
legacy MR jobs until they are all replaced or be... (more)
Answer written ·
MapReduce
· 2011
How does YARN compare to Mesos?
Arun C Murthy, Founder & Architect, Hortonworks Former Architect & Lead,
Hadoop Map-Reduce Team, Yahoo; Frmr. VP, Apache H...
Updated Mar 5, 2012
Good summary Jay, thanks.
I'd like add some clarifications:
# It's trivial to simulate the Mesos offer/reject model in YARN. You ask for
*any*, i.e. non-locality specific, resources and then rejec... (more)
Answer written ·
MapReduce
· 2011
What are some good MapReduce implementations for
graphs?
Ankur Dave, Spark committer at UCB AMPLab
Updated Mar 30, 2014 · Upvoted by Neha Narkhede, Co-founder and CTO at Confluent and Apurv
Verma, ML@GATech
Before looking at individual implementations, it's helpful to narrow down the
search by identifying some essential features of distributed graph processing
frameworks. Here are three important ones:
...
(more)
Answer written ·
MapReduce
· 2015
Is Hadoop dead and is it time to move to Spark?
Sean Owen, Director, Data Science @ Cloudera
Updated Jun 24, 2015
ATA I think this is like asking: is Linux dead and should we move to Docker? or
something like that. Linux is of course shorthand for a large number of related
technologies at this point, even if i... (more)
Answer written ·
MapReduce
· 2013
What is an intuitive explanation of MapReduce?
Gautam Singaraju, inquisitive.
Written Nov 15, 2013
Let's say there is a assorted box of candies and a teacher wants to count how
many different kinds of chocolate exist. One could probably do it by counting a
single jar: 5 red, 10 green etc.
Now l... (more)
Answer written ·
MapReduce
· 2014
What is the difference between MapReduce, artificial
intelligence, and machine learning? Or rather, how are they
related?
Smit Mehta, Googler
Written Sep 26, 2014
MapReduce - An engineering solution to handle operations (like sorting,
searching, etc) on huge data sets.
Imagine, if you want to sort 1000 strings. It's easy and doable on a single
computer. Th... (more)
Answer written ·
MapReduce
· Mar 18
What should I learn, Hadoop (MapReduce/Pig/Hive) or
Spark?
Adam Albertson, works at Amazon Web Services
Written Mar 18
In short, All of the Above.
Big Data is a rather large field and to be successful in it, you need to be pretty
well rounded. This means not allowing yourself to be so narrowly focused that
you’re a ...
(more)
Answer written ·
MapReduce
· Tue
Can I use Hadoop as a PC?
Amr Salah, Hadoop Developer
Written Tue
The nature of problems that hadoop solves , are processing massive amounts of
data. If you assign tasks to Hadoop they must be split-able across machines .
So a giant CSV file , it can be split and...
(more)
Answer written ·
MapReduce
· 2012
What are the advantages of Hadoop over distributed
RDBMS?
Charles Zedlewski, I work at Cloudera, purveyors of Apache Hadoop and
related things.
Written Feb 25, 2012
I think even at fairly large scales of data (let's say a few hundred TB's), it's rare
that people are using Hadoop as a substitute for a parallel database. Hadoop
has some different design goals and consequently a different set of strengths &
weaknesses relative to parallel RDBMS.
Hadoop is a lot more flexible. Unlike parallel RDBMS you don't have to pre-
process the data before you can use it. You don't have to design a star schema
or update some data dictionary or manipulate it with a separate ETL process.
Moreover you can change the schema after the fact with very little cost or
effort. There are some workloads where flexibility is very valuable and those
workloads are moving to Hadoop pretty quickly.
Hadoop is more scalable. The largest publicly discussed Hadoop cluster
(Facebook's) was at 30 petabytes mid last year and it's grown since then. I
don't think there are parallel RDBMS'es that have come close to those kinds of
numbers.
Hadoop is more economical. Factoring in all the costs you can get down to
~$500 / TB pretty easily with Hadoop and even lower if cost is what really
matters to you. There's no parallel RDBMS that comes close to those numbers
to my knowledge. Since log, text and image data is often much bulkier than
transactional data it's often been kept out of parallel RDBMS since the
economics just wouldn't make any sense.
Parallel RDBMS has a lot to say for itself as well. It's much more optimized /
optimizable (e.g. various indexing, caching, join strategies, etc) which makes it
more performant and efficient for many workloads. It supports a much richer
set of industry standard SQL functions and a broader range of tools that
support those SQL functions. It also provides lower latency for interactive
queries.
I don't think the dividing line is so much structured / unstructured data but
rather about asking repeated questions of known data (where a fixed schema
and optimization pays off) versus asking novel questions of not well known data
(where a fixed schema and all those optimization techniques are a hindrance).
But in general this is a "right tool for the right job" kind of answer.
23.9k Views · View Upvotes
Answer written ·
MapReduce
· Feb 2
Do most machine learning algorithms run in batch, or do
they run every time they get a new bit of data?
Håkon Hapnes Strand, Machine Learning Engineer
Written Feb 2
We need to make two distinctions here.
First of all, we need to distinguish on the type of learning. There are two ways
of training machine learning models:
Offline learning: The model is trained once...
(more)
Answer written ·
MapReduce
· Mar 16
Does HBase use Hadoop to store data or is a separate
database independent of Hadoop?
Mark Vogt
Written Mar 16
Part of your confusion arises from how you’re using the term “Hadoop” itself…
Hadoop is NOT one thing; even now in early 2017 Hadoop is considered to be
at least 2 things:
1. The Map/Reduce Processing Engine on top of HDFS; and
2. The Hadoop Distributed File System (HDFS) itself.
It’s clear from your question that you’re thinking “Hadoop = HDFS”, but it’s
important to be more precise than this, because you can see from all the
comments that mis-information abounds in Big Data…
RE-PHRASING your question as “Does HBase use HDFS to store data, or is it a
separate database independent of HDFS?” , you can see that the best answer
has been provided:
HBase a type of database system (storage + mechanisms for
adding/changing/deleting/searching) consisting of one or more often-
massive (think “millions of columns, billions of rows”) column-family
based tables;
These tables are split up along horizontal lines to form massive strips
called table “regions”;
These regions (“table-ets”) are then called “Hfiles”, and are then stored
in the HDFS as if they were any other type of (unstructured) file in
HDFS, allowing for distributed storage and processing of those files by
hundreds or even thousands of servers each processing (again, for a
database “processing” consists of adds/changes/deletes/searches) their
own little portion of this massive table;
LONG ANSWER TO THE SHORT QUESTION:
HBase USES HDFS as its storage mechanism, by partitioning massive table
structures into smaller “Hfiles” which are then treated like any other file in the
HDFS.
Hope this helps anyone else who’s been struggling with this concept.
Mark in North Aurora
180 Views
Answer written ·
MapReduce
· 2011
Would Facebook have been able to scale effectively if Google
had not publicly described MapReduce in 2004 and BigTable
in 2006?
Edmond Lau, former Engineer at Google (2006-2008)
Written Oct 5, 2011 · Upvoted by Thach Nguyen, interned at Google and Brian Schmitz, worked at
Google
Yes, but Facebook would have had to significantly invest in building out its
distributed data computation and warehousing layers to be able to scale
effectively, and it would've delayed its climb ... (more)
Answer written ·
MapReduce
· Mon
Can I find any sample Hadoop clusters online so that I can
practice Hadoop development?
Shahzad Aslam, Software / BigData Engineer (https://goo.gl/wc54BM)
Written Mon
I have not seen any cluster for free. You have to pay a bit to work with. All
major clouds has their offerings with per hour charges. In Microsoft Azure if
you setup a cluster for $15 for 24 hours....
(more)
Answer written ·
MapReduce
· Feb 3
What are the best ways to learn about Hadoop source?
Oleksii Yermolenko, Hadoop software developer at Prometric (2015-present)
Written Feb 3
Feel free to join my blog about Big Data - https://oyermolenko.blog. I’ve been
working with Hadoop for a couple of years and in this blog want to share my
experience from the early start. My blog i...
(more)
Answer written ·
MapReduce
· 2011
Would Facebook have been able to scale effectively if Google
had not publicly described MapReduce in 2004 and BigTable
in 2006?
Jay Kreps, I work with Hadoop at LinkedIn
Written Feb 28, 2011 · Upvoted by Jack Lindamood and Olaoluwa 'Ola' Okelola, Engineer at Facebook
Yes.
My understanding is that Facebook's live site is served off memcached and
MySQL which obviously have little to do with the Google papers.
Their big, difficult apps like photo sharing, ad tar... (more)
Answer written ·
Cascading
· Mar 10
How do I display a red circle in Cascading Style Sheets?
Rob Parham, LAMP stack developer
Written Mar 10
You can’t display anything with CSS alone, but if you put this class on a block
level element it will display as a red circle:
1. .circle{
2. border-radius:50%;
3. border:1px solid red;
4. background:red;...
(more)
Answer written ·
MapReduce
· 2010
What are the advantages/disadvantages running Cloudera's
distribution for Hadoop on EC2 instances rather than using
Amazon's Elastic Map Reduce Service?
Jeff Hammerbacher, Founder and Chief Scientist at Cloudera (2008-present)
Written Nov 10, 2010 · Upvoted by Eric Sammer, works at Cloudera and Henry Robinson, works at
Cloudera
Some nice aspects of EMR:
Dynamic MapReduce cluster sizing.
Ease of use for simple jobs via their proprietary web console.
Great documentation.
Integrates nicely with other Amazon Web Services.
Some nice aspects of CDH:
CDH is open source; you have access to the source code and can
inspect it for debugging purposes and make modifications as required.
CDH can be run on a number of public or private clouds using an open
source framework, Whirr [1], so you're not tied to a single cloud
provider
With CDH, you can move your cluster to dedicated hardware with little
disruption when the economics make sense. Most non-trivial
applications will benefit from this move.
CDH packages a number of open source projects that are not included
with EMR: Sqoop, Flume, HBase, Oozie, ZooKeeper, Avro, and Hue.
You have access to the complete platform composed of data collection,
storage, and processing tools.
CDH packages a number of critical bug fixes and features and the most
recent stable releases, so you're usually using a more stable and
feature-rich product. For example, we added support for Hadoop 0.20
on 9/10/09 [2], while EMR did not have support for 0.20 until 6/2/10
[3].
You can purchase support and management tools for CDH via Cloudera
Enterprise [4].
CDH uses the open source Oozie [5] framework for workflow
management. EMR implemented a proprietary "job flow" system before
major Hadoop users standardized on Oozie for workload management.
CDH uses the open source Hue [6] framework for its user interface. If
you require new features from your web interface, you can easily
implement them using the Hue SDK.
CDH includes a number of integrations with other software
components of the data management stack, including Talend,
Informatica, Netezza, Teradata, Greenplum, Microstrategy, and others.
If you have an existing analytics infrastructure, it's easy to integrate
CDH.
CDH has been designed and deployed in common Linux environments
and you can use standard tools to debug your programs. I have not
experienced this firsthand, but a number of our customers have
reported that it's difficult to debug failing programs on EMR because
of the custom version of Hadoop being run and the opacity of the
environment in which it is run. Again, this is hearsay, so if others
disagree, I'll remove this point.
[1] http://incubator.apache.org/whirr/
[2] http://www.cloudera.com/blog/200...
[3] http://aws.amazon.com/releasenot...
[4] http://www.cloudera.com/products...
[5] http://yahoo.github.com/oozie/
[6] https://github.com/cloudera/hue
14.1k Views · View Upvotes · Not for Reproduction
Answer written ·
MapReduce
· Feb 18
Can MapReduce read input from memory instead of file?
Fred Williams II, IBM systems programmer since 1981
Written Feb 18
Let’s see here…
Your computer “always” reads things from memory. The processor in your
machine has things called “registers” which contain memory addresses (or
instructions what to do with those mem...
(more)
Answer written ·
MapReduce
· Jan 14
What is the relationship between MapReduce and NoSQL?
Nicolae Marasoiu, 3+ years big data: Hadoop, HBase, Spark, Storm, Druid,
Zookeeper, Pig, Consul.
Written Jan 14
They both appeared as solutions to handling more data, more users.
MapReduce is efficient for batch processing: big throughput (can process
millions of input records per second, depending on the clu...
(more)
Answer written ·
MapReduce
· Feb 3
What are some good examples of problems I can solve with
MapReduce for time-series analytics?
Bob Marshall, B. S. Chemistry & Computer Science, University of Notre Dame
(1980)
Written Feb 3
MapReduce works on key:value transformations. Tom White’s book Hadoop:
The Definitive Guide provides a MapReduce example of processing the National
Weather Service’s historical climate data to determine the highest and lowest
temperature for each year across all the weather stations in the country. Each
mapper processes the data for its input split, writing the highest and lowest
temperatures (values) for each year of data (key). Following completion of all
mappers, the data is shuffled and sorted such that all the key:value pairs for
each unique key will be combined on a particular node. Then, reducers run to
determine the highest and lowest values for each year. This is a great example
of a time series problem, with rather large granularity of time values.
303 Views
Answer written ·
MapReduce
· 2014
What are the differences between batch processing and
stream processing systems?
Sean Owen, Director, Data Science @ Cloudera
Written Oct 27, 2014
ATA I think the example is wrong in a few ways. Although people use the word
in different ways, Hadoop refers to an ecosystem of projects, most of which are
not processing systems at all. It contains MapReduce, which is a very batch-
oriented data processing paradigm.
Spark is also part of the Hadoop ecosystem, I'd say, although it can be used
separately from things we would call Hadoop. Spark is a batch processing
system at heart too. Spark Streaming is a stream processing system.
To me a stream processing system:
Computes a function of one data element, or a smallish window of
recent data
Computes something relatively simple
Needs to complete each computation in near-real-time -- probably
seconds at most
Computations are generally independent
Asynchronous - source of data doesn't interact with the stream
processing directly, like by waiting for an answer
A batch processing system to me is just the general case, rather than a special
type of processing, but I suppose you could say that a batch processing system:
Has access to all data
Might compute something big and complex
Is generally more concerned with throughput than latency of individual
components of the computation
Has latency measured in minutes or more
I sometimes hear streaming used as a sort of synonym for real-time . Real-time
stuff usually takes the form of needing to respond to an event in milliseconds,
as in a synchronous API. This isn't streaming to me.
28.8k Views · View Upvotes · Answer requested by Prashant Raaghav
Answer written ·
MapReduce
· Mar 20
What should I learn, Hadoop (MapReduce/Pig/Hive) or
Spark?
Don S, Lived in India, UK, USA
Written Mar 20
First and foremost you must learn HDFS which is the distributed file system
used in Hadoop and the related linux commands.
Then you must learn the Hadoop architecture; though you will not use it on ...
(more)
Answer written ·
MapReduce
· 2013
Will Spark overtake Hadoop? Will Hadoop be replaced by
Spark?
Matei Zaharia, CTO @ Databricks
Written Nov 18, 2013
Originally Answered: Will Apache Spark ever overtake Apache Hadoop?
It depends a bit on what you mean by "Hadoop". Some people take Hadoop to
mean a whole ecosystem (HDFS, Hive, MapReduce, etc), in which case Spark is
designed to fit well within the ecosystem (reading from any input source that
MapReduce supports through the InputFormat interface, being compatible
with Hive and YARN, etc). Others refer to Hadoop MapReduce in particular, in
which case I think it's very likely that non-MapReduce engines will take over in
a lot of domains, and in many cases they already have.
From this latter point of view, perhaps the most interesting thing about Spark
is that it shows that a lot of workloads can be captured efficiently by the same,
simple generalization of the MapReduce model. Spark can achieve (and
sometimes beat) state-of-the-art performance in not only simple ETL, but also
machine learning, graph processing, streaming, and relational queries.
Importantly, this means that applications can combine these workloads more
efficiently. For example, once you ETL data in, you can easily compute a report
or run a training algorithm on the same in-memory data. Furthermore, you get
the same programming interface to combine these jobs in, and only one system
to manage and install.
How much will this matter? It's hard to predict, but one possibility is that after
experimenting with specialized computing models, distributed programmers
will want to have a general model, in the same way that programmers for a
single machine settled on general-purpose languages. Having a general
platform is even more important in big data, because data is so expensive to
move across systems! In this case, Spark shows that many of the tricks used in
specialized systems today (e.g. column-oriented processing, graph partitioning
tricks) can be implemented on a general platform.
In any case, it is a first-order goal of the system to stay compatible with the
wider Hadoop ecosystem, and just give people better ways to compute on the
same data. The Hadoop ecosystem is also moving quickly towards supporting
alternative programming models, through efforts like YARN.
24.8k Views · View Upvotes
Answer written ·
MapReduce
· Jan 18
Does the map-function in MapReduce Hadoop, distribute the
data from HDFS?
Evan Mouzakitis, Research Engineer at Datadog, wrote about Hadoop
Written Jan 18
One of the big draws for Hadoop is that it moves the computation to the data,
rather than the other way around. This reduces overall computation time as
data does not need to travel across a network in order to be operated on. So, to
more pointedly answer your question: the scheduler (usually YARN) schedules
the map to occur on a node that is as close to the data as possible. You can
read more about Hadoop architecture here.
124 Views · View Upvotes
Answer written ·
MapReduce
· 9h
Can I use Hadoop as a PC?
Shahzad Aslam, Software / BigData Engineer (https://goo.gl/wc54BM)
Written 9h ago
Hadoop has specific purpose and that is to process massive amount of data in
parallel and nothing else. If you have 10 low valued transport motor cars, can
you add them to functional as a single lu...
(more)
Answer written ·
MapReduce
· Jan 16
Does the map-function in MapReduce Hadoop, distribute the
data from HDFS?
Joe Nap, Sr. Data Engineer at Blue Apron
Written Jan 16
Your latter intuition is correct. Files are split in HDFS across physical machines
in block sizes of (typically) 64MB or 128MB. This is done 1) for high availability
of the data, in the event a machine or disk dies. And 2) So that processing
(mapping) of each block is split across physical processors to increase the
throughput of data processed.
Each block is typically replicated 3 times. Besides having an extra, extra copy
of the data, the data is replicated 3 times so that 2 copies are stored on one
networked rack, while the 3rd is stored on a remote rack. That way if you have
an individual machine failure, you can continue processing on the same rack,
while if you experience a network failure, the scheduler can reschedule that
chunk of work on another, accessible rack. It’s often said that it’s cheaper to
move the computation than moving the data.
HDFS Architecture Guide
243 Views · View Upvotes
Answer written ·
MapReduce
· Feb 1
Do most machine learning algorithms run in batch, or do
they run every time they get a new bit of data?
Zeeshan Zia, PhD in Computer Vision and Machine Learning
Written Feb 1
Most papers in machine learning conferences talk about or use algorithms
running in batch mode. Almost all ML algorithms taught in pretty much any
basic ML course operate in batch-mode.
In the real-...
(more)
Answer written ·
MapReduce
· Feb 1
Why don't MapReduce use memory?
Vidhi Goel, worked at Samsung Electronics
Written Feb 1
This is indeed a very good question. MapReduce traditionally uses local disks to
store chunks of data local to a worker.
Recently, there have been developments to make it in-memory. For reference
In-Memory MapReduce
Quoted from the webpage: “Apache Ignite comes with in-memory
implementation of Hadoop MapReduce APIs which provides significant
acceleration over the native Hadoop MapReduce implementation.”
Another alternative to MapReduce is Spark that uses in-memory RDDs and
provides scalable fault-tolerance. It is also a more generalized framework.
166 Views
Answer written ·
MapReduce
· Jan 17
Does the map-function in MapReduce Hadoop, distribute the
data from HDFS?
Eduardo Avaria Suazo, Big Data Architect at Globant (2016-present)
Written Jan 17
Data is already splitted, distributed, replicated and referenced in memory at
the namenode. When an MR application is submitted, the code is packaged and
distributed, making sure that all data is processed by a mapper, so the general
idea is that the code is moved where the data is and not the other way.
The exception to this is where for some reason the data is not usable as is (e.g
a compressed .gz file that spans over multiple machines and needs to be
reconstructed), or you define another mapper behavior so data needs to be
moved to complete the task.
115 Views
Answer written ·
Cascading
· Mar 10
How do I display a red circle in Cascading Style Sheets?
Christopher Bobby, Web Dev
Written Mar 10
.className or #IDname of your html element {
width: 150px;
height: 150px;
border-radius: 50%;
background-color: red;
/*make your width and height same, and always set border radius to 50% or
higher, though 50% will just do the work */
65 Views · View Upvotes
Answer written ·
MapReduce
· Jan 14
Will MapReduce work with shape based image retrieval?
Zeeshan Zia, PhD in Computer Vision and Machine Learning
Written Jan 14
Yes it will. It's ideally suited for retrieval. That's precisely why Google focused
on these particular primitives as the basis for their compute infrastructure.
Shape based retrieval will also essentially represent the query and database
images in some feature space, and exploit Map and Reduce operations for
finding distances from the query image ("map"; or do some kind of Locality
Sensitive Hashing, inverted indexing etc) and return the examples with
minimum distance (reduce).
455 Views · View Upvotes
Answer written ·
MapReduce
· 2014
Why did Google stop using MapReduce and start
encouraging Cloud Dataflow?
Jeff Nelson, Invented Chromebook, Former Googler
Written Jun 28, 2014 · Upvoted by Dhananjay Nakrani, works at Google and Samyak Datta, former
Software Engineering Intern at Google
I'd speculate this is a move toward finer grain concurrency mechanisms, which
MapReduce doesn't provide. The multi-core CPUs and GPUs provide plenty of
opportunity for concurrency, it's just a que... (more)
Answer written ·
MapReduce
· Feb 2
Do most machine learning algorithms run in batch, or do
they run every time they get a new bit of data?
Harut Martirosyan, Ms.C Applied Mathematics & Computer Science, State
Engineering University of Armenia (2008)
Written Feb 2
Mostly algorithms run in batch mode, but there are cases that they are real-
time. For instance, EdgeRank algorithm can run in real-time, where more
complex algorithms can’t afford that due to time ...
(more)
Answer written ·
MapReduce
· Jan 16
I've seen a number of homegrown solutions, but are there
any MR/HPC platforms which exist for lower-latency use
cases?
Greg Keller, Co-Founder at R Systems NA, Inc. (2007-present)
Written Jan 16
You would need to define the latency you are fighting more specifically to be
sure. The original home grown Hadoop used old and low end hardware under
the assumption it was more about bulk than speed.
The filesystem made multiple copies of everything for resilience and so you
didn't need expensive shared storage and networking. But it turns out if you
already have the networking (RDMA on IB or 40/10GbE) and fast parallel
filesystems every node can see all the data and so better performance is
achieved for many tasks while reducing the need to make more copies of blocks
and the performance that steals.
Your latency issue might be solved by either setting up to have the Filesystem
make more than 3 copies if your workload is bursty or a strong shared
filesystem if your workload is relentless with no time for all the duplication.
95 Views · Answer requested by Abdelaziz Mohamed
Answer written ·
MapReduce
· Tue
Can I use Hadoop as a PC?
Simon Thompson, nom nom nom nom. Cookie.
Written Tue
Let’s say you connected together 10 lawn mowers, each mower is 35 hp. Does
this give you a sports car? Nope, it gives you a big lawnmower, and that’s what
Hadoop is - it’s a lawnmower for data. If you have 100 nodes in a Hadoop
cluster you can chew a lot more data than if you have 10, but you can’t render
a single stream of 60fps 4k video with it. (ok, if you had 100 nodes that could
each render then you could… but you don’t and you can’t!)
28 Views · View Upvotes
Answer written ·
MapReduce
· 2012
How does Hadoop compare to Google's internal MapReduce
implementation as of 2011?
Cosmin Negruseri, problem setter in Google Code Jam, trainer for the
Romanian IOI team
Updated Sep 22, 2012
Distributed sorting is a great benchmark for mapreduce implementations
because it exercises all the parts of the framework well.
The most time consuming part probably is sending the data over the ... (more)
Answer written ·
MapReduce
· 2010
What are some promising open-source alternatives to
Hadoop MapReduce for map/reduce?
Jeff Hammerbacher, Professor at Hammer Lab, founder at Cloudera, investor at
Techammer
Updated Jul 27, 2012
Hadoop consists of many subprojects: HDFS, MapReduce, Hive, Pig, HBase,
and Avro. I believe this question refers to the MapReduce implementation,
which can operate over a variety of storage systems. I agree that Hadoop could
use decent competition; having used the software daily for many years, I
disagree that it's becoming more complex with each release. As an example,
see the heroic efforts of Chris Douglas on https://issues.apache.org/jira/b... to
remove some of the tuning parameters for a MapReduce job. I suppose that's
for another question though.
Both MapReduce implementations mentioned above (CouchDB and MongoDB)
require that data live in the data stores specified and present far different
semantics than those described in the Google paper.
Here are some MapReduce-ish implementations, all of which are either coupled
to a single storage system, a single programming language, or implement only
a small subset of the features of a mature MapReduce implementation:
Disco: http://discoproject.org
Misco: http://www.cs.ucr.edu/~jdou/misco/
Phoenix: http://mapreduce.stanford.edu
Cloud MapReduce: http://code.google.com/p/cloudma...
bashreduce: http://blog.last.fm/2009/04/06/m...
Qizmt: http://code.google.com/p/qizmt
HTTPMR: http://code.google.com/p/httpmr
Galago's TupleFlow: http://www.galagosearch.org/guid...
Skynet: http://skynet.rubyforge.org
Sphere: http://sector.sourceforge.net
Riak: http://riak.basho.com/mapreduce....
Starfish: http://rufy.com/starfish/doc/
Octopy: http://code.google.com/p/octopy/
MPI-MR: http://www.sandia.gov/~sjplimp/m...
Filemap: http://mfisk.github.com/filemap/
Plasma MapReduce: http://projects.camlcity.org/pro...
Mapredus: http://rubygems.org/gems/mapredus
Mincemeat: http://remembersaurus.com/mincem...
MapReduceTitan: http://www.kitware.com/InfovisWi...
GPMR: http://www.idav.ucdavis.edu/rese...
Elastic Phoenix: https://github.com/adamwg/elasti...
Preregrine: http://peregrine_mapreduce.bitbu...
R3: http://heynemann.github.com/r3/
Also, Microsoft's DryadLINQ is available under an academic license (not quite
open source) at http://research.microsoft.com/en....
Disclosure: I founded Cloudera, a company that provides commercial support
for a distribution of Hadoop, among other things.
33.5k Views · View Upvotes · Not for Reproduction
Answer written ·
MapReduce
· 2013
Will Spark overtake Hadoop? Will Hadoop be replaced by
Spark?
Sean Owen, Director, Data Science @ Cloudera
Written Nov 18, 2013
Originally Answered: Will Apache Spark ever overtake Apache Hadoop?
Spark is in a sense already part of Hadoop. It already runs on YARN, which is
Hadoop 2's generalized execution environment (Launching Spark on YARN) --
not just Mesos. And for example we (Cloudera) support it via Databricks on top
of CDH (Databricks and Cloudera Partner to Support Spark). So there is no
either/or here to begin with.
The larger point, I suppose, is that Hadoop is not one thing to be replaced by
one other thing. It actually names a large ecosystem of components. Spark
itself has no counterpart for a lot of what's under the Hadoop umbrella (M/R,
Zookeeper, Sentry, Hue, HDFS, etc.) But it is also almost surely true that many
things in the Hadoop ecosystem will subsume others. M/R is not going away for
example, but, it is not the right tool for many jobs on Hadoop, and Spark is a
right-er tool for many of those things, so it or something like it is going to
replace plenty of M/R usages.
To your particular points:
Spark is not an ML library itself, but has a small library called MLlib associated
with it. Spark is a better execution environment for anything iterative, and lots
of ML is, so it's going to do better than M/R-based things like Mahout for
speed. For non-iterative computations there's not really an advantage, and I
would imagine that more mature M/R-based implementations, even, are
preferable to MLlib for now. For the niche of algorithms that are naturally
graph-oriented, I also don't think Spark has an advantage over specialist graph
frameworks like GraphLab. For general ML, maybe so.
Spark itself doesn't have an ETL-oriented tooling like Pig or CDK (? someone
correct me?). As an architecture, it's better for ETL-like jobs that involve
anything that looks like a join. For simpler ETL, M/R and its associated tooling
are likely still the natural choice -- this is what they were built for.
Shark demonstrates how much better a non-M/R architecture can be for join-
like operations that you execute through things like Hive -- Shark is a lot faster,
although Hive is closing the gap ably given that it's M/R-based. Shark still
won't generally catch up to specialist Hadoop SQL engines like Impala (see
even Impala 1.0 performance vs Shark: Big Data Benchmark). It's a great
option since it is generally compatible with the same formats, metastore, query
language, etc as all of these. Lots of good choice here.
This is all to say that Spark + its tools are very good, given how much is offered
from just this one project. The good news is that there is no either/or choice,
not anymore.
15.2k Views · View Upvotes · Answer requested by Mohitdeep Singh
Answer written ·
MapReduce
· Feb 16
How do I implement SCD type 2 using Pig, Hive, and
MapReduce on Hadoop?
Anonymous
Written Feb 16
Given that Hive now support ACID properties with CRUD capabilities, this may
be done as in any other traditional database.
434 Views
Answer written ·
MapReduce
· Mar 30
What makes Spark faster than MapReduce?
Shahzad Aslam, Software / BigData Engineer (https://goo.gl/wc54BM)
Written Mar 30
In memory capabilities of Spark makes it faster. MapReduce writes
intermediate results back to Disk and read it back so in a single MapReduce
data is multiple time written to disk and read. Disk read is the worst time
enemy for data processing. So Spark takes in memory capabilities and keeps
the data in memory during data processing. A very simple example can be as
for example (i guess this will not be accrate but may clear the point)
z = (a + b) * c
MapReduce will process as follows
Read a,b from disk
process a + b
write result to disk
read intermediate results from disk and c
process data and write data back in disk
Spark will process as follows
Read a and b from disk
process (a+b) and keep the result in memory
get c from disk and process final result
write result back to disk
Notice because intermediate results are kept in memory that makes Spark
faster.
But ofcourse you need more memory to take advantage of spark which is
expensive otherwise it will as good as MapReduce or worst than MapReduce.
Law Intake Software System
45 Views
Answer written ·
MapReduce
· Apr 1
Do you have real-time experience with Hadoop?
Shahzad Aslam, Software / BigData Engineer (https://goo.gl/wc54BM)
Written Apr 1
Got 4 years with Hadoop, Spark, Storm with others (MapReduce, Hive, Pig,
Oozie)
https://www.linkedin.com/in/shah...
75 Views
Answer written ·
MapReduce
· Apr 6
Is Hadoop dead and is it time to move to Spark?
Shahzad Aslam, Software / BigData Engineer (https://goo.gl/wc54BM)
Written Apr 6
No this is not true. MapReduce / Hadoop and Spark will stay togather. Both
have their own use cases where they are useful. In short use Spark when you
need to process data quickly or data is coming in streams and use Hadoop
when you can process data in batches.
So every application is not suitable for Spark. here is the small list of
application suitable for Spark.
When data is accumulating for say an hour and you need to process it
quickly within few minutes to update results every hour
Data is coming in streams and you need to process as quickly as
possible (Spark Streaming)
Machine learning algorithms need many iterations so it need faster in
memory processing
And here is the list of application that do not need Spark
You need to process updated data say every day and data should be
ready next day for reviews or analysis. So data can be scheduled to be
processed in batch during night hours. For example if a application
sends recommended products to subscribers each day then Spark is
not needed as hadoop can process data in batch
When you have longer time say a week or 2 weeks to process data.
So keep it mind if you can afford longer data processing latency then move the
application to Hadoop and MapReduce. It will be much economical. You know it
will be waster of money if you buy expensive machine that process data in 10
minutes and remaining 23 hours 50 minutes machine is idle, why not buy a less
expensive machine and process data in say 10 hours.
Law Cases Management Software
67 Views
Answer written ·
MapReduce
· 2013
Why does Apache Mahout seem so popular if scikit-learn has
a much more comprehensive list of algorithms?
Charles H Martin, Calculation Consulting; we predict things
Written Dec 31, 2013 · Upvoted by Vladimir Novakovski, started Quora machine learning team, 2012-
2014 and Yuval Feinstein, Algorithmic Software Engineer in NLP,IR and Machine Learning
Scikit learn is limited to a single processor and runs in memory, and is
therefore limited to problems that fit into memory and don't require a lot of
time to run, or can be run in batch mode.
For, say, SVM classification, this is fine, as it hooks into the very famous libsvm
and liblinear packages. When the problem is too large for liblinear, the
preferred solution is Vowpal Wabbit, which runs off disk and can be used for a
pseudo-online learning.
For clustering a large collection of documents, scikit learn and VW is not so
great...although, in principle, I recall that VW has something.
The best python solution for this is probably gensim, which has a wrapper to
ARPACK... an old school Fortran SVD code that uses MPI for parallelization.
ARPACK - Arnoldi Package
This is what we used when I was at Demand Media
For a more modern solution ,you can also use GraphLab , which implements
both an in-memory, parallel clustering algo, and GraphChi, which implements
the same off disk. Note that the original version 1.0 of graphlab used an MPI
scheme for distributed, in-memory parallelism, although, from what I
understand, this is not a very important use case right now.
Most shops are now aggregating and placing all of their data into HDFS, and
with all the data in one place, it is possible to do now start doing actual data
science. For my clients who have data in Hadoop, I recommend using map-
reduce to generate features, and the run shared memory ML using scikit learn,
gensim, or graphlab. R is also very popular.
it is, however, painful to pull all of the data out of HDFS at times.
Apache Mahout is integrated into Hadoop/HDFS and implements distributed
memory algos which can be applied to data sets that are much larger than can
be handled by other techniques. For example, trying to cluster 100M
documents, or creating a very large scale, Netflix-stye recommender. This is
the motivation
10.1k Views · View Upvotes
Answer written ·
MapReduce
· 2011
Does Bing use Hadoop or any other implementation of
MapReduce?
Jeff Hammerbacher, Professor at Hammer Lab, founder at Cloudera, investor at
Techammer
Written Aug 26, 2011
Bing uses a file system called Cosmos [1] that is conceptually similar to
HDFS. Dryad [2] is their execution infrastructure; it's more expressive than
MapReduce. The majority of the queries run ove... (more)
Answer written ·
MapReduce
· Apr 6
What is the equivalent of MapReduce reducer in spark?
Shahzad Aslam, Software / BigData Engineer (https://goo.gl/wc54BM)
Written Apr 6
If you are familiar with Scala then you know what is the alternate (I am only
familiar with Scala). Watch this WordCount code in Spark Scala
val textFile = sc.textFile("http://hdfs://...")
val counts = textFile.flatMap(line => line.split(" ")).map(word => (word,
1)).reduceByKey(_ + _)
counts.saveAsTextFile("http://hdfs://...")
these are three lines of code. First and third line read and save results. Watch
second line. Here is what it is doing
1. Flapmap and split sentences in words
2. Make (Word, 1) tuples
3. and now the reduce part reduceByKey(_ + _)
I hope you understand
Law Cases Management Software
65 Views · View Upvotes
Answer written ·
Amazon Elastic MapReduce
· Mar 17
How elastic is Amazon's cultural fit?
Toby Thain, former SDE II at Amazon (2013-2014)
Updated Mar 23
I’m going to say: Not very.
Starting right from the interview—which is evaluated in terms of the leadership
principles—you’ll have to adapt to the Amazon way of working, and embrace
those principles...
(more)
Answer written ·
MapReduce
· Mar 6
What are the ways to find top-k records in hadoop using
Java?
Akhilesh Joshi, studies Computer Science & Data Science at The University of
Texas at Dallas (2018)
Written Mar 6
Someone please enlighten me on the same. I followed the following approach
1. wrote a simple word count program that you can find on any website of
hadoop
2. Then took the partition file and sorted using t...
(more)
Answer written ·
Cascading
· Mar 10
How do I display a red circle in Cascading Style Sheets?
Nicos Tombros, Student
Written Mar 10
In your HTML:
<div></div>
In your CSS:
div{border-
radius:50%;background:#F00;height:100px;width:100px;position:relative}
This produces a div with width and height of 100px, a background color of red
(n...
(more)
Answer written ·
MapReduce
· Mar 2
What is the relationship between MapReduce and NoSQL?
Anonymous
Written Mar 2
MapReduce is offline processing. NoSQL is mostly realtime serving. I’m
worried about the methodology you used to study this topic as you need to ask
this kind of question on Quora.
33 Views
Answer written ·
MapReduce
· 2010
What is Hadoop not good for?
Bill McColl, Way back in the 1980s I did some research on parallel functional
programming models. In the 1990s, along wi...
Updated Dec 5, 2010
Great question. In my recent New York Times article "Beyond Hadoop: Next-
Generation Big Data Architectures" I mentioned a number of areas where
Hadoop can be anything from 10x-10000x too slow for w... (more)
Answer written ·
MapReduce
· Mar 21
How is "Edureka" for learning Hadoop?
Don S, Lived in India, UK, USA
Written Mar 21
Edureka is one of the best places to learn Hadoop online. Their classes are
regular and its a mix of both explanations and hands-on. Moreover they also
provide class recordings which you must use to go through the topics after the
class. Apart from the course material the faculty also teaches many practical
use cases.
98 Views · Answer requested by E.sailesh Patro
Answer written ·
MapReduce
· Apr 6
Does the map-function in MapReduce Hadoop, distribute the
data from HDFS?
Shahzad Aslam, Software / BigData Engineer (https://goo.gl/wc54BM)
Written Apr 6
No it does not distribute data. The idea is the take processing to data instead of
moving data to processing. think about GB file compared with 5 KB program
file that process that data. What is bet...
(more)
Answer written ·
MapReduce
· Jan 26
Is it possible to make map reduce Hadoop program using
vb.net?
Harut Martirosyan, works at PicsArt Photo Studio
Written Jan 26
Not in this life, please.
84 Views
Answer written ·
MapReduce
· 2013
Why does Apache Mahout seem so popular if scikit-learn has
a much more comprehensive list of algorithms?
Jeff Hammerbacher, Professor at Hammer Lab, founder at Cloudera, investor at
Techammer
Written Dec 30, 2013 · Upvoted by Vladimir Novakovski, started Quora machine learning team, 2012-
2014
If you're in the Bay Area in February of 2014, Cloudera (company) is hosting a
meetup to work on the integration of scikit-learn with Apache
Spark: https://github.com/scikit-learn/....
So, people ... (more)
Answer written ·
Cascading
· Sat
What is a cascadable logic device?
Ioannis Panagiotopoulos
Written Sat
Thanks, I am not sure though, It seems that it refers to the ability to transfer
information from one element to the neighbouring without loss.
6 Views
Answer written ·
Cascading
· Apr 6
What is a cascadable logic device?
Nikos
Written Apr 6
Have a look at: Advances in Imaging and Electron Physics, Vol 142, page 5
(Introduction)
10 Views · View Upvotes
Answer written ·
MapReduce
· Mar 18
What should I learn, Hadoop (MapReduce/Pig/Hive) or
Spark?
Peng Cheng, I'm the lead committer of ISpark (Jupyter kernel)
Written Mar 18
Spark, please don’t learn MapReduce 1.0 as a framework, it can’t be deader.
MapReduce as an algorithm & architecture however is very important.
100 Views · View Upvotes
Answer written ·
MapReduce
· Apr 7
How do I explore data in R with MapReduce? If I will make a
prediction, how can I use R llpackages for prediction with
MapReduce?
Justin Watkins, Big Data trainer with Think Big Analytics, Certified Hadoop
Professional
Written Apr 7
Instead of investigating R on MapReduce, investigate using Spark. Apache
Spark is a different distributed processing engine that can run on top of
Hadoop, that works in a similar - but different - ...
(more)
Answer written ·
MapReduce
· Mar 18
What should I learn, Hadoop (MapReduce/Pig/Hive) or
Spark?
Tousif Khan, works at EPlanet Communications -
Written Mar 18
Hey,
Take a look one of my post here i did a comparison for both of them, it may
help you to take a decision
http://www.tkhan.org/why-would-i...
140 Views · View Upvotes
Answer written ·
MapReduce
· Feb 19
What is Map-Reduce?
Shiva Bhusal, works at Bowling Green State University
Written Feb 19
In simple words, MapReduce is breaking a task into several parts and
combining it together. Map does the split task and Reduce does the combining
part.
99 Views · View Upvotes
Answer written ·
MapReduce
· Mar 21
Are Oozie and MongoDB needed for Hadoop?
Deepak Kumar, GNNIT from GNIIT Software Engineering (2000)
Written Mar 21
HI,
No, Ooze and MongoDB is not needed for Hadoop. Although Ooze can be
installed with Hadoop for managing Hadoop job.
Check Big Data tutorials, technologies, questions and answers.
Thanks
36 Views
Answer written ·
MapReduce
· Jan 19
How did Doug Cutting create Hadoop?
Manish Ranjan, studied at Apache Hadoop
Written Jan 19
Who better to talk about it than Cutting himself. Here is where he talks about
it.
85 Views · View Upvotes
Answer written ·
MapReduce
· Mar 2
Why don't MapReduce use memory?
Anonymous
Written Mar 2
The whole point of MapReduce is to process data in such large volume that it
couldn’t fit in memory.
18 Views
Answer written ·
MapReduce
· 2013
What are some interesting beginner level projects that can
be built using Apache Hadoop?
Sean Owen, Director, Data Science @ Cloudera
Updated Jan 8, 2015
Try making a job that efficiently counts word co-occurrence, and occurrence,
and then computes term log-likelihood similarity. This is interesting in practical
terms because it's the essential basi... (more)
Answer written ·
MapReduce
· 2014
Am I expected to know how to use MapReduce in an
interview for an internship position at Google?
Gayle Laakmann McDowell, worked at Google
Written Oct 2, 2014 · Upvoted by Brian Schmitz, worked at Google and Jeremy Miles, Quantitative
Analyst at Google (2015-present)
Not unless there's a reason to believe you would or should know it. Perhaps if
your focus is something with system design you should understand it. But in
general, no. Google is more concerned with... (more)
Answer written ·
MapReduce
· Mar 18
What should I learn, Hadoop (MapReduce/Pig/Hive) or
Spark?
Anonymous
Written Mar 18
If you are really serious into getting into the BigData field then this question is
irrelevant. You will have to study both.
Coming to your question - you are comparing Apples to Oranges (for the most
part)
...
(more)
Answer written ·
Cascading
· Feb 28
In cascading dropdown, how can I set an option if it is
selected using JavaScript and the options are set in variables
(variablename=options)?
Med Unes, Full stack (PHP/SYMFONY - JS) Web developer
Written Feb 28
Well, I’ve crated a code snippet here that demonstrates how to achieve this
Here it is Live demo
47 Views
Answer written ·
Bulk Synchronous Paral...
· Mar 22
4 bit synchronous counter with parallel load?
Bryce Bradford
Written Mar 22
Try putting this question into any search bar of your choice, google, yahoo,
bing, etc, and your answer will be one of the first few results.
83 Views · View Upvotes
Answer written ·
Bulk Synchronous Paral...
· Mar 28
4 bit synchronous counter with parallel load?
Simon Fitch, BEng from King's College London (1989)
Written Mar 28
74′161
61 Views · View Upvotes
Answer written ·
MapReduce
· Apr 7
Is Hadoop MapReduce dead and will be replaced by Apache
Spark?
Shahzad Aslam, Software / BigData Engineer (https://goo.gl/wc54BM)
Written Apr 7
Not at all. Both Spark and Hadoop complement each other and they have their
own use cases. I answered such question multiple time but here i go again.
Speed is not the only factor to consider when ...
(more)
Answer written ·
Bulk Synchronous Paral...
· Mar 22
4 bit synchronous counter with parallel load?
Graham Cox, Has a degree in Electronics Design and Technology.
Written Mar 22
Yes, please. With 7-segment decoded outputs and look-ahead carry on the side.
95 Views · View Upvotes