0% found this document useful (0 votes)

47 views39 pages

MapReduce Quora

Uploaded by

varunprint1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views39 pages

MapReduce Quora

Uploaded by

varunprint1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 39

https://www.quora.

com/topic/MapReduce

Answer written ·
MapReduce
· 2013
What is an intuitive explanation of MapReduce?

Pararth Shah, Stanford MS CS '15, IIT-Bombay BTech CS '13, Google intern

'12, ML enthusiast
Updated Feb 2, 2014 · Upvoted by Amogh Akshintala, PhD Computer Science, University of North
Carolina at Chapel Hill (2020)

Long long ago, in a galaxy far far away, there lived an energetic young space
commander named Sheriff Sequential. Although he ruled the tiny planet
of Pentium Single-Core 1.3ghz, he harbored ambiti... (more)
Answer written ·
MapReduce
· 2014
How does SAP's Big Data platform HANA differs from
Hadoop, Mapreduce platforms?

Brian Feeny, SAP HANA Certified

Written Jan 21, 2014 · Upvoted by Manish Kaduskar, Architect at SAP and Abhishek Kumar Singh, SAP
HANA Trainer
They are nothing alike. It is like comparing Apples and Oranges. SAP HANA is
an in-memory database. Fundamentally, it operates just like many other OLAP
databases, with many performance enhanceme... (more)
Answer written ·
MapReduce
· 2011
How does YARN compare to Mesos?

Jay Kreps, I work with Hadoop at LinkedIn

Updated Sep 8, 2011 · Upvoted by Alex Feinberg, Distributed systems engineer
Both systems have the same goal: allowing you to share a large cluster of
machines between different frameworks.

For those who don't know, NextGen MapReduce is a project to factor the
existing Ma... (more)
Answer written ·
MapReduce
· 2012
What is the best directory structure to store different types
of logs in HDFS over the time?

Eric Sammer, ex-Cloudera ('10-'14), wrote Hadoop Operations for O'Reilly

Written Apr 25, 2012
Best, as you know, is subjective. The way I approach this is to consider what
directory structures are used for:
 Organization
 Access control, visibility, and auditing
 Resource control and allocation ...
(more)
Answer written ·
MapReduce
· 2010
What is Hadoop not good for?

Sameer Al-Sakran
Written Dec 1, 2010
Assuming you're talking about the MapReduce execution system and not
HDFS/HBase/etc --

Easy things out of the way first:

Real time anything
You can use hadoop to do precalculations, but will nee... (more)
Answer written ·
MapReduce
· 2015
What is Map-Reduce?

Eric Wu, worked at Google

Written Jan 21, 2015
Let's say we have the text for the State of the Union address and we want to
count the frequency of each word. You could easily do this by storing each
word and its frequency in a dictionary and looping through all of the words in
the speech. However, imagine if we have the text for all of Wikipedia (say, a
billion words) and we wanted to do the same thing. Our poor computer would
be stuck looping for ages!
MapReduce is a programming model that allows tasks (like counting
frequencies) to be simultaneously (parallel) performed on many (distributed)
computers. Now, instead of one computer having to loop through a billion
words we can now have 1,000 computers simultaneously looping through only
a million words each -- that's a 1000x time improvement!

There are two parts to MapReduce: map(), and reduce().

Map() takes in a word and "emits" a key and value pair. In this case, key = a
word, and value = 1 (we just have one instance of the word). For instance, the
phrase "bright yellow socks" would emit ("bright", 1), ("yellow", 1), and (socks,
1). Map() is called for every single word -- all one billion of them.
(You might be asking why we did this very redundant step. Bear with me for a
second!)
Now, we have one billion key-value pairs. What do we do with these?

This is where Reduce() comes in. All of the key-value pairs with the same key
are combined and fed into Reduce(). Let's take "hello" for instance, and say
that there are 500,000 occurrences of the word in Wikipedia. Then, we have
500,000 key-pairs that are passed into a single Reduce() call. Now, all we have
to do is loop through 500,000 ("hello", 1) pairs and sum up all of the values.
This is much easier than having to loop through a billion. Reduce() then "emits"
another key-value pair; this time, it's ("hello", 500000), or the word, and the
total sum of the values associated with that word. Repeat the Reduce() call for
every other word, and we've got the frequency for every word!

By distributing this massive job to many computers, if any one of these

computers fail for any reason, we can just restart the specific segment it was
working on. Imagine if this happened on a single computer!

Note that this is a very simple example. MapReduce has been used on many
problems that can be fit into this Map()/Reduce() model, such as Google's
search index, machine learning, and statistical analysis.
21.4k Views · View Upvotes
Answer written ·
MapReduce
· 2014
What is the difference between Apache Spark and Apache
Hadoop (Map-Reduce) ?

Suman Bharadwaj, was big data developer at Intel

Updated Mar 28, 2016
I'll mention the differences present at the shuffle side at a very high level, as I
understand it, between Apache Spark and Apache Hadoop Map reduce.
Since few folks have already mentioned about di...
(more)
Answer written ·
MapReduce
· 2015
What makes Spark faster than MapReduce?

Sandy Ryza, Apache Hadoop PMC, software engineer at Cloudera

Written Mar 9, 2015
I think there are three primary reasons.

The main two reasons stem from the fact that, usually, one does not run a
single MapReduce job, but rather a set of jobs in sequence.

1. One of the main l... (more)

Answer written ·
MapReduce
· 2011
How does YARN compare to Mesos?

Jie Li, Data Infra at Pinterest

Written Dec 14, 2011
Thanks for Jay Kreps and Arun C Murthy's good summary!

After playing arround both Mesos and YARN, I want to elaborate one of Arun C
Murthy's points, "Maybe that is the way to look at Mesos v/s YA... (more)
Answer written ·
MapReduce
· 2014
How does YARN compare to Mesos?

Matthew G Trifiro, SVP Marketing & Business Development, Mesosphere

Written Feb 8, 2014
There is one gigantic differences between YARN and Mesos—so big, in fact, I
am surprised that I am the first to mention it.

YARN is Hadoop-specific and is, therefore, specifically targeted at

sche... (more)
Answer written ·
MapReduce
· 2014
What are the advantages of DAG (directed acyclic graph)
execution of big data algorithms over MapReduce?

Tathagata Das, PMC of Apache Spark, builds all things streaming in Spark.
Written Nov 24, 2014 · Upvoted by Don van der Drift, Quora Data Scientist
[Disclaimer: I am an Apache Spark committer]

TL;DR - Conceptually DAG model is a strict generalization of MapReduce

model. DAG-based systems like Spark and Tez that are aware of the whole DAG
of operations can do better global optimizations than systems like Hadoop
MapReduce which are unaware of the DAG to be executed.

Long version:
Conceptually speaking, the MapReduce model simply states that distributed
computation on a large dataset can be boiled down to two kinds of
computation steps - a map step and a reduce step. One pair of map and reduce
does one level of aggregation over the data. Complex computations typically
require multiple such steps. When you have multiple such steps, it essentially
forms a DAG of operations. So a DAG execution model is essentially a
generalization of the MapReduce model.

While this is the theory, different systems implement this theory in different
ways, and that is where the "advantages" and "disadvantages" come from.
Computations expressed in Hadoop MapReduce boil down to multiple
iterations of (i) read data from HDFS, (ii) apply map and reduce, (iii) write back
to HDFS. Each map-reduce round is completely independent of each other, and
Hadoop does not have any global knowledge of what MR steps are going to
come after each MR. For many iterative algorithms this is inefficient as the
data between each map-reduce pair gets written and read from filesystem.
Newer systems like Spark and Tez improves performance over Hadoop by
considering the whole DAG of map-reduce steps and optimizing it globally (e.g.,
pipelining consecutive map steps into one, not write intermediate data to
HDFS). This prevents writing data back and forth after every reduce.

Storm, being a streaming system, is slightly different from the batch processing
systems referred earlier. It also sets up a DAG of nodes lets the records stream
between the nodes. Its best to compare Storm with Spark Streaming
(streaming system built over Spark) than Hadoop MapReduce. Both accepts a
DAG of operations representing the streaming computation, but then process
the DAG in slightly different ways. Storm sets up a DAG of node and allocates
each operation in the DAG of ops to different nodes. Spark Streaming does not
pre-allocate, rather uses the underlying Spark's mechanisms to dynamically
allocate tasks to available resources. This gives different kinds of performance
characteristics.
21.6k Views · View Upvotes · Answer requested by Umanga Bista
Answer written ·
MapReduce
· Mar 1
What are the main ideas behind map reduce?

Michael Ernest, Production-level work in Hadoop operations and application

architectures
Written Mar 1
In short, that a wide domain of computing tasks can be expressed in two
phases of operation: scanning and transforming data (mapping), followed by
consolidating that data to some value or summary (reducing).
...
(more)
Answer written ·
MapReduce
· 2014
What exactly is Apache Spark and how does it work?

Reynold Xin, Chief Architect @ Databricks

Updated Sep 3, 2014 · Upvoted by Thia Kai Xin, Data scientist at Lazada, Co-Founder of DataScience
SG. and Sean Owen, Director, Data Science @ Cloudera
Originally Answered: How does Apache Spark work?
In many ways, Spark is a better implementation of the MapReduce paradigm
(not a huge surprise since Matei Zaharia who created Spark also worked on
Hadoop MapReduce in its early days).

From a progr... (more)

Answer written ·
MapReduce
· 2010
What's the best way to come up to speed on MapReduce,
Hadoop, and Hive?

Amund Tveit, Co-Founder of Atbrox

Updated Apr 27, 2010
The best way is to start as hands-on as possible, e.g. download hadoop and
write some mappers and reducers and run them on a few datasets. The original
mapreduce paper[1] describes a few examples that are nice to start with:
a) word count
b) distributed grep
c) reverse web link graph
d) term vector per host
e) inverted index
f) distributed sort

Then continue at larger scale with Elastic Mapreduce[2-4] or setup and use
Hadoop on your own cluster.

References
[1] http://labs.google.com/papers/ma...
[2] http://aws.amazon.com/elasticmap...
[3] http://developer.amazonwebservic...
[4] http://aws.amazon.com/about-aws/...
6k Views · View Upvotes
Answer written ·
MapReduce
· 2011
What is MapReduce?

Mohsin Shafeeque Hijazee, I've been programming since 2001 an dmy

languages include x86 Assembly, C, C++, C#, Java, Python, Ruby, Gro...
Written Oct 26, 2011
MapReduce when spelled jointly, is a distributed computing paradigm inspired
by the ideas from functional languages. To process large number of inputs,
there are two functions not surprisingly name... (more)
Answer written ·
MapReduce
· 2011
What are some good class projects for machine learning
using MapReduce?

Alex Kamil, studied at Columbia University

Updated Oct 30, 2014 · Upvoted by Amund Tveit, PhD in machine learning. and Manohar Kuse, PhD
Candidate researching computer vision and machine learning in robotics.
Try implementing some ML algorithms not yet covered (or poorly covered)
in Apache Mahout: What are some important algorithms not yet covered in
Mahout? , and What are the top 10 data mining or mac... (more)
Answer written ·
MapReduce
· Mar 27
What are examples in which MapReduce does not work?
What are examples in which it works well? What are the
security issues involved with the cloud?

Nikita Gureev, M.S. from Royal Institute of Technology (2018)

Written Mar 27
While it is three separate questions, I’ll try to cover the first two and provide
some musing for the third one.
MapReduce[1] is generally used in Big Data tasks in order to process enormous
amount ...
(more)
Answer written ·
MapReduce
· 2014
Why did Google stop using MapReduce and start
encouraging Cloud Dataflow?

Kenneth Tran, Machine Learning Engineer at Microsoft

Updated Jul 1, 2014 · Upvoted by David Marek, worked at Google and Jeff Nelson, Invented
Chromebook, Former Googler

I'm not sure if Google has stopped using MR completely. My guess is that no
one is writing new MapReduce jobs anymore, but Google would keep running
legacy MR jobs until they are all replaced or be... (more)
Answer written ·
MapReduce
· 2011
How does YARN compare to Mesos?

Arun C Murthy, Founder & Architect, Hortonworks Former Architect & Lead,
Hadoop Map-Reduce Team, Yahoo; Frmr. VP, Apache H...
Updated Mar 5, 2012
Good summary Jay, thanks.

I'd like add some clarifications:

# It's trivial to simulate the Mesos offer/reject model in YARN. You ask for
*any*, i.e. non-locality specific, resources and then rejec... (more)
Answer written ·
MapReduce
· 2011
What are some good MapReduce implementations for
graphs?

Ankur Dave, Spark committer at UCB AMPLab

Updated Mar 30, 2014 · Upvoted by Neha Narkhede, Co-founder and CTO at Confluent and Apurv
Verma, ML@GATech
Before looking at individual implementations, it's helpful to narrow down the
search by identifying some essential features of distributed graph processing
frameworks. Here are three important ones:
 ...
(more)
Answer written ·
MapReduce
· 2015
Is Hadoop dead and is it time to move to Spark?

Sean Owen, Director, Data Science @ Cloudera

Updated Jun 24, 2015
ATA I think this is like asking: is Linux dead and should we move to Docker? or
something like that. Linux is of course shorthand for a large number of related
technologies at this point, even if i... (more)
Answer written ·
MapReduce
· 2013
What is an intuitive explanation of MapReduce?

Gautam Singaraju, inquisitive.

Written Nov 15, 2013
Let's say there is a assorted box of candies and a teacher wants to count how
many different kinds of chocolate exist. One could probably do it by counting a
single jar: 5 red, 10 green etc.

Now l... (more)

Answer written ·
MapReduce
· 2014
What is the difference between MapReduce, artificial
intelligence, and machine learning? Or rather, how are they
related?

Smit Mehta, Googler

Written Sep 26, 2014
MapReduce - An engineering solution to handle operations (like sorting,
searching, etc) on huge data sets.

Imagine, if you want to sort 1000 strings. It's easy and doable on a single
computer. Th... (more)
Answer written ·
MapReduce
· Mar 18
What should I learn, Hadoop (MapReduce/Pig/Hive) or
Spark?

Adam Albertson, works at Amazon Web Services

Written Mar 18
In short, All of the Above.
Big Data is a rather large field and to be successful in it, you need to be pretty
well rounded. This means not allowing yourself to be so narrowly focused that
you’re a ...
(more)
Answer written ·
MapReduce
· Tue
Can I use Hadoop as a PC?

Amr Salah, Hadoop Developer

Written Tue
The nature of problems that hadoop solves , are processing massive amounts of
data. If you assign tasks to Hadoop they must be split-able across machines .
So a giant CSV file , it can be split and...
(more)
Answer written ·
MapReduce
· 2012
What are the advantages of Hadoop over distributed
RDBMS?

Charles Zedlewski, I work at Cloudera, purveyors of Apache Hadoop and

related things.
Written Feb 25, 2012
I think even at fairly large scales of data (let's say a few hundred TB's), it's rare
that people are using Hadoop as a substitute for a parallel database. Hadoop
has some different design goals and consequently a different set of strengths &
weaknesses relative to parallel RDBMS.

Hadoop is a lot more flexible. Unlike parallel RDBMS you don't have to pre-
process the data before you can use it. You don't have to design a star schema
or update some data dictionary or manipulate it with a separate ETL process.
Moreover you can change the schema after the fact with very little cost or
effort. There are some workloads where flexibility is very valuable and those
workloads are moving to Hadoop pretty quickly.

Hadoop is more scalable. The largest publicly discussed Hadoop cluster

(Facebook's) was at 30 petabytes mid last year and it's grown since then. I
don't think there are parallel RDBMS'es that have come close to those kinds of
numbers.

Hadoop is more economical. Factoring in all the costs you can get down to
~$500 / TB pretty easily with Hadoop and even lower if cost is what really
matters to you. There's no parallel RDBMS that comes close to those numbers
to my knowledge. Since log, text and image data is often much bulkier than
transactional data it's often been kept out of parallel RDBMS since the
economics just wouldn't make any sense.

Parallel RDBMS has a lot to say for itself as well. It's much more optimized /
optimizable (e.g. various indexing, caching, join strategies, etc) which makes it
more performant and efficient for many workloads. It supports a much richer
set of industry standard SQL functions and a broader range of tools that
support those SQL functions. It also provides lower latency for interactive
queries.

I don't think the dividing line is so much structured / unstructured data but
rather about asking repeated questions of known data (where a fixed schema
and optimization pays off) versus asking novel questions of not well known data
(where a fixed schema and all those optimization techniques are a hindrance).
But in general this is a "right tool for the right job" kind of answer.
23.9k Views · View Upvotes
Answer written ·
MapReduce
· Feb 2
Do most machine learning algorithms run in batch, or do
they run every time they get a new bit of data?

Håkon Hapnes Strand, Machine Learning Engineer

Written Feb 2

We need to make two distinctions here.

First of all, we need to distinguish on the type of learning. There are two ways
of training machine learning models:
 Offline learning: The model is trained once...
(more)
Answer written ·
MapReduce
· Mar 16
Does HBase use Hadoop to store data or is a separate
database independent of Hadoop?

Mark Vogt
Written Mar 16
Part of your confusion arises from how you’re using the term “Hadoop” itself…

Hadoop is NOT one thing; even now in early 2017 Hadoop is considered to be
at least 2 things:
1. The Map/Reduce Processing Engine on top of HDFS; and

2. The Hadoop Distributed File System (HDFS) itself.

It’s clear from your question that you’re thinking “Hadoop = HDFS”, but it’s
important to be more precise than this, because you can see from all the
comments that mis-information abounds in Big Data…

RE-PHRASING your question as “Does HBase use HDFS to store data, or is it a

separate database independent of HDFS?” , you can see that the best answer
has been provided:

 HBase a type of database system (storage + mechanisms for

adding/changing/deleting/searching) consisting of one or more often-
massive (think “millions of columns, billions of rows”) column-family
based tables;
 These tables are split up along horizontal lines to form massive strips
called table “regions”;
 These regions (“table-ets”) are then called “Hfiles”, and are then stored
in the HDFS as if they were any other type of (unstructured) file in
HDFS, allowing for distributed storage and processing of those files by
hundreds or even thousands of servers each processing (again, for a
database “processing” consists of adds/changes/deletes/searches) their
own little portion of this massive table;
LONG ANSWER TO THE SHORT QUESTION:
HBase USES HDFS as its storage mechanism, by partitioning massive table
structures into smaller “Hfiles” which are then treated like any other file in the
HDFS.

Hope this helps anyone else who’s been struggling with this concept.

Mark in North Aurora

180 Views
Answer written ·
MapReduce
· 2011
Would Facebook have been able to scale effectively if Google
had not publicly described MapReduce in 2004 and BigTable
in 2006?

Edmond Lau, former Engineer at Google (2006-2008)

Written Oct 5, 2011 · Upvoted by Thach Nguyen, interned at Google and Brian Schmitz, worked at
Google

Yes, but Facebook would have had to significantly invest in building out its
distributed data computation and warehousing layers to be able to scale
effectively, and it would've delayed its climb ... (more)
Answer written ·
MapReduce
· Mon
Can I find any sample Hadoop clusters online so that I can
practice Hadoop development?

Shahzad Aslam, Software / BigData Engineer (https://goo.gl/wc54BM)

Written Mon
I have not seen any cluster for free. You have to pay a bit to work with. All
major clouds has their offerings with per hour charges. In Microsoft Azure if
you setup a cluster for $15 for 24 hours....
(more)
Answer written ·
MapReduce
· Feb 3
What are the best ways to learn about Hadoop source?

Oleksii Yermolenko, Hadoop software developer at Prometric (2015-present)

Written Feb 3
Feel free to join my blog about Big Data - https://oyermolenko.blog. I’ve been
working with Hadoop for a couple of years and in this blog want to share my
experience from the early start. My blog i...
(more)
Answer written ·
MapReduce
· 2011
Would Facebook have been able to scale effectively if Google
had not publicly described MapReduce in 2004 and BigTable
in 2006?

Jay Kreps, I work with Hadoop at LinkedIn

Written Feb 28, 2011 · Upvoted by Jack Lindamood and Olaoluwa 'Ola' Okelola, Engineer at Facebook
Yes.

My understanding is that Facebook's live site is served off memcached and

MySQL which obviously have little to do with the Google papers.

Their big, difficult apps like photo sharing, ad tar... (more)

Answer written ·
Cascading
· Mar 10
How do I display a red circle in Cascading Style Sheets?

Rob Parham, LAMP stack developer

Written Mar 10
You can’t display anything with CSS alone, but if you put this class on a block
level element it will display as a red circle:
1. .circle{
2. border-radius:50%;
3. border:1px solid red;
4. background:red;...
(more)
Answer written ·
MapReduce
· 2010
What are the advantages/disadvantages running Cloudera's
distribution for Hadoop on EC2 instances rather than using
Amazon's Elastic Map Reduce Service?

Jeff Hammerbacher, Founder and Chief Scientist at Cloudera (2008-present)

Written Nov 10, 2010 · Upvoted by Eric Sammer, works at Cloudera and Henry Robinson, works at
Cloudera
Some nice aspects of EMR:
 Dynamic MapReduce cluster sizing.
 Ease of use for simple jobs via their proprietary web console.
 Great documentation.
 Integrates nicely with other Amazon Web Services.

Some nice aspects of CDH:

 CDH is open source; you have access to the source code and can
inspect it for debugging purposes and make modifications as required.
 CDH can be run on a number of public or private clouds using an open
source framework, Whirr [1], so you're not tied to a single cloud
provider
 With CDH, you can move your cluster to dedicated hardware with little
disruption when the economics make sense. Most non-trivial
applications will benefit from this move.
 CDH packages a number of open source projects that are not included
with EMR: Sqoop, Flume, HBase, Oozie, ZooKeeper, Avro, and Hue.
You have access to the complete platform composed of data collection,
storage, and processing tools.
 CDH packages a number of critical bug fixes and features and the most
recent stable releases, so you're usually using a more stable and
feature-rich product. For example, we added support for Hadoop 0.20
on 9/10/09 [2], while EMR did not have support for 0.20 until 6/2/10
[3].
 You can purchase support and management tools for CDH via Cloudera
Enterprise [4].
 CDH uses the open source Oozie [5] framework for workflow
management. EMR implemented a proprietary "job flow" system before
major Hadoop users standardized on Oozie for workload management.
 CDH uses the open source Hue [6] framework for its user interface. If
you require new features from your web interface, you can easily
implement them using the Hue SDK.
 CDH includes a number of integrations with other software
components of the data management stack, including Talend,
Informatica, Netezza, Teradata, Greenplum, Microstrategy, and others.
If you have an existing analytics infrastructure, it's easy to integrate
CDH.
 CDH has been designed and deployed in common Linux environments
and you can use standard tools to debug your programs. I have not
experienced this firsthand, but a number of our customers have
reported that it's difficult to debug failing programs on EMR because
of the custom version of Hadoop being run and the opacity of the
environment in which it is run. Again, this is hearsay, so if others
disagree, I'll remove this point.

[1] http://incubator.apache.org/whirr/
[2] http://www.cloudera.com/blog/200...
[3] http://aws.amazon.com/releasenot...
[4] http://www.cloudera.com/products...
[5] http://yahoo.github.com/oozie/
[6] https://github.com/cloudera/hue
14.1k Views · View Upvotes · Not for Reproduction
Answer written ·
MapReduce
· Feb 18
Can MapReduce read input from memory instead of file?

Fred Williams II, IBM systems programmer since 1981

Written Feb 18
Let’s see here…
Your computer “always” reads things from memory. The processor in your
machine has things called “registers” which contain memory addresses (or
instructions what to do with those mem...
(more)
Answer written ·
MapReduce
· Jan 14
What is the relationship between MapReduce and NoSQL?

Nicolae Marasoiu, 3+ years big data: Hadoop, HBase, Spark, Storm, Druid,
Zookeeper, Pig, Consul.
Written Jan 14
They both appeared as solutions to handling more data, more users.
MapReduce is efficient for batch processing: big throughput (can process
millions of input records per second, depending on the clu...
(more)
Answer written ·
MapReduce
· Feb 3
What are some good examples of problems I can solve with
MapReduce for time-series analytics?

Bob Marshall, B. S. Chemistry & Computer Science, University of Notre Dame

(1980)
Written Feb 3
MapReduce works on key:value transformations. Tom White’s book Hadoop:
The Definitive Guide provides a MapReduce example of processing the National
Weather Service’s historical climate data to determine the highest and lowest
temperature for each year across all the weather stations in the country. Each
mapper processes the data for its input split, writing the highest and lowest
temperatures (values) for each year of data (key). Following completion of all
mappers, the data is shuffled and sorted such that all the key:value pairs for
each unique key will be combined on a particular node. Then, reducers run to
determine the highest and lowest values for each year. This is a great example
of a time series problem, with rather large granularity of time values.
303 Views
Answer written ·
MapReduce
· 2014
What are the differences between batch processing and
stream processing systems?

Sean Owen, Director, Data Science @ Cloudera

Written Oct 27, 2014
ATA I think the example is wrong in a few ways. Although people use the word
in different ways, Hadoop refers to an ecosystem of projects, most of which are
not processing systems at all. It contains MapReduce, which is a very batch-
oriented data processing paradigm.

Spark is also part of the Hadoop ecosystem, I'd say, although it can be used
separately from things we would call Hadoop. Spark is a batch processing
system at heart too. Spark Streaming is a stream processing system.

To me a stream processing system:

 Computes a function of one data element, or a smallish window of
recent data
 Computes something relatively simple
 Needs to complete each computation in near-real-time -- probably
seconds at most
 Computations are generally independent
 Asynchronous - source of data doesn't interact with the stream
processing directly, like by waiting for an answer

A batch processing system to me is just the general case, rather than a special
type of processing, but I suppose you could say that a batch processing system:
 Has access to all data
 Might compute something big and complex
 Is generally more concerned with throughput than latency of individual
components of the computation
 Has latency measured in minutes or more

I sometimes hear streaming used as a sort of synonym for real-time . Real-time

stuff usually takes the form of needing to respond to an event in milliseconds,
as in a synchronous API. This isn't streaming to me.
28.8k Views · View Upvotes · Answer requested by Prashant Raaghav
Answer written ·
MapReduce
· Mar 20
What should I learn, Hadoop (MapReduce/Pig/Hive) or
Spark?

Don S, Lived in India, UK, USA

Written Mar 20
First and foremost you must learn HDFS which is the distributed file system
used in Hadoop and the related linux commands.
Then you must learn the Hadoop architecture; though you will not use it on ...
(more)
Answer written ·
MapReduce
· 2013
Will Spark overtake Hadoop? Will Hadoop be replaced by
Spark?

Matei Zaharia, CTO @ Databricks

Written Nov 18, 2013
Originally Answered: Will Apache Spark ever overtake Apache Hadoop?
It depends a bit on what you mean by "Hadoop". Some people take Hadoop to
mean a whole ecosystem (HDFS, Hive, MapReduce, etc), in which case Spark is
designed to fit well within the ecosystem (reading from any input source that
MapReduce supports through the InputFormat interface, being compatible
with Hive and YARN, etc). Others refer to Hadoop MapReduce in particular, in
which case I think it's very likely that non-MapReduce engines will take over in
a lot of domains, and in many cases they already have.

From this latter point of view, perhaps the most interesting thing about Spark
is that it shows that a lot of workloads can be captured efficiently by the same,
simple generalization of the MapReduce model. Spark can achieve (and
sometimes beat) state-of-the-art performance in not only simple ETL, but also
machine learning, graph processing, streaming, and relational queries.
Importantly, this means that applications can combine these workloads more
efficiently. For example, once you ETL data in, you can easily compute a report
or run a training algorithm on the same in-memory data. Furthermore, you get
the same programming interface to combine these jobs in, and only one system
to manage and install.

How much will this matter? It's hard to predict, but one possibility is that after
experimenting with specialized computing models, distributed programmers
will want to have a general model, in the same way that programmers for a
single machine settled on general-purpose languages. Having a general
platform is even more important in big data, because data is so expensive to
move across systems! In this case, Spark shows that many of the tricks used in
specialized systems today (e.g. column-oriented processing, graph partitioning
tricks) can be implemented on a general platform.

In any case, it is a first-order goal of the system to stay compatible with the
wider Hadoop ecosystem, and just give people better ways to compute on the
same data. The Hadoop ecosystem is also moving quickly towards supporting
alternative programming models, through efforts like YARN.
24.8k Views · View Upvotes
Answer written ·
MapReduce
· Jan 18
Does the map-function in MapReduce Hadoop, distribute the
data from HDFS?

Evan Mouzakitis, Research Engineer at Datadog, wrote about Hadoop

Written Jan 18
One of the big draws for Hadoop is that it moves the computation to the data,
rather than the other way around. This reduces overall computation time as
data does not need to travel across a network in order to be operated on. So, to
more pointedly answer your question: the scheduler (usually YARN) schedules
the map to occur on a node that is as close to the data as possible. You can
read more about Hadoop architecture here.
124 Views · View Upvotes
Answer written ·
MapReduce
· 9h
Can I use Hadoop as a PC?

Shahzad Aslam, Software / BigData Engineer (https://goo.gl/wc54BM)

Written 9h ago
Hadoop has specific purpose and that is to process massive amount of data in
parallel and nothing else. If you have 10 low valued transport motor cars, can
you add them to functional as a single lu...
(more)
Answer written ·
MapReduce
· Jan 16
Does the map-function in MapReduce Hadoop, distribute the
data from HDFS?

Joe Nap, Sr. Data Engineer at Blue Apron

Written Jan 16
Your latter intuition is correct. Files are split in HDFS across physical machines
in block sizes of (typically) 64MB or 128MB. This is done 1) for high availability
of the data, in the event a machine or disk dies. And 2) So that processing
(mapping) of each block is split across physical processors to increase the
throughput of data processed.

Each block is typically replicated 3 times. Besides having an extra, extra copy
of the data, the data is replicated 3 times so that 2 copies are stored on one
networked rack, while the 3rd is stored on a remote rack. That way if you have
an individual machine failure, you can continue processing on the same rack,
while if you experience a network failure, the scheduler can reschedule that
chunk of work on another, accessible rack. It’s often said that it’s cheaper to
move the computation than moving the data.

HDFS Architecture Guide

243 Views · View Upvotes
Answer written ·
MapReduce
· Feb 1
Do most machine learning algorithms run in batch, or do
they run every time they get a new bit of data?

Zeeshan Zia, PhD in Computer Vision and Machine Learning

Written Feb 1
Most papers in machine learning conferences talk about or use algorithms
running in batch mode. Almost all ML algorithms taught in pretty much any
basic ML course operate in batch-mode.
In the real-...
(more)
Answer written ·
MapReduce
· Feb 1
Why don't MapReduce use memory?

Vidhi Goel, worked at Samsung Electronics

Written Feb 1
This is indeed a very good question. MapReduce traditionally uses local disks to
store chunks of data local to a worker.

Recently, there have been developments to make it in-memory. For reference

In-Memory MapReduce

Quoted from the webpage: “Apache Ignite comes with in-memory

implementation of Hadoop MapReduce APIs which provides significant
acceleration over the native Hadoop MapReduce implementation.”

Another alternative to MapReduce is Spark that uses in-memory RDDs and

provides scalable fault-tolerance. It is also a more generalized framework.
166 Views
Answer written ·
MapReduce
· Jan 17
Does the map-function in MapReduce Hadoop, distribute the
data from HDFS?

Eduardo Avaria Suazo, Big Data Architect at Globant (2016-present)

Written Jan 17
Data is already splitted, distributed, replicated and referenced in memory at
the namenode. When an MR application is submitted, the code is packaged and
distributed, making sure that all data is processed by a mapper, so the general
idea is that the code is moved where the data is and not the other way.

The exception to this is where for some reason the data is not usable as is (e.g
a compressed .gz file that spans over multiple machines and needs to be
reconstructed), or you define another mapper behavior so data needs to be
moved to complete the task.
115 Views
Answer written ·
Cascading
· Mar 10
How do I display a red circle in Cascading Style Sheets?

Christopher Bobby, Web Dev

Written Mar 10
.className or #IDname of your html element {

width: 150px;

height: 150px;

border-radius: 50%;

background-color: red;

/*make your width and height same, and always set border radius to 50% or
higher, though 50% will just do the work */
65 Views · View Upvotes
Answer written ·
MapReduce
· Jan 14
Will MapReduce work with shape based image retrieval?

Zeeshan Zia, PhD in Computer Vision and Machine Learning

Written Jan 14
Yes it will. It's ideally suited for retrieval. That's precisely why Google focused
on these particular primitives as the basis for their compute infrastructure.

Shape based retrieval will also essentially represent the query and database
images in some feature space, and exploit Map and Reduce operations for
finding distances from the query image ("map"; or do some kind of Locality
Sensitive Hashing, inverted indexing etc) and return the examples with
minimum distance (reduce).
455 Views · View Upvotes
Answer written ·
MapReduce
· 2014
Why did Google stop using MapReduce and start
encouraging Cloud Dataflow?

Jeff Nelson, Invented Chromebook, Former Googler

Written Jun 28, 2014 · Upvoted by Dhananjay Nakrani, works at Google and Samyak Datta, former
Software Engineering Intern at Google
I'd speculate this is a move toward finer grain concurrency mechanisms, which
MapReduce doesn't provide. The multi-core CPUs and GPUs provide plenty of
opportunity for concurrency, it's just a que... (more)
Answer written ·
MapReduce
· Feb 2
Do most machine learning algorithms run in batch, or do
they run every time they get a new bit of data?

Harut Martirosyan, Ms.C Applied Mathematics & Computer Science, State

Engineering University of Armenia (2008)
Written Feb 2
Mostly algorithms run in batch mode, but there are cases that they are real-
time. For instance, EdgeRank algorithm can run in real-time, where more
complex algorithms can’t afford that due to time ...
(more)
Answer written ·
MapReduce
· Jan 16
I've seen a number of homegrown solutions, but are there
any MR/HPC platforms which exist for lower-latency use
cases?

Greg Keller, Co-Founder at R Systems NA, Inc. (2007-present)

Written Jan 16
You would need to define the latency you are fighting more specifically to be
sure. The original home grown Hadoop used old and low end hardware under
the assumption it was more about bulk than speed.

The filesystem made multiple copies of everything for resilience and so you
didn't need expensive shared storage and networking. But it turns out if you
already have the networking (RDMA on IB or 40/10GbE) and fast parallel
filesystems every node can see all the data and so better performance is
achieved for many tasks while reducing the need to make more copies of blocks
and the performance that steals.

Your latency issue might be solved by either setting up to have the Filesystem
make more than 3 copies if your workload is bursty or a strong shared
filesystem if your workload is relentless with no time for all the duplication.
95 Views · Answer requested by Abdelaziz Mohamed
Answer written ·
MapReduce
· Tue
Can I use Hadoop as a PC?

Simon Thompson, nom nom nom nom. Cookie.

Written Tue
Let’s say you connected together 10 lawn mowers, each mower is 35 hp. Does
this give you a sports car? Nope, it gives you a big lawnmower, and that’s what
Hadoop is - it’s a lawnmower for data. If you have 100 nodes in a Hadoop
cluster you can chew a lot more data than if you have 10, but you can’t render
a single stream of 60fps 4k video with it. (ok, if you had 100 nodes that could
each render then you could… but you don’t and you can’t!)
28 Views · View Upvotes
Answer written ·
MapReduce
· 2012
How does Hadoop compare to Google's internal MapReduce
implementation as of 2011?

Cosmin Negruseri, problem setter in Google Code Jam, trainer for the
Romanian IOI team
Updated Sep 22, 2012

Distributed sorting is a great benchmark for mapreduce implementations

because it exercises all the parts of the framework well.

The most time consuming part probably is sending the data over the ... (more)
Answer written ·
MapReduce
· 2010
What are some promising open-source alternatives to
Hadoop MapReduce for map/reduce?
Jeff Hammerbacher, Professor at Hammer Lab, founder at Cloudera, investor at
Techammer
Updated Jul 27, 2012
Hadoop consists of many subprojects: HDFS, MapReduce, Hive, Pig, HBase,
and Avro. I believe this question refers to the MapReduce implementation,
which can operate over a variety of storage systems. I agree that Hadoop could
use decent competition; having used the software daily for many years, I
disagree that it's becoming more complex with each release. As an example,
see the heroic efforts of Chris Douglas on https://issues.apache.org/jira/b... to
remove some of the tuning parameters for a MapReduce job. I suppose that's
for another question though.

Both MapReduce implementations mentioned above (CouchDB and MongoDB)

require that data live in the data stores specified and present far different
semantics than those described in the Google paper.

Here are some MapReduce-ish implementations, all of which are either coupled
to a single storage system, a single programming language, or implement only
a small subset of the features of a mature MapReduce implementation:
 Disco: http://discoproject.org
 Misco: http://www.cs.ucr.edu/~jdou/misco/
 Phoenix: http://mapreduce.stanford.edu
 Cloud MapReduce: http://code.google.com/p/cloudma...
 bashreduce: http://blog.last.fm/2009/04/06/m...
 Qizmt: http://code.google.com/p/qizmt
 HTTPMR: http://code.google.com/p/httpmr
 Galago's TupleFlow: http://www.galagosearch.org/guid...
 Skynet: http://skynet.rubyforge.org
 Sphere: http://sector.sourceforge.net
 Riak: http://riak.basho.com/mapreduce....
 Starfish: http://rufy.com/starfish/doc/
 Octopy: http://code.google.com/p/octopy/
 MPI-MR: http://www.sandia.gov/~sjplimp/m...
 Filemap: http://mfisk.github.com/filemap/
 Plasma MapReduce: http://projects.camlcity.org/pro...
 Mapredus: http://rubygems.org/gems/mapredus
 Mincemeat: http://remembersaurus.com/mincem...
 MapReduceTitan: http://www.kitware.com/InfovisWi...
 GPMR: http://www.idav.ucdavis.edu/rese...
 Elastic Phoenix: https://github.com/adamwg/elasti...
 Preregrine: http://peregrine_mapreduce.bitbu...
 R3: http://heynemann.github.com/r3/

Also, Microsoft's DryadLINQ is available under an academic license (not quite

open source) at http://research.microsoft.com/en....

Disclosure: I founded Cloudera, a company that provides commercial support

for a distribution of Hadoop, among other things.
33.5k Views · View Upvotes · Not for Reproduction
Answer written ·
MapReduce
· 2013
Will Spark overtake Hadoop? Will Hadoop be replaced by
Spark?

Sean Owen, Director, Data Science @ Cloudera

Written Nov 18, 2013
Originally Answered: Will Apache Spark ever overtake Apache Hadoop?
Spark is in a sense already part of Hadoop. It already runs on YARN, which is
Hadoop 2's generalized execution environment (Launching Spark on YARN) --
not just Mesos. And for example we (Cloudera) support it via Databricks on top
of CDH (Databricks and Cloudera Partner to Support Spark). So there is no
either/or here to begin with.

The larger point, I suppose, is that Hadoop is not one thing to be replaced by
one other thing. It actually names a large ecosystem of components. Spark
itself has no counterpart for a lot of what's under the Hadoop umbrella (M/R,
Zookeeper, Sentry, Hue, HDFS, etc.) But it is also almost surely true that many
things in the Hadoop ecosystem will subsume others. M/R is not going away for
example, but, it is not the right tool for many jobs on Hadoop, and Spark is a
right-er tool for many of those things, so it or something like it is going to
replace plenty of M/R usages.

To your particular points:

Spark is not an ML library itself, but has a small library called MLlib associated
with it. Spark is a better execution environment for anything iterative, and lots
of ML is, so it's going to do better than M/R-based things like Mahout for
speed. For non-iterative computations there's not really an advantage, and I
would imagine that more mature M/R-based implementations, even, are
preferable to MLlib for now. For the niche of algorithms that are naturally
graph-oriented, I also don't think Spark has an advantage over specialist graph
frameworks like GraphLab. For general ML, maybe so.

Spark itself doesn't have an ETL-oriented tooling like Pig or CDK (? someone
correct me?). As an architecture, it's better for ETL-like jobs that involve
anything that looks like a join. For simpler ETL, M/R and its associated tooling
are likely still the natural choice -- this is what they were built for.
Shark demonstrates how much better a non-M/R architecture can be for join-
like operations that you execute through things like Hive -- Shark is a lot faster,
although Hive is closing the gap ably given that it's M/R-based. Shark still
won't generally catch up to specialist Hadoop SQL engines like Impala (see
even Impala 1.0 performance vs Shark: Big Data Benchmark). It's a great
option since it is generally compatible with the same formats, metastore, query
language, etc as all of these. Lots of good choice here.

This is all to say that Spark + its tools are very good, given how much is offered
from just this one project. The good news is that there is no either/or choice,
not anymore.
15.2k Views · View Upvotes · Answer requested by Mohitdeep Singh
Answer written ·
MapReduce
· Feb 16
How do I implement SCD type 2 using Pig, Hive, and
MapReduce on Hadoop?

Anonymous
Written Feb 16
Given that Hive now support ACID properties with CRUD capabilities, this may
be done as in any other traditional database.
434 Views
Answer written ·
MapReduce
· Mar 30
What makes Spark faster than MapReduce?

Shahzad Aslam, Software / BigData Engineer (https://goo.gl/wc54BM)

Written Mar 30
In memory capabilities of Spark makes it faster. MapReduce writes
intermediate results back to Disk and read it back so in a single MapReduce
data is multiple time written to disk and read. Disk read is the worst time
enemy for data processing. So Spark takes in memory capabilities and keeps
the data in memory during data processing. A very simple example can be as
for example (i guess this will not be accrate but may clear the point)

z = (a + b) * c

MapReduce will process as follows

 Read a,b from disk

 process a + b
 write result to disk
 read intermediate results from disk and c
 process data and write data back in disk
Spark will process as follows

 Read a and b from disk

 process (a+b) and keep the result in memory
 get c from disk and process final result
 write result back to disk
Notice because intermediate results are kept in memory that makes Spark
faster.

But ofcourse you need more memory to take advantage of spark which is
expensive otherwise it will as good as MapReduce or worst than MapReduce.

Law Intake Software System

45 Views
Answer written ·
MapReduce
· Apr 1
Do you have real-time experience with Hadoop?

Shahzad Aslam, Software / BigData Engineer (https://goo.gl/wc54BM)

Written Apr 1
Got 4 years with Hadoop, Spark, Storm with others (MapReduce, Hive, Pig,
Oozie)

https://www.linkedin.com/in/shah...
75 Views
Answer written ·
MapReduce
· Apr 6
Is Hadoop dead and is it time to move to Spark?

Shahzad Aslam, Software / BigData Engineer (https://goo.gl/wc54BM)

Written Apr 6
No this is not true. MapReduce / Hadoop and Spark will stay togather. Both
have their own use cases where they are useful. In short use Spark when you
need to process data quickly or data is coming in streams and use Hadoop
when you can process data in batches.

So every application is not suitable for Spark. here is the small list of
application suitable for Spark.
 When data is accumulating for say an hour and you need to process it
quickly within few minutes to update results every hour
 Data is coming in streams and you need to process as quickly as
possible (Spark Streaming)
 Machine learning algorithms need many iterations so it need faster in
memory processing
And here is the list of application that do not need Spark

 You need to process updated data say every day and data should be
ready next day for reviews or analysis. So data can be scheduled to be
processed in batch during night hours. For example if a application
sends recommended products to subscribers each day then Spark is
not needed as hadoop can process data in batch
 When you have longer time say a week or 2 weeks to process data.
So keep it mind if you can afford longer data processing latency then move the
application to Hadoop and MapReduce. It will be much economical. You know it
will be waster of money if you buy expensive machine that process data in 10
minutes and remaining 23 hours 50 minutes machine is idle, why not buy a less
expensive machine and process data in say 10 hours.

Law Cases Management Software

67 Views
Answer written ·
MapReduce
· 2013
Why does Apache Mahout seem so popular if scikit-learn has
a much more comprehensive list of algorithms?

Charles H Martin, Calculation Consulting; we predict things

Written Dec 31, 2013 · Upvoted by Vladimir Novakovski, started Quora machine learning team, 2012-
2014 and Yuval Feinstein, Algorithmic Software Engineer in NLP,IR and Machine Learning
Scikit learn is limited to a single processor and runs in memory, and is
therefore limited to problems that fit into memory and don't require a lot of
time to run, or can be run in batch mode.

For, say, SVM classification, this is fine, as it hooks into the very famous libsvm
and liblinear packages. When the problem is too large for liblinear, the
preferred solution is Vowpal Wabbit, which runs off disk and can be used for a
pseudo-online learning.

For clustering a large collection of documents, scikit learn and VW is not so

great...although, in principle, I recall that VW has something.
The best python solution for this is probably gensim, which has a wrapper to
ARPACK... an old school Fortran SVD code that uses MPI for parallelization.
ARPACK - Arnoldi Package

This is what we used when I was at Demand Media

For a more modern solution ,you can also use GraphLab , which implements
both an in-memory, parallel clustering algo, and GraphChi, which implements
the same off disk. Note that the original version 1.0 of graphlab used an MPI
scheme for distributed, in-memory parallelism, although, from what I
understand, this is not a very important use case right now.

Most shops are now aggregating and placing all of their data into HDFS, and
with all the data in one place, it is possible to do now start doing actual data
science. For my clients who have data in Hadoop, I recommend using map-
reduce to generate features, and the run shared memory ML using scikit learn,
gensim, or graphlab. R is also very popular.

it is, however, painful to pull all of the data out of HDFS at times.

Apache Mahout is integrated into Hadoop/HDFS and implements distributed

memory algos which can be applied to data sets that are much larger than can
be handled by other techniques. For example, trying to cluster 100M
documents, or creating a very large scale, Netflix-stye recommender. This is
the motivation
10.1k Views · View Upvotes
Answer written ·
MapReduce
· 2011
Does Bing use Hadoop or any other implementation of
MapReduce?

Jeff Hammerbacher, Professor at Hammer Lab, founder at Cloudera, investor at

Techammer
Written Aug 26, 2011
Bing uses a file system called Cosmos [1] that is conceptually similar to
HDFS. Dryad [2] is their execution infrastructure; it's more expressive than
MapReduce. The majority of the queries run ove... (more)
Answer written ·
MapReduce
· Apr 6
What is the equivalent of MapReduce reducer in spark?

Shahzad Aslam, Software / BigData Engineer (https://goo.gl/wc54BM)

Written Apr 6
If you are familiar with Scala then you know what is the alternate (I am only
familiar with Scala). Watch this WordCount code in Spark Scala

val textFile = sc.textFile("http://hdfs://...")

val counts = textFile.flatMap(line => line.split(" ")).map(word => (word,

1)).reduceByKey(_ + _)

counts.saveAsTextFile("http://hdfs://...")

these are three lines of code. First and third line read and save results. Watch
second line. Here is what it is doing

1. Flapmap and split sentences in words

2. Make (Word, 1) tuples
3. and now the reduce part reduceByKey(_ + _)
I hope you understand

Law Cases Management Software

65 Views · View Upvotes
Answer written ·
Amazon Elastic MapReduce
· Mar 17
How elastic is Amazon's cultural fit?

Toby Thain, former SDE II at Amazon (2013-2014)

Updated Mar 23
I’m going to say: Not very.
Starting right from the interview—which is evaluated in terms of the leadership
principles—you’ll have to adapt to the Amazon way of working, and embrace
those principles...
(more)
Answer written ·
MapReduce
· Mar 6
What are the ways to find top-k records in hadoop using
Java?

Akhilesh Joshi, studies Computer Science & Data Science at The University of
Texas at Dallas (2018)
Written Mar 6
Someone please enlighten me on the same. I followed the following approach
1. wrote a simple word count program that you can find on any website of
hadoop
2. Then took the partition file and sorted using t...
(more)
Answer written ·
Cascading
· Mar 10
How do I display a red circle in Cascading Style Sheets?

Nicos Tombros, Student

Written Mar 10
In your HTML:
<div></div>
In your CSS:
div{border-
radius:50%;background:#F00;height:100px;width:100px;position:relative}
This produces a div with width and height of 100px, a background color of red
(n...
(more)
Answer written ·
MapReduce
· Mar 2
What is the relationship between MapReduce and NoSQL?

Anonymous
Written Mar 2
MapReduce is offline processing. NoSQL is mostly realtime serving. I’m
worried about the methodology you used to study this topic as you need to ask
this kind of question on Quora.
33 Views
Answer written ·
MapReduce
· 2010
What is Hadoop not good for?

Bill McColl, Way back in the 1980s I did some research on parallel functional
programming models. In the 1990s, along wi...
Updated Dec 5, 2010
Great question. In my recent New York Times article "Beyond Hadoop: Next-
Generation Big Data Architectures" I mentioned a number of areas where
Hadoop can be anything from 10x-10000x too slow for w... (more)
Answer written ·
MapReduce
· Mar 21
How is "Edureka" for learning Hadoop?

Don S, Lived in India, UK, USA

Written Mar 21
Edureka is one of the best places to learn Hadoop online. Their classes are
regular and its a mix of both explanations and hands-on. Moreover they also
provide class recordings which you must use to go through the topics after the
class. Apart from the course material the faculty also teaches many practical
use cases.
98 Views · Answer requested by E.sailesh Patro
Answer written ·
MapReduce
· Apr 6
Does the map-function in MapReduce Hadoop, distribute the
data from HDFS?

Shahzad Aslam, Software / BigData Engineer (https://goo.gl/wc54BM)

Written Apr 6
No it does not distribute data. The idea is the take processing to data instead of
moving data to processing. think about GB file compared with 5 KB program
file that process that data. What is bet...
(more)
Answer written ·
MapReduce
· Jan 26
Is it possible to make map reduce Hadoop program using
vb.net?

Harut Martirosyan, works at PicsArt Photo Studio

Written Jan 26
Not in this life, please.
84 Views
Answer written ·
MapReduce
· 2013
Why does Apache Mahout seem so popular if scikit-learn has
a much more comprehensive list of algorithms?

Jeff Hammerbacher, Professor at Hammer Lab, founder at Cloudera, investor at

Techammer
Written Dec 30, 2013 · Upvoted by Vladimir Novakovski, started Quora machine learning team, 2012-
2014
If you're in the Bay Area in February of 2014, Cloudera (company) is hosting a
meetup to work on the integration of scikit-learn with Apache
Spark: https://github.com/scikit-learn/....

So, people ... (more)

Answer written ·
Cascading
· Sat
What is a cascadable logic device?

Ioannis Panagiotopoulos
Written Sat
Thanks, I am not sure though, It seems that it refers to the ability to transfer
information from one element to the neighbouring without loss.
6 Views
Answer written ·
Cascading
· Apr 6
What is a cascadable logic device?

Nikos
Written Apr 6
Have a look at: Advances in Imaging and Electron Physics, Vol 142, page 5
(Introduction)
10 Views · View Upvotes
Answer written ·
MapReduce
· Mar 18
What should I learn, Hadoop (MapReduce/Pig/Hive) or
Spark?
Peng Cheng, I'm the lead committer of ISpark (Jupyter kernel)
Written Mar 18
Spark, please don’t learn MapReduce 1.0 as a framework, it can’t be deader.

MapReduce as an algorithm & architecture however is very important.

100 Views · View Upvotes
Answer written ·
MapReduce
· Apr 7
How do I explore data in R with MapReduce? If I will make a
prediction, how can I use R llpackages for prediction with
MapReduce?

Justin Watkins, Big Data trainer with Think Big Analytics, Certified Hadoop
Professional
Written Apr 7
Instead of investigating R on MapReduce, investigate using Spark. Apache
Spark is a different distributed processing engine that can run on top of
Hadoop, that works in a similar - but different - ...
(more)
Answer written ·
MapReduce
· Mar 18
What should I learn, Hadoop (MapReduce/Pig/Hive) or
Spark?

Tousif Khan, works at EPlanet Communications -

Written Mar 18
Hey,

Take a look one of my post here i did a comparison for both of them, it may
help you to take a decision

http://www.tkhan.org/why-would-i...
140 Views · View Upvotes
Answer written ·
MapReduce
· Feb 19
What is Map-Reduce?
Shiva Bhusal, works at Bowling Green State University
Written Feb 19
In simple words, MapReduce is breaking a task into several parts and
combining it together. Map does the split task and Reduce does the combining
part.
99 Views · View Upvotes
Answer written ·
MapReduce
· Mar 21
Are Oozie and MongoDB needed for Hadoop?

Deepak Kumar, GNNIT from GNIIT Software Engineering (2000)

Written Mar 21
HI,

No, Ooze and MongoDB is not needed for Hadoop. Although Ooze can be
installed with Hadoop for managing Hadoop job.

Check Big Data tutorials, technologies, questions and answers.

Thanks
36 Views
Answer written ·
MapReduce
· Jan 19
How did Doug Cutting create Hadoop?

Manish Ranjan, studied at Apache Hadoop

Written Jan 19
Who better to talk about it than Cutting himself. Here is where he talks about
it.
85 Views · View Upvotes
Answer written ·
MapReduce
· Mar 2
Why don't MapReduce use memory?

Anonymous
Written Mar 2
The whole point of MapReduce is to process data in such large volume that it
couldn’t fit in memory.
18 Views
Answer written ·
MapReduce
· 2013
What are some interesting beginner level projects that can
be built using Apache Hadoop?

Sean Owen, Director, Data Science @ Cloudera

Updated Jan 8, 2015
Try making a job that efficiently counts word co-occurrence, and occurrence,
and then computes term log-likelihood similarity. This is interesting in practical
terms because it's the essential basi... (more)
Answer written ·
MapReduce
· 2014
Am I expected to know how to use MapReduce in an
interview for an internship position at Google?

Gayle Laakmann McDowell, worked at Google

Written Oct 2, 2014 · Upvoted by Brian Schmitz, worked at Google and Jeremy Miles, Quantitative
Analyst at Google (2015-present)
Not unless there's a reason to believe you would or should know it. Perhaps if
your focus is something with system design you should understand it. But in
general, no. Google is more concerned with... (more)
Answer written ·
MapReduce
· Mar 18
What should I learn, Hadoop (MapReduce/Pig/Hive) or
Spark?

Anonymous
Written Mar 18
If you are really serious into getting into the BigData field then this question is
irrelevant. You will have to study both.
Coming to your question - you are comparing Apples to Oranges (for the most
part)
...
(more)
Answer written ·
Cascading
· Feb 28
In cascading dropdown, how can I set an option if it is
selected using JavaScript and the options are set in variables
(variablename=options)?

Med Unes, Full stack (PHP/SYMFONY - JS) Web developer

Written Feb 28
Well, I’ve crated a code snippet here that demonstrates how to achieve this
Here it is Live demo
47 Views
Answer written ·
Bulk Synchronous Paral...
· Mar 22
4 bit synchronous counter with parallel load?

Bryce Bradford
Written Mar 22
Try putting this question into any search bar of your choice, google, yahoo,
bing, etc, and your answer will be one of the first few results.
83 Views · View Upvotes
Answer written ·
Bulk Synchronous Paral...
· Mar 28
4 bit synchronous counter with parallel load?

Simon Fitch, BEng from King's College London (1989)

Written Mar 28
74′161
61 Views · View Upvotes
Answer written ·
MapReduce
· Apr 7
Is Hadoop MapReduce dead and will be replaced by Apache
Spark?
Shahzad Aslam, Software / BigData Engineer (https://goo.gl/wc54BM)
Written Apr 7
Not at all. Both Spark and Hadoop complement each other and they have their
own use cases. I answered such question multiple time but here i go again.
Speed is not the only factor to consider when ...
(more)
Answer written ·
Bulk Synchronous Paral...
· Mar 22
4 bit synchronous counter with parallel load?

Graham Cox, Has a degree in Electronics Design and Technology.

Written Mar 22
Yes, please. With 7-segment decoded outputs and look-ahead carry on the side.
95 Views · View Upvotes

Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Big Data & Hadoop Overview
No ratings yet
Big Data & Hadoop Overview
44 pages
CC Unit-7
No ratings yet
CC Unit-7
16 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
Bda-3 Unit
No ratings yet
Bda-3 Unit
23 pages
Hadoop and Spark Overview
No ratings yet
Hadoop and Spark Overview
34 pages
MapReduce Unit3
No ratings yet
MapReduce Unit3
27 pages
BDA Unit 4 PDF
No ratings yet
BDA Unit 4 PDF
31 pages
Cloud Computing & MapReduce Basics
No ratings yet
Cloud Computing & MapReduce Basics
55 pages
Map-Reduce For Parallel Computing: Amit Jain
No ratings yet
Map-Reduce For Parallel Computing: Amit Jain
72 pages
BDA Unit-3
No ratings yet
BDA Unit-3
63 pages
BDA Notes
No ratings yet
BDA Notes
15 pages
Week 14
No ratings yet
Week 14
33 pages
4a MapReduce
No ratings yet
4a MapReduce
47 pages
MapReduce: Working and Advantages
No ratings yet
MapReduce: Working and Advantages
12 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
Assn - No:1 Cloud Computing Assignment 13.10.2019
No ratings yet
Assn - No:1 Cloud Computing Assignment 13.10.2019
4 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Hadoop and Mapreduce
No ratings yet
Hadoop and Mapreduce
21 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
53 pages
Big Data
No ratings yet
Big Data
43 pages
Lecture 03
No ratings yet
Lecture 03
26 pages
BDA Sem
No ratings yet
BDA Sem
2 pages
MapReduce for Big Data Enthusiasts
No ratings yet
MapReduce for Big Data Enthusiasts
18 pages
Bda Ia1 Scheme
No ratings yet
Bda Ia1 Scheme
7 pages
Bda CHP 2
No ratings yet
Bda CHP 2
5 pages
Hadoop: A Report Writing On
No ratings yet
Hadoop: A Report Writing On
13 pages
The Mapreduce Paradigm: Michael Kleber
No ratings yet
The Mapreduce Paradigm: Michael Kleber
13 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
14 pages
Map Reduce
No ratings yet
Map Reduce
35 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Hadoop: Big Data Solutions Explored
No ratings yet
Hadoop: Big Data Solutions Explored
18 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Sem 7 - COMP - BDA
No ratings yet
Sem 7 - COMP - BDA
16 pages
Hadoop: Presented by Y Naveen
No ratings yet
Hadoop: Presented by Y Naveen
7 pages
Cloud Parallel Data Processing
No ratings yet
Cloud Parallel Data Processing
25 pages
Bda tt1
No ratings yet
Bda tt1
8 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Module - 2 - Introduction To Hadoop
No ratings yet
Module - 2 - Introduction To Hadoop
24 pages
Unit-III Big Data
No ratings yet
Unit-III Big Data
10 pages
Big Data 1 PDF
No ratings yet
Big Data 1 PDF
17 pages
Big Data Unit 2 Notes
No ratings yet
Big Data Unit 2 Notes
6 pages
What Is The Hadoop Ecosystem?
No ratings yet
What Is The Hadoop Ecosystem?
4 pages
Hadoop for Big Data Beginners
No ratings yet
Hadoop for Big Data Beginners
87 pages
Bda Megh
No ratings yet
Bda Megh
50 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
Hadoop 2
No ratings yet
Hadoop 2
27 pages
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
No ratings yet
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
21 pages
Bda Unit 3
No ratings yet
Bda Unit 3
20 pages
CC Unit4
No ratings yet
CC Unit4
14 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
Big Data and Analytics and MapReduce 29052023 054155pm
No ratings yet
Big Data and Analytics and MapReduce 29052023 054155pm
35 pages
MapReduce - A Flexible DP Tool
No ratings yet
MapReduce - A Flexible DP Tool
6 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Crawler
No ratings yet
Crawler
83 pages
Consistent Hashing
No ratings yet
Consistent Hashing
19 pages
Face Book Chat
No ratings yet
Face Book Chat
3 pages
Garbage Collector
No ratings yet
Garbage Collector
10 pages
BPCL Case
No ratings yet
BPCL Case
26 pages
Asurion Hong Kong Limited ROHQ
No ratings yet
Asurion Hong Kong Limited ROHQ
2 pages
Group 5 Principles of Management
No ratings yet
Group 5 Principles of Management
18 pages
Books For Rebuilding Civilization
25% (4)
Books For Rebuilding Civilization
3 pages
Summary CH 1
No ratings yet
Summary CH 1
9 pages
Environmental Movements in India
No ratings yet
Environmental Movements in India
2 pages
Lecture 5: Basic Probability Theory: Donglei Du (Ddu@unb - Edu)
No ratings yet
Lecture 5: Basic Probability Theory: Donglei Du (Ddu@unb - Edu)
55 pages
Welding Consumables
75% (4)
Welding Consumables
128 pages
Sigma Marine Coatings Manual - Part66
No ratings yet
Sigma Marine Coatings Manual - Part66
2 pages
Alumina PFR PDF
No ratings yet
Alumina PFR PDF
309 pages
MCA Exam: Mathematical Foundations
No ratings yet
MCA Exam: Mathematical Foundations
3 pages
Japanese Dental Science Review: Jorge Perdigão
No ratings yet
Japanese Dental Science Review: Jorge Perdigão
8 pages
B.A. Nutrition and Health Education
No ratings yet
B.A. Nutrition and Health Education
30 pages
TmForum ODA
No ratings yet
TmForum ODA
42 pages
40 Years 40 Anos 40 Ans PDF
No ratings yet
40 Years 40 Anos 40 Ans PDF
35 pages
Math 1: Learning Philippine Money
No ratings yet
Math 1: Learning Philippine Money
7 pages
Measurements Mcqs Sol For Sea
No ratings yet
Measurements Mcqs Sol For Sea
156 pages
Under - The - Moon AK
No ratings yet
Under - The - Moon AK
10 pages
02 Laboratory Exercise 31
No ratings yet
02 Laboratory Exercise 31
8 pages
Coding & Career Tips
No ratings yet
Coding & Career Tips
15 pages
Final Examination in Eng121
No ratings yet
Final Examination in Eng121
24 pages
United States v. Lopez-Carreon, 10th Cir. (2000)
No ratings yet
United States v. Lopez-Carreon, 10th Cir. (2000)
4 pages
LS200 Series Liquid Level Switches
No ratings yet
LS200 Series Liquid Level Switches
4 pages
9.1.2.5 Lab - Install Linux in A Virtual Machine and Explore The GUI
No ratings yet
9.1.2.5 Lab - Install Linux in A Virtual Machine and Explore The GUI
4 pages
Research Questionnaire
No ratings yet
Research Questionnaire
3 pages
VCCEdgeH1 DealReport2024
No ratings yet
VCCEdgeH1 DealReport2024
31 pages
Time Study Method Implementation in Manufacturing Industry Nor Diana Hashim TS183.N67 2008 - 24 Pages
No ratings yet
Time Study Method Implementation in Manufacturing Industry Nor Diana Hashim TS183.N67 2008 - 24 Pages
24 pages
Unit 5 - Queues
No ratings yet
Unit 5 - Queues
16 pages
e Bill
No ratings yet
e Bill
1 page
LiDAR Full Notes
No ratings yet
LiDAR Full Notes
32 pages