Site Loader
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training | Edureka


A very warm welcome to all of you for this Spark Tutorial from Edureka. Before I start, can I get a
quick confirmation from all of you if I am loud and
clear, on your right-hand side you will find a chat
option or pushing back, you can just type it either options there, very good, thank you thank you Saurabh. So what you need to do is
as you can find out like I have asked you a question
and you have just posted a response here, so feel
free to interrupt me anytime in middle, whenever you have any time and I will be answering okay. So you can just interrupt
me in the middle from there and we can take up your
questions, lets start. What all things you can
expect from this webinar what is Apache Spark, why Apache Spark why we are learning this
new technology. In today’s world you must be hearing a
lot about this Apache Spark there Apache Spark is the
next big thing in the world, why, why people are talking
about that Apache Spark is the next big thing,
what are the features in Apache Spark due to bit
we are talking like that? That Apache Spark is the
next big thing again, what are the use cases
related to Apache Spark? How Apache Spark ecosystem looks like? We will also do some hands-on example during the session and in the end I will walk you through a project which will be related to Apache Spark, so, that is what you can
expect from this session. Moving further, now first
we before we even talk about what is Apache
Spark, it’s very important to understand Big Data because that is or what we are going to
choose right Apache Spark will be used on Big Data? Now what does this key word Big Data that is the first thing which
we are going to discuss. Now if I ask you what is Big Data? What do you understand by Big Data? What would be your response,
can I get some answers. On your right hand side you
will see a question panel you can just answer it from
there, seriously please make this little interactive,
it will really help you to understand this topic well, I will assure you by the end of this class you will all go with a good
knowledge about what is Apache Spark, but you need to help me to make it interactive. So you can tell me, what do you understand by Big Data keyword? Very good, true Throw
is saying that it refers a huge light data that
is generated every minute on the internet from various
resources, very good answer. So what are we saying that
large amount of data generated on in the corporate network, okay. They can be text, image,
video, stream, very good. Though sort of just
please see your statement you are saying that large
volume of data that you are calling it as a Big
Data but is it really the case can I call just not
volume of data as a Big Data? No, tract is just one of
the property of the data if I need to still define
what a Big Data I need to define in a broader
term, I need to say not only the volume but from various resources what the data is getting
generated for example, Facebook is generating lot
of data or news being use, medical domain all these
domains are generating Big Data. Now if I talk about
various kind of resources which is generating it
then we are also talking about this print where REDDIT, right. and in the end we will also
be talking about vector speed with which this data is
growing up because what about Facebook, please look
at just 10 year old company it is not very old
company they just 10 year or 12 year old company. Now in 10 to 12 years at
self, Facebook have grown that data exponentially,
they are dealing with very huge amount of data. Few months back I heard a tweet from Mark who is the CEO of
Facebook, he mentioned that in his Facebook Timeline he
mentioned it is of sponge back and mentioned that
Facebook today have number of users equivalent to
number of people living when this group 100 years ago. That’s a big statement, no Sammy, we can also deal with
unstructured data I’m coming to that point, so they are
talking a big thing, right. So now this is a challenge
with Facebook you can imagine how much big amount
of data is talking about. Now with respect to number of
users itself it is sounding such a use data, now
what are the activities what you do on Facebook? You tweet right, maybe
you can type a message you also upload your pictures,
you upload your video, you upload your audio right,
you do all that stuff. Now are they kind of formatted data what we used to use
another idea my sister, answer is no right,
definitely they are not kind of a very forum good
formatted data they are the different category
of data and that category is called Unstructured Data. Now your are DBMS a
system can they deal with that kind of data, answer is no. Our DBMS can deal with all the structure can a data which have
some sort of pattern. Now when we talk about Hadoop we also talk about audio, video which
we in other words we call it as Unstructured Data, okay. So that is also a format a variety of data what we deal with with
data, so we cannot just say that looking this the data
is huge then we call it as a Big Data no, that
is just one property because what if I have
a unstructured data, even if it is small in
nature but still if we have to you still use this
Hadoop roots, Big Data tools to solve them, so in those cases also use the data tools because
our DBMS is not efficient to solve all those kind of
problem, so that is one thing. Now whatever data you
get they also can have some sort of problems,
there can be a missing data there can be a corrupt
data, how to deal with that data that is called Veracity that is also one property of Big Data. So you can see Big Data is
not just about no volume but it consists of multiple of the factors like velocity, variety, veracity. All these are important
component of Big Data like I just said Facebook in 12 years is able to grow the data
so much, when we in terms of number of users itself it
is sounding like a Big Data and after users are doing the activities on their platform, imagine
how much data Facebook might be dealing with
similarly not only Facebook even if we talk about
Instagram, every minutes so much of post are getting like, like almost 70 likes, 36,000,111 I am talking about every
minute, I’m not even talking about like in a day basis,
YouTube every minute, three hours of video are getting uploaded but when you search anything in YouTube does it make you query slow, no. How they are able to handle
all that data so efficiently we can talk about Facebook
every minute so many users are posting the things or liking something so much of event is occurring, we can talk about Twitter every minute three likes, 47,000,222 tweets are happening so so much of activity is happening per minute. We are talking about per
minute they can imagine what must be happening now. So there’s a funfair
richness a statistics in fact what’s tells that every two years, data is getting doubled. You want to reach moon
just burn all the data what do you have right now
with you and you will be able to reach the moon twice
that is the amount of data what we are dealing with
at the current moment. Moving further, now
imagine what’s gonna happen in 2020 whenever I take the batteries I always tells West to
let you all are sitting on a data bomb and at a
dog bomb this is going to happen very soon because
what is currently happening is that only 4 to 5%
companies who works with data have realized the potential of the data. Now the challenge with
them is they are hesitant to move towards Big Data
safe to use the Hadoop tools or not and the reason
is they’re afraid that what will happen in case
if tomorrow the shift to Big Data domain and
will they will be getting a good support, will they
get the number of users who will be able to solve
the problem for them all these problems they
are still thinking, they’re hesitant for
the same reason to use the technologies like Big Data tools. Now but they cannot
stay for long like this because definitely there will be a stage where they will not be
able to use a DBMS at all or any traditional system
at all in that situation they need to make this transition. So it is expected by 2020
this 5% of the company will grow to 40% and imagine
right now itself when you go to this indeed.com or nok3.com you see so many jobs popping
up for a purchase path, Big Data and all, imagine what
is going to happen in 2020. There will be a huge demand
and less supply of paper I definitely say this so in your company if you are working let’s
say in a database company you must be senior managers they are maybe senior directors,
maybe VP’s and you must be thinking sometimes of that these people are really lucky they
started their careers 20 years back then Oracle DB or a DBMS who are just coming up
and today they became VP and I am still sitting at the
software developer position that’s a very general
thought which stands at some of your mind I am pretty sure about it. Now you are exactly sitting
at the same position, tomorrow generation,
your future generation is going to exactly think
in a similar manner, they will also be thinking
in a same way that these guys were lucky
man this Big Data domain was just coming up they actually
shooted with apache Spark and they became today VP
and I am still sitting at this position, so you will be occupying that very soon because this is the domain which is going to blast that is for sure. And this is not me, my I’m not telling it this is all the predictions
from the start agents and analysts and I’m
talking not about the small and it you can just leave
the blocks, you can easily get all those things, in
fact lot of people have also come to this level and said that, people in next five years the companies who will not be transforming
towards data or apache Spark they will not even be able
to survive in the market. This is also being said by the analyst. Now imagine by 2020 how
much of data we will be dealing it, you talk from
any mall, shopping cart, vehicles and these sort of
event which is generating data imagine the amount of data
we are going to delay. In fact this you might
have heard about this term, IoT, Internet of devices. That is itself requiring
Big Data right because that is generating so much of data. So, so many things will
be happening around you, talking about Big Data
Analyst and what exactly this Big Data Analytics is,
what exactly we do there? Now this is a process
where so first of all let us understand what is Analytics. Analytics is the process where you have your given the data and you
generate some insight from it. Some meaningful insight from it, you want to get something some
information from the data because currently when the
data is sitting with you, you don’t have any idea
about the data, what is there in the data and you don’t
have any idea about it but then you are working
with respect to that data with as an analyst then
you want to generate some meaningful
information off of the data that is called Analytics,
but now the major challenges with Big Data because the
data has grown up in volume in such a great extent, how
we can analyze their data? Can we use their data to
gain some business inside all those points we want to understand, then this dooming is
called Big Data Analytics. Now there are two sort of
analytics which are generally done the first sort of analytics
is called Batch Analytics, second sort of analytics is
called Real Time analytics. What is all that? Let’s understand it one by one. What is exactly this Batch Analytics and Real Time Analytics? Now everybody must be using
washing machine at your home or at ease have heard
about washing machine. Now what you can really
do, when you collect your clothes and then wash it someday or you just generally as soon
as you take out the clothes you first wash it and then
go or a bath and then use it. So you generally do this part
right you collect usually the clothes and maybe
someday you just put it in the washing machine and
just process all your clothes when it’s a process a yoga means
washing of all your clothes this kind of processing
is called Batch purposing. Where you collect some data
and then process it later so this kind of processing
we will be calling it Batch processing, so you
can see on historical data when you do some sort of processing that is called as Batch Processing. Real Time Processing,
let’s see one example. Let’s say you are doing or
your credit card transitions and pretty sure that most of
you must be using credit card or debit card online, now even if you do a payment to Edureka, you
might be doing it online so definitely everybody
must be using their cards, now let’s say if you are
sitting right now in India, sitting in Bangalore City and doing a credit card transaction, now immediately after 10 minutes your
part is also swiped in US is it possible, definitely
no but do you think it makes sense for banks to
kind of leg the transaction still happen and later they
can just see that whether it is a genuine connection or not right, definitely they don’t
want to wait otherwise if the foreign happens,
it will be their loss, so they what they do, as
soon as any Real Time event when they receive that a
person is trying to swipe the card at some location
which do not looks like a genuine connection, they
will either start sending you an OTP or they will
block that connection they will immediately give
you a call, they will ask you whether you have done
this connection because this looks unusual to us,
all those questions they will start asking and
once you approve then only they will let that transaction happen, processing is happening
on the historical data or the current data, current data. So, which means what, we
are doing this processing in the Real Time as and
when the data is coming I am doing the processing
must immediately as soon as I swipe the card and the
Real Time, my system should get activated and start
and running the algorithms and checking whether I should
allow this transaction or not. Now this second type of processing is called Real Time Processing. So, just to explain you the difference between Batch Processing
and Real Time Processing. So Batch Processing or
books on the historical data by at the same time the second
kind of processing works on the immediate data that’s
the difference between them. While we are talking about all that, if we talk about Real Time
Analysis, I just talked about few use cases
just like my credit card in banking it’s very important
for government agencies you’re applying for our dark arts or not, so if you are in India
you might be doing it can you give one more instance
for Real Time Processing this is in front of use
Amina now if we talk about any Stock Market Analysis
right, Stock Market Analysis. If we talk about that, now
immediately what happens, lot of companies are there,
I am not sure whether you have heard about
tolerance search Goldman Sachs have you heard about these
companies, Morgan Stanley, Goldman Sachs, Event of research, if you have heard about
these names what they do? They have developed a smart
algorithm what that will go out and do, that you apply
your money you give them your money off your stocks
to them, what they will do that algorithm will kind
of do a prediction and tell that okay, this stock
price is going to be high this stock price is going to be low, they are not making their
algorithm public because that, they are better but
hurt should update leave they do that it will be their
loss but what they do is they have a smart algorithm
and that algorithm is happening at the Real
Time means at any event if they see any unusual event
because of which the market can go down or the stock
trust firm of Plexus top is can come down, what will have to is, they will immediately send
itself, so that the customer do not go in loss, if they find some event at the Real Time when
any stock can make profit they will by default buy that stock, so this set of algorithms are running at the Real Time scale,
what I know something. So these all companies are using this Real Time processing part. Similarly there can be multiple
example telecom companies, health care, no for health
care is very important a patient is coming, now as
and when the patient came we immediately want to get some insights on whatever information
is given and based on that do some processing means
start treating the patient, so all those things are also
happening at the Real Time. Now when to use why apache Spark? When Hadoop is already
there, why we were talking about this Batch processing
and Real Time processing? Let us understand that part. Point number one, which is very important in Hadoop you can only work
on Batch time processing means your Hadoop is not meant
for Real Time processing. So now what happens let’s
say you have collected a data on day one on day two, only
you will be able to process it something of that
sort I’m not just saying that you have to do it in
a day itself even the data is let’s say one R word
that’s a historical data only but you will not be immediately
able to access that data that is what being done
in Hadoop systems but when we talked about Apache
Spark there is no time what you can do here
this as and when the data is coming you can immediately process it, immediate processing can
happen in the case or spot. Now you can ask me another question, so Spark is only for Real Timed Data? No, Spark and deal with
historical data means Batch kind of processing as well as it
can do Real Time of per second. So it can do both kind of
processing that’s an advantage with the apache Spark,
is it the only advantage? No, let’s understand two
more things with respect to apache Spark. Now when we talk about
Hadoop, so has it just a Hadoop Spark like it
happens Batch processing. Now when we comes to Spark
it happens with respect to your Real Time Processing. Now so the same thing
which I just explained you so what I do you can have handle the data from multiple sources
you can process the data in Real Time, it’s very easy to use. Now anybody who have already
written MapReduce programming? No, if you have done that
you might be knowing that, MapReduce it is becoming a stated, it’s not that easy like
Samir have done that. Samir, you can easily convey that. So it’s not really easy
like for the beginners need to learn MapReduce is not
an easy task it takes time it’s complicated in terms
of writing the program. With Spark things are
very easy, even Spark have one best advantage,
faster processing. This Spark can be processed
very fast in comparison to your MapReduce program that is one of the major advantage with Apache Spark. Let us go and understand now in detail, once I explain you that part you will be all clear my MapReduce for
slower, why Apache Spark is faster, why we are
making all this statement what is Apache Spark, how it works, so let’s understand this part. So I’m going to my white gonna
just give me a moment now, if I let me share my screen, okay. So let’s go step by step, let’s understand what MapReduce boss, what was the problem with MapReduce, remember I just said that MapReduce is slower,
what is the reason? So I am just going to
take you to little detail, so let’s understand this
part, now let me take some example and let’s say that example is having a file, that
file let’s say it is gonna have some data, let’s say apple, banana. So I am assuming that all of you already have knowledge about
what is Hadoop systems you know about how data
of escaping process in Hadoop system, if you do
not you need to let me know like we split the data into
128 MB, so I’m assuming that you all already
are aware of this topic. Now orange let me copy this. Let’s say my data is of this sort. Let’s say this is my friend, now I’m telling you
already let’s say this file what is there, is let’s say of 256 MB, now if I’m dividing this
data into my default size how many blocks it will
create, two blocks. So it is going to be 128 MB, 128 MB. Now this is going to create two blocks, 128 MB and 128 MB. Now what let’s say your boss came to you and said to you, you need to
I have a problem like this and you need to give me
of both of this problem when I say what count of the problem, now in this file I have
only three key words, apple, banana, orange how many times apple is opening in this file,
how many times banana is opening in this file,
how many times orange is opening in this file? You came out and started
working on this way you thought it’s a easy problem because I can divide my file into two parts, 128 MB 128 MB each and
what I am going to do? I am going to distributed
fashion I am going to work on it. In order to work on distributed fashion what I am going to do is
I am going to let’s say, start solving this problem in this way I will say okay, I’m going
to you know two apple by A, orange by four, banana
by B just to make it a little simple, now
what you are going to do? Let’s say you set up will
take I want two apple, one in front of this,
second time banana chip, now you want an apple one
in front of this because this is offering first, an orange cake you started appending one in front of them now they came again, you saw that apple have already occurred
before and the count was one so this time you will be
increasing the font by one and you made it to dispatch. Now again you did this
algorithm in the similar manner for banana, you kept on doing
this for your first block of work, for the second block
of code which may be working on some other machine, you
did exactly the stimulus you did that exactly the similar step. Now what was the next step what you will be doing in this case? Now in this is the next
step will be that you will be combining for the
outputs whatever came up so let’s say first of all
you want to combine apple how many time it offered,
so let’s say from here you go down put off let’s say apple, 20 in the class, from just
block two, let’s say you got the output of a,34 in the mass, similarly for bananas, you did that, so for banana, let’s say you bought 56, similarly you did for second banana, this
orange and this orange, then in the end you will combine
this and give the output. So you will be doing
something of this sort next a,20,34 and then here also you
will do for banana and then you will do for orange
and then in the end you are telling that okay a,1
you bring the solution to your boss and tell
him I solve this problem. Your boss is going to be
not happy with you, why? There is not for telnet
you there is a problem with this approach, this
is not the right approach. Can anybody tell me where is
the performance bottleneck here where is the performance bottleneck, why I am telling you that this
is not the right approach, can anybody see this
form statement tell me where is the problem? So, need is saying aggregation part by something this
aggregation part is Apollo, what about other says
there, one has to wait for other now let’s say
that problem is not there, let’s say it is very quick,
one has to wait for other, no, that’s not the right
solution, that actually is not a problem, so if
this is a kind of connected in a way that it cannot be
they may not wait for other. What is the other solution? What is the problem here? And then what will be
the solution, can you see the problem here, this 128 MB file, do you think it is small, when I only have text data do you think it is going to be small, no. Now when you are doing this step, don’t you think you are
decreasing your performance? Because every time an element is coming you are going and checking back whether that element have offered
before or not and secondly, then you are adding of this number. So don’t you think this
is a bottleneck for us? I don’t want to do this because every time a new entry is coming every
time we need to go back and check whether that element
has occurred before or not this is the major bottleneck
for your algorithm. Now how MapReduce have
solve this, how MapReduce of solving this what is the
correct solution process? So let us see how we can solve this. So from where this bottleneck,
I’m not real excited because we were looking back, how about if we remove that
bottleneck, so let’s see, so let me remove this
solution from here and now what we are going to do it,
let us give a better solution so what we are going to
do, so let’s say I am going to make apple, one this time,
I am going to make banana, One where I am going to make orange, one. Now when the actual came again this time I am not going to go and look back. Now again even here, I am
going to only put comma one in front of this, so
whatever key one is coming I am just appending one in front of this, similarly, I did for my second drop also so for my second block
also I did the exactly the same stuff I am not
waiting for my, I am not going and checking for
my previous increase. Now in the next step what I am going to do wherever apple came I want
to bring them together so what would be just I’m
going to combine these entries from both this machine
I’m going to combine this interest and what
I’m going to do wherever I feel apples or cream,
let’s bring them together after apple comma one, apple
comma one, apple comma one, from both the machines for
wherever a thing was there just bring them together,
how we can do it? By doing a sorting okay. Bringing everything
together in one machine and then by doing a sorting step. Similar, thing I am going to do for banana so let’s say I say banana
comma one, banana comma one, let’s keep on doing this, now similar thing I can do for orange as well. So I can keep on doing that. Now what is the next
step, in the next step it is going to just combine
up all the one like this so wherever a boss one was
coming I’m just bringing up together similarly for
banana I’m going to do that can everybody smell the solution now. We can smell the solution,
what is the next thing I need to do, I need to
just combine everything, aggregate everything,
so let’s say this one is offering three times
that will give an output, eight comma three, b comma
three whatever the number of one would be there I will
be combining that output so let’s say, a comma three, b comma three whatever we do okay, I’m
just giving an example here. now this is how MapReduce solves a problem so if you see what are the steps we did the first step what we did
is called as Mapper Phase. Second step what we did
these two steps is called a sort and shuffle pitch and a third step what you are doing here
is called Reducer Phase. So these are the three steps involved in MapReduce programming. Now this is how you will
be solving your problem. Now okay, we understood this back but why MapReduce was lower,
there’s a still a mystery to us, because we want
to definitely understand that why we were talking
about that MapReduce is lower in order to solve this what we are doing is resuming
that my replication back to this one I didn’t am
assuming that you know all these facts from Hadoop
systems if you do not know you need ask me, okay. So that I can give you an
appropriate example for that so I am right now assuming
that you know the replication so I’m assuming the
replication factor is one in this is, now what is going to happen now if I see this actually I am doing this so I have these two machines,
add these two machines and then these two machines, so right now my all the operations
are happening so this is, let us say on Dino again I
am assuming you know this if you do not you have to
stop me this is these two are your data nodes, okay. These two are data nodes, so where the data resides factor
data node, so what is gonna happen, let’s say this is your block b one and let’s say this is your block b two so what happens, this be one block, if I am considering that my
replication factor is one this be one block is next
a residing in the hardest of data node one and this
b two block is residing in the hard disk of data node two. This is let us say data node
two, this is an data node one. Now if you notice what’s gonna happen, where you perform your processing do you perform your
processing in your disk level or you perform your processing
in the memory level? Can I get an answer, where
your processing happens? Memory, it always that memory where the processing happens. So now what we have to
do is, now let’s say the mapper code came
because the first code which is going to run will be mapper code when the mapper code
will come to this machine this block b one, will be moved out from the disk means copy from
the desk to the memory of this machine, so b one
block will come to the memory of this machine and the
mapper code will be executed. Similarly, b to block of this machine will also come towards the
memory of this machine and it will be getting executed. Now, if you are quality from
computer science program or even if you are not,
you might have thought that whenever there is a
input-output operation happen when I say input-output operation I mean when you read any of the data from your disk or you write the data to your disk so that is called your
input output operation. So what I’m saying is, so
you might have heard this that whenever there is a
input/output operation happen it degrades your
Performance because you have to do a disk seek and all those stuff, so that the reason it makes
their performance slow. now in this example can you notice I’m doing an input-output operation. Now this is my input path of
copy the data to my memory now the map for output is
one Mapper output is this, this is a mapper output, now
mapper output let me call this as let us say O one,
let me call this output as O two, now what is going to happen? All this output will be
given back to the disk now this O one will be stored back here, O two will be stored
back here, what happened? I have stored back my O
two of mapper output here now if you notice this is again
an input-output operation. I’m doing an output operation to my desk, question controller, what would happen if the block size is
large, will it efficient to use of the memory, right
now we have to assume that this memory is good
enough to at least hold up 128 MB of data, otherwise
you will get an error. In MapReduce programming,
you will simply get another if you have let say 128 MB of the data but if you have less than 128 MB of memory you will get another but max pack have a very smart way to solve this problem. Spark do not have any problem
you can be care less memory, it still take care of it, that is a very interesting story about Spark but when it comes to
MapReduce it says no error. Clear Sarah, that is the
reason we actually divide our data to 128 MB, so
that at least our memory should be enough to handle it. Now, what will happen? So I got my first O one
and I have already observed my input output operation, now start and shuffle will happen sorting shuffle will happen on one
machine, let’s say right. So let’s say this step is
happening in one machine so if you notice the data is
coming from all this machine to one machine, so let’s
say this they decide to do sort and shuffle on
let’s say data node one, so this auto machine will
be having a network transfer of data, this O two will come here, now after that this sort
and shuffle step will happen let’s say the output which
is coming out from this is O three, this is O three. Now again this O two will
be sent to the memory, O two will be sent to memory and O three will be again saved in
the disk, after that again you will be sending the
reduce at what reducer what will bring the O
three into the memory and I’ve been pushing the
final output to the disk and this, so many input-output
operation happening in just one program, mapper
that input-output sort in Japan have done Network transferred and the less input-output. Third step reducer have done
the input-output operation, again can you see so much
of input-output operation in one program tactical recent
your math reading programs are slower in nature, everyone can hear why MapReduce programs
are slower in nature? What if you have already
executed robot for example in O’Neil MapReduce you
might notice that when you execute, it do not
give an immediate output, it takes a good time to execute the over and why that happened
this because there are so much of input/output
operations, thanks Ratish. Let’s move on, so this is the problem with MapReduce, now let us see that how apache spark is solving this problem how apache spark solve the problem and why it is faster, why we
are saying that my that I will be able to give
the output in faster time? So let’s understand that. Now in order to explain you this, let me first of all, so let’s say I have a file again here and let’s say my data is like this one, three, five, six, seven,
eight some more data. Some more data, so let’s
take this data is there, similarly there is more
data 34, 78, three, six. Now let’s say this is one more data similarly let’s say we have more data here let’s say 23, 67, one, nine promoted. Now I’m telling you this
the file is of 34 MB, 384MB, sorry. This file is 384 MB and second thing is let’s say the name of the file F.txt. Let’s say, this is the name of the File. Now I am writing some alien word for you, please do not worry about that
because I will be explaining you this portion, now let’s say if I have, do not worry about what
is this I’m just texting. Let’s understand what
exactly we are doing. Now in this example also
let’s say I have created a cluster this is my name
not these are your data nodes now what is happening here and telling you that this file F.txt is of 384 MB, so it’s very obvious that
my file will be divided into three parts b one,
b two, b three block. Now again I am assuming here
that made application factor is but I am calling this as b one block, calling this as let say b two block, calling this as let say b three block, each of 128 MB. Now what would be my next step? So I have just understood
that we have these blocks now let’s say this file
is residing in my HDFS. So where it resides in
the disk, so let’s keep it in the disk, b one block here, b two block here, b three block here. Now where it will
residing in the data node at the disk, you think this is
where our NTFS my data base. Now, then as soon as I
hit this first mapping first of all before even
understanding this part let me explain you one more thing, what is the main entry point in Java? Without which you cannot
write any program, anybody Know, main function. Without main function
you cannot do anything. Now in Apache Spark also there is one main entry point without which, none of your application will work and that entry point is
called Spark Context. We also denotes Spark Context as SC. Now this is the main entry point and this the sides at the master
machine, so we will be keeping this SC you know, for when you write your Java programs, let’s say
you have written one project for one project there will
be a separate main function for another project they will
be a separate main function similarly, this SC will be separated for each individual application. Now let’s understand
this first line of code what they do, so ignore
this oddity for some kind what is this oddity doing
just ignore that part you can write not relate
this oddity as some data type like for example in Java
we have string data, so you can just replace this
oddity thing for something not definitely from replacing
this RDD with a string that means number is
unbearable, so let’s assume that number is inaudible for some time, let’s see we have just seen
that SC means Spark Context without which your Spark
application will not have been executed, now this text file this is an API of Apache
Spark, we use we understand this in more details you’ve
read other staff sessions but I will just give you
an idea about what is this text file do, what
this text file API will do, in Apache Spark would be whatever file you have written inside
that F.txt, it will go and search that file and
will load it in the memory of your machine, what does I mean by that? Now in this case F.txt
yes well in three Machine, F.txt is b one block, b
two block, b three block. So what is going to happen
would be your b one block, let me create this,
let’s say this is my RAM, let’s say this is my RAM. Now what is gonna happen
in this case would be just b one block will be
copied I’m not saying move will be copied to the
body of this machine, b two block will be copied to the memory of this machine, b three
block will be copied to the memory of this machine. So this is how your blocks will be sent to this machine memory, now
what is going to happen? So we have just understood that b one, b two, b three block
here, I am assuming that my memory is big enough
to hold all this data. Now what happens in case
we’ll all the blocks it is not mandatory sort of, it is not mandatory that all
your block size should be same it can give different as
well, it doesn’t matter. Whatever would be the block
size in respective of that it is going to copy the
block towards the memory that is what happened with
my first line of code. Now, these three files which are combined least sitting
in memory is called as RDD, so these three files which is stating combinely in
memory is called as our RDD. What is the name of RDD,
we have given number RDD. So we have given the name
to this RDD as number RDD. What is RDD, RDD is a distributed data, sitting in memory. What is the full form of RDD? Full form of RDD yes,
resilient distributed data. Now let me ask you one question is this a distributed data or not? Is it a distributed
data or not, yes it is. It is a distributed data,
what do you understand that resilient can you get an answer, what do you understand by
just, see what does it read though it’s not a listener
but still I just want to understand what do you understand by this key word, resilient? Resilient means real life. That’s an English meaning for reliable. Now yeah, you can call that sort of, now when I say something is reliable. Now which brings me to
a question, now whenever I am talking about RAM first of all, in RAM if I’m keeping the data and this is the most volatile thing in my whole system because whenever you
have something in that if you restart your laptop
everything will get erased from your RAM I get some
most volatile thing. Now I’m still saying RDD
is resilient how because I am going to lose the data
hasn’t been immediately I just restart my laptop or something I am going to lose my data. Now how does this will guided? Remembers of application
factor, replication factor? So, let’s say the
replication factor is two, let’s say replication factor is two now in this case let’s say
if I have few more machine, so let’s say b one rock is sitting here let’s say b two block is copied here let’s say b three block is copying here. Any of the machine let’s
say that is siding like that now what is going to happen? Let’s say in this condition let’s say just block it, lets block it. So what we lost, what we lost, yes we are not yet reached to that
stage but I will be speaking about that, below this
will b lost, b three lost. Now is b three anywhere
else, yes between 0 so. What is going to happen? It will immediately Load b
three also in this machine it will immediately load
between this machine, now b one and b three
both will start reciting together in this machine and
then what’s gonna happen? If these three will consist of energy. So again b three will
be moved to the memory and immediately that RDD will be created. So that is the big I
mean in these RDD ensures that it is pressed in
there, even if you lose the data or if you lose any of the machine it does not matter it takes care of it. So this is called a Resilient Portion. Now let’s move further,
so we just understood what is RDD and secondly, it is resilient. Let’s take one more step. So we have created our number RDD now I am also creating a filter RDD but now what I am going
to do is I am going to create it on my number
RDD, just number RDD.map, again this map is the
API, what this API do is I’ll usually be understanding
this part in our sessions in the day but let’s just
let me give you this part a brief above it, for your introduction. Whatever code you will be
writing inside this map API will be executed, so
whatever code you will be writing inside this
line will be executed, so right now just written
some English keywords in this place you have to replace
this English keyword logic to find values less than 10
get some programming logic, maybe can be a Python program,
it can be a scalar program it can be anything,
whatever program you want to write, you can mention it, yes. So whatever code you will be writing your map function will be responsible or your map API will be
responsible to execute it. Now what we are doing
here, one more point, RDDs are always immutable whenever said I say that are delivered
immutable that means if you have already put the
block b one into the memory you will not be able to make any change in your block b one, you will not be able to make any change in your block b one no. Now what is going to happen? So let us come here first
before I start working on this part, let us see this part. So let’s say you have
written some scalar function or a Python function
which is some function which is just finding out all the values which are less than 10, so let’s assume in this b one block, in this b one block, you have let’s say these
four bunch all the banch, so in that case what
would be the output here what would be output here,
this block one comma three because these two values are less than 10, one comma three, so can I
call this block SB four block, let’s call it as B four, what would be the output from here, it
will be three comma six. Let’s call this also as B five block. Similarly, if you notice
this I have let’s say, one comma nine and let’s
call this as maybe six block. So let’s say this b four
block, this is b five block and this is b six block. Now what is happening here? that this B one block
which is sitting in memory I will be doing this when
this code will be executed that execution will
happen on this B one block and a new block before will B created. I am not going to do any
change in B one block I am just doing the
cross on this B one block and creating a new block
which I am calling it a B four block. Similarly, from this B two block, you will be generating this B five block. it will be again sitting in the morning. Similarly, here B six
block will be generated. Now in this case your B
one block and B four block both will start reciting
in memory together. Similarly B two and B five
will be residing together and B three and B six
will be reciting together. Collectively all these
three, B four, B five, B six will be obtained called as an RDD the name of that RDD
would be filter one RDD. Clear everyone? What is RDD, how RDD works? This concept is clear to everyone. So this is how Spark works. Now let me ask you a question, don’t you think this will be faster? Are we doing money input-output operation just like I was bringing Map Reduce, no. Only the input-output operation happens at the first page, when
I was using F.txt file. After that my data was
always taking in memory and that’s the reason
I am not doing any sort of input-output after that
and that is the reason it is going to be giving
you faster output, so that is the reason Spark is faster in comparison to MapReduce but I think the RAM become yeah,
definitely that is there. But still Spark have a big if your RAM is know also it can
handle it and that concept is called Pipelining Concept. I am not going to cover it in this session but yes, there is a big event if your memory is less Spark takes of it. Very interesting concept
in fact okay, yes. Again that’s a very interesting concept that Spark can still handle if you have a little less memory. So that makes park very
smart framework that is the reason people are
going for this paper. Now, so you have you’ve leaned
over I do recall sessions we go over all these topics
in detail that power stuff and is it what if this situation happened then what will happen to
all those things we go on, it will spill the extra
danger to my desk, no. It will not load any data
to the desk but still will be able to handle it,
that was bad thing right? You must be wondering curious about how but that just might what about pipeline. Is there any limitation on number of concurrent client requests, no. You can read as many
number of times as you want if the right thing you want
to do then only it’s a problem there is no limitations on that. Now let’s take one more
step, so we have just read this part, now if you notice what is happening here, so
initially what I worth having so right now I have a filter one already, so let me code a notice my
f one, this is my filter by RDD this filter one RDD
is dependent on something, yes, it is dependent or my number RDD, on my number RDD, my
number are it is dependent or something, yes F.txt, so
this file it is sleeping? No, can you see this is a
graph which I just deleted here this graph is maintained
by stop context as soon as you execute all this
statement and this tag this is called dat, directed acyclic graph is also called as lineage, So in lineage what happened,
it maintains all data which maintains all the information about that dependencies like f
one is having a dependence your number, number is having a dependency on F.txt so this dependency graph what it is maintaining is called a lineage. So, this is very important
part of the total. Now if you notice what is
happening just B four block got generated due to B one
block, this B five block got generated due to B two
block and this B six block will be generated into B two block. Another terms, I can say this
F filter RDD got generated with the help of number RDD, so number was also not RDD but from that number RDD I will created a new entity that is called as filter one RDD, this F is
called a Transformation step but this step we called
it as transformation step. Now are we printing any output here, no. We are keeping only the data in memory, in Java we used to use
this print statement. In a Spark we don’t have
print statement but instead of that we have a collect statement, so let’s say if I want to
print B four, B five, B six that means I want to print filter one RDD I can write filter1.collect, this will print B four,
B five, B six to your SC Now this thing what you are doing here this collect agony what
you’re doing means whenever your printing any output B
called this as a Word action. So this step in a spot
context is called S action so this is how you work on that, there are two major step one is transformation where you can convert one
form of RDD to another form of RDD and second thing is called Action where you can print your output. So these are very important
points to keep in mind by working on apache Spark. Let’s go back to our
site any question on this before I move back I can
again come back to it, everyone clear about this step, good. Let’s move back, now if you notice here the same thing we will discuss that Batch in Real Time processing moving further, this is how it is done,
so this I just discussed about the Spark, that Spark
provide Real Time processing, so basically, agreed creation
start with the transformation, yes, yes sir, now this faster processing is the part which we have just discussed and can you also see
it is very easy to use, it is very easy to use in
comparison to my MapReduce if you have already done
MapReduce programming or if you remember that
apple, orange, banana example, definitely my agony way is
much simpler in comparison to your MapReduce Program,
if you see the MapReduce code it is complicated in
nature but Spark program is very simply in look at, so that’s the reason your Spark programs carries a very simple nature to go on. Now moving further, let’s
understand a Spark Success Story, what are the things we have,
now Spark Success stories there are a lot of people
who are using it these days like if we talk about stock market, stock market is using
apache Spark lot because of faster processing
capability, easier in nature plus a lot of things which are available (mumbling) Twitter Sentiment
Analysis may be a trending which is happening based
on that some company want to make some profit
off of it maybe they start doing some campaign based on that, banking credit card fraud deduction I gave already an example of credit card where let’s say some
fraud is being detected maybe they’re expecting
that this do not it sounds like a genuine connections
can we learn with the package but MapReduce it is
impossible to do because they cannot even perform
on Real Time processing, secondly, even if we try
to apply on historical data it will be slower that’s
a challenge there. In medical domain also we
apply apache Spark a lot, so these are the areas where
Apache Spark is getting used. Talking about Spark now,
we have already discussed what is fun, so now in Spark we have only seen Real Time
processing and everything, now apache Spark is an open
source cluster it is available to you, that it’s free
of course you may not pay to work on that, that is also one of the very important pathways
why apache Spark is famous it can you can perform
Real Time processing Batch, kind of processing every
kind of processing, you can perform on it. You can perform your
programming pack, you can do or data parallism, it can also
take care of fault tolerance we have already seen the
result of resilient part. It is reliable that ocean
is called fault-tolerant in as well yeah, now multiple it has been on top of MapReduce, what
it will I get as an output if I use connect function
just after creating the past already, it will just print out the original sort of, in fact I will do an execution practically and show you one example thereafter. So that you can remain
here that what exactly how it will being done,
how you can load the data and how you can see the data inside. I will show you a practical
also just within few minutes. Great, now let’s move
further, so this is about the apache Spark, now
it’s very easy for me to explain you all these
things because we have already seen it, the Spark
always use with Hadoop can be used as standalone,
yes that’s on a fact, you can use a standalone as well. No need for Hadoop cluster, you can simply even create a Spark things on your own simple Windows machine
and can start working on it without requiring
any other, you can keep the file locally and opponent
that was fun part about it. you’ll need not require HDFSS at all. I will show you one example of that also, so that you’ll remain clear
that how we standalone except I can use of apache Spark, I do not even requirement RDMS to go
connect that’s a fun fact. So many advantages you
can make out on your own Spark is giving almost
100x times faster speed don’t you think it’s an awesome speed, 100x, I’m not talking
about double or triple the same, I’m talking
about 100x time faster which makes Spark very powerful, you might be hearing a lot that lot
of companies are migrating from Map Receiver to apache Spark, why? I hope you got your answer, it’s simple as well as it is making
your speed so fast, processing speed so fast,
caching is very powerful what is this persistence
or not be the really going sessions and going detail of data but we can cache your data in memory which is quite helpful as well
and most of the cases. You can deploy your application within the source YARN or as
a standalone cluster. Now this is a very good
feature that event let’s say you have already configured
your Hadoop and all you need not change your cluster specific apache Spark, ping plus
that you can use it whatever you are using for your MapReduce for your Apache Spark. Similarly, Spark can be
programmed with multiple, programming language like add Python. (mumbling) So, there are lot of
language Java can also use, so these four languages are
used at the current moment. They both are same sort of,
they both are exactly section. Now moving further, so using
I do through apache Spark, so let’s see how we can do all that. Now Spark with HDFS,
makes it more powerful because you can execute
your Spark applications on top of HDFS very easily. Now second thing is Spark
plus MapReduce programming where Spark can be used along
with MapReduce programming in the state Hadoop
cluster, so you can run some application with
MapReducee and same cluster you can use regular Spark
application, no need to change all that, so that is one of the powerful things that you need not create separate cluster for Spark and separate clusters or mass-produced. Similarly if I just explain
you even if you have already done configured, you
can use it for Apache Spark. So this is very powerful because usually all of our older applications
for MapReduce were deployed on YARN and now Spark and
taking the bridge of that so like companies who want
to migrate from MapReduce to apache Spark for them
it is making the life very easy because you
can just directly talk it doesn’t matter you need
not change the cluster manager you can directly start working on it. For people who do not
know what is YARN just a brief about it, this is
a cluster resource manager, let’s see few more things. Now what happens with Spark? So with Hadoop you can
combine of the things that was one thing so Spark
is not intended to replace Hadoop, keep this in mind
in fact you can take it as a extension of your Hadoop framework. People have this confusion a lot they say that Spark we’re going
to replace Hadoop, no. It is not going to
replaceable because they are still depleting all the things, you are using HDFS, you are using YARN but just that the processing
style you are changing so Spark is not going to replace Hadoop, in fact you can call it as an extension of the Hadoop framework second time. When we talk about Spark with MapReduce, now they can also work
together, so sometimes they are not new applications,
no not now they’re very rare applications but
there can be applications there’s some part of the code
they write it for splits back and some part of the code
they write with MapReduce this is all possible, so let’s say a company’s transforming the
codes no need for MapReduce to apache Spark they require
time so maybe some part of the foot which is
really important for them they can start processing it
with respect to Apache Spark and rest of the map really what
they can leave it as it is. So you can keep on slowly
converting like that because combinely also they can work, so if you Sparkle stand
alone it does not provide any distributed by their
definitely stepped it takes I mean because if you are already
using it as a standalone let’s say if you are not using as the data that in that case definitely
you are not liberating there’s Apacaha Spark making
it as a single process. Now moving further, what
are the important features in apache Spark, definitely
the speed, polygon, polygon means multiple
languages which you can use are kala, Python, Java, so many languages. You can perform so much of analytics in memory computation,
when we are executing everything in memory this is
called In-Memory Computation. You can integrate Hadoop, you can also apply machine learning and this makes apache very powerful, it is so powerful that Hadoop obviously
use not even or do this even now we have massout, anybody who have not heard about massout, I hope everybody must be having, if not and
let me just explain you massout is a MapReduce
programming framework which is used to write your
machine learning algorithms so you can write your machine
learning algorithms with Mahal now MapReduce is somahow
to convert the problem in MapReduce pay and you get down but now MapReduce itself is slower plus, machine learning algorithms are very highly hydrated in nature because of this your execution will
become very slow in mahal because machine learning
algorithms are already slower in nature plus MapReduce
programming is slower in nature now because of that mahao
and got emptied sometimes asked to give an output, I am not talking about even minutes some
time to execute even a smaller data set it’s
something can go even hours. Now this is a major problem with mahao, know what Spark did Spark come up with a very famous framework called SMLA, Spark MLA, his is a substitute for mahao. Now in MNLA every processing
is going to happen in memory so that there
will be knowing to talk about operation even the
hydration what is happening will be happening in the
memory so this will make the things very fast, now because of this what happened that MapReduced programming which was used by mahal,
people stopped using that. Now what happened with
this part they stopped using this mahal in
fact the core developer of this mahal but did themselves migrated to words the MLA, now even if you talk to those core developers of
mahal, they themselves are recommending that if you want to execute machine learning
progress better execute it in the Spark framework only. Executed by using Spark MLA
rather than executing it in your Hadoop, I so that’s the reason and machine learning
algorithms on Big Data everyone is moving to what Spark MLA. Let’s see all this part in detail now, when we talk about now
in the space bar fight and we just go discussing
about this features. Spark can run 100x time
faster, why we already know we have already speed
network, now when we talk about polygons we have just
discussed that you can write and scale of Parquet Java and Hive and so, many languages are being supported. Now next Spark this is important. Lazy Evaluation, let
me again take you back to my PPT, so in this case, now what actually happens, how this
execution happens here? So, first of all what is happening here is it is not like that as
soon as you hit this acid or textile it will immediately
load this beyond the memory it do not work like that,
in fact what it do is that as soon as you hit
this line it will generate this B one block but it
will be empty initially it will not be keeping any
data, then what will happen? You generated this number dot now it will again generate B
four block, B five block and B six blocks but
they all will be empty, there will be no data inside it but as soon as you change filter1.collect, now what happens as soon as you get this filter1.collect it
will go to your F one means filter one which
is nothing but B four, B five, B six, they’ll say that I want to print your data,
now what is going to happen? Filter one will say I don’t have any data I am currently blank, now
the filter one will go and request number RDD to give the data. Now this B one, B two, B three they are also empty right now but
they will also say that I am blank, it will go to F.txt, F.txt will be loading the data to num, num will be loading the data to filter one and then this filter one
will be giving the output. So this thing is called a Lazy Evaluation means till that time you
will not hit an action it will not print, it will not
do any execution beforehand. So all the execution
start only at the time when you hit and action if you are coming from big programming
background, you might have already seen this feature
till that time there you do not do a dumb
statement, it do not execute anything which is beforehand, now this portion is
called Lazy Evaluation. Why lazy evaluation because we do not want to unnecessary but
important the memory till that time we are not
printing the output means when we do not want to display something they will not be doing any institution, so that the data should
not remain in the memory unnecessarily, so this is
called a Lazy Evaluation. Here about this part, this
is called a Lazy Evaluation. Let us come back to the slides now. Now look at this part, so this is the lazy evaluation property,
now the Real Time computing like at the Real Time as
and when the data is coming you can immediately start
cross the thing in the memory it said this is the fourth
property which we have already see, now the fifth
property you can work start with this DFS, you can
work start with MapReduce same thing what we discussed and you can also perform your machine
learning like things like that is the that is part about this. So this is how you will be
applying your machine learning these are the major
features about the Spark. Now let’s take a break and
after that I will be talking about ecosystem because this
will be a detailed topic there I need to spend a good
time, so let’s take a break and then we will start,
so after the break there are still lot of topics to talk about, we will be also doing a
practical and followed by a project in the end,
we will just walk to the project that what
kind of project you will be learning then you will
be doing the next sessions of Apache Spark, so let’s
do all that after the break, so let’s take a break of 10
minutes, so let’s be back by 4:30 friends okay, so we will start about ecosystem and
practicals and this is going to be very important,
so please be on that, so please be back by 4:30 okay. So everyone back can I
get that for confirmation everyone back, able to
hear me, shouting there. So let’s move further, now
Spark consists of the things what we are working on like
for example creating RDDs that is a part of Spark
Core, now Spark Core is the major engine on top
of that all the libraries are built, for example we
have Spark sequence there what you can do, you can write a query in a SQL programming
way and ontology it will be getting converted
with respect to your path which means the computation
will happen in the world. Second thing is fast screening this is the major component because of which it was possible that we are able to perform, Real Time processing,
so spot streaming helps you to perform Real Time
processing SparkMLib because the machine learning logarithim I just discussed about this
part when I was discussing about mahau, SparkMlib
is a mostly a replacement for mahao because here the algorithm which were taking us in YARN
Hadoop, they’ll take only few seconds in SparkMLib, that’s a major improvement in number land that the reason people are shifting
towards those five graphX where you can perform your
class kind of computation you can connect your prints recommendation and Facebook friends so there it generate, internally graph and give you offer. Any graph sort of computation
is done using graphX. Sparks R this is a newly developed member they’re still working on
it, it’s right on the beat of these versions, now here R is an open source language used by analysts. Now what Spark emitted what
that they want to bring all those analysts to the
Spark simple and for that they are working hard by
bringing this back on. Stock are have already
made it and this is going to be the next big thing in the market. Now how this ecosystem looks like, so there will be multiple
things for example when we talk about Spark sequence most of the times every computation
happen with respect to RDDs but like in Spark he could be have something called SQL, now data frame is very analogous clear RDD
but the only difference would be because the data
which will be sitting in the body will be in the tabular format now in this case the data
what you are keeping it going to also have
column in function along with row information you will
also have column information that’s the reason we do not call it as RDD in fact we call it as the top three. Similarly, in your machine
learning also we have something called a ml pipeline
which helps you to make it easier to combine multiple algorithms so that is what your ml
pipeline do in terms of MLM. Now let’s talk about
Spark Core, so Spark Core we already discussed any
data which is residing in the body we call that data as RDD and this is all about
your Spark Core component where you will be able to walk on large-scale parallel system because all the data will be finally
sitting distributing this again so all the computation
will also happen firmly. So this is about your Spark Core component when we talk about the
architecture of Spark, now you can relate this as your name node where your dragon program
is attend which call as master machine, so your master machine your Spark context on
video similarly Worker Node is called as theta node,
so in stock we denote this data nodes as broken
one, now there must be a place in memory
where you will be keeping your block that space in memory
we called the Tax executor as you can see there
are two data nodes here or work on order we are
having executed means the space in your RAM
where you will be keeping all the block will be called as executors. Now the blocks which are
residing, for example you were doing that dot map logic to get or the values less than 10, now that logic the code what you are executing on an RDD is called as task, so
that is called as task. Now in middle there the
store manager just like YARN or something YARN missus
whatever you want to keep that will be an intermediate
thing, now everything will be moving towards
this cycle path context then YARN will be take
injure of the execution then inner execute on
your cord will be sitting where you will be performing
your task, you can also cache your data if you wish to, you can cache or for process the data. Now let’s talk about Spark Streaming, Spark streaming we have already discussing from good time that
you have Real Time kind of processing available,
so what happens is here you will be as soon
as you’re getting the data you will be splitting the data into data, just small small data
and you will immediately do processing on it in memory that is done with the help of Spark Screaming and the micro backup data
what you are creating is also called as Dstream. Now we are just talking
at a very high level of all this profit because
we just want to give you an idea about how things
works but when we go in the stations these things
are all in the stream. Definitely in just two
and a half to three years it is impossible for
us to cover everything but it is going to be kind
of overview all the topics what I’m giving you,
is it same as backward for what the Spark in general, yes. Spark engine is helping
you it is just working like a Spark or it is converting
your things to your are giving and helping you to process the data, that is the role of the Spark processing Now similarly when we talk
about Spark streaming, now Spark streaming as I was
talking about you can get the Real Time data in are,
now the data from where you can be pulled off, it
can be for multiple sources. You can use Kafka, you can
use Hbase it can be pulled up from packet format any sort of data and the Real Time bring the
data into the Spark system after that you can apply anything, you can apply Sparks SQL
means, you can execute your SQL on top of it, you can execute your machine running code on top of it, you can apply your simple
RDD code on top of it anything and store back
on the output either in your Hps, your memory
SQL kafka last bit search whatever you want to do but main yes, when the data leaves here at
the Real Time you can immediately start doing cross the same. So even the other libraries
can pull up the data immediately and can start acting on it. Now so this is the same
example like you can just pull up the data from kafka, HDFS/S3 from any of the sources bringing
it to the stock streaming then save it in HDFS or database or maybe a UI dashboard wherever you want to. Similar things are there
like you will be getting an input data stream you
will be converting into a Batch of small phone data
and then in the back itself you will be outputting everything, so what are happening, so you are what the practice of
data what you’re creating so I can call all those
things maybe small small RDDs. What I am generating, so that’s the reason it is denoted here, so you
are getting a deep feel in small for a Batch of data,
so maybe this is activity which is getting generated
for a short a time. Now all the outputs
will be given afterwards so this is a very high-level picture of how paths streaming is going to work. Similarly in Spark SQL,
now this is very powerful thing because it can give
you an output very quickly now if you have a SQL with
Spark you can execute it and that is called your Spark SQL. Now Spark SQL can handle
your structured data, semi structured data but cannot handle your unstructured data because anyway we’re performing SQL kind
of query, so it makes sense for it to perform on semi-structured and the structured data only
not on unstructured. In the streaming data
structure it will be structured it will be structured data
but this is going to be my structured data, now
support for various format you can bring the data
from multiple formats like, Parquet, Jason, hive anyway
similarly all the queries what you can working to
different are beliefs who can do that as well,
you can be using data frames you can shuffle it to
RDD as well, so all those things are possible in your Spark SQL. The performance if I
compare with your Hive is very high, in her own
system if this is a red mark which other is there is your Hadoop system you can easily see that we are taking so less time in comparison
to your Hadoop systems but that is the major advantage
when using this Spark QSL. Now, it uses the JDBC Java
which is a Java driver or ODBC driver which is the Oracle driver for your connection
for creating connection you can also create your
user defined function just like in Hive, so you can also do that in Spark as well. So if you have already of pre-built API you can use it if you do
not have it creator you do and then you can execute
it, if you do not know UDF in high incidence of
concept of mediums not only in height which is a
general concept where you can create your own
function, you can write your own Java PUD and
can use it as a function in your subsequent or
in type that is called your UDF back, so this
is how your Spark SQL go. Now what usually is a workflow correct? You are going to have a
data source from where you will be getting the
data, you will be converting to a date API, data API
means nothing but analogous to RDD but it will be a tabular format, so it will be having the rows
as well as column information. Now you are going to have the name column you are going to have interpreted convert it to a pathway doing the computation your Spark SQL services will be running and in the end you will
be trying to offer. So this is a high level picture of how to pass the SQL votes. Now let’s talk about Mllib, which is machine learning
library there are two kinds of alogarithms one is
supervised alogarithm, second is unsupervised logarithm. Unsupervised algorithm you
already know the output you already know some part
of the output using that you are predicting something new, inside to provide learning you
do not know anything about your data, you don’t have even the previous date output
and you want to get some output from it, this is
what unsupervised learning. So Mllib can handle good
ratings, now in supervised we have multiple examples
like classification, regression, similarly
unsupervised we have clustering, SVD all the things are
available for unsupervised in package what it has
given very less here just creates failure, is there
any limitation of particle? No, there is no such
limitation Sammy, okay. you can execute all the
things which are available in fact there is something called as make your Spark Context you
also have a high context. Now if you want to execute
your high query choose what you can do it with
the help of pipe context. So there is no such limitation, so you can still keep on writing the code in height and can execute directly up. Now moving further, what
are the techniques we have what are the various data
sources in Sparks SQL? So we just already
discussed the same thing so we have Parquel, we have Jason, let me go back to quickly show you again you can get it from CSV,
HBase from database, Oracle, DB, my SQL packages and all. These are your data’s much
this is there available here so you can just get it
from all these data sources so lot of data source you
can use it further yes or no? No, so in classification
generally what happens? Just to give you an example, you must have what is the spam email
box, I hope everybody must be having, I’ve seen that Sparking for your spam email box in your Gmail. Now then any new email comes up, how Google decide
whether it’s a spam email or a non spam email, that has done as the example of classification plus three. Let’s say you might have
see in the Google News like when you type something it cooked all the news together, that
is called your clustering. Regression, Regression is also one of the very important fact, it is not here, no regression is, lets say you have house and you want to sell that home and you have no idea what is the optimal price you should lease for your house. Now this regression will
help you to achieve that, collaborative Bentley you might have seen when you go to your Amazon back page that they show you a recommendation,
you can buy this because you are buying
there but this is done with the help of collaborative filtering. So this algorithm is used for your recommendation pack graphX an important confident again in graphX you will you can apply all your problems, you can solve all your
problems at the graphX. Now there are multiple
things we have edges which denotes the relationships
now can you see this back, Bob, Carol these are what exits where you can call it also a
leaf, now the connector between them is called as H, that is just being shown here now if
there is an arrow here that is called a directed
graph like we have seen in the lineage also something so that is your directed graph. Now what are the use cases
for it there can be multiple but let’s see few examples. Now all of you must have
seen this Google map, Google map you must
test it, now Google map and the back in there you the graphX what you do is when you
apply something you do not just search for one part it in fact goes for multiple parts and it
shows your mud optimal path which is may be less this time
or maybe the links distance now that computation what all is happening to compute all the graph checking that pit will take less time
make computation of all that is done with the help of graphX. Similarly there are a lot
of examples for protections also thanks for using
this graphX you can also see this Twitter or LinkedIn they give you recommendations of friends, that is all can be done that the end of them so all your recommendations
happens because they generate the graph and
all that and based on that they compute and give
you the output, so there also graphX execution, so graphX is a very strong logarithm available with us. Now before I move to the project I want to show you some practical part how we will be executing Spark things. So let me take you to the VM machine which will be provided by
director, so this machine’s are also provided by a
director, so you need not worry about from
where I will be getting the software, what I
will be doing this point it role there everything
is taken care by director Now once you will be
coming to this you will see a machine like this, let me this so what will happen you will
see a blank machine like this let me show you this but
this is how your machine will look like, now
what you are going to do in order to start working
you will be opening this permanent by clicking
on this black option now after that what you can do is you can now stop go to your
Spark, now how I can work with Spark, in order
to execute any program in Spark by using scalar program you will be entering it as. (mumbling) If you type Spark
attention it will take you to the ELA Pro where you
can write your path program but by using a ELA programming language. You can notice this, now
can you see the Spark it is also giving me one
point five point two version so that is the version of your Spark. Now you can see here you can also see this part context available as you’ll see when you get connected to your Spark shake you can just see this will
be by default available to you, let this get an
attack it takes some time. Now, we’re all connected,
so we got connected to this scale prom, now if I want to come out of it I will just type exit, it will just let me
come out of this block. Now, secondly I can also write my program with my freighting, so what I can do if I want to do programming in Spark but with Python programming language I will be connecting with Spark box. So I just need to type my
Spark in order to get connected with your data, I’m not
getting connected now because I am not going to require Python I will be explaining everything
that scalar right now but if you want to get
connected you can type in spot so let’s again
get connected to my Spark Now meanwhile this is getting connected let us create a file,
so let us create a file so currently if you notice
I don’t have anything I already have F.txt, so
let’s say I say cat.txt I have some data one,
two, three, four, five. This is my data which is with me now what I am going to do let me push this file and do select them check
if it is already available in my system that means DFS system, hadoop dfs.cat.a.txt just to quickly check if
it is already available. Okay, there is no such
file so let me first put this file to my system look put a.txt so this will put it in the
default location of dfs now if I want to read
it I can see this path. So again I’m assuming that you are aware of this as the best
marks, so you can see now this one, two, three, four, five is coming from Hadoop file system. Now what I want to do,
I want to use this file in my system of Spark,
now how I can do that so let me come here so in scala, in scala we do not have integer
float and unlike in Java when you use the define
like this like integer is equal to 10 like this we use to define but in scalar we do not
use this ticker tape in fact what we do is we call it as back so if I use back a equal to 10, it will automatically identify that it is integer value not in
feature than you notice it will tell me that a
is of my integer type. Now if I want to update this value to 20 I can do that, now let’s
say if I want to update just to ABC like this,
this will move an arrow because a is already defined as integer and you are trying to
assign some ABC string type so that is the reason you got this arrow. Similarly, there is one
more thing called as Val, Val B is equal to 10 lets say if I do it works exactly similar
to that but have one difference now in this
case if I do B is equal to 20, you will see an
error and why this error because when you define something as val it is a constant, it is
not going to be bearable anymore, it will be a constant and that is the reason if you
define something as val it will be not updated,
you will they should not be able to update that value. So this is how in kala you will be doing your program, so back for variable part of val for your constant val. Now, so you will be doing like this now let’s use it for the example what we have learnt, now let’s
say if I want to create a car TV so val number is equal to sc.textfile. Remember this API we have
learned to say file already sc.textfile now let me
give this file a.txt if I give this file
a.txt it will be creating and hardly see the
Spark, it is telling that I created an RDD of
string type, now if I want to read this data I will
call number.collect. This will print me the
value what was available can you see, now this line what you are seeing here is going
to be from your memory this is your from memory it
is reading up and that is the reason it is showing up
in this particular manner. So this is how you will
be performing your step. Now, second thing I told you that Spark and walk on standalone system says that, so right now what was happening was that we have executed this part
in our history of this now if I want to execute
this on our local file system can I do that, yes it can still do that. What you need to do for
that so in that case the difference will come here, now what the file you are giving
here would be instead of giving like that you will be denoting this file keyword before
that and after that you need to give your
local file for example what is this path /home/ I do they come this is a local path not as deep as path so you will be writing
/home/cat Eureka/ a.txt. Now if you give this, this will be loading the file into memory
but not from your hdfs instead what does that
if just loaded it from your hadoop, so that is the difference so as you can see in
the second case I am not even using my hdfs which means what? Now can you tell me why they
set up this was interesting by the side of input part
does not exist because I have given a typo here
okay, now if you notice why I did not get this error here why I did not get this other here this file do not exist
but still I did not got any error because of Lazy Evaluation, Lazy evaluation kind of made sure that different if you have given the wrong path it created an empty are dealing but it has not executed anything, so all the output or the error mistake you
are able to the scene when you hit that action of connect. Now in order to correct this value I need to connect this edeka and
this time if I execute it will work, you can see this output one, two, three, four, five. So this time it work
fine so now you should be more clear about that leave
the evaluation as the same when if you are giving the
wrong file name doesn’t matter suppose I want to use
Spark in production unit but not on top of Hadoop
is it possible, yes. You can’t do that, you can do that sorry but usually that’s not what you do but yes if you want you can do that
there are a lot of things which you can you can also
deploy it on your Amazon cluster that lot of
things you can do there how will it provide the
distribute in that case will be using some other
distribution system. So in that case you
are not using this pack you can deploy it will be
just definitely you will not be able to kind of
go across and distribute in the cluster you were
not able to liberate that redundancy but again news even Amazon is enough for that, so
that is how you will be using this, now you’re going to get. So this is how you will
be performing your path because as I say how you
will be working on this path I will be explaining you as I told you so this is how things works. Now, let us see an interesting use case so for that let us go back to our PPT this going to be very interesting so let’s see this use case, look at this is very interesting, now this use case is for earthquake detection using Spark so I think Japan you might
have already seen that there are so many
earthquakes coming you might have heard about it I
definitely you might have not figured that you
must have heard about it that there are so many
quack which happens in Japan now how to solve that
problem with the budget so I am just going to give you a glimpse of what kind of problems
we solve in the sessions, definitely we are not going
to walk through in detail on this but you will get
an idea how often Spark is. Just to give you a
little bit of brief here but all of these targets
will learn at the time of sessions, now so let’s see this part how we will be using this
case, so as everybody must be knowing whatever question so I’ll crack is like a
shaking of your surface of the earth your home start shaking no, all those events that happen in fact if you’re from India you
might have seen recently there was a earthquake incident which came from Nepal fight even
recently two days back also there was that incident,
so these are quick keeps on coming, now very important part is let’s say if the earthquake
is a major earthquake like earthquick or maybe tsunami may be forest fires maybe a volcano. Now it’s very important for them to kind of estimate that crack is going to come they should be able to kind of predict it, beforehand, it should not happen that as the last moment they go out to do that the flag is come after
came are threatened no it should not happen
like that and they should be able to estimate all
these things beforehand they should be able to predict beforehand. So this exhaust system that
Japan is using oil today so this is a Real Time kind of use case is what I am presenting
so Japan is already using this path penguia in order to solve this earthquake from them
so we are going to see that how they are using it okay. Now let’s say what happens
in Japan earthquake model so whenever there is a earthquake coming For example at 2:46 p.m. on March or 2011 now Japan earthquake early
warning was predicted now the thing was as soon as I predicted immediately they start
sending the alert to schools, to the lift to the
factories every station, through TV stations, they
have immediately kind of told everyone, so that all the students were there in school
they got the time to go under the desks, bullet
train before running they stopped otherwise except immediately the will start shaking
now the bullet trains are already burning at the very high speed they want to ensure that
there should be no sort of casualty because of that so all the bullet trains
stopped, all the elevators the lift which were running they stopped otherwise some incident can happen in 60 seconds, 60 seconds before this number they were
able to inform almost everyone, they have send the message they have broadcast on TV all those things they have done immediately
to all the paper so that they can send
at least this message whoever can receive it
and that have saved, millions of life, so how they were able to achieve that we have done all this bit of elbow Apache Spark, that is the most important before how they got you can see that everything
what they are doing there they are doing it
on the Real Time system. If they cannot just collect
that data and then later the processes they did everything at the Real Time system, so they connected the data immediately
processing and as soon as they detected that
earthquake they immediately informed it in fact this happened in 2011. Now they start using it very frequently because Japan is one of the area which is very frequently and
of affected by all this. So as I said the main
thing is we should be able to process the data and we’re
fine they have to media thing you should be able to handle
the data from multiple sources because they can will be
coming from multiple sources may be different different
sources they might this thing some other
other event fixed because of which we are predicting
that this or it can happen it should be very easy
to use because if it is very complicated then
in that is for a user to use it it will be
very become complicated if you may not be able
to solve the problem now even in the end how
to send up a lot message to the bottom right, so all those things are taken care by your Spark. Now there are two kinds of layer in your earthquake, number one there is a prime giving and a second building. There are two kinds of ways in our Spark, priming wave is like bender
or when is this about to start it, start with the Dickey Center and expand go or 20 going to start. Secondary wave is more severe being which Sparked after friend
even, now what happens in secondary failures once that start it can do maximum damage
because primarily that you can say that initial
wave but the second we will be on top of that so
they make is found details with respect to that I’m
not going in detail of that but here there will be some
details with respect to that now what we are going to do using Sparks we will be creating our honesty. So let’s go and see that in our machine how we will be cheap
calculating over our OC which using which we will be
solving this problem later and we will be calculating
this alpha with the help of Spark system, let us again come back to this machine, now in
order to work on that let’s first exit from
this term so once you exit from this concern now
what you are going to do I have already created this
project and kept it here because we just want to
give you an overview of this let me go to my download section there is a project called src,
so this is your project. Initially what all
things you will be having you will not be having all
the things initial part so what will happen so let’s say if I go to my downloads from here
I have alt two project. Now initially, I will not be having this target directory
project directory I think we will be using over SBT symbol if you do not know SBT
the scissors scalable tool which takes care of all your dependencies check takes care of all
your dependencies enough so it is very similar to mebane if you already know
Mebane you this is because very similar but at the same time I prefer this SBB because SBB
is more easier to write in comparison to your method, so you will be writing this bill thought at beating so this find able to write you build.sbt. Now in this point you will be giving the name of your project
your what’s a version of is beating using
version of scala of what you are using what are the dependencies you have with what versions dependencies you have like four Spark for example I’m using 1.5.2 version of Spark, so you are telling that
whatever in my program I am writing if I require anything, related to Spark work go
and get it from this website org.apache.spark, download it, install it. If I require any dependency
for Spark streaming program for this particular version 1.5.2 go to this website or
this link and execute it. Similar thing for among the
best share just telling that. Now, once you have done
this you will be creating a folder structure, your folder structure would be you need to create an SRC folder after that you will be
creating your main folder from main folder you
will be creating again a folder called as ELA. Now inside that you will
be keeping your program so now here you will
be writing your program so you are writing a can you see this streaming.scala, network.scala, r.scala. So let’s keep it as a black box below. So you will be writing the code to achieve this problem statement, now what we are going to do let’s come out of this, go to your main project
folder and from here you will be writing sve package, it will start downloading with respect to your is beeping it
will check your program whatever dependency you
require for Spark path path streaming, Spark
MLlib it will download and install it, it will just download and install it, so we
not going to execute it because I’ve already done it before and it also takes some time, so that’s the reason I’m not doing it, now once you have filled this packet, you will
find all this directories are the direct a spot project directory these got created later of these. Now what is going to happen, once you have created this, you
will go to your Eclipse so your Eclipse we will
open let me open my Eclipse so this is all your equipped your file now I already have this
program in front of me but let me tell you how
you will be bringing this program now, you
will be going to your in both option with import
you will be selecting your existing project into workspace, next once you do that you need to select your main project for
example you need to select this r2 project for your crater and click on okay, once you do that there will be a project directory coming here this a tool will come here,
now what you need to do go to your as/main a lot
ignore all this program I require only this are
chlorella because this is where I’ve written my
main main function code. Now after that once you reach to this you need to go to your
run as healer application and your code will start executing. Now this will return me errors, let’s see this output. Now if I see this, this will show me once it’s finished executing. See this often area under ROC is this so this is all computed
with the a low path program. Similarly there are other programs also which will help you to screen the rate I not I’m not walking over all that. Now let’s come back to my PPT and see that what is the next step
what we will be doing. So you can see this there
will be an excel sheet getting created, now
I’m keeping my ROC here now after you have created your ROC you will be generating our graph. Now in Japan there is one important, Japan is already more affected
area of the earthquake and now that problem here is that whatever it’s not like even for a minor earthquake I should start sending the alert right I don’t want to do all that
for the minor affection in fact the buildings
and the infrastructure what is created in Japan is in such a way if any earthquake below
six magnitude comes there there of the homes are
designed in a way that there will be no damage,
there will be no damage there. So, this is the major thing when you work with your Japan fellows, now in Japan so that means with six they
are not even buried but above six, they are worried. Now for that they will
be a graph generation what you can do, you can
do it with back as well once you generate this graph you will be seeing that anything but you’re going above six, if anything
which is going above six we should immediately start them and now, if you know all this
programming site because that is what we have
just created then shown you this execution path, now if you have to visualize the same result this is what is happening, this is showing my ROC but if my earthquake is going
to be greater than six then only waves or a
lot then only stand up and learn to all the
people otherwise stay calm that is what the project,
what we generally show in our space program design. Now it is not the only
project we also kind of create multiple of the product segments for example like I know
create a model just like how wall not do it how
or not may be creating whatever sales is happening with respect to that they are using apache Spark and at the end there and of making
to visualize the output of doing whatever
analytics there is so that is ordering this part so all those things we walk you through when
we do the for session all the things you
learn fare and feel that all these projects are using, right now since you do not know the
topic you are not able to get 100% of the
project but at that time once you know each and
every topic subjective you will have a clear picture of how Spark is ending all these new spaces. So there select what we
wanted to just discuss it with the second part,
so I hope this session is useful for all of
you, you got some insight on how apache Spark
works, why we are going with about apache Spark and what are the important things
available and what was that important. (mumbling) Any questions from one
of them, please ask. What do you guys, apache Spark something in the Real Time near
yet I’m usually it is you can create Real Time
but it will not be helpful so it is almost near to
reactor because we will try some and I am talking to you sort of, it is not exactly, even
my voice just reaching to you in few mini seconds at least right or in nanoseconds even if you are looking at my screen you are not seeing
that data in the exception of thing in the squatters
in Spark Real Time you cannot define, so
will there will be even a minor delay that is
called near real time that’s what we can decide. So generally that is what
we are going to design in fact so it will be
nearly any other question from anyone, this session is very helpful I learned a lot today thanks for me. So, if you want to learn
any of the details part you can get in touch with Tilaka, I’m also one other there
and let me tell you this is the hottest topic in the market and right now there are
so many jobs available do not go by my word
surrender recover just go and explore yourself you
will see maximal jobs of Big Data to me and that the reason a lot of people are moving
towards apache Spark and a drape I have a head
so many students learning it in making ship inter
carrier lot of people have got successfully the
job in this domain okay. So thank you everyone for
making this interesting I hope that you love this Edureka session over what path we are once
again some time Edureka session I would love to see you once again. So, thank you everyone. I hope you enjoyed
listening to this video, please be kind enough to
like it and you can comment any of your doubts and
queries and we will reply to them at the earliest, do
look out for more videos in our playlist and subscribe
to our at Edureka channel to learn more, happy learning.

Reynold King

100 Replies to “Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training | Edureka”

  1. Hi, I have few questions:
    1.) Its about difference between Hadoop and Spark, you told that there are lot of i/p o/p operations in hadoop whereas in spark you said it happens only once when blocks are copied in memory and rest of the operations are performed in memory itself, so i wanted to ask when entire operation is completed so i/p o/p operation might be again required to copy the result to disk or result stays in memory itself in case of spark?

    2.) Also, when we use map and reduce functions in spark python, how does those things works then? All the map operations are done in memory like that of hadoop? but what about reduce thing as reduce will merge result of two blocks so, don't you think that again network overhead will occur when we pass data from another disk to the disk in which we need to do reduce operation and the that disk will again copy that data to its memory? Can you explain how exactly it will work in case of spark?

  2. hey, could you please help ?
    I want to create sample application using Strom 1.0.2/1.0.3.
    I tried but but i got exception

    [{:type java.io.FileNotFoundException
    :message
    C:UsersjkumarAppDataLocalTemp93b802d6-1143-492d-bff9-5e6552167ec0workers-usersd7e0ca4f-6892-465f-9775-591b3e2e071c
    (The system cannot find the path specified)
    :at [java.io.FileInputStream open0 nil -2]}]

    please help me

    Thanks in advance

  3. Hi,I don't have knowledge on hadoop and I am willing to learn spark. Please let us know the details like when I join in your institute on spark course I can get a clear understanding on topics of spark or I need to do hadoop training first then after I have to take spark training???Also let us know the duration of courses and timing details.Kindly share the contact details as well

  4. Hi.. 🙂

    I have a question, I heard Apache PIG is also used for processing Live Streaming Data.
    Can you please confirm if I am correct? and in this session trainer said only Apache Spark is the tool which supports Live Streaming Data. Which context is correct. Please clear my confusion.

    Regards
    Santhosh

  5. This is the best spark demo I have ever heard. Very clear and planned way of explaining things! Have taken up Hadoop basics classes with Edureka, which are great! Planning to enroll for spark as well. Would you explain more realtime use cases in spark training? Hadoop basics doesn't have use case explanation, which is the only drawback of the course! Great going , thanks a lot for this video.

  6. Very good Explanation. Awesome content.

    I have a question.

    When Map function is executed the results are given as a block in memory. This is fine. In the example provided in the video, the map function doesn't require any further computation( since the job is to take numbers less than 10). What about for a job like Word count.
    1. How would the output of the map function be?
    Is it same as Map function in MapReduce (apple,1 (apple,1) (apple,1) (banana,1),(banana,1),(banana,1),(orange,1),(orange,1),(orange,1))?
    Or we can write the code for reducing also in the same map function giving output as ((apple,3) (orange,3)(banana,3))??

    2. And are the blocks from each data node will be sent to a single data node to execute the further computation?? (as in reduce in map reduce)??

    Thanks in Advance

  7. Hello. This is very informative. I think the resiliency concept which you explained here is a bit improper. Resiliency in Spark is with respect to the lineage and not the replication factor as RDD can be written at the speed of DRAM so there is no need to replicate each byte. Awaiting your reply.

  8. I completed hadoop coaching few days back. I would like to learn spark and scala .Is this 39 videos good enough for Spark AND Scala Training?

  9. So far this is 4th course I am watching, Instructors from Edureka are amazing. Very well explained RDD in first half. Worth watching !!!

  10. I guess the "collect" keyword is because it collects the processed data from all the blocks and throw it to the output

  11. Excellent session…very informative..trainer is too good and explained all concepts in detail…thanks lot

  12. Its an amazing video . Gives a complete concept of spark as well as its implementation in real world. Thanks

  13. Cheers to Edureka ! Very Well explained . Please Upload " Using Python With Apache Spark " Videos too !!

  14. Loved your video. Explained the basic details in a best possible way. Would wait for your new videos on this topic..Can you share the github link for the earthquake project?

  15. Nicely explained, i am in the process of learning Machine learning algorithm in Python & R. I may have to learn Spark in future 🙂

  16. Hi, It is very much informative lecture …I have a plan to write my thesis in apache spark …could please suggest me good topic ..please it will be a great help thanks.

  17. thank you so much for this wonderful tutorial.. I have a question.. while discussing about lazy evaluation, you mentioned that for B1 to B6 RDD memory is allocated, but they remain empty till collection is invoked. My qs is.. what is the size size of the memory that is allocated for each RDD? How does the framework predict the size before hand for each RDD without processing the data? eg, B4, B5 , B6 might have different sizes and smaller or equal to B1, B2, B3 respectively… I didn't get this part. Could you please clarify?

  18. What are the chances for someone with Linux background and basic Java knowledge but not software programming experience to get a job as Spark after taking online training and passing the Apache Spark certification. have someone here got a job without experience?

  19. One of the best video I ever watched.  MapReduce was not explained in this way wherever i checked.   Really thank you to post this.  Use Cases are really good.  Worth the time watching almost 2 hrs.  5 star to you the instructor.  Very impressed.

  20. Great session ..very informative .. Can you please share the sequence of videos in Apache Spark and Scala learning playlist.. Thanks in advance

  21. Good lecture. An action is a trigger for lazy eval to start right? .collect() is not equivalent to printing..

  22. 103 history students spotted who do not now anything about computer and without any reason down voting this excellent post 😛

  23. This makes things more clear after my Data Science class lol. Thank you so much for a great tutorial, I think this will sharpen me up.

  24. Got a question on the topic? Please share it in the comment section below and our experts will answer it for you. For Edureka Apache Spark Certification Training Curriculum, Visit the website: http://bit.ly/2Q2PV3x

  25. HI all i had a doubt i had a 1 PB data to be processed in Spark. If i am trying to read whether 1PB of data will be stored in memory are not how it will process could anyone please help me,

  26. Thank you for guiding students like us sir. Appreciate your knowledge and ability to pass it to us. It was a great session.

  27. Loved the way how the trainer explained about it. Watched for the first time and it cleared all my doubts. Thanks, edureka.

  28. 26:34 Why do we consider the step for incrementing our indicies as a bottle neck, but we don't consider sorting as a bottle neck?

    EDIT: I think I understand the bottleneck. If we don't know what the all the possible words are, then we can't have a simple array index based counter. Instead we we would use a hashmap, and would need to check for the existence of the word in the hashmap.

    for each word in the file
    if the word is in our hashmap
    increment hashmap
    if the word is not in our hashmap
    create the index, and increase the hashmap

    This is a looping bottleneck for sure

  29. From where i can read this kinda core information about Spark and Hadoop….any links or way to find documents…

Leave a Reply

Your email address will not be published. Required fields are marked *