Text Representation Using Bag Of Words (BOW): NLP Tutorial For Beginners - S2 E3

Bag of words (a.k.a. BOW) is a technique used for text representation in natural language processing. In this NLP tutorial, we will go over how a bag of words works and also write some code for email classification that uses a bag of words and the Naive Bayes classifier in machine learning.
Code: github.com/codebasics/nlp-tut...
Exercise: github.com/codebasics/nlp-tut...
⭐️ Timestamps ⭐️
00:00 Theory
08:00 Coding
Complete NLP Playlist: • NLP Tutorial Python
🔖Hashtags🔖
#nlp #nlptutorial #nlppython #nlpbagofwords #bagofwords #bagofwordsexample #bagofwordsusingnlp #bagofwordsnlp
Do you want to learn technology from me? Check codebasics.io/?... for my affordable video courses.
Need help building software or data analytics/AI solutions? My company www.atliq.com/ can help. Click on the Contact button on that website.
🎥 Codebasics Hindi channel: / @codebasicshindi
#️⃣ Social Media #️⃣
🔗 Discord: / discord
📸 Instagram: / codebasicshub
🔊 Facebook: / codebasicshub
📱 Twitter: / codebasicshub
📝 Linkedin (Personal): / dhavalsays
📝 Linkedin (Codebasics): / codebasics
🔗 Patreon: www.patreon.com/codebasics?fa...

Пікірлер: 36

  • @codebasics
    @codebasics2 жыл бұрын

    Check out our premium machine learning course with 2 Industry projects: codebasics.io/courses/machine-learning-for-data-science-beginners-to-advanced

  • @harshalbhoir8986
    @harshalbhoir8986 Жыл бұрын

    This Was So Cool Explaination Thank You So Much!!

  • @napoleanbonaparte9225
    @napoleanbonaparte9225Ай бұрын

    So Sir, what we can get from this video is we can find out the precisioj % of spam & its number in the list of emails.That we can do it in two ways 1.using the traintest,multinomialnb 2. directly importing the countvectorizer using pipeline().So bag of words means simply collection of same words together,as we had collected spam here.

  • @AmitKumar-BIDSP
    @AmitKumar-BIDSP Жыл бұрын

    Great presentation; Thank you

  • @tomernx5
    @tomernx5 Жыл бұрын

    Great video! thanks!

  • @mohammadriyaz5586
    @mohammadriyaz55864 ай бұрын

    thank you soo much sir for the easiest explanation

  • @user-eu3xt2nk7v
    @user-eu3xt2nk7v11 ай бұрын

    Great Work! Keep doing.

  • @ashishpanchal701
    @ashishpanchal701 Жыл бұрын

    Hello Sir, Thank You for being such a wonderful teacher!!! Just had a doubt in the Naive Bayes model that is built in this video... where have we used stemming, lematization ?? which would make it an NLP problem If we have not used them, then won't it be a simple Naive Nayes Machine Learning Problem

  • @codebasics

    @codebasics

    Жыл бұрын

    it is not like you have to do stemming etc to consider it as an NLP problem. Here as you can see without stemming etc we got pretty good accuracy. Hence you can consider this as both NLP and machine learning problem. In fact ML is used to solve NLP problems so NLP is at a higher level. Now you can definitely try stemming etc, I would request you to build that out and see how the model performance improvers.

  • @JakeThalacker
    @JakeThalacker Жыл бұрын

    Is there a simple way to edit this to use bigrams instead of single words?

  • @pineappleld
    @pineappleld5 ай бұрын

    In this example I can see goal was somehow a classify either ham or spam. Is it possible to build similar but classify on four options?

  • @swanandAragade
    @swanandAragade Жыл бұрын

    Even if we put input as a spam mail ,then it's not detecting that it's a spam, it only shows ham to all mail.

  • @nareshkumarvanga3127
    @nareshkumarvanga31272 ай бұрын

    Thank you Guruji

  • @roopeshn9394
    @roopeshn9394 Жыл бұрын

    Hi, sir If you can assist me in any way, please do so with my issue. I have four or five columns of data in dict format with keys and values. I need to make a sentence or narrative from this data. Is it possible or not? If possible, Please guide me sir. input: { "source": "Sanju", "type": "message.cloud.display.AUTO_SCALING", "value": "1" }, Output: Sanju has value with this type of "message. cloud.display.AUTO_SCALING"

  • @pragtisood3239
    @pragtisood3239 Жыл бұрын

    Doubt: From where can we get the csv file

  • @vivekjha9952
    @vivekjha99522 жыл бұрын

    Hi Dhaval sir, I want to learn technology for data science by you and mentored too, Could you please provide guidance for an experienced 8 year IT professional who wants to transition to Data Science as Iam not able to figure out which institute to select.

  • @nemsingh6035

    @nemsingh6035

    Жыл бұрын

    Follow codebasics

  • @surinder3677
    @surinder36774 ай бұрын

    Notes: 16:40 CountVectorizer

  • @hsekar6701
    @hsekar670110 ай бұрын

    I'm unable to download the en_core_web_sm pipeline..! So could anyone please help me....!

  • @siddhanthardikar2468
    @siddhanthardikar24689 ай бұрын

    Hello sir, I had one doubt. is fit_transform and transform the same? Thats because to transform X_train you used v.fit_transform(X_train.values) and for X_test you used v.transform(X_test). I hope you can just clear this doubt for me. Thank you.

  • @siddharthrox

    @siddharthrox

    8 ай бұрын

    I don't know if you've figured this out by now or not. I'll share my understanding anyway. fit_transform will try to learn the vocabulary from the training data. After that it will create a matrix representation based on what it learned. But in transform, it is assumed that the learning has already happened and only a matrix representation needs to be generated. That is why you see that fit_transform is used with training data and transform is used with test data.

  • @gopalpawar7352
    @gopalpawar73522 жыл бұрын

    Sir full stack developer course create please one videos and create playlists ...

  • @ayushgupta80
    @ayushgupta804 ай бұрын

    Bag of words --- size of vector is equal to size of vocab [ all elements are 0 , except the words present in statement ] Sparse representation - It may consume too much memory & computer resources .

  • @bhaskarbsarkar5232
    @bhaskarbsarkar5232 Жыл бұрын

    Doubt : When vectorizing, we are taking X_train. According to my understanding, the vectorization is building a vocabulary w.r.t the data given. So, is it better to take the whole X instead of X_train to build the vocab and after that we can split into train and test. Because there is a possibility that some words would be in test data and not in train data. And when I took X for vectorization, the vocab size increased. So, what is the correct method here?

  • @codebasics

    @codebasics

    Жыл бұрын

    Excellent question Bhaskar. In our case what would have happened is we had more than 4k samples in training set which probably covered majority of the vocab in test samples also. Right way would be to create a CountVectorizer and call .fit (instead of fit_transform) on entire dataset. After that on individual training set and later on test set just call .transform

  • @malshininissanka4106

    @malshininissanka4106

    Жыл бұрын

    @@codebasics Should not we consider the test data as unseen data? If we fit countVectorizer on the entire dataset data leakage might happen?

  • @uptoolate1896
    @uptoolate1896 Жыл бұрын

    And that was the moment that ignoring his suggested prerequisites finally caught up with me.

  • @vyaspadala8468
    @vyaspadala84682 жыл бұрын

    Sir please provide us big data engineer and data science course

  • @arvindatmuri5604

    @arvindatmuri5604

    2 жыл бұрын

    Have a look at Numpy pandas and Matplotlib and Machine learning courses, Data science is almost covered in all these topics

  • @changeorbeextinct
    @changeorbeextinct Жыл бұрын

    if any email has Nigeria and prince then it is authentic.. NOT :) BTW, great videos.

  • @debarghabhattacharjee4000
    @debarghabhattacharjee40002 жыл бұрын

    Please provide the spam.csv file....

  • @bhaskarbsarkar5232

    @bhaskarbsarkar5232

    Жыл бұрын

    It's in the git repo itself.

  • @kinghezzy

    @kinghezzy

    Жыл бұрын

    I cant find it there

  • @rinkisingh5529
    @rinkisingh5529 Жыл бұрын

    Why does your wife use your account to watch CID? And you have mentioned this in at least 2 of your videos, are you trying to cover something up.. Someone call the CID to investigate :)

  • @ramandeepbains862
    @ramandeepbains862 Жыл бұрын

    Solution of bug : AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names_out' use get_feature_names() instead of get_feature_names_out its a version issue . sample code : v.get_feature_names()[790:800]

Келесі