Everything new and interesting in Machine Learning, Deep Learning, Data Science, & Artificial Intelligence. Hoping to build a community of data science geeks and talk about future tech! Projects demos and more! Subscribe for awesome videos :)
@4:38 are you sure d_q is the number of total time steps? I think it's supposed to be the dimension of the query & key.
@scott794816 сағат бұрын
In the final video are you going show an example when you feed data into the model and the interpret the output. It would be good to see any prepressing of the data to get it in the right format to feed into the model. I'm keen to use this model for a timeseries forecasting exercise 8 timesteps ahead.
@user-qd2oc6xq8nКүн бұрын
Can u tell an interactive model of AI neural network for school project.. And ur videos are nice and I understand easily.. Pls tell
@MadMax-ph1rlКүн бұрын
All this hindu Indian scientists where really selfie to keep there studies and research to a certain group And when someone from West discover out the completely same thing like after 200 or 300 years they started saying "no we discovered it hundreds of years ago " . So why don't you spread that knowledge Because of people like this the your science knowledge and this so called modern world is 100 years back in time
@anirudh514Күн бұрын
Very well explained
@ashishanand96422 күн бұрын
Why this is so Underrated, this should be on every one playlist for linear regression. Hatsoff man :)
@user-oj2wg8og9e2 күн бұрын
wonderful explanation!!!
@ArielOmerez2 күн бұрын
C
@ArielOmerez2 күн бұрын
B
@ArielOmerez2 күн бұрын
D
@bartekdurczak40852 күн бұрын
good explanation but the noises are little bit annoying but thank you bro <3
@youtubeaccount86132 күн бұрын
appreciate this! thank you so much!
@nirorit4 күн бұрын
Based on what we’ve studied in class (information theory and machine learning) bigger batches are more accurate as they minimize the MSE (min square error) iirc of the cost function - or in other words more accurate. So if I were to trust my memory/understanding and professor then unlike what you said: Smaller batches are better because they are faster to compute than a full data batch, and another reason is because it introduces more randomness (larger error) during training which can help escape high local minimums.
@psiddartha71156 күн бұрын
I am non engineer how to prepare
@eeera-op8vw6 күн бұрын
good explanation for a beginner
@LNJP135797 күн бұрын
Brother, you have summarized really well in such a short video. Every second was GOLD 🙂
@sudlow38607 күн бұрын
With regard to the quiz I think it is B D B. Not sure how this is going to launch a discussion though. You present things very well.
@CodeEmporium6 күн бұрын
Ding ding ding! Good work on the quiz! While this may or may not spark a discussion, just wanted to say thanks for participating :)
@wowcat44267 күн бұрын
Cringe
@rpraver17 күн бұрын
Also as always great video, hoping in future you deal with encoder only and decoder only transformers...
@CodeEmporium6 күн бұрын
Yep! For sure. Thank you so much!
@theindianrover20077 күн бұрын
cool!
@CodeEmporium6 күн бұрын
Thank you 🙏
@rpraver17 күн бұрын
Not sure if just me, but starting at about 4:50 your graphics are so dark... maybe go to a white background or light gray, like your original png...
@CodeEmporium6 күн бұрын
Yea. Let me try brightening them up for future videos if I can. Thanks for the heads up
@LeoLan-vv1nq7 күн бұрын
Amazing work, can't wait for next episode !
@-beee-7 күн бұрын
I would love if the quizzes had answers in the comments eventually. I know this is a fresh video, but I want to check my work, not just have a discussion 😅
@dumbol81267 күн бұрын
is this same as the wjat timesfm uses
@neetpride59197 күн бұрын
Why aren't the padding tokens appended during data preprocessing, before the inputs are turned by the feedfoward layer into the key, query, value, vectors?
@slayer_dan7 күн бұрын
Adding padding before forming K, Q, and V vectors would insert extra tokens into the input sequences, altering their lengths and potentially distorting the underlying data structure. As a result, the subsequent computation of K, Q, and V vectors would incorporate these padding tokens, affecting the model's ability to accurately represent the original data. During the attention calculation, these padding tokens would influence the attention scores, potentially diluting the focus on the actual content of the input sequences. This could lead to less effective attention patterns and hinder the model's ability to learn meaningful representations from the data. Furthermore, applying padding after forming K, Q, and V vectors allows for the efficient use of masking techniques to exclude padding tokens from the attention mechanism. By setting the attention scores corresponding to padding positions to negative infinity before the softmax operation, the model effectively ignores these tokens during attention calculation. This approach preserves the integrity of the input sequences, ensures accurate attention computations, and maintains the model's focus on relevant information within the data. P.S. I used ChatGPT to format my answer because it can do this thing better.
@neetpride59196 күн бұрын
@@slayer_dan how could it possibly save computing power to pad the matrices with multiple, 512-element vectors, rather than simply appending <PAD> tokens to the initial sequence of tokens?
@eadweard.7 күн бұрын
In answer to your question, I can either: A) mono-task or B) screw up several things at once
@Ishaheennabi7 күн бұрын
Love from kashmir india bro!❤❤❤
@algorithmo1348 күн бұрын
cant wait for more deep learning in depth coding and tutorials! Would love to see deep learning in time series :D
@CodeEmporium7 күн бұрын
Nice! Currently making this playlist for the Infromer architecture. You can check out a few videos on this in the playlist “informer from scratch”
@user-mr3se3jk1r8 күн бұрын
You have missed the concept of teacher forcing during training
@rasikannanl34768 күн бұрын
great .. so many thanks ... need more explanation
@aakarshrai58338 күн бұрын
Bro could you please label you equations. It'll be helpful
@algorithmo1348 күн бұрын
Hi @CodeEmporium, do you have the solution to quiz 2 at 8:46?
@joeybasile15728 күн бұрын
Nice man
@himanshusingh29808 күн бұрын
Really want to hear indian accent of thai guy 😅😂
@katerinaneprasova29399 күн бұрын
Are there right answers to the quiz somewhere? Would be helpful to put them in the description.
@katnip19179 күн бұрын
Great Video!! Thank you for the explanation. My question is, why not use the current state in the target network, instead of the next state?
@rajeshve72119 күн бұрын
Best ever explanation of BERT! Finally understood how it works :)
@kenesufernandez12819 күн бұрын
✨💖
@jonfat43719 күн бұрын
Very great explanation, but for god’s sakes, stop the irritating noises. Im losing it man…. what would happen if u continued normal?!
@abinav929 күн бұрын
Good video! Well explained. In real life though a particular time series will correlate with itself and depend on other time series. Any way to take this into account to improve predictions?
@burakkurt190710 күн бұрын
Allah razı olsun
@lazarus801110 күн бұрын
Good video here's a comment for the algorithm
@yaminevire785410 күн бұрын
I am from Bangladesh ❤❤
@rpraver110 күн бұрын
As always, great video, looking forward to next video on the code...
@StraightToTheAve10 күн бұрын
My brain can’t comprehend how some things were created
@davefaulkner630211 күн бұрын
Thanks for your efforts to explain a complicated subject. Couple of questions: did you intentionally skip the Layer Normalization or did I miss something? Also -- the final linear layer in the attention block has dimension 512 x 512 (input, output size). Does this mean that each token (logit?) output from the attention layer is passed token-by-token through the linear layer to create a new set of tokens, that set being of size token sequence length. This connection between the attention output and the Linear layer is baffling me. The output of the attention layer is (Sequence-length x transformed-embedding-length) or (4 x 512), ignoring batch dimension in the tensor. Yet the linear layer accepts a (1 x 512) input and yields a (1 x 512) output. So is each (1 x 512) output token in the attention layer output sequence passed one at a time through the linear layer? And does this imply that the same linear layer is used for all tokens in the sequence?
@jorgesanabria648411 күн бұрын
Would historical nutritional data count?
@hackie32112 күн бұрын
Can you please blow up the Llama/Llama 2 architecture and code for us? Eagerly waiting for your LLM videos.
@CodeEmporium12 күн бұрын
Yep! That’s definitely a future playlist idea
@hackie32112 күн бұрын
@@CodeEmporium Awesome. Thanks
@tripathi2612 күн бұрын
This is interesting. Eagerly looking forward to next episodes ❤
@yolemmein12 күн бұрын
Very useful and great explanation! Thank you so much!
Пікірлер
@4:38 are you sure d_q is the number of total time steps? I think it's supposed to be the dimension of the query & key.
In the final video are you going show an example when you feed data into the model and the interpret the output. It would be good to see any prepressing of the data to get it in the right format to feed into the model. I'm keen to use this model for a timeseries forecasting exercise 8 timesteps ahead.
Can u tell an interactive model of AI neural network for school project.. And ur videos are nice and I understand easily.. Pls tell
All this hindu Indian scientists where really selfie to keep there studies and research to a certain group And when someone from West discover out the completely same thing like after 200 or 300 years they started saying "no we discovered it hundreds of years ago " . So why don't you spread that knowledge Because of people like this the your science knowledge and this so called modern world is 100 years back in time
Very well explained
Why this is so Underrated, this should be on every one playlist for linear regression. Hatsoff man :)
wonderful explanation!!!
C
B
D
good explanation but the noises are little bit annoying but thank you bro <3
appreciate this! thank you so much!
Based on what we’ve studied in class (information theory and machine learning) bigger batches are more accurate as they minimize the MSE (min square error) iirc of the cost function - or in other words more accurate. So if I were to trust my memory/understanding and professor then unlike what you said: Smaller batches are better because they are faster to compute than a full data batch, and another reason is because it introduces more randomness (larger error) during training which can help escape high local minimums.
I am non engineer how to prepare
good explanation for a beginner
Brother, you have summarized really well in such a short video. Every second was GOLD 🙂
With regard to the quiz I think it is B D B. Not sure how this is going to launch a discussion though. You present things very well.
Ding ding ding! Good work on the quiz! While this may or may not spark a discussion, just wanted to say thanks for participating :)
Cringe
Also as always great video, hoping in future you deal with encoder only and decoder only transformers...
Yep! For sure. Thank you so much!
cool!
Thank you 🙏
Not sure if just me, but starting at about 4:50 your graphics are so dark... maybe go to a white background or light gray, like your original png...
Yea. Let me try brightening them up for future videos if I can. Thanks for the heads up
Amazing work, can't wait for next episode !
I would love if the quizzes had answers in the comments eventually. I know this is a fresh video, but I want to check my work, not just have a discussion 😅
is this same as the wjat timesfm uses
Why aren't the padding tokens appended during data preprocessing, before the inputs are turned by the feedfoward layer into the key, query, value, vectors?
Adding padding before forming K, Q, and V vectors would insert extra tokens into the input sequences, altering their lengths and potentially distorting the underlying data structure. As a result, the subsequent computation of K, Q, and V vectors would incorporate these padding tokens, affecting the model's ability to accurately represent the original data. During the attention calculation, these padding tokens would influence the attention scores, potentially diluting the focus on the actual content of the input sequences. This could lead to less effective attention patterns and hinder the model's ability to learn meaningful representations from the data. Furthermore, applying padding after forming K, Q, and V vectors allows for the efficient use of masking techniques to exclude padding tokens from the attention mechanism. By setting the attention scores corresponding to padding positions to negative infinity before the softmax operation, the model effectively ignores these tokens during attention calculation. This approach preserves the integrity of the input sequences, ensures accurate attention computations, and maintains the model's focus on relevant information within the data. P.S. I used ChatGPT to format my answer because it can do this thing better.
@@slayer_dan how could it possibly save computing power to pad the matrices with multiple, 512-element vectors, rather than simply appending <PAD> tokens to the initial sequence of tokens?
In answer to your question, I can either: A) mono-task or B) screw up several things at once
Love from kashmir india bro!❤❤❤
cant wait for more deep learning in depth coding and tutorials! Would love to see deep learning in time series :D
Nice! Currently making this playlist for the Infromer architecture. You can check out a few videos on this in the playlist “informer from scratch”
You have missed the concept of teacher forcing during training
great .. so many thanks ... need more explanation
Bro could you please label you equations. It'll be helpful
Hi @CodeEmporium, do you have the solution to quiz 2 at 8:46?
Nice man
Really want to hear indian accent of thai guy 😅😂
Are there right answers to the quiz somewhere? Would be helpful to put them in the description.
Great Video!! Thank you for the explanation. My question is, why not use the current state in the target network, instead of the next state?
Best ever explanation of BERT! Finally understood how it works :)
✨💖
Very great explanation, but for god’s sakes, stop the irritating noises. Im losing it man…. what would happen if u continued normal?!
Good video! Well explained. In real life though a particular time series will correlate with itself and depend on other time series. Any way to take this into account to improve predictions?
Allah razı olsun
Good video here's a comment for the algorithm
I am from Bangladesh ❤❤
As always, great video, looking forward to next video on the code...
My brain can’t comprehend how some things were created
Thanks for your efforts to explain a complicated subject. Couple of questions: did you intentionally skip the Layer Normalization or did I miss something? Also -- the final linear layer in the attention block has dimension 512 x 512 (input, output size). Does this mean that each token (logit?) output from the attention layer is passed token-by-token through the linear layer to create a new set of tokens, that set being of size token sequence length. This connection between the attention output and the Linear layer is baffling me. The output of the attention layer is (Sequence-length x transformed-embedding-length) or (4 x 512), ignoring batch dimension in the tensor. Yet the linear layer accepts a (1 x 512) input and yields a (1 x 512) output. So is each (1 x 512) output token in the attention layer output sequence passed one at a time through the linear layer? And does this imply that the same linear layer is used for all tokens in the sequence?
Would historical nutritional data count?
Can you please blow up the Llama/Llama 2 architecture and code for us? Eagerly waiting for your LLM videos.
Yep! That’s definitely a future playlist idea
@@CodeEmporium Awesome. Thanks
This is interesting. Eagerly looking forward to next episodes ❤
Very useful and great explanation! Thank you so much!