Machine Learning in Python: Principal Component Analysis (PCA) for Handling High-Dimensional Data

Ғылым және технология

In this video, I will be showing you how to perform principal component analysis (PCA) in Python using the scikit-learn package. PCA represents a powerful learning approach that enables the analysis of high-dimensional data as well as reveal the contribution of descriptors in governing the distribution of data clusters. Particularly, we will be creating PCA scree plot, scores plot and loadings plot.
🌟 Buy me a coffee: www.buymeacoffee.com/dataprof...
📎CODE: github.com/dataprofessor/code...
⭕ Playlist:
Check out our other videos in the following playlists.
✅ Data Science 101: bit.ly/dataprofessor-ds101
✅ Data Science KZreadr Podcast: bit.ly/datascience-youtuber-p...
✅ Data Science Virtual Internship: bit.ly/dataprofessor-internship
✅ Bioinformatics: bit.ly/dataprofessor-bioinform...
✅ Data Science Toolbox: bit.ly/dataprofessor-datascie...
✅ Streamlit (Web App in Python): bit.ly/dataprofessor-streamlit
✅ Shiny (Web App in R): bit.ly/dataprofessor-shiny
✅ Google Colab Tips and Tricks: bit.ly/dataprofessor-google-c...
✅ Pandas Tips and Tricks: bit.ly/dataprofessor-pandas
✅ Python Data Science Project: bit.ly/dataprofessor-python-ds
✅ R Data Science Project: bit.ly/dataprofessor-r-ds
⭕ Subscribe:
If you're new here, it would mean the world to me if you would consider subscribing to this channel.
✅ Subscribe: kzread.info...
⭕ Recommended Tools:
Kite is a FREE AI-powered coding assistant that will help you code faster and smarter. The Kite plugin integrates with all the top editors and IDEs to give you smart completions and documentation while you’re typing. I've been using Kite and I love it!
✅ Check out Kite: www.kite.com/get-kite/?...
⭕ Recommended Books:
✅ Hands-On Machine Learning with Scikit-Learn : amzn.to/3hTKuTt
✅ Data Science from Scratch : amzn.to/3fO0JiZ
✅ Python Data Science Handbook : amzn.to/37Tvf8n
✅ R for Data Science : amzn.to/2YCPcgW
✅ Artificial Intelligence: The Insights You Need from Harvard Business Review: amzn.to/33jTdcv
✅ AI Superpowers: China, Silicon Valley, and the New World Order: amzn.to/3nghGrd
⭕ Stock photos, graphics and videos used on this channel:
✅ 1.envato.market/c/2346717/628...
⭕ Follow us:
✅ Medium: bit.ly/chanin-medium
✅ FaceBook: / dataprofessor
✅ Website: dataprofessor.org/ (Under construction)
✅ Twitter: / thedataprof
✅ Instagram: / data.professor
✅ LinkedIn: / chanin-nantasenamat
✅ GitHub 1: github.com/dataprofessor/
✅ GitHub 2: github.com/chaninlab/
⭕ Disclaimer:
Recommended books and tools are affiliate links that gives me a portion of sales at no cost to you, which will contribute to the improvement of this channel's contents.
#dataprofessor #PCA #clustering #cluster #principalcomponentanalysis #scikit #scikitlearn #sklearn #prediction #jupyternotebook #jupyter #googlecolab #colaboratory #notebook #machinelearning #datascienceproject #randomforest #decisiontree #svm #neuralnet #neuralnetwork #supportvectormachine #python #learnpython #pythonprogramming #datascience #datamining #bigdata #datascienceworkshop #dataminingworkshop #dataminingtutorial #datasciencetutorial #ai #artificialintelligence #tutorial #dataanalytics #dataanalysis #factor #principalcomponent #principalcomponents #pc #machinelearningmodel

Пікірлер: 93

  • @DataProfessor
    @DataProfessor4 жыл бұрын

    Anyone interested in a tutorial video on Bioinformatics (using Python)? Comments down below 👇 If you find value in this video, please give it a Like 👍 and Subscribe ❤️for more videos on Data Science. 😃

  • @marcofestu

    @marcofestu

    4 жыл бұрын

    I do 😁

  • @DataProfessor

    @DataProfessor

    4 жыл бұрын

    @@marcofestu It's coming up 😃

  • @WhatsAI

    @WhatsAI

    4 жыл бұрын

    Hey, I love your videos! It gave me the envy to start my own KZread channel explaining (vulgarizing) the most used terms in artificial intelligence for everyone not in the field. So people know more about this "black box" of our domain. I do use some simple animations to help describe those terms in the simplest and most concise way possible. I also use an artificial intelligence voice since my own accent and voice wouldn't work well. (And I loved the idea of an AI teaching AI ahah!) This said, I'd love to collaborate with you in a video. Explaining a term you use in a video, like in the way of a little 30 sec - 1 min video from me incision in your video or anything you'd like to do! If you are interested or want to know more, feel free to check out some of my videos on my account or contact me! (sorry for this comment - message, I couldn't find any emails to message you directly)

  • @shwetaredkar734

    @shwetaredkar734

    4 жыл бұрын

    Yes , I am . Working on protein-drug interactons using deep learning and machine learning.

  • @rongchai7895

    @rongchai7895

    4 жыл бұрын

    yes! I don’t mind if you do a full tutorial of bioinformatics project, like ML application in real NGS data. I am also interested in learning scRNA, chiqseq from you. Sorry asked too much, lol. Love your videos, keep it up!

  • @tjf7101
    @tjf7101 Жыл бұрын

    I’m taking an online class from a brick and mortar school. This was part of this weeks “lecture”. I have to say. This all seems thoughtful and very well presented . If I was part of this program and new want you were talking about I bet it would be great 👍

  • @christopherreif3624
    @christopherreif36242 жыл бұрын

    Thank you for this video, I've been struggling with understanding PCA for a good minute, but your video explained it extremely well! Please keep posting more like this.

  • @DataProfessor

    @DataProfessor

    2 жыл бұрын

    You're very welcome! Glad it helped :)

  • @dvir-ross
    @dvir-ross Жыл бұрын

    This video helped me a lot! Thanks!

  • @andreamarkos
    @andreamarkos2 жыл бұрын

    Thank you data Professor, excellent intro! Suscribed, at once! It is important to analize the correlation matrix to identify highly intercorrelated variables and then the loadings, in order to interpret semantically each component: the meaning of PC, PC2 and PC3

  • @DataProfessor

    @DataProfessor

    2 жыл бұрын

    Welcome aboard!

  • @jeffersonjones7863
    @jeffersonjones78634 жыл бұрын

    Very informative! Thanks for this important lesson 🙂 keep it up!

  • @DataProfessor

    @DataProfessor

    4 жыл бұрын

    Thanks for the kind words of inspiration! 😃

  • @tinanajafpour7214
    @tinanajafpour7214 Жыл бұрын

    thank you so much for the tutorial

  • @satishchhatpar
    @satishchhatpar3 жыл бұрын

    Good Explanation. Thank you for sharing.

  • @lucaslessa5442
    @lucaslessa54422 жыл бұрын

    Insane. I love it. Please continue with this videos

  • @DataProfessor

    @DataProfessor

    2 жыл бұрын

    Glad it's helpful!

  • @chisomokwueze967
    @chisomokwueze9673 жыл бұрын

    Hi professor, thank you so much for such an education tutorial, I have a dataset with shape (99,25), how can I use PCA to select 10 or 8 best features that explain at least 90% variance of the dataset. In summary how I use PCA for feature selection, without transforming them into principal components ie PC1, PC2 ..... I just need the features for further classification

  • @lucianb8034
    @lucianb80343 жыл бұрын

    Thanks, one question though - How come most tutorials conduct the main 'sklearn PCA' function before determining how may components to use in the PCA itself? Wouldn't it be better to determine the variance ratio between the components BEFORE choosing how many components to use (e.g. 1 or 2 or 3)? Isn't it a potential waste of time to start with a PCA, then find out in the scree plot afterwards that you should (could) have used more/less components? I think I'm missing something.

  • @leonardolp21
    @leonardolp213 жыл бұрын

    Incredible tutorial, congratulations! The 3d viz are awesome and help a lot to understand loadings x attributes relationship. I was wondering if isnt possible to access the scoring coeeficient matrix that is used internally by the Transform(X). Do you know how can I achieve that?

  • @vijayalaxmiise1504
    @vijayalaxmiise1504 Жыл бұрын

    Thanks lot Professor

  • @marcofestu
    @marcofestu4 жыл бұрын

    Wow I loved those plots 😍

  • @DataProfessor

    @DataProfessor

    4 жыл бұрын

    Yes, the look great and very informative too! 😃

  • @dr.merlot1532
    @dr.merlot15323 жыл бұрын

    I'd just like to remind people using pca to consider centering their data. Also consider both the variance and covariance pca

  • @Esthermore90
    @Esthermore902 жыл бұрын

    Excellent, thank you for sharing

  • @DataProfessor

    @DataProfessor

    2 жыл бұрын

    My pleasure!

  • @adir9290
    @adir92903 жыл бұрын

    Hello Professor, I really want to appreciate the beautiful work you are doing on your channel. I have watched some of your videos and i will say the simplicity with which you deliver your lectures blows me away. I am trying to do a PCA on some some data which have npy file but i have got no luck to do that as i don't know how to go about it when using npy file for PCA. I will appreciate your help to guide me. Thank you.

  • @DataProfessor

    @DataProfessor

    3 жыл бұрын

    Hi, thanks for the kind words, and I am glad that you're finding this channel helpful. On to your question, as npy files are generated in numpy, after reading it in, can you try converting it to a pandas dataframe, then from there you can follow the steps mentioned in this video. Hope this helps.

  • @bryan_truong
    @bryan_truong2 жыл бұрын

    This would've been more helpful if you explained how to determine the number of components. Because it seems like you just assumed it would be 3 because you knew there were three target labels (the three different species). If you didn't already have output/target labels and this was TRULY an unsupervised approach, it would have been useful to see how you arrive at 3 components.

  • @andrerr30
    @andrerr303 жыл бұрын

    How can we merge the 3D graphs (plots) of samples (flowers) and variables together? (To do so we must standarize the samples data, doesn't it?)

  • @fatur36
    @fatur362 ай бұрын

    Dear Prof, PCA1, PCA2, PC3, represent which variables? certain variables or combination variables. It is very important to explain when we have a 3-D graph for non-technical users/audiences. Thanks in advance

  • @dalpyma8791
    @dalpyma87912 жыл бұрын

    Thanks for the video. I have a question: I have >100K sensors each sensor has between 2000-4000 values. My DF looks like this ['sensorNr', 'values'] long table format. How can I classify/cluster my data series in two categories?

  • @pateltapasvi7277
    @pateltapasvi72777 ай бұрын

    I don't know why but all the codes are executed on my dataset but the final scree plot is not appearing even though the code has successfully executed but the blank output is coming. What is the reason could be?

  • @stlo0309
    @stlo03092 жыл бұрын

    hello! you demonstrated brilliantly everything there is to do with simple datasets where for each data sample, its features are single numbers. But can this be implemented for 3 dimensional dataset as well? more specifically, my dataset is of the form np.array(2000, 64, 400). here 2000 is number of data samples, 64 streams of 400 data each and i'm looking to reduce these 64 streams to some lower number...

  • @dalpyma8791
    @dalpyma87912 жыл бұрын

    Great video! Thanks a lot. I have a question, what should I do in order to classify data series (100k curves... with only two columns=['sensorid', 'values'], each sensor has 2000- 4000 point values). The idea is to "cluster/classify" each sensor curve in two groups!

  • @marcyboy641
    @marcyboy6414 жыл бұрын

    Interesting! I was just following an online course about PCA😄

  • @DataProfessor

    @DataProfessor

    4 жыл бұрын

    Thanks for your comment! Glad you found us. 😃

  • @username42

    @username42

    3 жыл бұрын

    which course do you take marcy?

  • @ramaselvanathan9924
    @ramaselvanathan99243 жыл бұрын

    how do you use this method when in spectral data analysis? For example, they are 4 different samples and they have values for various wavelengths. How do I reduce the wavelength feature names?

  • @DataProfessor

    @DataProfessor

    3 жыл бұрын

    The spectral data could be used as input to PCA, the resulting scores plot would allow you to investigate the similarities or differences of the samples while the loadings plot would allow to investigate the relative importance of features to the observed similarity/differences shown by the scores plot.

  • @kamlesh6290
    @kamlesh62903 жыл бұрын

    Thanks professor 😊

  • @DataProfessor

    @DataProfessor

    3 жыл бұрын

    A pleasure!

  • @datnguyenduc9266
    @datnguyenduc9266 Жыл бұрын

    Great Video! I love your tutorial, it helps me alot in understanding of PCA. Btw, Can we eliminate the PC3 with only 3% of variance, and run the model again with n_components =2 , I think that 3% of variance seems to be not much critical to the model?!

  • @juandiegorojas6356
    @juandiegorojas6356 Жыл бұрын

    hello, a question, i have problems runing this part, anyone please help Y_label = [] for i in Y: if i == 0: Y_label.append('Setosa') elif i == 1: Y_label.append('Versicolor') else: Y_label.append('Virginica') Species = pd.DataFrame(Y_label, columns=['Species'])

  • @nathaneasley4582
    @nathaneasley45823 жыл бұрын

    How do you transform your own dataset into sklearns format?

  • @DataProfessor

    @DataProfessor

    3 жыл бұрын

    Hi Nathan, let's say that our dataset is in a CSV file. Here is what we can do: 1. Read CSV file using pandas pd.read_csv() and assign it to a variable (e.g. df) 2. Using the dataset that is assigned to df (in step 1) and apply the following code to split the dataset to training and test set: X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2) 3. That's all, the above dataframe can be used by scikit-learn to build ML models Hope this helps 😃

  • @petelok9969
    @petelok99692 жыл бұрын

    Dear Professor, can I ask would you consider these interpretations as more or less correct: Scores plot looks at what extent each PC captures the behaviour of each sample whereas the loadings plot is an indicator of how much each variable weights the PC. Also, is it possible to derive the loadings from the score values? Btw thanks very much, I'm getting so much out of your tutorials. Peter

  • @username42
    @username423 жыл бұрын

    can we also make it to plot 2d scatter plots of different combinations instead of 3d with a for loop in plotly or we need to use matplotlib for it ? also what could be the usage of pca for bioinformatics ?

  • @DataProfessor

    @DataProfessor

    3 жыл бұрын

    Yes, we can definitely iterate through the features and display different combination of 2D scatter plots. There are many usage of PCA. It's really a useful algorithm. Here are some usage: PCA can generally be used to provide a visualisation of the distribution of the data samples. It also allows quick visualization and comparative analysis between 2 datasets whether they share similar distribution to one another. In bioinformatics and cheminformatics, PCA scores plot has been used for segmenting the dataset. Additionally, PC coefficients can also be used as a compressed form of all the variables used in the dataset. Furthermore, analysis of the PCA loadings plot may allow us to gain insights on the feature importance.

  • @username42

    @username42

    3 жыл бұрын

    @@DataProfessor thanks for the answer, but how do we trust pca and other unsupervised techniques since when we change some parameters during pca and other unsupervised techniques like means and etc, the score plots can change slightly ? so do we also need to perform cross validation like we do it in supervised techniques ? or even you might knowthat orthogonality or the normalization and etc to the raw data can also change the results on score and loadings plots of the pca, so how do we decide which one is correct then ?

  • @nicholaswarner1897
    @nicholaswarner1897 Жыл бұрын

    I'm very amateur with Python and I always struggle with the first part of these tutorials. I have an excel file that I want to structure in the same way as the dataset you're importing into python that way the code works. How do I view the dataset you are importing to compare my file to that dataset?

  • @BrianBeanBag
    @BrianBeanBag2 жыл бұрын

    How do you get in to print the feature names and target names? When I do the same, using the same dataset, I get an error saying: 'DataFrame' object has no attribute 'feature_names'. The same goes in the step "Assigning Input (X) and Output (Y): It simply tells me, that the Dataframe has no attribute called "data" and "target".

  • @DataProfessor

    @DataProfessor

    2 жыл бұрын

    This works if it's the example dataset from Scikit-learn. For all other dataset read from CSV files the feature names should automatically be read from the 1st line of the CSV file.

  • @BrianBeanBag

    @BrianBeanBag

    2 жыл бұрын

    @@DataProfessor Ahh I see. I used the CSV-file. Thanks a lot!

  • @user-kx8gv2yk4v
    @user-kx8gv2yk4v3 жыл бұрын

    you are my hero!

  • @DataProfessor

    @DataProfessor

    3 жыл бұрын

    Thanks!

  • @hakanbayer3020
    @hakanbayer30203 жыл бұрын

    Hi sir, thank you very much for the video! I have experimental data set of time dependent signals 800(time)x49(signals) voltage values. I used PCA and reduced it to 800x2.How can I reduce further and extract information from these set for ML application and is there any other feature extraction method that you can advice for signal feature extraction ?

  • @DataProfessor

    @DataProfessor

    3 жыл бұрын

    As you said, your dataset of 800 x 49 it is 800 rows x 49 columns, so using PCA the columns reduces to 2 which are the 2 PCs, one important thing to consider is to calculate the cumulative variance of these 2 PCs (the sum of the explained variances for PC1 and PC2), if they are too low then you can consider to use more PCs, such that the cumulative variance is within acceptable value (hopefully 70%).

  • @stlo0309

    @stlo0309

    2 жыл бұрын

    my data is almost similar to yours! i'm looking to reduce dimensionality as well haha

  • @nathanthreeleaf4534
    @nathanthreeleaf45342 жыл бұрын

    So, by default, the first descriptor always includes the most variance, regardless of what that variable contains? That seems pretty flawed if that's the case. I'm dealing with a dataset where the first variable is zip code, which has little if anything to do with the target variable. It seems like you could sort your data however you want and the very first variable, will always be the "most important."

  • @DataProfessor

    @DataProfessor

    2 жыл бұрын

    Hi, by the first descriptor I presume you're referring to the first principal component (PC1) and yes it contains the most variance where subsequent PCs will contain less and less. As for the contents of each of the individual PCs, no it does not represent the first descriptor in your dataset. Rather it is the fusion of all descriptors in your dataset where the signal from all of these descriptors are represented in the first few PCs whereas the noise will occupy the latter PCs.

  • @nathanthreeleaf4534

    @nathanthreeleaf4534

    2 жыл бұрын

    @@DataProfessor Thank you for the clarification and for taking the time to respond. I think I had a misunderstanding of what the outputs of running a PCA really represented. I thought since I had 50 variables, that was why I'd have 50 principal components. Additionally, I thought the purpose of running a PCA was to trim down a dataset - so if you had 50, you realize you only "need" let's say 5, to represent the same amount of variance AND that those 5 correspond to variables in your dataset. Since this does not seem to be the case, I am at a loss for understanding how PCA saves you any time, since you have to run your PCA with your full dataset each time given there is nothing telling you how meaning each variable is. I'll have to do some more research. Thank you again.

  • @shobanaathiappan4275
    @shobanaathiappan42754 жыл бұрын

    hi, why not use the PCA function directly? from sklearn.decomposition import PCA pca = PCA(n_components=3) pca.fit(X)

  • @DataProfessor

    @DataProfessor

    4 жыл бұрын

    That works too 😃

  • @username42

    @username42

    3 жыл бұрын

    yes and why this function and other provides different outcomes then ?

  • @tomisonoakinrimisi7032
    @tomisonoakinrimisi70323 жыл бұрын

    Hi sir, I'm trying to do pca in a different programming language.. If the loading values and scores have the same magnitude value but different signs ( positive or negative) does it matter ?

  • @DataProfessor

    @DataProfessor

    3 жыл бұрын

    The values are the relative magnitude of likeness (similarities and dissimilarities) I would think the signs is indicative of the direction and spatial location of the data samples (in the scores plot) and variables (in the loadings plot).

  • @tomisonoakinrimisi7032

    @tomisonoakinrimisi7032

    3 жыл бұрын

    @@DataProfessor thank you very much

  • @sushmithapulagam6021
    @sushmithapulagam60214 жыл бұрын

    How could we know that the components need to be only 3 in the starting pca = PCA(n_components=3)

  • @DataProfessor

    @DataProfessor

    4 жыл бұрын

    Great question, we generally look at the cumulative variance if it gives more than 70% of the total variance then thise are the optimal number of PCs to use. Or you could follow the Haaland and Thomas approach where you look at the MSE versus number of PC snd if increasing the number of PC did not provide marked improvement then we select the optimal number to use at the point at which no further improvement of total variance are observed (the earliest point). Hope this helps.

  • @petelok9969
    @petelok99692 жыл бұрын

    Hi Professor, where is the input file, the iris dataset? It it a csv file somewhere on you GIT? Peter

  • @DataProfessor

    @DataProfessor

    2 жыл бұрын

    Hi, as mentioned in the provided Jupyter notebook, the iris dataset used herein was from the sklearn library as follows: from sklearn import datasets iris = datasets.load_iris() X = iris.data Y = iris.target

  • @petelok9969

    @petelok9969

    2 жыл бұрын

    @@DataProfessor ah OK thanks Professor. I'm quite new to Python but can I ask was this an example dataset already within the skylearn library or was it something you actually imported into that library? Regards Peter

  • @DataProfessor

    @DataProfessor

    2 жыл бұрын

    @@petelok9969 Hi, no worries, with some practice, you’ll be on your way to master this. The one used in this tutorial, is an example dataset from the sklearn library. In spite of this, there are various ways to get data in, for instance you can use pandas to read in a csv file both locally and remotely from the cloud.

  • @Diamond_Hanz
    @Diamond_Hanz Жыл бұрын

    My gdown guy!

  • @mankomyk
    @mankomyk Жыл бұрын

    I took n_components=10 , but their cumulative sum is still only 58%. What does it say about my data, or what should I do next? (My data contains 31000 rows and 280 features and target column has to categories.) Thanks!!

  • @DataProfessor

    @DataProfessor

    Жыл бұрын

    This indicates that the first 10 PC components account for 58% of the data’s variance.

  • @yes_cassi
    @yes_cassi3 жыл бұрын

    A professor indeed

  • @DataProfessor

    @DataProfessor

    3 жыл бұрын

    Thanks!

  • @mellmiss8522
    @mellmiss85223 жыл бұрын

    I need pca code in spyder 3.8

  • @spicytuna08
    @spicytuna082 жыл бұрын

    great stuff data professor, would this work for 200 feature? thanks

  • @DataProfessor

    @DataProfessor

    2 жыл бұрын

    Yes, and more as well.

  • @bryanchambers1964
    @bryanchambers19642 жыл бұрын

    I like the vid but I'm trying to see how I can find out which columns of the original data frame are most related to each principle component.

  • @DataProfessor

    @DataProfessor

    2 жыл бұрын

    Hi, you can look into the loadings factor which compresses the original variables into a lower dimensional space accounted for by the new PCs (e.g. PC1, PC2, PC3, etc.)

  • @bryanchambers1964

    @bryanchambers1964

    2 жыл бұрын

    @@DataProfessor Thanks, yeah. I thought about that later. So the data frame that shows the PC's will have the same number of rows as the number of columns of the data frame we are working with I think is what you're saying?

  • @DataProfessor

    @DataProfessor

    2 жыл бұрын

    @@bryanchambers1964 Yes, the number of rows will stay the same while the number of columns will be significantly reduced (compressed from high dimension to low dimension; practically the number of columns will reduce).

Келесі