The most common mistake in biostatistics

Want to take a class with me? Visit simplistics.net and sign up! See you there!
*Technical side note: zero-inflated simply means you have a distribution with an abnormal number of zeroes. To model zero-inflated data, it's common to use models that combine logistic regression (to predict the 0 vs 1) and logistic regression (to predict 1+). So it's technically incorrect to say that a distribution that has this characteristic (combining yes/no with degree) is "zero-inflated." In other words, zero-inflated describes the distribution, not how the distribution is modeled. Make sense?

Пікірлер: 74

  • @trini-rt6xn
    @trini-rt6xn2 ай бұрын

    I'm not a Statistician or a Biostatistician, and I'm not even good at Math, but your explanation was so crystal clear even I can understand it. Sweet! And I've had Senior Level Management folk - VPs, SVPs - from major Big Pharma companies ask to keep hacking away at data that plain as daylight like the Continuous Variable Distribution you showed in this video, and I keep asking myself: "am I so stupid? Am I missing something obvious?" After all, the data is being summarized and showing whatever its showing, but somehow the big folks want it to show something else. And I'm always like "what else do you want it to show? It is what it is!" Of course, I swallow my pride and hide my impatience because maybe, just maybe, I'm really stupid. But after months of slicing and dicing data into invisible chunks, it always comes back to where I started. Scary! Thanks again for making advanced topics palatable for myself and others like me. It gives us hope.

  • @QuantPsych

    @QuantPsych

    2 ай бұрын

    Thanks!

  • @hamidjess
    @hamidjess2 ай бұрын

    This is a Nobel Price in Languages right here.

  • @jishanzaman3421
    @jishanzaman342121 күн бұрын

    Gem of Stat in this Era

  • @galenseilis5971
    @galenseilis59712 ай бұрын

    Yeah, an "optimal cutoff" requires a well-defined optimization problem. It requires an objective function to be either minimized or maximized. Vaguely pointing at a continuous empirical distribution does not constitute such clarity.

  • @zimmejoc
    @zimmejoc2 ай бұрын

    The inflection idea to make language continuous is something we already do. Daft and Lengel talk about media richness. Papers are less rich than talking because tone and inflection don’t come through in a paper. That’s a huge simplification of their premise, but that’s the gist.

  • @QuantPsych

    @QuantPsych

    2 ай бұрын

    Very cool!

  • @antoniobarros3415
    @antoniobarros34152 ай бұрын

    As always, the vlog is excellent. It brings to mind a quote from Frank Harrell on categorisation: ‘Employ it when the intention is to mislead the reader" ;-)

  • @galenseilis5971

    @galenseilis5971

    2 ай бұрын

    I've enjoyed perusing Harrell's biostatistics book.

  • @yiannisspanos694
    @yiannisspanos6942 ай бұрын

    I used the book by Imbens and Rubin (2016) to measure treatment effects in my MSc thesis. It's a sub-classification, based on the Propensity score (PS), a continuous variable. The sample is split on the median such that the average PS of the treated is equal to the average PS of the controls in each stratum. The results are somewhat sensitive on how the sample is stratified, but the stratification is done using a very specific algorithm. I would be interested to hear your thoughts on that book. Note that the PS in my sample was analytically derived, not estimated.

  • @anasbit2
    @anasbit22 ай бұрын

    I am really interested in the continuous along with categorical findings when we have to make a decision, as you said. I would really appreciate it if I could find a paper that demonstrate this approach, or utilize it.

  • @galenseilis5971

    @galenseilis5971

    2 ай бұрын

    I recommend looking up "decision theory".

  • @antoniobarros3415

    @antoniobarros3415

    2 ай бұрын

    or 10.1186/s41512-019-0064-7

  • @galenseilis5971
    @galenseilis59712 ай бұрын

    It is interesting considering how to model a random variable X that quantifies an extent but also have a variable Y is an indicator variable that gives us whether there was any extent to X at all. If we have labels then we can model X * Y. Without labels I think a mixture could be reasonable. If you have labels for some but not all of the data then that sounds like a missing data problem. There you should consider whether the missingness is MCAR, MAR, or MNAR. If the former two, then model-driven imputation is may be possible. All-the-better if you impute a probability distribution over the missing values rather than filling in just one value.

  • @NicholasBerryman-zs8ip
    @NicholasBerryman-zs8ip2 ай бұрын

    I always liked Fuzzy Logic as a framework for interpreting linguistic categorical concepts continuously and vice versa. 'Fuzzification' and 'Defuzzification' are permanent fixtures in my toolbox for when I need to explain the ideas that you talk about here.

  • @QuantPsych

    @QuantPsych

    2 ай бұрын

    I have no idea what that is....I'm starting to rethink my attempt to talk about linguistics :)

  • @galenseilis5971

    @galenseilis5971

    2 ай бұрын

    Personally, I do not like fuzzy logic (or fuzzy set theory). I have yet to encounter a problem where I thought it was conceptually better than a probabilistic model. It isn't that "cold", "warm", or "hot" are fuzzy. Rather there are difference probabilities of someone saying any of those given labels given some factors including the physical temperature. Often mixed effects and mixture models explain a lot of variation in such responses.

  • @NicholasBerryman-zs8ip

    @NicholasBerryman-zs8ip

    2 ай бұрын

    @@galenseilis5971 I will agree that mixed effects models often explain 'fuzzy' ideas better that fuzzy logic, I do think you're throwing some baby out with the bath water here. I have often found it a good explanatory approach (not modelling approach) to get people out of thinking in the ways this video criticizes. When people ask me to treat an obviously continuous variable as a categorical one, I've often found it helpful to present some edge cases and say e.g. 'Could we not say that this point is "more positive" than this other point? Perhaps there's some implicit information in your wording of "positive" that the model wouldn't get if we treated a fuzzy thing like language as something perfectly exact', and that often gets the idea across, where I find bringing probabilistic ideas often confuse people not well versed in them. And that's without getting into fields like controls theory, where probabilistic thinking doesn't make sense - it doesn't make sense to say a train has a 50% chance of breaking where we mean it is breaking at 50% power.

  • @galenseilis5971

    @galenseilis5971

    Ай бұрын

    @@NicholasBerryman-zs8ip I have not seen the light yet, but I'll admit I have not looked into to what extent it helps knowledge translation to talk in terms of fuzzy logic. It could be that even if it is debatable on intellectual grounds, some people may just find fuzzy logic more intuitive for some problems. I find that fuzziness in this ontological sense is nonsensical and doesn't help me understand the systems I study, and hence my own dislike of it as an interpretation of the math being used. Probability theory has applications in control theory, but we don't need to jump into that level of breadth to discuss your example. What I understand from your phrasing is that there is a bounded variable "power" which has a well-defined 50% point between the lowest value and the greatest value, and that there is a variable for whether the system is broken which is a binary state of a system. This binary variable can be formulated as an indicator function for which some of its pre-image corresponds to "broken" and the complement set corresponds to "not broken", which gives us a way of classically labeling the state space (i.e. the partitioning is "sharp" in fuzzy set theory terms). While probability is unnecessary in this example, so is fuzzy logic. There are also cases where the system is not always broken at 50% power, but rather is sometimes broken or sometimes not broken (as labelled in the data). A probability approach would say there is a probability p of the system being broken at 50% power. A fuzzy logic approach would say the system is z broken and simultaneously 1-z not broken. In my point of view the probability interpretation of this generalization of the binary broken/not broken is preferable over the fuzzy logic interpretation.

  • @galenseilis5971

    @galenseilis5971

    Ай бұрын

    @@NicholasBerryman-zs8ip There may well be upsides to the knowledge translation side of fuzzy logic. It is not something I have looked into.

  • @McDreamyn_mdphd
    @McDreamyn_mdphd2 ай бұрын

    I've encountered that many tend to create categorical variables to use as predictors in logistic regression models, so that the value on the logit scale can be easily interpreted as an odds ratio. But what they don't realize is that the values can be recoded to keep the continuous distribution of the variable, but transformed it so that the value of 0 can indicate the value of say the bottom 25th percentile and the value of 1 can equal the value at the upper 25th percentile. Now in theory you are still interpreting the values as if they were a binary variable, but at least you do not lose statistical power by capping the natural variability of an informative covariate

  • @swinginkeke

    @swinginkeke

    2 ай бұрын

    Can you walk me through a practical example of this? I’m a biostatistician at a hospital and or docs ALWAYS want odds, even at the expense of losing data/power/etc. I like this idea, but haven’t come across it before.

  • @galenseilis5971

    @galenseilis5971

    2 ай бұрын

    Hmm, I cannot say that I find this use case compelling. The canonical logistic regression is already clear enough to interpret as-is without further tinkering. Not that I think other categorizing strategies are appealing here either.

  • @antoniobarros3415

    @antoniobarros3415

    2 ай бұрын

    @@swinginkeke probably, they should take the responsibility for the decision. They should have a look to DCA (decision curve analysis).

  • @McDreamyn_mdphd

    @McDreamyn_mdphd

    2 ай бұрын

    @@swinginkeke as (Xi (Xi− X25th Percentile) / (X75th Percentile − X25th Percentile), where 0 = 25th percentile and 1 = 75th percentile for person i on variable X.

  • @McDreamyn_mdphd

    @McDreamyn_mdphd

    2 ай бұрын

    @@galenseilis5971 Well, I agree, but for publication purposes in medical journals, there is often less interest in understanding a single unit by unit increase in the log odds of say some diagnostic test and instead there is a desire to transform the interpretation to something that is clinically meaningful. If I have a patient and I want to understand the dose-effect of a statin on an inflammatory marker (troponin), the transformation I outlined above is a very straightforward approach of translating the odds ratio in a very easy and understandable metric, especially for clinicians who may not necessarily be adept at reading medical literature. Over my career, I have learned that success it is less about what I know, and instead what I can do to demystify the numbers and make them clinically relevant to my peers in the publication process.

  • @PATRICKCONNOLLY-ub2vb
    @PATRICKCONNOLLY-ub2vb2 ай бұрын

    Medical doctors are indoctrinated to think of the world in terms of decision points and cutoffs. This is why the doctor demands discretization of a continuous response. He wants to have a decision point, where test results above that point indicate treatment. Continuous distributions are much more challenging to deal with. If you have a guy whose test score is in the middle of the pack, what do you do, give him half the treatment? What if the treatment is a surgical procedure? This is why doctors demand cutoffs. They are not morons, they just have a different set of priorities and constraints.

  • @galenseilis5971

    @galenseilis5971

    Ай бұрын

    "They are not morons, they just have a different set of priorities and constraints." I think that's the crux of it right there. This doesn't excuse poor statistical practices when they occur, but they're just humans trying to solve problems like the rest of us. Mapping continuous random variables to discrete random variables usually makes the most sense 'after' the joint probability distribution over the variables has been well-specified. The discrete outcomes can be the options to be decided upon, hopefully along with probabilities computed via the change of variables for the mapping, so that decisions under risk can be made. Even when we have discrete random variables to begin with, probabilities distributions (exempting Dirac delta distributions) do not usually commit to a single answer anyway. If I look at demand for hospital beds and I want to know how many beds will be enough for the next planning period, I can only allocate one counting number of beds; not the whole ensemble. Making decisions commits us to one mutually exhaustive option over the others, with some risk of it being incorrect.

  • @nl7247
    @nl72472 ай бұрын

    Please also discuss the problems when categorical data are analysed as continuous data. Thank you for your videos.❤

  • @QuantPsych

    @QuantPsych

    2 ай бұрын

    What problem? That's very common to do that. For example, male/female becomes 1/0 and we can use regression to do a t-test. Unless you mean something else?

  • @nl7247

    @nl7247

    2 ай бұрын

    @@QuantPsychI mean if using continuous numbers to analyze categories, e.g., we don't really consider there could be 0.73 in the gender range when we use 1 or 0 (or 2) to represent only two genders (not getting to get into the recent gender classification discussion here). Or, something which should only be integers that making it continuous does not make sense in real world, although we often say or hear people have an average of 0.83 car... Thank you for your thoughts and reply.

  • @galenseilis5971

    @galenseilis5971

    Ай бұрын

    @@nl7247 One of the ways that models can be less realistic is to ignore the set of possible outcomes. If I have a count variable, e.g. Poisson, the expected value is not in general an observable outcome. That's okay if you are truly interested in the expected value. If you are not interested in the expected value then you should use something else like a distribution over the observables.

  • @planetary-rendez-vous
    @planetary-rendez-vous2 ай бұрын

    I categorized my gene expression into low medium and high because we have no idea how to analyze something without resorting to pvalue comparisons (is there a difference of the mean, plug in whatever model you have depending on normality).

  • @galenseilis5971

    @galenseilis5971

    2 ай бұрын

    What mathematical and computational methods you should use will depend on the question you're trying to answer (assuming that your data can answer that question in principle).

  • @planetary-rendez-vous

    @planetary-rendez-vous

    2 ай бұрын

    @@galenseilis5971 In our experiment we wanted to see if there is a "correlation" between a specific gene signature and the rest of the data, that is if we could sort out our tumor samples or explore any kind of pattern based on a specific gene signature expression which the tumor express. Anyway I'm not sure if that is a good idea but we did.

  • @galenseilis5971

    @galenseilis5971

    2 ай бұрын

    @@planetary-rendez-vous Well, such exploration of data could reveal interesting patterns. Naturally, if you find an interesting pattern you might next want to find out if it generalizes to future samples. Finding a pattern in a single sample is still only an early step towards an independently-verifiable phenomenon. And beyond a stable statistical pattern there are causal models. These come in different forms, but I recommend Judea Pearl's work on causality. I think his diagrammatic approach is a good entry point.

  • @qwerty11111122
    @qwerty111111222 ай бұрын

    10:00 consider the bumblebee

  • @QuantPsych

    @QuantPsych

    2 ай бұрын

    TIL about bumblebee languages :) Fascinating stuff!

  • @tulipped
    @tulipped2 ай бұрын

    Myanmese (or Burman, depending on who you ask).

  • @QuantPsych

    @QuantPsych

    2 ай бұрын

    Excellent! I was hoping I'd get someone who knows :)

  • @royals2013
    @royals20132 ай бұрын

    “But previous literature did” hm ok yeah let’s shy away from that excuse

  • @QuantPsych

    @QuantPsych

    2 ай бұрын

    Seriously!

  • @galenseilis5971

    @galenseilis5971

    2 ай бұрын

    What excuse? Fife cited a paper. What's wrong with that? If you read the paper and find problems with it that's one thing, but citing a source for a claim isn't an excuse as I understand it. Please elaborate.

  • @royals2013

    @royals2013

    2 ай бұрын

    @@galenseilis5971 PI only wants to do something because previous literature did something. Statistics evolves, better practices emerge. Bad statistics are replicated far too often from people assuming the original methods are appropriate.

  • @galenseilis5971

    @galenseilis5971

    2 ай бұрын

    @@royals2013 I understand the context of your comment better. Thank you. Yeah, that's a really good point. A lot of literature has false claims in it, and that is something to have some vigilance about. Assuming uncritically that previous literature is *de facto* correct is unwise. It sounds like Fife has read the paper, albeit a substantial amount of time ago. If we want to further evaluate the paper that is up to us. In this context I think Fife is just making a claim with a citation. You're right that we should not take the conclusions of the paper at face value, but I don't think it is excuse-making to cite previous work as evidence for a claim.

  • @royals2013

    @royals2013

    2 ай бұрын

    @@galenseilis5971 ah wasn’t meaning it as a quote from quantpsych btw lol just a quote that I hear a lot from PIs. Totally agree w everything said in the vid

  • @galenseilis5971
    @galenseilis59712 ай бұрын

    I think what actually stops me from ever using median splits is that the decisions I help people make with statistics don't involve the median. It just doesn't have any relevance on the problems I work on.

  • @antoniobarros3415

    @antoniobarros3415

    2 ай бұрын

    I came across several PhDs that were based on that particular subject matter.

  • @galenseilis5971

    @galenseilis5971

    2 ай бұрын

    @@antoniobarros3415 Their PhD dissertations depended on using the median? It certainty could happen.

  • @antoniobarros3415

    @antoniobarros3415

    2 ай бұрын

    @@galenseilis5971 sure do. terrible. I cannot however id the PhD!

  • @swinginkeke
    @swinginkeke2 ай бұрын

    Totally agree in theory, but docs love ORs and the Titanic turns slowly. How can I better communicate interpretability of betas if I keep the outcome continuous? “For each year older the kiddo is, we see delay to initial imaging increase by 1.6 days.” The blank stares haunt my dreams.

  • @QuantPsych

    @QuantPsych

    2 ай бұрын

    True. Probably better to show them a plot.

  • @galenseilis5971

    @galenseilis5971

    Ай бұрын

    Are you equivocating the random variables with (conditional) expected values of the variables? They are not the same in important aspects for planning.

  • @galenseilis5971
    @galenseilis59712 ай бұрын

    I could see someone trying to partition the data if they saw a bimodal distribution and no apparent labels to explain that bimodality, but I would still prefer a mixture model. A mixture model allows the assignment of probabilities to the apparent subpopulations.

  • @planetary-rendez-vous

    @planetary-rendez-vous

    2 ай бұрын

    Nah I split my continuous variable into 3 categories, low,medium, high because we have no idea how to analyze it. 🤡

  • @galenseilis5971

    @galenseilis5971

    2 ай бұрын

    @@planetary-rendez-vous lol

  • @galenseilis5971

    @galenseilis5971

    2 ай бұрын

    @@planetary-rendez-vous In all seriousness, my advice is to ask for help when you don't know how to analyze your data.

  • @galenseilis5971

    @galenseilis5971

    2 ай бұрын

    @@planetary-rendez-vous Ah, I just noticed you did below. You might get answers in a comments section like this. You can also reach out to statisticians (like Dustin) or others at industry and academic institutions. In my experience they're happy to help when they have time.

  • @planetary-rendez-vous

    @planetary-rendez-vous

    2 ай бұрын

    @@galenseilis5971 Yes of course, however I had 2 lab experiences and both didn't have statisticians. One lab had an epidemiologist and tbf it is not the same as a statistician that understands the theory behind the methods. So the epidemiologist doesn't have any reservations about using categorizations ; we truly didn't know better, and it is very likely that nobody knows better in current environments except specialized ones with dedicated researchers on good statistical practices.

  • @DistortedV12
    @DistortedV122 ай бұрын

    If all you know is ANOVA, what would you do instead?

  • @galenseilis5971

    @galenseilis5971

    Ай бұрын

    What alternatives are appropriate to ANOVA depends on the analysis problem; there isn't a one-size-fits-all approach. Saying that something isn't ANOVA is like saying something isn't a banana; it doesn't narrow things down very much. Start with the problem you want to solve and search for or develop the best method you can for it.

  • @QuantPsych

    @QuantPsych

    24 күн бұрын

    Hire a statistician :)

  • @gimanibe
    @gimanibe2 ай бұрын

    First of all you’re kind of crazy, in a good way 😂, secondarily, I will use “languagatized”with my students! Boo to discretizing continuous variables!

  • @galenseilis5971

    @galenseilis5971

    2 ай бұрын

    Fife's kind of crazy is a fun type of crazy.

  • @zimmejoc
    @zimmejoc2 ай бұрын

    I just don’t get the median split. Why take a continuous variable with all its information and then strip it away and turn it into a discrete one with two values. It’s a loss of data. EDIT: I commented the moment you said median split. Nice to see your elaboration support my dislike of the practice. 😁

  • @QuantPsych

    @QuantPsych

    2 ай бұрын

    I'm glad we agree!

  • @fruithillfarm6113
    @fruithillfarm61132 ай бұрын

    Diagnostic criteria require an optimal cutoff. Those cutoffs are not arbitrary or determined by one dataset (the focus of researchers). Clinicians often conceptualize the data continuously (e.g., pre-diabetic, higher risk for cardiovascular disease, pre-clinical risk for stress-mediated chronic disease development), but patients want to know if they have a condition or not (category). Clinical scientists don't categorize everything because we only know how to use ANOVAs , but what a condescending standpoint. Eliminating categorical cutoffs eliminates diagnoses. I'm good with that, but really, as a patient, are you?

  • @galenseilis5971

    @galenseilis5971

    Ай бұрын

    Eventually mutually exclusive choices about whether or how to treat have to be made which induces some amount of discreteness.

  • @QuantPsych

    @QuantPsych

    24 күн бұрын

    Did I say that *all* clinical scientists categorize things for an ANOVA? I don't believe I said that. Some do. I have known many to do it. It's not condescending to accurately state that some people categorize so they can use their ANOVAs. I think you're missing my point. I recognize (and say as much in the video) that sometimes, at the end of the day, you need to make a decision (e.g., a diagnosis). I do not object to discretizing data at that point (provided we keep in mind the data are continuous). Rather, my objection is discretizing before doing analyses (and analyzing categorical versions of our continuous variables).