Strings in Rust FINALLY EXPLAINED!

Ғылым және технология

The ultimate Rust lang tutorial. Follow along as we go through strings in Rust. We will be talking about UTF-8, the &str and String types, indexing into strings, and more!
📝 Get notified when the Rust Cheatsheet comes out: letsgetrusty.com/cheatsheet
The Rust book: doc.rust-lang.org/stable/book/
Chapters:
0:00 Intro
1:09 What is a string?!
6:53 &str and String
10:21 Creating strings
12:19 Manipulating strings
13:31 Concatenating strings
15:29 Indexing into a string
19:44 Strings and functions
20:32 Outro
#letsgetrusty #rustlang #tutorial

Пікірлер: 197

@letsgetrusty2 жыл бұрын
📝 Get your *FREE Rust cheat sheet* : www.letsgetrusty.com/cheatsheet
@antonioquintero-felizzola53342 жыл бұрын
This has become my favorite RUST channel on KZread.
@stardustbiscuits
2 жыл бұрын
Because this is the only rust channel on KZread
@Yotanido2 жыл бұрын
Little note about function parameters: Taking &str instead of String is good, but only if you don't need an owned String. If you do need an owned String, make sure to take a String, so the caller can decide how the owned string is generated. (For example, the caller might already have a String. If you take a &str, you need an unnecessary clone) Now to contradict myself: If you do need an owned String, it might be best to use an impl Into instead. This way, the caller can pass in a &str as well. Improves ergonomics. In the same vein, taking impl AsRef instead of &str also allows your function to take an owned string. It is trivial to put an & in front, so it's not as important as Into, but it also slightly improves ergonomics. Depending on what you actually do with it, you might even want your string input to be IntoIterator or something. This does decrease ergonomics, since the caller now needs to call chars on the string, but it does mean your function will also work with, for example, Vec. If you sometimes return an owned string and sometimes a &str, you can use Cow. For example, if you sometimes return a string literal, Cow
@jonathanmoore56192 жыл бұрын
I think this is probably your best video yet. It's great that you've gone a little bit deeper. Programmers really need to know this stuff. Thanks for your effort.
@JaycenGiga2 жыл бұрын
At 20:50 you talk about fixed-length encoding using four bytes. This is what UTF-32 does. AFAIK none of the major languages uses it, but Python has an interesting take on it: When a string is created, the interpreter chooses the “best fit“ between ASCII, UTF-16, and UTF-32, so that constant indexing is always possible but not too much memory is wasted. This of course only works because Python strings are immutable.
@dacid44
Жыл бұрын
Rust's char type actually is UTF-32, as far as I understand it. Which means you could get very similar functionality to a fixed character length encoding by simply using a Vec (or an &[char] to emulate a string slice), if you so wished. Granted, not all the operations that are possible for the normal string types are implemented for a Vec, but given the size of the Rust ecosystem, there's probably a library to make it possible.
@haniyasu8236
Жыл бұрын
I'm like 90% sure that Java `String`s are UTF-32. From my recollection, the `char` type in Java is pretty explicitly an `int` under the hood as well.
@abcxyz5806
Жыл бұрын
@Haniyasu char is UTF-16. But if i remember correctly, multibyte encoding does not work in Java. Could be changed now, haven't use Java since Java 8
@upgradeplans7772 жыл бұрын
As a note on your comment at the end: The type that you said that Rust does not have would be represented by Vec in Rust applications. It is not equivalent to rune slices in Go ([]rune), but intended for the same usage. In general, go slices are similar to rust vectors. However, there is a difference between char and rune: In Go, rune is an alias for int32. In Rust, char is its own type. With Rust's emphasis on memory safety, safe Rust code cannot generate invalid chars. This means that it's behavior is different from u32, the type that it would otherwise be equivalent with. In Go, it is perfectly possible to create meaningless runes. The same is true for strings by the way, safe Rust code cannot generate invalid UTF-8 values, but no such limitation exists in Go. As I understand it, these are equivalent types between Go and Rust: - string -> &[u8] - rune -> i32 - byte -> u8 - []byte -> Vec - []rune -> Vec - [7]byte -> [u8; 7] - [7]rune -> [i32; 7] The other Rust types that we mentioned (char, &str, String, Vec, etc) have additional memory safety guarantees that Go does not provide.
@mistakenmeme2 жыл бұрын
Keep it up!! As Rust grows, you will one day be remembered as one of the O.G. Rust youtubers!
@cramhead2 жыл бұрын
Liked the little explanation of UTF-8 encoding. Thanks for making the videos. It helpful to refresh what I’ve read and get a few tips too
@billhurt3644 Жыл бұрын
This might be the best string and UTF8 encoding video I’ve ever seen. So many experienced, professional programmers really do not understand how Strings actually work, even in their own language or choice. And it truly did demystify for me Rusts behavior around the different string types. Much appreciated.
@abhishekdas8292 жыл бұрын
Probably the best explanation of string, utf-8, ascii types I’ve encountered in my 15 year career! Keep up the good work!
@dabzilla052 жыл бұрын
This was driving me crazy last week. Really glad to see good, thorough showcase of this concept in Rust. Appreciate your content! Keep at it, this will be big when Rust blows up.
@Mehraj_IITKGP11 ай бұрын
Here a brief summary of UTF-8 encoding: - UTF-8 (Unicode Transformation Format, 8-bit) is an encoding scheme, just like ASCII, for representing Unicode characters. - In UTF-8, ASCII characters are represented using a single byte, which means that any valid ASCII text is also valid UTF-8 text. - Therefore, UTF8 is backward compatible with ASCII. - In UTF-8, characters that can be represented using a single byte (i.e., ASCII characters) are represented as themselves. - Characters that require more than one byte are encoded using a combination of multiple bytes. - A code point refers to a numerical value assigned to each character or symbol in the Unicode standard. - Code points are represented using hexadecimal notation and are typically prefixed with "U+" to distinguish them from other numerical values. - For example, character "é" (Latin Small Letter E with Acute) consists of two Unicode code points: the base character "e" (U+0065) and the combining acute accent (U+0301). When encoded in UTF-8, "é" is represented by the bytes 0xC3 0xA9. - A grapheme refers to a visual unit of a written language. It represents a single user-perceived character or a combination of characters that are displayed together. - len() function returns the number of bytes, not the number of characters in a Unicode-unaware string. - len() function returns the number of characters in case of a Unicode-aware string.
@WizardOfArc2 жыл бұрын
I learned something new about how UTF-8 works! Thank you!
@rebelmachine882 жыл бұрын
The super in-depth content I crave! Great video
@NOPerative Жыл бұрын
Def one of the best, light weight (and logic dense) videos I've seen regarding rust string types from a practical standpoint (concerning aspiring Rustaceans). Excellent vid.
@danielhadad49112 жыл бұрын
Your channel is so cool, thanks for putting the effort in making these tutorials.
@carlesxaviermunyozbaldo47822 жыл бұрын
Great explanation of this important topic in Rust. Thank you very much!
@chanhhua70502 жыл бұрын
Wow, this is extremely useful! I come from the Java world and it was really confuse me when working with Rust string, especially when I need to deal with ideographic characters. Thank you!
@DavidTorralbaGoitia Жыл бұрын
Amazing content for a beginner! Super helpful! Thank you very much! 🙌
@noblenetdk Жыл бұрын
splendid and thoroughly explained. Bravo!
@GolangDojo2 жыл бұрын
Shit just got serious
@hermannpaschulke15832 жыл бұрын
Most people say strings in rust are complicated. For me, it makes a lot of sense how it's handled. I do quite a bit of C progamming on µCs, and there everything is char*
@sconosciutosconosciuto2196
2 жыл бұрын
Is rust bad for microcontrollers?
@saadisave
2 жыл бұрын
@@sconosciutosconosciuto2196 Embedded Rust usually doesn't have the full standard library. It has parts of std called core and alloc. If your resources are limited, you can use a CString, which is null terminated and equivalent to char*.
@jamesbond_007 Жыл бұрын
Finally! A great explanation of what's going on between string slices and Strings -- thank you! I also appreciated your delving into unicode encoding -- I was worried you were going to rabbit hole on binary representations of a bit characters, but you did exactly the right thing in terms of explaining how unicode encoding works, how it solves the "where am I" when you have a pointer to an arbitrary byte in a unicode string (i.e. "am I at the start?" "where is the next char boundary?") -- I love that you explicitly mentioned that the first byte of a multicode byte string is differentiable based on the high order bits, and that it encodes the length of the multibyte sequence. You mentioned indexing into a Unicode string was a linear operation, which is true, but it's sub-linear in terms of number of bytes explicitly traversed -- if you have 4 4-byte unicode chars in a string, traversal takes only 4 operations, not 16, due to this clever encoding of the first byte.
@malharvora12812 жыл бұрын
Nice and useful intro to Rust strings. Thanks :)
@JeremyChone2 жыл бұрын
Very nice in depth video.
@RufusROFLpunch2 жыл бұрын
This was great. I wish I had this when I was first learning Rust.
@0xedb2 жыл бұрын
This earns you the title of professor. Hardly seen anything better explained than this!
@muffledcry Жыл бұрын
This is a great guide to a subject that has given me a ton of grief in Rust. Awesome job!
@alexeyermolaev6002 Жыл бұрын
Awesome content, thanks for your work!
@johnyepthomi892 Жыл бұрын
Namaste brother. You’re videos are too good. Keep this format going, tackling each topic standalone or mixed if it’s contextually relevant.
@davidtremaine8076 Жыл бұрын
You have such great videos. Learning a lot.
@jeremietamburini Жыл бұрын
Fantastic explanation, thank you very much! 👍
@thesuperyou28292 жыл бұрын
what an effort man.... hats off
@dieust Жыл бұрын
Really helpful, thanks !
@skyeplus2 жыл бұрын
The thing I like about Rust is you can take a buffer out of one type and transfer it to another. Like for instance you can convert between String, Vec and Box, while keeping the same underlying buffer without reallocation and copying. Hope C++ would have this. It had node transfer in limited form for lists and in newer standard for maps.
@SolomonUcko
Жыл бұрын
Note that converting to a `Box` reallocates if the length is less than the capacity, but other than that, owned conversions just transfer ownership of the existing allocation.
@skyeplus
Жыл бұрын
@@SolomonUcko oh, OK. Good to know. I'm still at the very beginning at learning Rust.
@Pilosofia Жыл бұрын
you are the best one to explain the difference between the two.
@__abhish2 жыл бұрын
really nice content man. Keep it rusty xD
@primingdotdev2 жыл бұрын
awesome video, just a small thing. Newer programmers may be confused by the lookup vs search times for a character. For UTF-8 (or any variable length encoding) if you want to lookup the nth scalar or grapheme you need to do a linear walk through of the string to count off every time you get to the end of a sequence of bytes representing a scalar/grapheme but for a fixed length encoding (runes, UTF-32) you can rely on each unit (scalar/char/grapheme) being a fixed size and you can just skip (n - 1) * 4 bytes (UTF-32 uses 4 bytes per scalar) to land in the right place. Not sure if I just confused a bunch of people or added clarity for some.
@SolomonUcko
Жыл бұрын
UTF-32 has fixed length code points/units, but not grapheme clusters
@prueba875 Жыл бұрын
Very interesting video thanks!
@exhaustedrose Жыл бұрын
Thanks for this refresher, I was getting pretty rusty.
@rustlabs79322 жыл бұрын
Simply brilliant !!!
@johnandrews54142 жыл бұрын
Outstanding video.
@dreastonbikrain1896 Жыл бұрын
Thank you, you demystified Unicode for me, now I see how I would implement some of the UnicodeSegmentation crate myself :)
@nofaldiatmam89059 ай бұрын
learn more about utf, bytes and string on 20 min video than my 4 year of uni, thanks man ✌️
@mumk Жыл бұрын
FINALLY, thanks bro
@JakobKenda Жыл бұрын
2:04 you could create an array of chars or Vec since chars are 4 bytes long.
@rtdietrich5 ай бұрын
Great Video!!!!
@viacheslav13922 жыл бұрын
Привіт світ - was great!)
@emvdl2 жыл бұрын
Thanks! 🤙
@idiot7leon2 жыл бұрын
Thanks!
@tobi96486 ай бұрын
I really like all of your videos they are one of the best out there. You mentioned that you came from a javascript background. How did you become a pro-rust-developer? did you learn the language privately and than you searched for a job or was it a fluent transition within the company you've worked in? I'm prof. C# /Typescript/Javascript developer and would like to jump on the rust-train :-)
@partisan-bobryk2 жыл бұрын
🇺🇦 Thank you for the explanation and examples!
@menardmaranan93562 жыл бұрын
What's the extension you're using for type autocomplete?
@jonathanmoore56192 жыл бұрын
Let's get goddamn rusty!
@Captainlonate2 жыл бұрын
How do you get VSCode to show the type annotations for let bindings automatically?
@joehsiao6224 Жыл бұрын
Great content! Can you do a video of chars in Rust and unicode scalar values? After hours of searching on the Internet, I am still confused.
@coolbrotherf127 Жыл бұрын
I started with creating char arrays in C back in the day so this isn't too crazy compared to that.
@shukterhousejive2 жыл бұрын
How does Rust handle interoperability between string implementations (OSString, CString, a hypothetical UTF-32 String etc.)? Is there enough compiler sugar to pass a reference to an alternate String type to &str, or is there an "IString" trait you can implement or is there a lot of myString.as_str() involved?
@valthorhalldorsson9300
2 жыл бұрын
it’s always manual conversions, though you have a lot of options depending on what you need - rust is super strict about not performing expensive conversions automatically (especially ones that can fail, like converting a byte string to a utf-8 string)
@xrafter
2 жыл бұрын
usually you will see something like AsRef in th3 standard library
@jimshtepa54232 жыл бұрын
Богдан, спасибо! очень крутой материал
@minecrafter88632 жыл бұрын
Hey bro, thanks for video. Do you use rust for blockchain development? The guys at Solana would love to have a series on that development on solana!
@qwerwasd17243 ай бұрын
I'm a little late, but what is the binary? And how does it differ from the stack and heap?
@mr.x558210 ай бұрын
goated content
@tanuvishu2 жыл бұрын
This is the first time I really understood UTF-8
@winsonleow9660 Жыл бұрын
I am trying switch over as well from nodejs still struggling with rust..
@yapayzeka Жыл бұрын
this channel is a blessing. best explanation out there. thank you
@dibyojyotibhattacherjee42792 жыл бұрын
Hey do u get the warning that says something like could not access incremental compilation directory?.
@letsgetrusty
2 жыл бұрын
Nope
@proloycodes
2 жыл бұрын
looks like you are using termux cuz i also get that
@theana55502 жыл бұрын
Okay... it's been about a month, but you can just use a Vec for constant time lookups.
@letsgetrusty
2 жыл бұрын
A char does not equal a user perceived character. 1 user perceived character can be multiple chars.
@theana5550
2 жыл бұрын
@@letsgetrusty yeah, true
@foobar12692 жыл бұрын
String in Rust made my head spin. Coming from Ruby and Python string is very easy.
@alagaika8515
Жыл бұрын
I have been using Python since the Python 2 days and while it seemed simpler by trying to automatically convert between unicode and bytes, this was a source for really confusing errors. In fact, Python 3 became more strict in this area by introducing the same separation between bytes and (character) strings that Rust is essentially using - which saves programmers from a lot of hard to track errors. The difference is that in Python, indexing into a string now counts the characters, which has an unexpected complexity, while Rust counts the bytes and then checks that the result is valid.
@skyeplus2 жыл бұрын
As far as I understand in Go you take a string which is a slice of bytes and build another slice of int32, 1 per each unicode code points. So it's nothing special, or advantegious. You just paid for a a full string decoding once. This would be equivalent to chars().map(|c| c as u32).collect::() in Rust. But because in Rust iterators are lazy you don't pay for creation of a new vector each time unless you explicitly want to.
@proloycodes
2 жыл бұрын
shouldn't it be `.chars().map(|x| x as u32).collect::()`?
@skyeplus
2 жыл бұрын
@@proloycodes You're correct. Fixed.
@driftwood-f4p8 ай бұрын
How do you enter emoji in your code?
@johnpett5242 жыл бұрын
Excuse me! Can you tell me how your vscode can show type of rust variable on the left side? Thanks a lot
@OhMyYasmine
2 жыл бұрын
@Let's Get Rusty I was wondering the exact same thing
@antoniojohnson7693
Жыл бұрын
That's the rust language server.
@antoniojohnson7693
Жыл бұрын
Or should I say, Rust Analyzer extension.
@salvadorvillarreal1643 Жыл бұрын
I know about the stack and the heap (and the register), but I've never heard of the "application's binary". What is it? Do you have another video where you explain it? Thanks!
@dynfoxx
Жыл бұрын
There are a few different ways a program can get read only or read write memory. Your executable normally contains a chunks or describes chunks or memory that it needs. The OS then copies or creates these chunks from your program and marks them as read only or read write. They are different sections of the program then heap and stack. That's the basics hope it makes sense.
@salvadorvillarreal1643
Жыл бұрын
@@dynfoxx Definitely! Thank you!
@available898 Жыл бұрын
interesting that you see ASCII as a map from integers to characters and not a map from characters to integers :)
@736939 Жыл бұрын
"Applications binary" - is is the a static memory?
@TheRealWinsletFanАй бұрын
The proliferation of code pages was an issue many years before the world wide web hit critical mass. Not everyone developed code to run in only one Country/Region.
@hytryi_huy Жыл бұрын
10:37 wow that's so cute, thank you))
@ksnyou2 жыл бұрын
18:03 Namaste!
@Maaruks Жыл бұрын
What is string slice?
@hoverpillow61069 ай бұрын
Дякую за вашу працю.
@kishanbsh2 жыл бұрын
Definitely learnt unicode.. thanks!!.. wondering how this complexity is being hidden in other languages 🤔..
@xrafter
2 жыл бұрын
They usually hide this checking and validation away from you . in js for example the engine will do it .
@trejkaz
Жыл бұрын
@@xrafter In JS they just let you make the mistakes. "🤔".length == 2, for example. And to split grapheme clusters, you will have to go find an external library, because it can't do that at all. The only thing other languages can do is streamline the usage of the complex stuff. In Elixir, for example, they hide the concept of string normalisation by making string equality work correctly out of the box. They still have an iterator over grapheme clusters but it's in the core library so you don't have to pull a separate dependency to get it. Older languages as a rule just get string handling wrong and push the complexity to the developer. It isn't so much hidden, as completely absent.
@Maaruks Жыл бұрын
what is byte?
@Chastor97 Жыл бұрын
wow. I'm from JS too
@marcorodrigues13312 жыл бұрын
What intrigues me is that, according to that explanation, in UTF-8 we should have 2,164,864 possibilities and not 1,112,064. So lets see: - 1 byte characters (0xxxxxxx): 2^(7) possibilities = 128 possibilities - 2 bytes characters (110xxxxx 10xxxxxx): 2^(11) possibilities = 2,048 possibilities - 3 bytes characters (1110xxxx 10xxxxxx 10xxxxxx): 2^(16) possibilities = 65,536 possibilities - 4 bytes characters (11110xxx 10xxxxxx 10xxxxxx 10xxxxxx): 2^(21) possibilities = 2,097,152 possibilities Total possibilities of characters: 128 + 2,048 + 65,536 + 2,097,152 = 2,164,864 So why 1,112,064?
@MalleusImperiorum
2 жыл бұрын
In 2003, the RFC 3629 standard for UTF-8 restricted the possible range of symbols by U+10FFFF and excluded U+D800..U+DFFF from the range, to make UTF-8 compatible with UTF-16.
@marcorodrigues1331
2 жыл бұрын
@@MalleusImperiorum thanks for the clarification!
@alexandershemelin66052 жыл бұрын
cool man
@fabianmallmann4834 Жыл бұрын
Good stuff! Just getting into Rust as a mainly JS-Dev and this is by far the best content out there! One question: What VS-Code extension are you using, that shows these greyed out type-annotations and function-parameters?
@nic37ry
Жыл бұрын
It's built-in in vs code and it's generated by the Language Server and it's called inlay hints
@PeterPerhac Жыл бұрын
Y U no have THANKS button? I would have bought you a pint! Thanks mate, fantastic explanation. Had some aha moments. Thanks again
@Maaruks Жыл бұрын
ACII ? what is ascii?
@abcxyz5806 Жыл бұрын
Why is there both to_string() and to_owned()?
@Maaruks Жыл бұрын
what is emoji?
@dorktales2542 жыл бұрын
Nice shirt
@Maaruks Жыл бұрын
please teach me about integers? what are integers?
@LinuThomas5 ай бұрын
Nice Video :) GOD Help
@trejkaz Жыл бұрын
Rust exposing the complexity of strings isn't because it's a low-level language - it's because it's a modern language. Working with strings in any language, you're expected to deal with the complexity of grapheme clusters. The only difference here is that they're strongarming you a bit into handling them correctly by removing some of the common ways people do things incorrectly. You gave JavaScript as an example, but the only reason you don't see this in JavaScript is that they simply give you no way to do things correctly in the first place. If you take Java, which is of a similar sort of vintage, it does have routines in the standard library to deal with it. Elixir is another modern language which also tries to implement strings more correctly - in the case of Elixir, though, string equality is implemented correctly by default (unlike in Rust), and you're allowed to iterate grapheme clusters without pulling in a dependency (unlike in Rust).
@milind_patil2 жыл бұрын
I love the Namaste 🙏
@Maaruks Жыл бұрын
Whats is STR?
@chrisalexthomas Жыл бұрын
You skipped straight past the original solution to the 127 character limit which was code pages and encoding. UTF-8 was the better solution that came out of this, but UTF-8 was not widely used. UTF-16 was, but the problem with it was that it doubled the amount of memory required to represent every character, even those that could be represented by ASCII. So it was not popular because computers back in the 90's had such limited memory. UTF-8 was popular in the early 2000's, but the problem was that it was not standardised on and everybody was still using character encodings. So there was this fun problem of writing data into databases in the wrong encoding and then decoding it in the wrong way too, leading to all sorts of fun jobs for me to fix peoples databases and nobody really understood how to fix. These days this problem is all fixed (I guess, right? right guys?). But yeah, just wanted to write about how UTF-8 was NOT the original solution to the ASCII problem.
@romanmahotskyi68982 жыл бұрын
Дякую (:
@Maaruks Жыл бұрын
whats is binary?
@gamer-gw9iy2 жыл бұрын
Let's get rusty!
@farleylai1102 Жыл бұрын
Why to expose the complexities of UTF8 encoding to programmers when most of time it is the user perceived chars to deal with?