Ep.3 - Of All The Algorithms In All The World [Machine Learning]

Updated: Mar 13

In this episode Holly rants about America trying to find Russian missile silos, why self-driving cars sometimes drive into cyclists, and how to predict house prices.



The amount of times I've like gone out in my car to IKEA or something, and I've thought...I am gonna kill a cyclist today. Oh, do you know what we should do? We should- can- can we use machine learning to create us an intro?


You absolutely can use machine learning to create us an intro, on account of the fact that we use machine learning for everything else.




So ...


I think that would be, I think that would be really relevant.


Machine learning to , uh , do the transcription. Machine learning to do the intro. And also we could use machine learning to , to write the script of the episode because it would probably be as coherent as the last word series that I, that I said. The last episode...I wasn't very coherent in some of my words.


Yeah I wasn't either, really. I think you spent about 18 hours editing all of my, like word breaks and like ums and stuff and fillers out of that.


It was 15 hours, the last edit.


Oh my God.


15 hours in total. That was far more than it needed to be. And it was far tighter in the end than it needed to be, but I was kind of like trying to set a line in the sand .


Yeah. Where I didn't sound like a complete idiot. I appreciate that. Thank you, by the way. <laugh>


It was a bag of words.


<laugh> That's me, bag of [ words] .


That's a machine learning joke, we'll, we'll get to that.


Uh huh, okay, cool . So , um , I don't actually know like anything at all about machine learning. So can you give me like a quick intro to what machine learning actually is?


I can give you a quick intro into what machine learning actually is, but one of the points that I wanted to make from this episode was you don't actually need to know what machine learning is to get the benefit from it. And this is one of the big things that I'm hopefully gonna try and cover in this episode is the fact that people talk about machine learning as if it's very difficult, as if it's difficult to get into, difficult to understand. No doubt at some point we'll mention my exam. And one of the things that I was guided on prior to taking my machine learning exam was, there's a lot of math in it. Like it's very, very math heavy, and it wasn't anywhere near as bad as I was expecting it to be - in part because people had like , um , hyped it for that. But yeah , I'll try and do a concise explanation of what machine learning is. Machine learning is a subset of artificial intelligence, where we are aiming to use learning algorithms, to automate actions. That is about as good as it could get.


Okay. So what kind of applications do we have for this then? What sort of actions are we looking to automate when we use it ?


This is the thing, because machine learning is such a broad field now, there's so many different applications that it's difficult to try and group them in, in one way. I , I try to think of different ways of kind of summarizing the field of machine learning into like one or two representative examples. And as far as I can see there , there really isn't a representative example, 'cause the field is so big now, but one of the places I guess we should start the discussion is, machine learning is useful for making automated predictions. So if you want a good example of where machine learning could be used - it could be something like predicting house prices, where you have existing data and you want to make a prediction from that existing data.


Okay. So say on like Rightmove or something, when you're looking for , um, a particular property that you wanna buy and it says, other houses in this area are worth this much money? Is that like an application of that? Or is that just drawing a dataset and averaging it?


It depends of course, because it could be, they are just taking things like the prices on your street that are most recent. So say from this postcode or this general area, what are the last five house prices take an average of that, that wouldn't be machine learning. That is just an average, but predicting house prices is an example of a problem that is well suited to machine learning. So that's one of the places when you're looking at, at, machine learning, be it like learning how it works, learning what it is or if you're a business that's looking to gain the benefit of machine learning is, do we have a machine learning problem? Do we have a problem that machine learning is well suited to solve? And I think very often when people look negatively at machine learning, there's a lot of reasons that people look negatively at machine learning, but one of them is the problem that they're trying to solve is not well suited to machine learning. For example, the description that I gave a second ago , making predictions based on prior data, you need prior data, or you need the ability to gather that prior data . So if you don't have it, you , you have nothing to base the prediction off.


So not really great for novel problems, then?


It depends what you mean entirely by novel problems. There are some problems that machine learning has effectively solved or certainly made , um , huge grounds towards that prior to machine learning would've been considered novel problem are now very, very simple. An example of that would be handwriting recognition. Handwriting recognition is a great problem for machine learning. For one thing, it has that, that initial step that I said of you need the data, we just get a lot of people to write things <laugh> and then you've got a lot of handwriting data and , and you can step through that process. Now I think handwriting recognition, especially on mobile devices, is kind of just known to be a solved problem. Most mobile devices have some kind of handwriting recognition as, as a feature of them. So I think people would now look at handwriting recognition as not a novel problem, but before machine learning it would've absolutely been a novel problem. So yeah, it depends a little bit what you mean by that term.


Okay. And so say like handwriting prediction, do you mean like using the Apple Pencil on an iPad, for example?


Yeah. So , um, really any of those systems and , and they all work , uh , roughly in the same way, but yeah, that is, that is a prime example. On iPads, you can use the Ppple pencil, you can write something and then either it can save effectively a vector of what you've written or it can convert that into text. So when I say handwriting recognition, that's what I mean, you know , you're writing something by hand and then the computer is turning that into typed text.


And how well does that typically work? Do you have any kind of biases in the , the data sets ? Is there any issues with like reliability of data?


No, it works perfectly!




What are you getting at?


Absolutely nothing.


So I guess what , what is being hinted at here is I recently stumbled into a problem with, with the Apple Pencil. And this is indicative of a bigger problem with, with artificial intelligence systems in general. And <laugh> , it's like a whole tangent we could go down in terms of like, I have RSI and one of the ways that I manage having RSI is using inputs other than a keyboard. So the Apple Pencil is great for that because it allows me not to make the problems I have with my fingers worse. And I use the Apple Pencil and I'm very sorry to anybody who's on Android, but the entire ecosystem in this room is-, it's all Apple, I use the Apple Pencil other pencils do exist. <laugh> I use the Apple Pencil, I use it all the time and I use it at least a couple of hours every day. And I have very, very few problems with it. But recently I had an interesting problem, which to me is kind of whilst , I don't know , because I don't know the, the , um , backend systems that Apple use. It looks very much to me like a machine learning problem that's called overfitting. So we can get into what that is in a second, but in short , I can try to write a certain word and my Apple Pencil will absolutely never recognize the word that I'm trying to write. It always recognizes a different word. And I got very frustrated at this and I made a , a 48 second video of me just trying to write this word and the Apple Pencil continuously , uh , failing to, to recognize it. And I posted that to, to Twitter, 'cause that's where you put all of your frantic videos and internet rants. And the thing with that was an awful lot of people would jump on that immediately and be like, oh, artificial intelligence is useless. Oh, you know, they said machine learning was gonna take over the world and look at how it fails. The word that I was trying was the French word for yes, Oui. So O U I and the Apple Pencil was always picking up Q U I and we'll get into like what that might be and what overfitting is in a second, but it looked funny, and it looked funny in that kind of 48 second snippet that I'd posted. And I think the big thing that it misses is, I have used the Apple Pencil for hundreds of hours and don't have these problems, but as soon as there is any issue, it's like, oh, artificial intelligence is useless. And I think that's something that we very, very commonly see with artificial intelligence systems, where if it makes any mistake, people then say, oh, it's useless and we should throw it out. Even if it's been operating suitably for a long time, or if it's been even operating better than a human could operate and people will ignore all of that prior evidence and just say, oh, it's useless.


I think a lot of that is about marketing in technology to be fair. Like you always get kind of a new concept that comes along and then people lose interest in whatever the previous hot topic or big thing was. And I think in recent years it was kind of machine learning and artificial intelligence, particularly in things like anti- virus software and then people kind of ditched that and jumped on the blockchain train <laugh> so yeah . Yeah. I think , um , people are quite quick to kind of throw the towel in when something's flawed.


I think this is the same problem though, where people ascribe to sentences meanings other than are included in the words of the sentence. So a , a product vendor might say we use artificial intelligence, which might be true if they're using machine learning, machine learning is a subset of artificial intelligence, it can be true that they're using artificial intelligence, but some people will see that marketing material and jump on it and say, oh, but artificial intelligence doesn't exist, assigning a different definition to that word than was intended. Or they will talk about it as if like, oh, there's no such thing as a generalized intelligence. And it's like, that's not what that marketing material is saying. They're literally just saying that they're using artificial intelligence , uh , in the same sense that I'm sure the Apple Pencil is. A software product that, that I work on uses machine learning in the backend. We don't ever shout about that. We don't ever really talk about the , the way the machine learning works. Not because of any like intellectual property reasons or anything like that. But anytime you mention machine learning, people immediately throw the product out and say, oh, this is , this is useless. It's got this fake artificial intelligence thing. And they think they've built Androids that have generalized intelligence and like , no, we just use machine learning in the backend. It's actually really good. <laugh>


<laugh> Well, what do you use it for?


Uh , yeah . So we have a huge amount of , um , I guess I should start the beginning. What is the software product? It's a vulnerability management platform, right? So we have data about vulnerabilities that we've discovered and we have data about systems that we are working on. And one of the things that we might want to do is predict changes in those systems. So you could imagine, for example, if you're monitoring many tens of thousands of systems and you want to know, have any of those systems changed that vague definition can be difficult to work with and also having a human step through all of those many tens of thousands of systems to individually check, if any system has changed, can, can be difficult. But if you have a huge amount of data of what that system looked like previously, and you feed that into a learning algorithm, that learning algorithm is able to make predictions based on new data it gets , it might be able to predict that a system has changed. If a system has changed, its security stance might have changed. So that could be an indicator that you want to assign a human to that, to then take a look at it. So that's one of the things, or if you have , um, found a vulnerability in a system and you know, an awful lot about this specific system's configuration, you might want to look through the data set of all of the other systems that you're monitoring and say, are there any systems in here that are similar? This is a categorization problem. It's, it's different, but still a machine learning problem. Um , machine learning can be, can be very good at that as well. Show me things that are similar to this example that I've got here. So we use machine learning for that stuff, but yeah , very often, if any product dares mention artificial intelligence or machine learning, people throw the marketing material out saying, you know, this , this problem of, oh, they're pretending they've invented generalized intelligence. It's like, no, we're just using a learning algorithm, that's it.


Yeah. I went to a talk previously, I think BSides Manchester 2018. And it sort of touches on a couple of problems that you just mentioned and people absolutely do have this attitude that because a product uses machine learning or artificial intelligence that it needs to be kind of more capable than it actually is. And this talk was about NextGen antivirus software that used machine learning to predict whether a piece of software would be malicious or not. And effectively somebody had managed to circumvent this software by just feeding it bad data for training purposes, which completely broke the software.


Yeah. And the- there's a whole field of research in terms of adversarial applications of, of machine learning and trying to develop inputs for machine learning algorithms that will give bad outputs. So for example, just a kind of off the cuff example here of if you're developing a machine learning system to do facial recognition, people are obviously gonna wanna look at that system and say, is there any way that we can circumvent it. Antivirus systems that use machine learning? Absolutely people are gonna look at that and to see, is there any way that we can circumvent it or bypass it or whatever term you prefer there. And that can be both ways as well. So having legitimate software flashed as illegitimate - or malicious - and malicious software not being picked up a as malicious of , of course that's gonna happen. But yeah, I think where people draw the line for how good machine learning has to be is sometimes a little bit extreme. Like people talk about these systems as if they have to be 100% perfect and very often they really don't . You can get a benefit without it being perfect. I , I think one of the things like I'd love to talk about our , um , transcript from the, the podcast of , of how we put the transcriptions together. But I wanna pause that for , for a second and give you a really good example that I think people will disagree with and reasonable adults can disagree, but it's a really good example of this problem. How good do self-driving cars have to be until we should allow them on the roads? Now, what I would say , uh , as a , as an opening proposition, I'm not saying this is my opinion, but as an opening proposition, if you were to say machine learning , uh , or rather self-driving cars have to be as good as humans to be out on the road. That sounds on the face of it to be a sensible statement. The self-driving cars have to be as good as humans, but one of the things with that is humans are really bad at driving <laugh>.


Oh, I know. <laugh>


People drive whilst using mobile phones, they drive distracted drive drunk, mess around with the satnav when they're supposed to be driving, humans are really bad drivers. So you , you might want to draw the line slightly higher than that. And so self-driving cars have to be better than humans. And then you start getting into problems of not to fall down the tunnel <laugh> of , uh , the troll problem. But some people might draw that line that self-driving cars have to be perfect drivers in every situation before they would allow them on , on the roads. And that is a strange stance to me, a sense that I think some people do genuinely hold because, well, if they're better than people are, why don't we, why don't we move to the improved system?


Yeah. We sort of let perfect be the enemy of good in that situation. I think what I would do with that sort of problem is maybe plot sort of like quality of drivers, generally like a graph, maybe plot , um , car accidents, or like fatalities or anything discard that any data that was caused by someone say being distracted while they were driving or driving drunk or tired or something, because that would probably get rid of the vast majority of accident . And then you would end up with broadly speaking, a barometer of what good would be for a self-driving car that is as safe as a normal, like safe driver who wouldn't drive kind of drunk or distracted or mess with their phone while they're in the car or something. And I think that would be a pretty good bar to set. It sort of, I guess, discards all of the bad data.


The problem with that in terms of a machine learning problem would be, how do you gather that data and who is in charge of discarding? Oh, that one doesn't count because it was a drunk driver or that one doesn't count because it was a distracted driver and those kinds of things also that might be possibly the , the wrong problem to solve. Because if you were to think of self-driving cars as replacements, and if you were to say, this person has gone to a bar and has gotten drunk, would we prefer them to drive themselves home or if their car was a self-driving car- I think in that situation where you're saying self-driving cars have parity with humans, they're as bad as the average human, you would prefer them to have a self-driving car 'cause in that situation for that isolated incident, that person is probably going to crash.


Yeah, sure.


So one of the ways of doing that can be effectively accidents per mile or a guess maybe that should be miles per accident, but you get the point that I'm looking at here is in terms of on average, how many miles can a person drive without an accident? And some people have several a year and some people can drive for a decade and have none. But on average, how many miles does the average person do before they have an accident? And then compare that to self-driving cars in terms of how many miles do they get through without having an accident? And in many instances, self-driving cars are already beating humans. Of course, in part, this is marketing, and in part this is selected data because we don't have level five self-driving cars. They can't drive in every instance, but where we have things like cars that can drive themselves on the motorway or on the highway cars that can park themselves, those kinds of things, even that's beneficial. You know, I'm not saying that we , we should aim to have a perfect self-driving car that can drive in every instance before we allow them on the roads, even just, hey, I'm not very good at parking, my car can park itself. That's an improvement. We should look to allow those kinds of things.


Yeah. I think it solves an accessibility issue as well for people who wouldn't be able to drive otherwise or who have mobility problems. It makes it a little bit easier for them to kind of get around, I guess. So, yeah. And the example that you gave before about, you know , if somebody had gotten drunk and had a self-driving car, would we prefer their car to be self-driving and for that to be how they got home.


Yeah. This is one of the problems. So of course we we're selecting data here. So you have a chosen data bias because of course somebody could equally pick a situation in which a self-driving car might misbehave and then say, well, in this instance, what would you prefer? So for example, a truck is driving down the road and the load that that truck is carrying is traffic lights and the self-driving car that is following that car full of traffic lights is going absolutely mental because it thinks there's 11 traffic lights in front of it . And it doesn't know what to do.


Was that a real example? Because I think I saw that like on Twitter a couple of months ago.


There's been several examples of this and we , we definitely should get into , um , the kinds of problems that can occur with, with machine learning. But yeah, we've seen self-driving cars, naming no particular brand that have detected the moon as an amber traffic light. We've had them , um , following cars whose load is traffic lights getting confused at how many traffic lights there are in front of it . We have things like colored lights on the street. So maybe a green light or a red light just outside of a store that it has interpreted as a traffic light. You definitely can select problems like that. And you also have training set bias in terms of if you don't teach the system or if you don't more accurately supply it , training data so that it can learn about a feature, it will not know about that feature. For example, if you do not teach your self-driving car what a cycle lane is, you'll have problems the first time it comes across a cycle lane, as one naming no brands, self-driving car manufacturer found. Yeah, Uber-


Uber make self-driving cars?


Uber do make self-driving cars.




Yeah, this is one of the things. So , um, who are the biggest manufacturers are self-driving cars? Google, Uber, Tesla - not necessarily the brands that people expect.


I didn't actually know there was a Google car.


The , there there'll be some pedants in the audience, no doubt who will say it's not Google, it's an alphabet subsidiary and yeah, fine. It's the Google mothership.


Pedants. She said pedants, not peasants.


I'm not sure I did. <laugh> it's Waymo is Google's. So Wemo is , uh , the , the name of the brand.


I'll have to have a look. Not that I want a self-driving car,


Not one that drives into cyclists.


I don't really like cyclists. I recently moved to London and I've had to drive in London a couple of times . And I just like- cyclists outside London seem to be like reasonably compos mentis and like, know what they're doing and have like a sense of danger and self-preservation. London? Absolutely not. No, crazy. The amount of times I've like gone out in my car to IKEA or something and I've thought, I am gonna kill a cyclist today and it's gonna be completely an accident, but I just can't do anything about it, 'cause they drive like they're crazy.


It's just the rage. You can't, you can't contain yourself. <laugh>


Me? <Laugh> I'm a really calm person.


I think something I wanna get on the record as well is I am not an advocate for self-driving cars. I am a machine learning advocate and self-driving cars just seems to be the, the go-to example of kind of like everyday applications of machine learning that that people are familiar with. But really that's only because it's- it's big and like attention grabbing because you could look at something like CAPTCHAs as being a system inherently tied to machine learning in several different ways that people bump into all the time.


Ah, so how does that work?


CAPTCHAs? There's there's a, a lot of different systems. What , what we're looking at here is an anti-automation protection. So where you want to set up a system that cannot be automated. So a simple example of that might be something like a login form. You don't want people to be able to automate the login process, 'cause you might be trying to protect against something like bruteforce attacks. That's one of the ways of doing that. There's the protections, rare , limiting count lockout, all of those kinds of things, but one of them is anti automation and you might try and develop a system that humans can easily complete, but that computers can't with the intention of therefore blocking automated access. That is a harder problem than it seems. And we can talk about the technical side of it, but you actually mentioned earlier, the problem that I see with things that CAPTCHAs in many instances, and that is accessibility. If you have a system that is designed to keep computers out or I guess more accurately designed to be easy for humans to interface with, but difficult for computers to interface with - which humans? The most, I think when people think of CAPTCHAs, what they think of is the, the swirly text one where it says, you know, type the letters in the box and the letters have been obscured in some way , italicized, changed in colors. Maybe there's something overlaying them to obfuscate them, different- different lines through them, that kind of thing. Originally that was a difficult problem for computers to solve - optical character recognition. It can also be very difficult for somebody who's maybe using a screen reader or who ha- is maybe , uh , has sight issues, those kinds of things from accessing. So yeah, that's a whole different side of machine learning is actually where it's an accessibility problem.


So are there different kinds of machine learning?


There are, we should probably finish off , um , different kinds of CAPTCHAs first as well. And then I then I'll go into to different kinds of machine learning. So , um , CAPTCHAs started off as like the swirly text one that I think everybody's familiar with. The one that we see more often these days is where it shows you images and it says click all of the attack helicopters or click all of the technicals or-


I-I've never seen one that asked me to click attack helicopters, I'm not gonna lie . It's usually like, click all the traffic lights and there's a teeny tiny corner of a traffic light in another square. And you're like, does that one count or not? Do you want the actual bulb? Or like the fitting around? Yeah . Like, and you , you click it and it lets you through anyway, but like it's not-


Or you fail because there was a traffic light behind a wall and you weren't familiar with the location.


Yeah, no completely.


I wonder why suddenly companies might have huge data sets of images that they might want to know where the traffic lights are in those images. I wonder what that's related to. So yeah there's-




Go ahead.


You're gonna have to cut this out , but do you remember the thingy before where you said , um, that there was like American military bases and Russia?


We can talk about that.




Um , so , um, so yeah , so that's another where , where it presents the images and it asks the information about the , the images and then if you get a certain proportion, right , uh , it can let you through. And then there's also CAPTCHAs that are behavioral based. So some of the people who develop some CAPTCHA systems might not want you to know actually what it's testing you on, but might monitor your behavior on a website or behavior across your internet session to see if you are having like a bot, or not. Originally the obfuscated letters worked really well because optical character recognition was difficult for computers. Now it's not, we're , we're getting to a stage now where detecting the content of images is becoming easier for automation systems to do. Also just on a cybersecurity note, you can just pay people or otherwise coerce people into filling the CAPTCHAs in for you. So if you are, you know , a major crime group and you have financial backing or something like that, you could just pay people to fill those CAPTCHAs in for you.


Or coerce them apparently.


Yeah. I like the term coerce in that context because it can mean many different things. Right? So I think whenever you say coerce somebody people think threats of physical violence? No , not necessarily. Um , I can give you an example that might be-


Would it be physically reeducate? <Laugh> p.


<Laugh> Physically reeducate? No . Uh , what it could be is , um , if I have a botnet, so if I have many thousands of machines that are infected with malicious software and I'm trying to break into some system, and that requires me to, to bypass a CAPTCHA on my infected machines, I could just pop the CAPTCHA up and say, oh, your system is going to update in the next 60 seconds. If you don't want the system to update, fill this CAPTCHA in . And actually all I'm doing is coercing you into filling the CAPTCHA in for me, you can do that or I could pay you to fill a CAPTCHA in . Yeah. So just generically using humans to defeat the anti-automation.


That sounds like crime to me.


In that context, it is. But it's actually something that can be applied more generally. So when we get into talking about training sets and things like that, you need the data to give to the learning algorithm, to set up the machine learning model. If you don't have the data, you need to gather it in some way . So say for example, you're trying to make a machine learning system that can accurate look at a picture and tell you if the picture is of a cat or a dog, you'll need to get loads and lots of pictures of cats and dogs, and then you need to separate them into these are cats and these are dogs.


Do you know who does this? Datadog?




< Laugh > they actually do. That's their CAPTCHA. It's like, is this a cat or a dog. And most of the time, they're pictures of dogs, but sometimes it's a cat and you have to-


Or you are not very good at telling the difference.


I'm really good at telling the difference between cats and dogs I have loads of them.


How would you know?


I have loads of them.




Cats are really small. Dogs are like horse sized.




They're not real.


Okay. So , uh , if you have a training set, you need to, to label that training set or in that context, you'd need to label that training set . And one of the ways that you can do that is pay people. I will show you some photos and you label them whether their cats are dogs. And then I now have a labelled data set that I could then give to my supervised learning algorithm.


So is that what CAPTCHAs are, are they basically using people to train data sets ?


Yeah. So in many instances they are so for example. If you think of the swirly text one, the , the original CAPTCHAs that people are most familiar with. If we are looking to do something like train a machine learning system, to read signs, I could present to you two photographs of signs and ask you what is written on the signs. And I could do that where one of those images, the system knows what the answer is. And that is the one that it is validating you are a human with. And the other one, it might not know the answer, but if it presents it to many people and all of those people agree on what's written on the sign, it can learn that. So yes, you can use CAPTCHAs to, to label data. So earlier when you said, sometimes I don't fill the images out correctly and it still lets me through. Yeah. Because the ones that you might have mislabeled might have been part of it, gathering training data. So it's tested you on others, you got the others correct. And then the ones that you got wrong were the ones that it itself didn't yet know the answer for. So could be that.


So basically I'm mislabeling the training data set and slowly preventing our computer overlords from rising up.


Yeah. And that is, that is a real big problem. So when it comes to, when it comes to machine learning, we should probably pause for , for a second to talk about the fact that there is broadly, I'm gonna have to generalize somewhat and skip over some details. But broadly speaking, there is three kinds of machine learning algorithm. You have supervised learning, unsupervised learning, and then reinforcement learning. Supervised learning is where we're giving a labelled data set and unsupervised learning is where we're not giving a labelled data set . So unsupervised learning is gonna be suitable for categorization problems. So here is a bunch of data, split that data into K categories where you don't necessarily have to tell it what items you're looking for. We'll get into- the correct term there is probably features. Um , we'll get to that in a second though . So , so that would be unsupervised learning. Whereas supervised learning would be where you're supplying labeled data. But the thing is, as you've mentioned, you can have bad data within your training set and , and bad data could be many different things. So you might have to go back to our earlier example, if you're trying to predict house prices, you might have a whole load of information about houses. Now there's a lot of different problems from a machine learning point of view here. And one of the things when it comes to data science and building machine learning systems is, a lot of the effort spent in machine learning systems is not actually on the learning side of things. A lot of it is on things like which algorithm should we use? How should we set up the model, and problems with the training set . If we have a whole bunch of pieces of information about a house, and we have different columns on that we would call those columns features. So one of the features might be number of bedrooms, one of the features might be location, those kinds of things. The first problem we've got is we don't know which of those features are important. Some might have no effect on the prediction for the house price. Some might have a strong effect on the prediction of the house price. And we, we don't necessarily know which is which. Some of the features might be incomplete, so we might have a column for, does this house have a swimming pool? And some of those columns might say, yes, some of those columns might say no, and some of them might be blank. So we don't know whether they , they have or not. And that can be a problem from a data set point of view. And if you look at something like that, let's take a simple example. If we have medical data about people and we are trying to predict from this medical data, if a person has a certain medical condition, some of the information about them might be very important, but the information that we get might be problematic. So we might get an entry where the age field is 150 . Okay, this person is probably not 150 years old. That data is probably just an error. It is written down wrong. It might be something like, oh, that's the person's edge in months, for some reason. It might have supposed to have been 15, but there was a typo. It might just be wrong. It might be missing , uh , or you might have something like some of the data has age ranges. So instead of it saying 15, it is 15 to 20 and we don't know the specifics. And then how we overcome those problems is one of the most interesting areas of machine learning for me is like, oh, we have all of this training data. What do we do? We could simply get rid of all of those entries that have problematic data. The problem there would be what percentage of your training set is problematic? Are you getting rid of huge amounts of data because there's some problem, a missing value, incorrect value, something like that. How do you know if that value is , is bad or not? So say it's not 150 , which I think we can say pretty accurately is not a valid value, but maybe it's something like 101 it's feasible that a patient could be that old, but it would be an outlier. So how do we handle those ? And also if we have empty values, what , what do we do there? So for an empty value, for example, you could in some instances, talk to an expert in that field and say, what , what do you think this value is likely to be? Probably not a good one for age, but again, there could be some predictions. If the person is known to be retired, you might make a prediction that they're over 65 then. You could look to do something like the average value. So replace that value with the average, from the data set . Now, if you take the average value of your data set , or if you take a prediction based on other characteristics, like are they're retired or not, you would get very different values and that can introduce bias. Or if , if you have missing values in your data set , you might develop a machine learning system to complete the data set , to use to develop your machine learning system. So there's a lot of different things you can do in that area, but yeah, for supervised learning, one of the biggest problems bases is how do you get the data? How do you clean that data so that, you know, it's , um , complete. And then also once you've got all of this, how do you know which parts of it are important? Is it important that the house has got swimming pool or not? Is it important that the house has got three floors or not? Is it important what the roof insulation is made of you dunno , or certainly, I dunno , cause I don't sell houses for a living, but my point being there , you could have a huge amount of data to that is actually irrelevant to the prediction that you're trying to make. There's a big time to start about training sets. It's important though. Yeah. Yeah . And people get cool job titles like data scientist. That's a cool job title.


Yeah. They're fun at parties , actually data scientists, big fan of shots, usually. Um , so supervised learning then. Um , other types?


So supervised learning is where your learning algorithm, takes labeled data. So the , the next big one is unsupervised learning. And this is where you don't have labeled data. You just have , uh , a whole bunch of data and you feed it into a learning algorithm. And that learning algorithm will separate that data into a number of groups. The number of groups is something that you can configure, generally speaking. This could be useful if you have, if you have data that you just want to categorize, generically speaking, you just wanna , you've dropped a bunch of photographs on the floor and you want to reorganize them in some way that is meaningful. The problem with that is categorization systems like this. You are not necessarily telling the system how you would like it to categorize. It's just going to do it depending on the specifics of the algorithm. And that can lead to some interesting problems. One example of machine learning systems, making useless predictions that I can think of was a story that I read a long time ago where allegedly and I haven't looked into this, I dunno if it's true or not, but it was just a , I think a good example of this kind of problem, allegedly the US military wanted to be able to develop a system that they could supply photographs to. And that system would make predictions as to whether that photograph contained Russian missile silos or not. So the idea being that they could scan over Russia and they could take loads and loads of photos of Russia, feed it into a system and the system would tell them whether a missile silo was there. That sounds like a really good machine learning problem. Especially if you look at some of the problems that we've been talking about. So, you know , we are talking about, if I give you a photograph, can you tell me if it's a cat or a dog we're doing the same thing here. If I give you a photograph, can you tell me if there's a missile silo there? The problem is the data that you supply, how you clean that data, the specifics of the algorithm, the specifics of how the model is set up is really, really important. And if you either don't put the effort in or make some bad choices along the way, you can get unpredictable results out. The story goes that the US military spent however many tens of millions of dollars, allegedly on developing this system. The idea being that they want to then be able to supply a photograph of Russia, where they don't know if there's a missile base there or not, or ideally not one, but tens of thousands of photographs of Russia and have the system predict whether there is a missile base there or not. And in actuality, what the system predicted was whether or not it was a photograph of America. And that was, that was what the system had, had been tailored to learn unintentionally. So they , they gave it all of these photographs and I'm sure there's some technical reason like the , perhaps the colors in the image were very slightly different because the angle of the satellite, it was passing through a different amount of atmosphere or something like that. Or maybe it was a lens effect due to, to angle .


Maybe it was just that all of the pictures of America had missile bases in and all of the pictures of ground were actually from Canada.


Yeah. Maybe something like that. Yeah. The , the , the training set was , was bad in that in is I'm , I'm sure what the problem is. So that's the kind of thing that it was trying to build up to there where, when you're, when you're building systems to , to categorize data, you're not necessarily gonna get out what , what you want from that system. If you make poor choices or if you have a biased training set .


So I guess the kind of problem that you wanna solve, you need to establish first, what sort of approach you wanna take and which kind of machine learning you wanna employ for that. Otherwise you could end up with it solving a different problem or approaching it in the wrong way, because you've not been specific enough with your query.


It- it's deeper than that. So , um, I gave so many examples of where the training data can just be wrong, missing values, incorrect values, typos , those kinds of things, but you can also have it where the training set data is completely correct and completely valid, but not in a format that's appropriate for the machine learning algorithm. So for example, if we are predicting salaries and we have a whole bunch of information about people who work in the industry, and we want to supply that to a system that can detect salaries, one of the things that you would presume, no doubt is an important feature. An important piece of , uh , data or category of data , uh , would be job title. But the algorithm is unlikely to know what a job title means. If you have junior consultant, senior principal, director of, vice president of, the algorithm wouldn't inherently know what that is. So you would have to prepare that data in some way . So that, that brings us onto ordinal and nominal data. So ordinal is data that has an order and nominal is just data that is a , a name of something that is not inherently ordered. What state are you from, is nominal there's no in inherent order between, you know , Ohio and Idaho-


Why are we American now?


Well, <laugh> what , what I was wondering there is whether the audience would immediately jump on, oh, of course, Ohio is better or whichever. Oh , also incredibly worried that I wouldn't off the cuff, be able to name more than two states <laugh> so maybe we should have picked counties. The reason that I didn't pick counties is English counties sometimes have very strange names. And I thought that might confuse the audience.


It's also because one time Holly and I had a "draw the world from memory" competition. And I was pretty sure she made up a bunch of countries. I'd never heard of , um, Holly won that one because I didn't know enough about geography to be able to win.


Oh, I won that , uh , through knowledge of Uzbekistan, Tajikistan and all of these , um-


I'm pretty sure it was Turkmenistan actually. I thought you'd made it up.


Just to be clear, Turkmenistan is a real place. So we have data where the order is important, job title, seniority, that is an example of that. And then you have examples of data where the order isn't important or in , in most contexts isn't likely to be important, like the state that you were born in, for example. So we would have to prepare that data before we give it to the algorithm. Another example could be as well scale, so you might want to rescale the data. If you're doing something like salary, you might want to rescale that so that it is values between zero and one or something like that. It really depends on the problem that you're trying to solve. The general point there is just, you probably need to prepare the data in some way. I don't necessarily want to get into it, but some of the methods of encoding have got kind of cool names, but one example would be one hot encoding is , is a method of encoding data. I just think that's a really cool name .


So, so far the problems with machine learning that we've covered are the reliability or the, I guess, data integrity of the training data set that you use-


I think, yeah, I think if we were trying to kind of summarize my vague rant on machine learning into what problems might you come across, the first is missing or incomplete values in the training set, errors in the training set, so just where , um , data is incorrect, where data needs to be encoded. These are all generic data science problems and relatively well documented within the literature of how you best should handle those. But one of the things that we haven't mentioned, which is probably worth pointing out is where the training set itself is inherently biased. There can be a lot of different reasons for this. One of them might simply be that the training set that you are supplying to the machine learning algorithm is in some inherent order. And that order might add bias to the training set in somewhere that's unexpected. That is a relatively easy thing to solve. Just randomize the data that you give it, you know, don't give it in a fixed order, but supply it to the algorithm in a randomized order. That's a pretty easy example to handle if, you know to expect that as a potential problem. But what can be harder to handle is where your training set just has biased data. So if you are looking at, for example, predicting salaries, as we mentioned earlier, if your training set is entirely made up of the salaries of people in Seattle that might not apply, it might not be generalized enough to apply to people in another, a city like New York


<laugh> or I guess , um , if it's the salaries of only like minorities in a particular field, maybe like tech, for example, then salaries generally for say white men might be higher.


Absolutely. And , and that is actually a , a really good example to give where it , it might be that your training data is biased, or it might be that the data that you are trying to predict is biased. It might simply be the case that there are certain categories of people who are just paid more than others for, for no reason, relevant to , to the prediction. It could be that men are paid more than women, for example, handling that within, within the machine, learning algorithm would be important. In the very least, it would be important to ensure that the training set was representative of the system that you're trying to predict. So if your training set is entirely made of , of male salaries, it's not going to predict female salaries accurately because there is an issue in the system that your training set doesn't account for.


I guess , um, another example of this that you mentioned and was FaceID on iPhones.


Yeah. FaceID, or really any kind of , um , system like that is a really good example. There was a certain digital camera many years ago now, which was the first example that I saw of, of this, where the digital camera was using facial recognition. There's a lot of reasons why digital camera might wanna use facial recognition. It might be a part of its auto focus system, or it might be something like if it detects a face within the photograph, it puts the camera into portrait mode instead of landscape mode automatically. I've seen cameras before, where when you smile, they'll take the photograph automatically and those kinds of things. There's a lot of different reasons behind why a system might have facial recognition built into it. The problem there could be, is your training set representative of your user base? So if you, if your training data is photographs of your employees and you're a US based company, there might be bias within your employees that isn't representative of your user base. So for example, you get a whole bunch of people in San Francisco and take photographs of them, and then use that as the training set for your facial recognition system. And then you sell those cameras in China, you might have issues there with training set bias because your user base is not represented within the training set. This leads onto definitions that are worth mentioning, which is , um , overfitting and underfitting. Overfitting would be the learning would perform very well in your training set, but not very well on your production data set. I can give you an example, uh , so earlier we mentioned my woes with the Apple Pencil and I, I thought that the Apple Pencil problem, it struck me as an overfitting problem. And the reason for that was the word that I was trying to write, was OUI . Now in English OUI is not a common series of letters. And certainly at the beginning of a word, I couldn't think of an English word off the top of my head that begins OUI, but there's many words that I could think of off the top of my head, English words that begin QU, or even QUI. So if the system is doing handwriting recognition and it has been trained against a Corpus of English words, or if it's trained against sets of people's handwriting who are writing in English , um , it might perform very well. And then when you write in a different language, it might not perform as well because there might be character sequences that it doesn't recognize, or even more directly, there might be literal characters that it doesn't recognise, you know , does the system know about accents? Does the system know about non-English alphabet letters, those kinds of things. So, yeah, the problem that I had with the Apple Pencil, it , it could well be overfitting where it is presuming that OU is unlikely, but QU is very likely, so it's, it's presuming it's going to be QU when it wasn't.


I still feel like this is going a little bit better than the last episode where you kind of put me on the spot and asked me a bunch of questions about chaos engineering and I was like, oh no, my show notes are terrible.


<laugh> This is , this is the funny thing, cuz we, you know, we prepare like, oh, what , what are the things that we wanna talk about? And all of the things that I wanna talk about are just like, oh , um , I have RSI, we should probably talk about that . It's quite an interesting topic. And then what we actually get into is no, no, no. Allow me to explain to you what overfitting is. <laugh>.


I mean, there are worse things.


No, I'm not bad at French. Apple is bad at reading French.


I asked Siri if , um, she understood French once and she did not, because she was set up in English. And I feel like that's like an interesting feature that they should introduce, like to be able to switch between languages without having to go into your settings and change the language that it's in.


So this is, this is one of the things actually that I dunno how Android handles this, cause I've only used it on the iPhone, but when, when you're writing, I , I very often will swap between English and French whilst writing. So I do that a lot and I find that things like autocorrect work very well, where it , it seems to realize that you're- you've swapped language and it kind of catches up with you very, very quickly. So, so that's good. But then I found that the handwriting recognition didn't seem to work very well at all with French. Some of them were- some of the problems that I had with the handwriting recognition, you could kind of see what the problem is if you imagine, so "I write" J apostrophe. J'ecris.




Um , so if you , if you take the , the French phrase, I write J apostrophe, the problem is I don't dot my Js. So when I write it by hand, I have a J without a dot above it and then an apostrophe.


Oh, is that how you write? I , I always put like a little line across the top of mine, like a T.


Oh, okay. Um, but yeah, you can , you can see the point that, because I don't do that. So I just have the , the lower body of the J and then I have an apostrophe next. Right . You can see how the system might get confused by that. 'Cause is that an apostrophe or is it the dot above the J? And again, that can lead to a problem where based on its training set, it almost always sees the , the J character have something above it. So it might presume, or it might have stronger confidence that, oh, that must be the dot above the J when in actuality, I don't do that, but I acknowledge that most people probably do either cross the J or dot the J , but I don't. So you have problems like that as well, where does the training set represent the user base ? It's that same problem.


Yeah. So I guess in that situation, you would end up with a grammatical error or it might correct to a completely different word.


So, so in most instances for that example, it just ignored the apostrophe. That was just what it was doing. It was, it was leaning in that direction. That that must be the dot above the J this person must have just written this word because handwriting recognition of course, is like, you don't want it necessarily to correct you too much because I might be purposely writing a misspelled word or a foreign word, a word that's just not, not within its dictionary. Oh, one of the things I wanted to talk about just for before we close out was my , uh , machine learning exam. So , uh , I recently passed the AWS machine learning certification. We should probably do an episode at some point about my experiences with certifications, because I think my experiences with certifications makes you angry.


It does, it does a little bit, but , um , I'm- I approach certifications differently now. I used to , um , when I was younger and we hung out a lot more, you would do the, this thing that really irritated me, where you'd be like, oh, I've got a week off, I'm gonna do an exam, and you would do your exam and get a certification. And I used to get really stressed about exams. And therefore it would stress me out that that was how you approached them. It was kind of seat your pants , like on the fly stuff. And I couldn't empathize at all, but then I started working in security while doing a master's degree <laugh> and last year, kind of end- end of the year, like maybe September, October time, I had to take some leave from work and we were in lockdown, so there was nothing I could do. And I absolutely was not gonna sit at home for a week with nothing to do in lockdown by myself. So I did an AWS exam and I felt like I understood you a little bit more then I , I didn't enjoy it.


Which exam?


Uh , solutions architect associate.


Did it go well?


Yeah, I passed it. I'm gonna do my security specialty kind of later this year now.


It it's funny to me that when , when I asked the question, did it go, well, you immediately default to, did you pass or not? Which isn't what I meant by that. We should, we should save this for a story for another time that we should do an episode on , um , certifications, 'cause I definitely have an unpopular approach to certifications. And um, I definitely think there's something to talk about that, but just one of the things I wanted to mention before we close out was the machine learning exam that I did, AWS machine learning specialty. What I wanted to point out was within the field of machine learning, there is an awful lot of math and there's definitely an awful lot of statistics. And I think some of these certifications, like the AWS machine learning certification, people might get put off by if they're not so confident in , in their , uh , math ability or if they're just that a lot of the content is gonna be focused on that . And in actuality, it's not. Some people might get a little bit mad about that because math is such a strong part of machine learning as a field. But as I said, right in the opening of this episode, machine learning can be approached in a lot of different ways. If an organization is looking to get the benefit of machine learning, they could, for example, use a system who's backend is a machine learning system, but is provided to them just as a black box of just, it , it works. Um AWS has several of these, AWS Rekognition, AWS Polly. These are systems that in their backend are machine learning systems, but they're set up in such a way that you can just consume them. You don't need to understand specifically how it works. So a part of the exam is knowledge of those services. A big chunk of the exam is what we spoke about earlier , which is effectively data management, feature engineering, handling errors, those kinds of things. That's a significant part of the exam. There is math in the exam. It , it's not much at all. If you can do square roots , exponentiation, summation, that's everything you would need for that exam.


I, I don't even know what half of those words mean. Okay so basically I am a machine learning expert now.


Um, probably not. And there's probably been some terrible definitions of some of the words, but hopefully for those who've been listening in, what I wanted to get across is the , the big takeaway that I wanted from this episode is, don't look at machine learning systems as having to have a hundred percent accuracy for them to have value. Many of these systems have benefits in time saving and cost saving, in just giving you a ballpark estimation. They can be valuable without getting it a hundred percent, right. They're making predictions.


And some of them turn into Nazi chatbots.


Quite a lot of them actually turn into Nazi chat bots.


It's unfortunate really.


A surprising number.






Then you have to brain them.


You have to shut them down.


Like in iRobot, imagine like iRobot, but with Nazis.


iRobot, but with Proud Boys.


Oh, oh, she went there . <laugh> So right, if somebody came into this with like no knowledge of machine learning , um , and they'd kind of heard of a couple of cloud services, maybe AWS services, what would you recommend they get started with? What's the easiest route into this into kind of understanding it and getting a little bit hands on and not kind of just being immersed fully in all of the theory and data science behind it?


One of the big things that like people should look to to gain is knowing when a problem is well suited to machine learning, or I guess in some instances it can be just as important to know when a problem is not well suited for machine learning. So that you don't put an awful lot of time or effort into looking into using machine learning to, to automate something if it's not a good fit. As I mentioned earlier, if you can't gather sufficient data, if you can't for a supervised learning system, appropriately label that data, those are likely to be problems that machine learning isn't gonna be a good fit for. But if you can get to the point of where you can understand just a glancing overview of when machine learning might be beneficial to a system that is probably enough for most people who work within it. The , the reason there being, if you have a problem that is well suited to machine learning, then you can take the time to look into how to apply machine learning to that problem. Because as we've talked about, there's lots of different situations where machine learning would be very different. Predicting house prices or recognizing the content of images are two very different problems. So the amount of knowledge that you would need in that area, if you're looking to predict prices of houses, predict salaries, those kinds of things. If you have a regression problem, knowing how convolutional neural networks work might not be beneficial to you in that problem. So don't get


Sorry, what- what's a con- convolution, convolutional, convolutional neural network. What- what is that?


So there's, there's computational neural networks, there's recurrent neural networks. Um, there's the whole area of, of deep learning that I've successfully managed to navigate you around until the end of this episode. So we will have to do a part two specifically on deep learning and neural networks in general. But yeah, if , if people are looking at this episode just as a how to get started in machine learning, the thing that I would look for is what makes a good machine learning problem and some of the, a terminology. So I've been using terms like feature when, I mean like category of data-


Or bug? <laugh>.




She's just looking at me like I've got two heads, for anybody listening.


Yeah um, so, so just picking up the terminology and picking up the, the different areas would, would probably be enough. <Pause> Convolutional.




Recent Posts

See All