So What? How do I clean and prep my data for analysis part 1

Posted on January 29, 2021 by Katie Robbert

So What? Marketing Analytics and Insights Live

airs every Thursday at 1 pm EST.

You can watch on Facebook Live or YouTube Live. Be sure to subscribe and follow so you never miss an episode!

In this week’s episode of So What? we focus on the basics of a data analysis project. We walk through general terminology, the exploratory data analysis process, and general best practices for prep. Catch the replay here:

So What? How do I clean and prep my data for analysis, part 1

So What? How do I clean and prep my data for analysis, part 1

Watch this video on YouTube

In this episode you’ll learn:

general terminology and definitions
getting organized and setting yourself up for success
what tools you should have at the ready

Upcoming Episodes:

Data Cleaning (series) prepping the data – 2/4/2021
Data Cleaning (series) QA’ing the data – 2/11/2021
Data Cleaning (series) Let the machines do it – 2/18/2021

Have a question or topic you’d like to see us cover? Reach out here: https://www.trustinsights.ai/insights/so-what-the-marketing-analytics-and-insights-show/

AI-Generated Transcript:

Katie Robbert 0:26
Well, hey, it’s been a minute since we’ve been on the air probably since sometime last year. But we’re back. I’m Katie. I’m joined by Chris and john. And this is so what the marketing analytics and insights live show from Trust Insights. This week, we’re going to start our series on Data Prep for analysis. So there’s a lot to cover. So on in this show, in particular, we’re going to start with the basics, some terminology, how you get yourself set up for success, and what tools you should have already. So we’re just we’re really hoping you know, if you have questions, feel free to drop them in at any time. If we don’t cover something, then feel free to ping us after the show. And we’ll try to cover it next week. So Chris, john, so what prepping your data for analysis? Why is this a topic that has that we are considering? Talking about over a series of a few weeks? Why is it such a big deal?

Christopher Penn 1:19
Well, you know, data is an ingredient. And if you’re cooking, and you don’t have ingredients, or you don’t know what they are, or you don’t know what you’re supposed to do with them, it makes it really difficult. It also makes for really disgusting food to the end. It’s like, Oh, look, broken seashells, that’s an ingredient, right? No, no. So there’s, there’s a couple of different things that are important to think about with data. You know, first is understanding like what good data is, and we get it. And then what do you do with it is are kind of the big three questions. So to start off, I think we should probably set some some basic foundational stuff like what is data? there’s really four ish, big kinds of data, right? So there’s subjective data and objective data, which is opinion, in fact, and then there’s number data and non number data. And you can kind of like draw a two by two matrix, if you will, of those different kinds. And a lot of what we deal with in marketing analytics is in the, the objective data, like facts of some kind, whether it’s numeric or not numeric, but there is also still some subjective. So a couple of examples, I think, a good numeric objective data really easy, like website, visitors, it’s a number. And there’s no shortage of that. And that’s what a lot of the software and tools that most marketers are used to dealing with, are accustomed to. There is objective, non numeric data. This is like emails, text dimensions. In datasets, these are things that aren’t numbers, but it’s still important. You would need to be processing this, you want to understand, for example, sentiment analysis of a piece of text, or even just like what topics on a web page, then you get into the subjective side, this is where things get really interesting. So you have things like subjective numeric data, which sounds like a contradiction in terms, but it’s not when you think about like a survey, on a scale of one to five, you know, how much do you like this coffee, and everyone says, it’s a, you know, it’s a four, okay, it’s a number. But it is also subjective. It’s not a fact. It’s just a person’s opinion, we have a lot of that data. And then you have subjective non numeric data. And this is where you have opinions, review data, people writing blog posts, about your content, social media, it is tons of even things like focus groups in and market research and opinions, polls are all subjective, non numeric data. And so one of the toughest things we have to do as marketers first is figure out, what do we have in each of these four categories?

Katie Robbert 4:01
Well, and I think it’s interesting, too, because, and this is something that we’ve been talking about a lot internally is, you know, what is data? Like? Obviously, it’s, you know, what you describe, you can put it into one of those four buckets. But I think the most common understanding of data is that it’s a set of numbers. And that’s it. And so I think giving it those qualifiers of subjective or objective, you know, whether it’s factor opinion, you know, and then whether it’s structured or unstructured numeric, or non numeric data, I think really helps move along the conversation as to why we want to dig into this over a course of a couple of weeks, because it’s not just looking at a set of numbers and saying, okay, that’s the biggest number, therefore, analysis done, like there’s a lot that goes into it. JOHN, you talk with a lot of, you know, people in the network, what kinds of, you know, misconceptions, do you think there is around, you know, what is data?

John Wall 4:59
Yeah, well, I think A huge part of it that goes straight to our challenge as marketers, is the fact that it’s the subject of data that tends to be most useful as far as what you’re going to be doing for future products or, or also in the b2b space where you don’t have tons of objective data. That’s kind of where you can at least make some some movements. So the easy action is always in the objective data, you know, you can look at website hits, or likes or whatever. But really, until you dig into the subjective side, that’s where you tend to find tougher questions and the the value so that’s the, the real data challenge for me is kind of where do you want to land? You know, where do you look first. And the easy answers are always, you know, tend to be on the objective side. But again, the subjective side is where you get to purchasing, which is irrational. And so you know, you need that data to help guide you for where you want to go.

Katie Robbert 5:54
So Chris, where do we start? What what are some of the things that we need to understand about good data? before you can even start to think, Okay, I need to do an analysis.

Christopher Penn 6:06
Yeah, so data, good data has six qualities, I’m gonna switch over our display here, you can actually go over to Trust insights.ai. And get this from our instant insights section of the website, this is our quality Data Quality Framework. And good data is three to six things right? It’s clean, which means that it’s it’s free of errors, which, if you’ve ever looked in your CRM, or your marketing automation software, you’ve seen [email protected], your data is not clean, because you’ve got got junk in there. It’s complete. This is one that’s a big problem. So we were just looking at a client’s Google Analytics yesterday. And there was a huge period about three and a half weeks where it looks like they forgot to put their tracking code back on their website. So there’s like 21 days with zeros on their site. And there’s a guy that that’s a problem, because you got a major analysis gap. Now, data has to be comprehensive means it covers the questions being asked if I am asking you for return on investment, and you give me all your revenues, but you don’t give me any of your costs. I can’t do return on investment because the formula is earned minus spent divided by spent if there’s no spent data, you can’t do ROI. So that’s a one that again, you have to be very clear at the beginning of a data project, what your goals and strategies are, so that you know whether your data answers the question. The corollary to that is your data has to be chosen well, which means try to have as little irrelevant data as possible. This is marketing’s Bain, we’re in the middle of a very large attribution project for one of our large automotive clients right now. And we started out that project with something like 215 different data points, different data series, which you know, if you imagine a spreadsheet that just goes on and on and on a major part of the data preparation process to even do analysis for that says, Okay, we got to figure out which of these 215 are relevant, and then remove the stuff that’s not data has to be credible, which means it has to come from good sources. And it has to be chosen from a relevant time period. So making sure that it’s it’s fresh, and it’s not biased in any way, which is really hard to do, and most people don’t think about it. And the last part is, has to be in a calculable format that both humans and machines in particular can use. And again, this is something that you don’t think a lot about, until you start working with it and realize just how messy data is. You show you a very quick, simple example, when you go to, for example, Google Analytics, and you just say you hit the Export button, and it spits out a spreadsheet. You look at this and go Okay, that’s cool. But when you zoom in, the first six lines of this thing are junk, right? So you can’t just put this into a piece of software and expect you to go something as simple like if Google Analytics can even spit out clickable usable data? what hope do you have for anything else? At the bottom of every Google Analytics export, they always put a summary like, and then again, if you don’t know to look for it. It’s like, Oh, this completely screwed up your analysis. So it’s just little things like that, that each of these buckets, you’ve got to have almost a checklist, if you will, and you could almost like score on a clipboard, like, you know, a score to one to five, at any given data source, say how does it rank for each of these six factors. And I guarantee if you just spend some time with your data, you’re going to find that some of your data scores really low.

Katie Robbert 9:32
So a little quick anecdote. So a lot of people know that I used to work in risk management in the pharmaceutical industry. And one of the challenges that we always had with reporting data to pharma quarter over quarter was that we were never getting consistent data from the agencies who were supposed to be reporting it. So basically, we had this network of clinics that were supposed to be reporting data every single day. And every quarter, the network would look different. Some states would come on some states would come off the timeframe. So it all be all over the place. And it was pretty much a nightmare to say that consistently. This is the data that we can compare quarter over quarter and do any kind of analysis to say, you know, year over year, this is what trends look like, like, obviously, you know, we made it work, we had large quantities of data. So there was a lot of inference that could be made. And that’s a whole other topic. But, you know, when you’re even you know, Chris, to your point, when you’re talking about things like Google Analytics, and you’re trying to do, you know, month over month, week over week comparisons, if it’s not consistent, if it’s not complete, if it’s not clean, calculable, credible, you can’t do something basic, like what happened last week, and what happened the week prior.

Christopher Penn 10:51
Exactly. So that’s sort of the things you have to be able to score your data with, in order to be able to start using it. When it comes to getting organized and setting up for success. One of the things that you need to know is what what does the, what does the data analysis process look like? Even before we start talking about fancy stuff, like data science and machine learning and advanced analytics, just even being able to look at our data and go, Hmm, what do we need to know, I know for next week, we’re going to start digging into examples of how this looks, but want to give you a preview of what just the overall framework looks like. So data analysis is really sort of a an eight step process, right? First thing you figure out what your goal and strategy is like, What are you trying to prove? If you don’t know what you’re looking for? It’s kind of a lost cause. The second part, which is surprisingly difficult for marketing, is the data collection itself. How do you get at the stuff just this morning on one of our, our, in our Slack, if you go to TrustInsights.ai dot AI slash analytics for markers, you can join our free slack group. But we’re having discussions about getting data into different dashboards right ganglia from one place to another, it’s really hard to do a lot of different systems out there, you you’ve got your marketing automation system, you CRM, your social media channels, your social management tools, your ad systems, you name it. So how do you collect all this data and get it into one place? So you can even start working with it? The third step is attribute classification, right? which is a fancy fancy term for saying, what’s in here, right? What’s What is it? What kinds of data is like we’re talking about? Is it subjective? Is it objective? Is it numbers, is it not numbers? What’s in the box. And, again, this is something where you will find surprises. When you look at like social media data, it on the surface looks like it’s objective data, new academic, but sometimes, there’s really challenging stuff in there, one of the easiest ones to to really understand just how much of a problem this is, is Instagram. When you look at Instagram data, you get a URL to the image or the video it was shared, you get the username, the engagement numbers, and then the description. Unless you have really good machine learning software that can describe the image, you may have a total mismatch between what’s in the description, and what’s in the actual image. We all have that one of doctors friends, right? Who, who posts a picture of like you that well before the pandemic them on a beach with a drink of some kind. And the caption says, Well, this sucks, right? We all know that they’re being silly. But if you didn’t have the image you just had, well, this sucks, like you have a, you could have a pretty severe disconnect if you’re trying to do some analysis. So even something like that in attribute classification, it’s important to know after that, you do initial analysis. So looking at one variable, look at multi variables, this way do quality checks, like stuff missing is a broken, are there unusual things in the data that you didn’t expect to be there anomalies? I have had this experience with my own website, I was pulling submissions from a contact form. And a couple of weeks ago, a few spam bots stopped by and so you’re right in the middle of you know, download, download, first name, last name stuff was given a whole bunch of like porn links. So I got that’s not what I expect to do on my contact website. But you have anomalies like that in your data anytime you’re getting data from the general public to get anomalies.

Katie Robbert 14:30
Well, and so Chris, I think that you know, as we’re walking through the basics, you know, so you just gave a good example of a type of anomaly. Can you in numeric data? For example, can you give an example of let’s say you have, you’re looking at your website, traffic data, and you’re looking at a past month, what is an example of an anomaly that somebody could or should be looking out for.

Christopher Penn 14:54
So there’s a couple of different ways that can go like if you’re it just goes to zero like that’s an anomaly that collusion If you broke your Google Analytics, another opposite one is like, Oh, look, there’s like 10,000 people stopped by your site in one day because somebody posted a link to a blog post yours on a popular subreddit, right? That would be a clear case of an anomaly. It’s not something that you could probably sustain. But it was just one of those weird one time things. If you owned, say, GameStop stock this whole week, compared to the last five years, one of the important things to talk about in that in is understanding when and this is a more advanced statistical discussion is when something is an anomaly, versus when it’s a breakout, which is where something changes, and then stays for a little while, or then a trend when something changes, and it stays changed, right? Being able to differentiate between the three. But all that has to happen in that initial analysis step. If you don’t do that, you don’t know what’s in the box.

Katie Robbert 15:50
Well, and I think that that’s an important point to that we could, you know, explore for a second is that, you know, just because you have an anomaly in your data, is that a bad thing? And should you immediately remove it? And then, you know, let’s say, for example, you know, you have that one spike on one day, because you had a bot attack, for example, or it was a legit spike on one day, because it happened to be, you know, national data processing day, and everybody came to your website to learn about data processing, for example, you know, do you have to remove that from your data? And how do you go about acknowledging that, like, where do you even say, hey, when I did this analysis, there’s some data missing? Because it didn’t make sense.

Christopher Penn 16:33
You’re jumping ahead.

Katie Robbert 16:38
Okay, let me pull it back, pull it back.

Christopher Penn 16:41
So after you’ve done the initial analysis, that’s when you do requirements verification, this is when you say, okay, we set out our goals, we said our strategy, can we answer the questions that we’ve been asked to answer with this data? And at this point, you may have to stop and say, No, we can’t we need more data, we need better data. You know, in in going back to the example of our attribution project, we’re doing one of our clients, we’ve had to stop and start this project five or six times now. Because we’re like, okay, you know, this department didn’t get us the date and time you start the project. That’s one of the reasons why that that first step, that goal, that strategy, and all that planning is so important, because otherwise you do this process a lot. The next step, and we’re where we’re really going with loss of is in preparation, when you talk about centering and scaling and cleaning. This, Katie is is where you were talking about, like, what do you do with some of those anomalies? What do you do with missing data? So, for example, if you’re doing time series analysis, centering and scaling is really important. Because what you want to do is you want to try and get as close to apples to apples with your time, time date as possible. If you’ve got website traffic, and you know, retweets, right? They’re very different scales website traffic’s probably in the 10s of 1000s of visitors and your retweets probably in the 10s. Right? So you want to normalize that using any number of mathematical techniques to get closer together so that you could do your time series forecasting. That’s true for like really advanced machine learning models, they will actually automatically do that for you, because they know that you’re probably not going to because you’re lazy. And I am anyway. So it knows to to to do those things. But the the other one is that cleaning stage, what do we do with an anomaly? It depends, it depends on what you’re doing with the data, if you’re trying to get like, you know, basic analysis, like what happened this month, you have to leave it in, right? If you’re trying to do trend analysis, you might want to smooth it down with like a seven day moving average. Or if you’re trying to do, you know, get a sense of what’s really happening, you might use a median, and then have like a 5% cut off. So you cut off the top 5% of values, and the bottom 5% of values of any data set that can take out those short bursty little anomalies. But it all depends, again, going back to that goal and strategy, what you’re doing with the data dictates how you prepare it.

Katie Robbert 19:09
So I cut my teeth on data analysis in the academic world. And there’s very strict rules with, you know, clinical trials around how you can analyze the data. And I don’t want to say that, you know, when I stepped out of that world into the marketing world, like the rules went out the window, because obviously they didn’t, but the rules do to me feel a little bit more loose. So for example, you know, in terms of like the consistency of the date range, if you don’t have 100%, apples to apples, data on the date range. So from the first from it midnight on the first to midnight on the 31st in every single data set that you’re putting together, it’s invalid, you can’t do it. The same is not true, at least in my experience in a non regular waited more like marketing data, digital analytics, those kinds of things. Why is that Chris? Like, why is it more acceptable to have less strict rules? And how do we, I guess compensate for that to make it feel more valid?

Christopher Penn 20:16
If that goes back to go on strategy, right, it’s level of risk, right? If If I screw up a drug analysis, I could literally kill millions of people, right? If I screw up a time series forecast for when to send an email, probably nobody’s gonna die. Right? You might not get your bonus, but you know, no one’s gonna die from that. And so it all comes down to what level of risk is acceptable. And honestly, that which is not part of the the exploratory process, but it comes it’s the literally, the namesake of the show, is the so what what are you going to do with the data? If there is a, you know, 10% variance in your email marketing forecast? Is that enough to care? Or? Or is it you know, knowing you got to send on Tuesday, whether it’s 10%, you know, variants, or 50% variants, it’s still, you’re still gonna send your email on Tuesday, right, you got seven choices for what day of the week you send your email, there’s much lower risk. And so you can be more lacks, in some of these statistical techniques, depending on the level of risk. If you were building a data analysis to tell a company where spend $10 million, you might want to employ a lot of rigor, because that could be a very expensive mistake. Again, still, no one’s gonna die. But it could it could endanger the company’s profitability.

Katie Robbert 21:39
So it sounds like one of the things that needs to go into the goal and strategy section is that risk analysis of, you know, if we get it wrong, or if we, you know, spend $100,000, and it’s the wrong thing is, can we live with that? You know, I know that we’ve talked with other marketers who purposely have a, what almost like a negative ROI on their campaigns, where are they know, they’re spending more, and it’s, they’re losing money, but they’re okay with that. And, you know, as someone who does the books for the company, like that freaks me out, and it gives me hives, but, you know, it’s a risk that they’re willing to, to live with, because that’s just part of their business model. And I think that all of those questions, what ifs, and so once that all has to get sorted out first. And, you know, I think that is one of the things that we see. And again, john, I know, you talk with a lot of the network and the prospects of, you know, people just sort of jumping ahead to the outcome, and what they want to see versus what do we even have to work with?

John Wall 22:49
Yeah, there’s, because there are often two are two different camps, you know, there’s the one side where you’re just going to go and dig once and try and find an answer to one question. And that’s totally different from the projects where you’re like, Okay, we’re gonna report on this every month for the rest of our lives. And I definitely wanna give a plug for documentation, you know, talking about all this stuff, where you’re validating the data, and you’re maybe having to delete columns and change things. Every time you do that process, you really want to have a set of running notes, where you’re saying, okay, delete columns, x, you know, Double, double j, and all that kind of stuff. So that as fast as possible, you can get to the point where you can pay another human to do it or automated or whatever. But you want to learn those lessons. And another thing with that is because you don’t want to three cycles later, forget one of the steps and pollute your data pile. So yeah, do your documentation, that’s my PSA for the day.

Katie Robbert 23:42
I think it’s a really good PSA. And again, like I mentioned, I cut my teeth in the more pharmaceutical academic world, methodology is everything, you have to document every single thing you did with the data. So if you removed a number on line 157, that better go into your methodology. And it’s something that, you know, when we’re doing this analysis, sort of in the marketing space, less regulated spaces, I very rarely see methodology statements from a team doing an analysis, you know, for themselves for their, you know, upper management on behalf of a client. And to me, that’s such a mess, because there’s, if you’re doing the analysis, right, there’s no reason to not talk through what it is you did to get there. It doesn’t mean you’re giving away all your secrets, it just means that you’re being transparent around how you arrived at the answers that you’re sharing. So I think the methodology

John Wall 24:38
and job security diagram, there’s not much overlap and that

Christopher Penn 24:45
that’s true. The other thing on the risk side, too, is again, what happens with the data, like we used to working in an agency where, you know, there was a tremendous amount of reporting, but honestly, we could have colored crayons on napkins and handed it to the clients and The URL because the clients never read any of the reports ever. And so there was no risk level. So you literally could have made anything up. Now, we are fortunate that starting our own company, we have been freed of a lot of the restraints from, you know, the way that our old shop used to do things. And we actually do insist that clients read or produce. But, again, that comes down to risk, if you’re just cranking out PowerPoints every month for your reporting, and you know, nobody looks at them, your risk level is low. And so you don’t have to be particularly strict about the rules from a risk level. That said, it’s not a bad habit to get into it of being strict with the rules just for your own professional development. So that you, if you do change jobs, and you work into a company where it’s more stringent about the rules, you’ll be okay. in that company.

Katie Robbert 25:48
Well, and I would like to sort of backtrack a little bit of the disclaimer, just because people were not looking at the data does not mean that we weren’t stringent about how the data was being collected. We were still very, you know, we took a lot of pride in and I think that that’s something that is also worth mentioning is, you know, you talked a little bit about that QA process, Chris, but I think that it’s, you know, something that, you know, we’ve asked this question a lot to like our partners and our friends, you know, in the marketing space of what is your, you know, QA process, and a lot of times, we’ll just sort of get a blank stare of What do you mean? And so that seems like another as you’re thinking through, you know, I need to do this analysis, even if it’s a small analysis, there should be some level of QA built in to make sure there’s no copy and pasting errors, or if you’ve set up an automated process that it’s working correctly. Chris, do you want to talk through a little bit, you know, what you’ve experienced with and without QA and what people should be looking for?

Christopher Penn 26:49
The the number one thing is get humans out of the process out of your data as quickly as possible, because humans are the source of errors. As much as possible automate this stuff, and that’s something we’ll tackle in future episodes, is how do you automate as much of this as possible and have the machines enforce rules that the humans may have forgotten? Whether it is, you know, normalization, or deletion or anomaly detection imputation, you do not want people doing that stuff. Because people just screw that up all the time. I screw it up, even when, you know, looking at my own work, and just sanity checking things going? Is this within the expected range for you know, for this variable, I screwed things up all the time. So one of my personal imperatives is how can I get myself out of the process as much as possible, and double and triple check the code itself, walk through it, and talk it through with with other members of the team or with the customer to make sure that the process is sound, and then get humans away from it and let machines enforce the process for you. Katie, I

John Wall 27:52
wanted to ask you about that. Because you know, for so many of the business things that I’ve done methodology is in documentation, even documentation after note, you know, if it’s even done, because, you know, again, management just wants the answer, they don’t care. But in the realm that you came from where methodology is legal requirement, and has to be done, how did that normally work? Was there because you know, for us, maybe it’s a Google Doc, you know, of some kind, just running text that, but did you actually have materials that had to always ride along with the data? Was it you know, how would How did you manage that?

Katie Robbert 28:24
Well, not to date myself, but a lot of stuff was actually collected on physical pieces of paper that were then kept under lock and key, literally, the like, the box with all the data in it would be locked, and then the keys would have to be kept somewhere else because it was protected health information. But then somebody would have to input it into some sort of a system. And so all of these steps had to be documented. A lot of times before the project even started, you had to outline the process before you could even get started and have a committee approved the process to say, yes, this is a valid process in order to ensure that your data is correct. And then you had to have one or two research assistants, double checking every single number. So it’s a lengthy process. It’s why clinical trials tend to get more expensive, because there’s so many quality assurance checks involved, but it’s for the safety of, you know, human lives. Now, in this respect, like in the marketing space, Chris, to your point, like, if you’re sending out an email, it’s probably not life and death. However, if someone’s paying you, you know, six figures for some kind of analysis, or even, you know, $500 for an analysis, you want to make sure that it’s correct. And so, one of the things that I’ve seen a lot of people skip over is that QA process and so even if you’re pulling, you know, really straightforward data directly from Google Analytics, it never hurts to have somebody else. Take a look at it. Like, Hey, can you check my work, just make sure I didn’t copy and paste something incorrect. You don’t even have to go down the road of having lots of code and processes and automation. I, you know, Miss type numbers all the time, like when I meant to write 75, or 57, that could be a big error down the line if I don’t catch it upfront. And so as you’re going through this exploratory data analysis process, make sure that you’re building in those QA checkpoints along the way. Because if you get too far down the road, backtracking to find out where went wrong can become a very expensive endeavor for something that was meant to be simple in the first place.

Christopher Penn 30:33
Yep. So the last two steps in the process, feature engineering is when you’re taking the data that you have and making a new date out of it. And the easiest example of this is when you have a date, you can take a date and break it into the year, the month, the day, the day of the month, the day of the year, the week, day, numerically, the week, day by name. And you can create all these additional features from just that one data point that you can then use for analysis. So a lot of you see this a lot. And stuff like email and social media marketing, what day of the week is the best day to tweet, you know, is a silly example. But all that all has to be engineered from the data you already have. And so that’s something that, again, is done in the requirements process, but it is a discrete steps, because you have you do have to typically either write code or have tools do that for you. And then the last part is, once you’ve done all this stuff, you got to do something with it. What happens with the data? does it become insights and analysis? Does it go into a formal data science process to become a machine learning model or piece of technology, a piece of software? Not everything does, but you at least have to be able to say okay, after this whole process, here’s what we’ve shipped, here’s the thing that that you can use to make a decision with one of the big pitfalls that we’ve seen way more than than is this comfortable, is you do all this, and then you create a you know shelf where I here’s your PowerPoint slide deck that gets put on a shelf, and nobody ever looks at it. Okay, well, what was the point of spending all those hours and all those resources if you never actually do anything with the data? And it goes back to that goal and strategy of what decisions do you want to make with this data. And in terms of, you know, we talk about the data preparation process and the data analysis process, you’ve got to have that goal in mind upfront, because otherwise, it’s just an academic exercise.

Katie Robbert 32:20
I’ll be honest, a lot of times people will say, I’m just curious, I want to see what the data says, if that’s the response that you get from the person asking you to do the analysis, push back and ask for something more specific, because that really is going to be a waste of your time.

Christopher Penn 32:37
Yep. And then in terms of the tools to do this stuff, you can do a surprisingly large amount of it. In a simple instant, simple spreadsheet software, like Microsoft Excel is a pretty darn good tool, right? It’s the number one data analysis tool on the planet for a reason, because it’s good enough for a lot of different tasks. Some of the other tools that we use, typically, we do use a lot of databases, SQL databases, stuff to store large datasets, Big Query, we do a ton of coding in the statistical programming language R. But you can do that in Python and Julia and Scala, you know, depending on the language you’re comfortable with, if you’re not, if you if you aren’t going that route, there are also tons of off the shelf tools that you can use, for visualization for steps like the your initial analysis, software, like tableau, which is a Salesforce product is a really great tool for some of the preparation and feature engineering can use tools like IBM Watson Studio. And it’s very, very, very varied modules, some of which are no code, just click and drag stuff that you want cleaned up. So there’s plenty of options for tools. The trick is, as you’re going through this, is to say, Okay, what tools do I need at each stage in the process? And do I have them? Or if there’s a stage in the process that doesn’t seem familiar to say, Okay, well, what do I need to learn? What, what skills do I need to learn what processes are recipes? And then what technology? So really, the whole people process platform is kind of a mini checklist at each stage of the exploratory data analysis process to say, what do we have? What don’t we have, like, what do we know? Well, what what recipes have we already cooked?

Katie Robbert 34:16
And it may sound daunting to sort of go through this whole process, do the checklist, but one of the things we talk about a lot is getting yourself organized upfront with your requirements, knowing what tools you’re going to need. It’s a lot like you know, crystal ball, your analogy is about like cooking, when you’re have your music class and you have everything organized. Once you do that step, everything else goes so much faster, because you know exactly what you’re working with. It actually helps keep you focused as well. It keeps you on task. If you’re working at a company where you know everything is like an hourly rate and you have to stay on budget. It really helps you understand how much time you’re going to need to do these things. So doing this work up front to Set your goal, set your strategy, organize your tools, ask your questions, set up your process of QA, how are you going to check, sort of look for anomalies, you really can’t skip this step. Now, if you have bazillions of dollars, you know, basically the few money and you don’t really care, then yet go ahead and skip right to analysis, it doesn’t matter. 99.9% of the people who are going to be doing analysis do not have that few money. So it’s probably a good idea to get organized upfront, and really, you know, explore what it is you’re going to need to do, even if it’s a simple analysis, because, you know, to your point, Chris, it can become a repeatable process and something you can eventually just automate. And you can move on to more interesting and exciting stuff. Exactly,

Christopher Penn 35:45
I mean, simple views and plastics up, I was getting ready to make some chicken soup. Earlier today for dinner tonight, and get everything ready, turn the instapot on that I realized, got to put the chicken

central as that. Now think about that, from a marketing data perspective, though, like, if you were trying to do social media engagement analysis, and you forgot like Facebook, even though it’s like one of your big channels, that’d be a pretty big oops, but it happens. So the, when we talk about getting ready to do data analysis, you know, the data preparation process and the exploratory data analysis process. Having that cookbook, and and following the recipe is so important. If there’s one place where I know personally, I go wrong the most it is not following my own processes. If there’s one thing I’ve done that that makes things go way faster, it’s having the existing code. And then we’ll just take that code and port it to new things and be up and running much, much faster. The worst thing I’ve seen is watching somebody, you know, when they start a project and just have to, you know, they open up a brand new template in Tableau or whatever, just start from scratch again, like why are you doing that, you should be taking your existing code and tools and you know, update them, sure tune them up all the time. But you should never be starting from scratch if you can possibly avoid it.

Katie Robbert 37:11
I think another really good example of this, and this is something we were talking about with one of our partners this morning, is not having even the right systems in place in order to collect the data. So a very common issue that we see. But we are happy to help with it anytime is companies will install Google Analytics, but they won’t also then install Tag Manager. And so their ability to collect data collected accurately collect gold data conversion data, which is most likely what they care about, what are people doing, are they buying stuff is compromised, because you’re not collecting it efficiently, you’re not collecting it correctly, you’re not collecting it completely. And then they want to go ahead and start running campaigns and making big business decisions. But they didn’t step back to first do that foundational work to say, do we even have the right data, they just jump ahead to step six. And all of a sudden, you know, they’re making decisions based on things that aren’t real.

Christopher Penn 38:13
Yep. So that’s the process and and so the the lay of the land, if you will, it’s it’s really not about the tools you can get by with a lot of the tools you already have. It very much is about the process and the skills that the people put at the process in order to make this stuff work well, consistently, over a long period of time. Any final parting thoughts I

John Wall 38:39
plan on having to do it again, and plan on having to explain every part of it. Oh, so true.

Katie Robbert 38:46
But john, to your point, your PSA, if you document from the get go. When people ask questions, you have it ready, like you can answer that question, what happened to line 32? Hold on, let me tell you exactly what happened to line 32. And it just you also build that trust with the people you’re doing the analysis for. So, next week, we actually get to talk about the data itself.

Christopher Penn 39:10
Alright, so stay tuned. We’ll see you next week. Thanks for watching today. Be sure to subscribe to our show wherever you’re watching it. For more resources. And to learn more. Check out the Trust Insights podcast at Trust insights.ai slash t AI podcasts, and a weekly email newsletter at Trust insights.ai newsletter got questions about what you saw in today’s episode. Join our free analytics for markers slack group at Trust insights.ai slash analytics for marketers. See you next time.

Transcribed by https://otter.ai

Need help with your marketing AI and analytics?

You might also enjoy:

Instant Insights: The Beginner’s Generative AI Starter Kit

AI and Copyright Law: How Copyright Applies to AI-Generated Content

The Trust Insights Leadership Team

The Woefully Incomplete Book of Generative AI

How To Build Better Analytics Dashboards and Reporting

Get unique data, analysis, and perspectives on analytics, insights, machine learning, marketing, and AI in the weekly Trust Insights newsletter, INBOX INSIGHTS. Subscribe now for free; new issues every Wednesday!

Click here to subscribe now »

Want to learn more about data, analytics, and insights? Subscribe to In-Ear Insights, the Trust Insights podcast, with new episodes every Wednesday.

Leave a Reply Cancel reply

Pin It on Pinterest

Share This