As a PM, you need the skill of data analysis. How?
Part 3 of 3 of the series: Develop the critical PM skill of data analysis
In the previous two issues I explained why it’s important for PMs to know the skill of data analysis, and when to use it. This issue goes into some details of how to analyze data with specific tools.
SQL is a universal standard
As a PM, you should be fairy proficient with using SQL (Structured Query Language). SQL continues to be the most popular language for data systems. So unlike software programming where there is is a large diversity of languages, SQL is the universal language for the vast majority of cases you encounter as both a PM and even as an engineer. Get proficient at SQL early in your career, and it will pay dividends for you over and over again. (Note that there are different flavors of SQL, but it’s very easy to pick up additional SQL varieties after you learn your first one.)
The theory behind SQL and relational databases can be pretty involved, especially if you don’t have a technical or math background. But PMs of all backgrounds can quickly get up to speed using the basics of SQL for very practical day-to-day tasks. In other words, focus on the practice, not the theory.
Getting set up
The very first step is getting read access to a database. This can be the actual production database if you’re working in a very small startup, or a mirror (copy) in a larger company. In some cases, you‘ll be working in a data warehouse, where data has been processed into a more manageable and understandable format. In any case, you may need help from a software engineer or maybe a tech admin in this initial step to get access. Fortunately it’s a reasonable request because most tech folks understand modern PMs need pretty direct access to data to do their job. Read access is pretty safe from an engineering standpoint and also allows you as a PM to do 99% of the data analysis tasks you need anyways. If you get stuck at this stage, it’s very reasonable to ask your manager for help.
In sum, get a technical person at your company to provide you with read access to a database of relevant data. Ask them to give you pointers to get set up with a SQL-compatible interface. It could be a SQL app that connects to a database. Or a simple terminal interface. Also, make sure to understand how often the data is refreshed. Are you querying against a real-time production database? Or are you looking at data that is refreshed nightly? (The latter is not be useful for investigating urgent production issues, but is useful for looking at product usage trends.)
The best way to learn SQL is to actually use it. Just simple trial and error with Google and Stack Overflow. Recall you aren’t trying to understand the theory. Just get good at the practice. Over time, theory and deeper understanding will develop.
The easiest and most relevant way for a PM to start looking at data is reviewing customer information in a database. (A note on privacy: Be careful you are following the privacy rules in your organization when reviewing sensitive data.) If possible, find an actual reported customer problem. Find an issue in your support team’s backlog. You should be able to find a customer ID or an email address. Dig into the database for that customer. For example, often there is a table that represents customers. So you would execute something like:
select customer_id , name , email , phone_number from customers where customer_id='12345'
When doing triage for a customer problem, it’s good to have a baseline understanding of the customer. If you don’t have access to a customer profile in an admin portal, this should be your first query. The customer ID is likely used in other tables as you continue to do the investigation. So if you don’t have it, make sure you get it. For example, if you’re given an email address from the support department, you execute this query instead:
select customer_id , name , email , phone_number from customers where firstname.lastname@example.org'
Next, you want to pick up the trail of breadcrumbs left by this customer. What were they doing last before they encountered the reported problem? In many systems, there’s at least one events table that tracks actions performed by customers. So you would do something like:
select customer_id , action_name , action_type , system , time from events where customer_id='12345' order by time desc limit 50
This query lists the 50 newest events of the customer that were logged by the system. From here, you may be able to deduce that the customer consistently started seeing error messages after they updated their name in the system. This is something clear and actionable that you can share with the engineering team to fix.
When you are deep into an investigation, you should have a scratchpad open where you can quickly jot down various IDs as you try to uncover the series of events leading up to the reported problem. Only once you have satisfactorily discovered the root cause, you can write a single clean query combining everything, execute it, take a screenshot of it, and clearly document the problem in your issue tracker, before sending it to your engineering team to fix. For example, you may combine the above two queries with a join to show the information together.
select customer_id , email , action_name , action_type , system , time from events join customers using (customer_id) where email@example.com' order by time desc limit 50
There’s a lot more details here, and I’ll point to additional resources later in this issue. But this is the general approach of investigating one-off problems: First zero-in on a single customer that is being impacted, and identify the root cause. Then identify the impact. Can you see the same error or problem behavior from other users? If yes, document those as well, as it should inform your prioritization of a fix.
Most of the time, there is a software bug that is causing the problem. So it’s not really your job to understand why the problem is manifest. But from investigating the data, you can readily assess the impact itself. That’s the power of data analysis. You can deeply understand systems and customer impact, without necessarily having to go into the code. There’s tricks to further investigate the code itself as a non-technical PM, that doesn’t require you to be an experienced engineer that understands the entire codebase. But that’s for another newsletter issue.
Aggregation techniques are necessary to combine raw data points to surface valuable insight. In the most simple case of capturing usage, you might have something like:
select count(*) from events where action_type='submit_comment' and '2020-09-01' < time and time < '2020-10-01'
This would count the number times anyone submitted a comment in a productivity app, for example, during the month of September. If you wanted to count the number of unique users that submitted a comment, you would replace
More interestingly, you often want to calculate the statistics of a given metric in a given time period. You may be tempted to calculate an average, but from a product perspective, you should avoid that. Averages don’t typically help with product decisions, and are often even harmful. For example, you might determine that the average credit card processing time for users in your ecommerce app is 2.3 seconds. That’s great you think. But what if a good proportion of users are experiencing times of up to 10 seconds? That would be a horrible experience. Don’t rely on averages. A better characterization is using percentiles. This query calculates a few of them.
select percentile_cont(0.50) as p50 , percentile_cont(0.80) as p80 , percentile_cont(0.90) as p90 , percentile_cont(0.95) as p95 , percentile_cont(0.99) as p99 from durations where type='credit_card_processing' and '2020-09-01' < time and time < '2020-10-01'
p95 here turns out to be 3.2 seconds, it means that 95% of credit card processing times take up to 3.2 seconds. That’s great! Using percentiles to analyze your data allows you to quickly identify where in your product users are suffering the most pain, and should allow you to make better product decisions generally. It also allows you set reasonable business (and thus, product-technology) goals. For example, a goal is to reduce
p95 to 2.5 seconds by the end of quarter, and
p99 reduced to 2.8 seconds being a stretch goal.
The above query captures statistics within a given time period. But you typically want to perform this analysis over multiple periods of time, and even surface it in some dashboard tool, so you can see the metrics evolve over time. So aggregation in time is also important. Using the same credit card processing example, you could have something like:
select month , percentile_cont(0.50) over (partition by month) as p50 , percentile_cont(0.80) over (partition by month) as p80 , percentile_cont(0.90) over (partition by month) as p90 , percentile_cont(0.95) over (partition by month) as p95 , percentile_cont(0.99) over (partition by month) as p99 from durations where type='credit_card_processing' and time > '2020-01-01' group by 1 order by 1 asc
This would give you the credit card processing percentiles per month, since January 2020.
SQL is very powerful and can be a bit intimidating, especially if you are new to databases. But if you start with the simplest concepts and apply them directly to your daily work, you’ll catch on very quickly. If you’re anxious to get started, but don’t have an easily accessible database at the moment, I highly recommend working through PostgreSQL Exercises. It’s very much focused on doing, the best way to learn SQL (and product management in general!)
Dashboards and data visualization tools
Generating results with SQL is the first step in data analysis. Longer-term, you want to invest in more scalable data tools. Data visualization tools are probably the most important here as it serves as an important data communications to other stakeholders. For example, if you want to show product usage trends to a different department, you might need something more robust than just a table of values that is manually entered once a month in a spreadsheet. Fortunately this is an area you don’t have to worry about as a PM. Software engineers and data teams will provide you the tools. And if those tools are not able to achieve your needs, you need to make that known to them as a user.
In particular, visual tools should provide you a mission control style dashboard at a minimum, to help you visualize your query results without you needing them to enter them every time. Unlike SQL, there’s many different options in the marketplace here. And different companies will have different needs and legacy reasons for selecting certain choices. So it’s not important you try to learn ahead of time. Just pick up the skills of the particular tool your company offers, as you ramp up in a new PM role, for example. However, if you do want to spend a few weekends getting at least a cursory understanding of these tools, it will always benefit your resume to do so. Some popular options are:
Don’t over-invest in learning these tools if you are not using them day to day. Lastly, if your company doesn’t have anything set up currently, you should at least be using the charting functionality in Microsoft Excel / Google Spreadsheets to show your work to stakeholders within your organization.
In this issue, I discuss primarily SQL as the language of modern data analysis. Many companies, especially larger and older organizations still rely heavily on Microsoft Excel or Google Spreadsheets. As a PM, you may not have access to a database in these companies, and be just given data exports where it’s expected you will be using Excel. That’s unfortunate, but do take this as an opportunity to maximize the value of the resources you are provided. Demonstrate to stakeholders that using what little constrained data you have, you can provide immense product development value. And once you’ve shown this, it will be easier to convince folks to give you more direct data access.
Machine learning and AI
Unless your product directly uses machine learning (ML) or artificial intelligence (AI), there isn’t any urgent reason to become an expert in these topics as a PM. The aforementioned techniques are more than sufficient to analyze data for typical product needs. Nonetheless, do know that basic ML/AI techniques are very accessible to non-technical PMs. You would be able to pick up the skills and tools to do this type of data analysis just by taking a few hours of free online courses. And the pre-requisite here is just high-school level math minus calculus. The excellent Coursera ML course by Andrew Ng is now a classic for those interested in getting a simple taste of these topics. But again, know that these are nice-to-have skills as an early-career practicing PM. Not something to focus on right away.
Scripting is another optional tool in a PM’s toolchain that’s a nice-to-have for data analysis. It allows for more power and flexibility, but in most cases it’s unnecessary.
Critical PM skill of data analysis
This is the final issue of this series, Develop the critical PM skill of data analysis. Visit the website to see past issues.
Thanks for reading! 🙏
Mentorship in your inbox: Subscribe to Product Management 101
Product Management 101 covers practical tips and basic skills for aspiring and new product managers.
I’m Victor Wu, Head of Product at Tilt Dev. I’ve been creating digital products for over 15 years, and mentoring folks for much of that time. Subscribe, and reply to email updates with questions you’d ask in a real-life mentoring session. I’ll answer them in future issues.