In the past few years, influencer marketing has gone from a guerilla marketing tactic to a staple of any large brand campaign, and nowadays if you haven’t at least tried influencer marketing, you’re doing it wrong. But while there are an endless amount of new influencers popping up every week, the challenge of finding and choosing those who will give a positive return on investment is only becoming more and more difficult. When building Trefiel, we used influencer marketing almost exclusively to grow to beyond the $20k/month mark, but even after booking hundreds of influencers we still struggled to predict which ones would reliably result in more sales revenue than the cost of their promotion. In reality, the term ‘influencer’ has become an oxymoron – most of the influencers on social media have zero influence, and are just trying to make a quick buck without providing real value to consumers or brands.

The problem


The problem with influencers is that while you can get a good idea about how many people are liking or engaging with their content, there isn’t a super strong relationship between how big the audience is compared to the sales produced. Some of our small mummy blogger influencers with highly engaged audiences of 50k followers would make more sales from a post costing $150 than other bikini girls with whom we spent $1000+ on a single post. The style of content they produce, the amount of effort they put into the posts, the levels of engagement with their audience and who their audience is all matters. I like to think of the size of an influencer’s audience as setting the upper bound of the possible revenue gained from a campaign post, but it offers little value when it comes to predicting exactly how much your brand will gain from a post with that influencer. And while there are influencer analytics tools on the market that can help you find and analyse influencers and general stats about their audiences (eg InfluencerDB, Socialblade, Iconosquare), none of those tools on the market do a good job of teasing their actual ‘influence’ out of these stats, or integrating it into a prediction tool that could help you make better choices when choosing influencers to work with.

To help with this, I would use InfluencerDB as my primary research tool, and build my own spreadsheets with stats from the platform (such as followers, average likes per post, average comments per post) as well as some additional features to better estimate follower engagement. While this helped, it still didn’t do a very good job of capturing an influencer’s overall level of influence, and we still had to go through their accounts one by one to look at the individual posts and comments. After hundreds of bookings, we learned a bunch of our own strategies to help predict whether an influencer would work long before we spent any money on them. The downside was that because none of the tools automated this for us on a mass scale, we still had to trawl through each influencer’s posts reading the comments, looking at their followers’ demographics, seeing how many other brands were working with the influencer on an ongoing basis, and tracking the engagement on the brands’ pages after sponsored posts. This resulted in huge investments of time to try and find the needles in the haystack (only about a dozen out of the hundreds of influencers actually had highly engaged audiences and provided consistent results for our brand, but we still had to go through them one by one to find out).

The need for personalised recommendations

This brings me to the last and most critical part of predicting influencer success. if you’re an eCommerce brand and have assessed the amount of likes and comments on an influencer’s previous posts for other brands, you still have no idea how many sales the brand has made thanks to the post (unless you know the brand personally and they’re willing to share the sales results). This is one of the huge problems – you may be able to see how many likes and comments an influencer’s post for another brand received, but you have no easy way of measuring how many of those translated to new followers, website visits, or ultimately sales.

Influencers don’t want to share campaign results (if they do, you’re less likely to do a test campaign and it helps you bargain on price), the brands they are working with don’t want to share their sales results (it’s usually contractually confidential, and could lead to you competing for an influencer’s limited inventory if they’ve previously been successful), and most importantly of all, most of the time the influencers and agencies don’t even know themselves (I’m yet ask for a case study from an agency and receive anything useful – it’s always campaign post impressions and a example couple comments, which mean absolutely nothing when attempting to predict sales).

While there are some tools available that attempt to assess how much influence an influencer actually has (eg Hypetap,which gives a score showing an influencer’s suitability for their platform), most of these are fairly naive and focused on the size of the audience and likes per post, rather than how many of their followers actually engage with the brands they advertise in their posts or buy the products promoted. Furthermore, the scores they offer are based on the influencer overall, and as a brand owner they don’t necessarily help you in trying to predict whether they will work for your own brand. This is important because while our bikini model influencer would have sold a whole lot of bikinis, her audience wasn’t as interested in moisturising face masks.

Right now, it’s impossible to know for certain if an influencer’s audience will buy your product unless you pony up the cash for a test post, so wouldn’t it be great if there was a tool that could help you predict this based on an influencer’s posts for other brands, depending on how similar they are to yours, and help you decide which influencers are worth the money (and time) to book?

My solution


So, I thought I’d come up with a tool that takes the data available through the Instagram API and build some prediction models to do this job for you as a brand owner. Now, to go and build this tool would require aggregating data on tens of thousands of influencer and brand accounts over a period of 2-3 months, which would mean needing to replicate the data warehouse of many of the analytics tools already available on the market, and aint nobody got time for that. So instead, here is my spec on how I would build it, and how this would work for the user.

UPDATE: As of April 4th, it looks like Facebook has deprecated many of the Instagram API endpoints required not only by my proposed tools, but also by most of the social media analytics tools on the market right now.

How it works

When choosing influencers, there are two possible ways a brand owner may want to explore the market:

  1. When the brand owner wants to start from scratch, they want to be recommended influencers based on the performance of their posts for other brands similar to theirs.
  2. When the brand owner has worked with influencers before and knows which ones were successful and which weren’t, they want to find similar influencers to those who have been proven.

While current tools can filter influencer accounts by content genre or audience size (which aren’t good predictors of their actual influence), giving the user the ability to choose the dimension with which they want to sort (eg audience size and followers, as well as number of ongoing relationships with other brands, comment engagement, followers gained for brands etc) would be extremely useful here and much better in prediction outcomes.

Ideally, if both of the approaches above are built into a research tool, the user would go through one of the pathways above and be shown a list of suitable influencers for their brand, as well as a prediction on the amount of followers (and potentially engagement on their posts) we expect the brand to gain from a sponsored post. We would use followers gained as the default outcome metric since it’s the best proxy metric for clicks to website and sales since we can’t track those directly (more on this soon).

To make sure we don’t just end up with the largest influencers at the top of the results every time, we’d sort the influencers by expected followers to total followers ratio, since total followers is a rough guide for price per post and this helps our users get the best ROI for each dollar spent.

Visitor and sales tracking via analytics integration

Followers gained can be useful, but ultimately we want to predict sales made from each post. For this we need visibility over what is happening on the website once a post is made. To do this, we could have users of the tool grant access to their website’s Google Analytics (‘GA’) or Facebook Advertising (‘FA) account, and then retrieve the data for visitors and sales made on each date as a result of a post. In order to get users to opt-in to this and relinquish their data, the tool could automatically audit all previous campaigns by going back through their website data and previous promotions and give a report of the results.

With this permission we can then filter out unrelated traffic sources and average daily direct visitors and sales, which would give a much more accurate estimate on the resulting sales from each post. You could also compare site-wide conversion rates for traffic coming from each individual influencer, giving another metric to compare influencers on and allowing you to judge the fit of each audience to your brand.

What data do we need?


To make these tools work, first we need the data. Behind the scenes we need to query the Instagram API to get info about our brands, influencers and their followers, and we will warehouse this data in three tables

  1. Account table (metrics about the account overall)
  2. Posts table (metrics about influencer and brand posts)
  3. Follower count table (to track the daily changes in follower counts resulting from promotions).

While we can get some overall metrics straight from the Instagram API (the same as offered by other tools), these alone don’t correlate well with the results you can expect to gain from an influencer’s post. So, below are the features to be included and generated for each table which will then be fed into our prediction models. Many of these features I’ve come up with from my own experience in choosing influencers, and some of them will need additional ML pipelines, particularly when it comes to classifying characteristics of the account.

Table 1: Account metrics

This table gives the overall metrics for each Instagram account (one account per row). Many of these metrics are derived from the other two tables.

Basic features available through API/scraping

  • Follower count
  • Total posts
  • Total following

Engineered features from their post history and features above

  • Average likes per post (last 30 days) – From this we can gauge impressions per post
  • Average comments per post (last 30 days) – A much better metric than likes for audience engagement
  • Follower growth (last 30 days) – How fast is this account growing (you want to cement ongoing relationships early with fast-growing accounts since they have limited inventory for brands and you’ll get better rates if you get in early and lock rates in).
  • Total brands mentioned over last 90 days – Sum of brand accounts (not other influencers) mentioned on sponsored posts in the last 90 days.
  • Average sponsored posts per brand over last 90 days – Using the two metrics above we can see whether brands who have recently worked with the influencer are rebooking them multiple times; this indicates they are getting positive ROI on the posts and a great sign for other brands.
  • Like to follower ratio (determine ratio of users engaging with the content (comments) compared to those we know saw it (likes)). This is important since likes indicate someone likes a picture, but aren’t a useful metric for predicting whether a follower will buy. Comments = investment in the influencer.
  • Likes per 10,000 followers – A proxy for engagement (what % of followers are seeing and engaging with the post)
  • Comments per 10,000 followers – A proxy for engagement (what % of followers are seeing and engaging with the post)
  • Number of unique commenters in last 90 days – Repeat commenters indicate high engagement (or comment pods/fake accounts), while many new commenters indicate high growth or inconsistent exposure to their audience.
  • Average number of commenters per post
  • New follower to repeat commenter ratio – Use the two metrics above to determine whether most comments are from repeat commenters (indicating a highly engaged community, or comment pods/fake accounts), or many one-time commenters (indicating that the audience either has low engagement, inconsistency in who is seeing the posts, or many new followers due to high growth).

Features requiring additional ML pipelines

  • Genre of content (eg fashion, beauty, fitness – use NLP on hashtags and image classification to detect style of images).
  • Total sponsored posts over last 90 days – sum of posts classified as sponsored by an additional ML pipeline.
  • Is influencer? (True/False). Determine whether this account is an influencer and should be crawled or shown as possible influencer.
  • Is brand? (True/False). Determine whether this account is another online brand and posts mentioning them (and their follower count history) should be tracked.
  • Is follower? Determines whether this account is a regular instagram user.
  • Age (Scrunch uses facial recognition on posts to predict age, but you could also use NLP on profile/comments).
  • Demographic data of followers, particularly country. This can help a lot in making sure the influencer’s demographic matches that of the brand, especially when it comes to reducing shipping charges and time which can boost the sales made from a campaign’s post. The tools already available are likely doing this by exploring a sample of each influencers’ followers, looking at the location data on their posts, and building a distribution of followers across major countries. At prediction time, this is also useful in determining the buying power of the audience (US followers are more likely to purchase than those from India), the likelihood of the product/brand being attractive to those followers, and whether shipping will be cheap and fast, all boosting conversion rates.

Table 2: Post metrics

This table holds data on each post from influencer and brand accounts (one row per post)

Basic features available through API/scraping

  • Account name
  • Date and time of post
  • Location
  • Number of comments on post
  • Number of likes on post

Engineered features (usually requiring additional ML pipeline)

  • Is sponsored post? (Yes/No) – We need to detect whether a post is a sponsored post or not, so we’ll need a classifier to go through an account’s posts and classify each one. The main features useful in detection is whether a post mentions a brand account, the hashtags used (#ad almost certainly means it is a sponsored post), and whether the posts mentions discounts, ‘buy’, or other words signaling promotion.
  • Number of mentions per comment (comments tagging their friends may or may not be more valuable than comments about the post itself).
  • Average comment length (longer comments = higher engagement)
    Average comment sentiment (positive comments = better audience engagement and more likely to convert to sales)
  • Portion of comments from fake accounts/automatic comments – Many influencers have bought fake followers or have fake accounts following them, but we can detect these accounts by looking at the content of the comments (many post the same comments like “Nice shot!”) and their following/followers ratio etc.
  • Portion of comments from other influencers – Since the Instagram algorithm boosts posts that receive high levels of engagement early, many influencers (especially those with agencies managing their accounts) use comment pods to get likes and comments soon after posting, thus boosting their posts’ engagement and leading to more penetration in their followers’ feeds. Brands (and our models) should ignore these comments when assessing audience engagement.
  • There are also a few opportunities for using NLP on these comments – when followers ask questions (‘?’) on the post or about the product that’s a great sign that they’re engaging with the content (and considering whether it’s right for them), when they mention words like ‘buy’ or ‘bank account’ or ‘afterpay’ they are showing purchasing intent, and when they mention words like “reviewed’ or ‘I tried’ or ‘I bought’ it shows they may have already tried the product (and that the product/content may also be a good match for the rest of the audience).

Table 3: Historical follower count

This table holds daily follower counts for each account in our database (one row per observation)

Basic features available through API/scraping

  • Account name
  • Date of observation
  • Follower count
  • 1 day shift (today’s observation minus yesterday’s)
  • 2 day shift (today’s observation minus 2 days ago)

How the tools work

So, we’ve defined our features, now for the moving parts. The first thing we need to do is build the pipeline for crawling and collecting data in the first place. To do this, we can define a list of known influencer and brand accounts by scraping other analytics tools (naughty), or we can crawl Instagram accounts and create a shortlist of accounts with more than 10,000 followers, classifying them by hand in order to build our training set for these classifiers.

Once we have these classifiers, then we can populate our account table and start crawling the posts to populate our other tables. Many of the additional ML pipelines (eg comment sentiment, fake account comments, age etc) will require their own individual training sets, some of which we can find elsewhere and some which will need to be created by humans.

While we’re building our database of accounts (we will hit API rate limits fairly quickly, so this will take a couple weeks to populate), we will also have a daily job running to record the daily follower count of each account, populating our follower count table. This will run at a time where we can maximise the length of time between sponsored posts and the next update, but we’ll use 6am in the influencer’s timezone to start (usually 50% of total impressions will happen in the first 6-8 hours).

Once we have enough brand and influencer accounts crawled and analysed (eg 1,000 or more of each) and sufficient historical data of follower counts (at least 2-4 weeks), we can start training our models. Once we’ve trained and validated these models and are happy with our input features and accuracy, we’re ready to serve our predictions via our front-end tools.

Over time and once we have a diverse group users who have shared their Google Analytics data, we can build out additional models which predict visitors and sales resulting from sponsored posts.

Interacting with the tools


Tool 1: Analyse performance of previous influencer campaigns

Our first tool is aimed at giving the user some insight on how their previous influencer campaigns have actually performed (and gain feedback on how well our model has classified which posts were truly sponsored posts to feed back into the classifier). To do this, the user would input their brand’s instagram account name, we’d then analyse the account (if we haven’t already), and give them an outcome report for all posts (or up to 3 without payment) mentioning their brand within the last 90 days.

This gives the user insight on previous posts, but more importantly, if we can encourage the user to share their Google Analytics or Facebook Analytics data with us, we can collect data about the traffic and sales that resulted from each post as well. In return, the user gets a true reflection of the return on investment from sponsored posts, on top of the number of followers gained. The next step to this would be to ask the user for an estimate of the cost of each of the posts so that we can give an estimate of return on ad spend (ROAS). If we are an influencer booking service (such as Hypetap or Tribe), we could automatically access this data from previous campaigns run on the platform, making it even easier. Then, once we have enough of this data, the prediction tool is capable not only of predicting followers gained from influencers (without knowing how much those influencers will cost and therefore the ROI), but also predicting visitors and sales revenue and recommending influencers based on which will lead to the best ROAS – the holy grail of an influencer research tool.

Tool 2: Followers, visitors & sales prediction tool

Now that our brand owner has assessed the performance of previous campaigns (if they’ve run any), they want to find new influencers and predict the amount of followers/visitors/sales they will gain.

The core concept here is to

  1. Return a list of influencers that will work well for the user’s individual brand (not just their performance overall), then…
  2. Predict the outcome of a post with that influencer (with some confidence intervals for error).

For each influencer in our database, the model will take their overall metrics and the previous outcomes for each of the brands the influencer has worked with, and then give a prediction of outcome for the user’s brand – with previous performance from similar brands weighted more heavily than dissimilar ones. Brand similarity will be calculated using a clustering algorithm, and we can also warn the user if their brand is beyond a threshold of dissimilarity compared to the previous brands that the influencer has worked with, as well as the influencer’s audience itself.

Once the core tool of follower prediction has been built and is working, there are a few clear paths to make this more valuable and offer value-add services for the user. The first is obviously giving predictions of website visitors and sales as mentioned earlier, which would require the user to share their GA/FA data with us, which we can then use to refine our models for all users. This gives the additional outcome metrics of ‘expected website visitors’, ‘expected sales’, and most importantly ‘expected ROAS’.

Then, once we’ve identified the influencers most likely to result in successful posts, we can offer an influencer management/booking service allowing them to easily send the influencer a booking request, or offer a service where we manage and book the influencer on behalf of the user. Most management platforms have a managed service model as well (eg Hypetap), so integration with a platform like theirs or an influencer booking agency could make this simple. We’d take a cut of the booking fee and gain access to the price paid in return for the referral.

Wrapping it up

So, that’s my wall of text on how I’d build this tool, I’m surprised you made it here. After speaking with the founders/CEOs of a few of the tools on the market, it does appear that none of them are at the point of creating a tool like this for their users, which is a bit disappointing. Since closing Trefiel down I’ve moved on from working with influencers (thank goodness), but a tool like this I would have happily paid thousands of dollars a month on, or a percentage of campaign spend. So if you work on an influencer or social media analytics platform, please build this!