Machine Learning in GeoMarketing to Predict the Best Branch Locations for a Business, Part 1: Low Complexity

José Luis Domínguez
7 min readOct 9, 2022

Unsupervised learning+GeoMarketing

Background

The current dynamism and pace of life of all users in a population is allowing various economic sectors and mainly private sector companies to find an area of opportunity in which to invest, given the clear need of citizens in various social contexts defined by the rhythms of work, acquisition of goods and services, food security, transfers and economic development.

With the exponential growth of technology and the reach of new mid- to high-end devices by users, today, understanding the pattern of use of users or their habits, inside or outside their homes, is vital importance for companies, mainly to be able to strategically satisfy the various needs that become endless opportunities.

That is why it is not enough just to understand the usage behavior of users on a web page, television, app or streaming platform, but it is also necessary to understand the reason for where and when they are used, that is, according to a certain use, which is the geographical space that demands a need to be met.

Currently it has been found that around 80% of all data can have georeferenced geographic features. That is why these types of data require powerful computational methods due to their complexity, diversity and volume.

Given this context then, any company or investor commonly asks the following question: what location will be the one that will allow a greater number of people from the target audience to come to us and get to know us?

Apparently it is a simple question, but here is a real challenge for data scientists, because the problem to be solved is influenced by various project objectives, processes of social and economic dynamics, in conjunction with a combination of factors and circumstances that occur at a given time in geographic space.

In this research we will demonstrate how the combination of machine learning and statistical techniques allow the development of an artificial intelligence system that allows determining the best locations for a commercial establishment.

We will develop a simple approach, but it will represent the basis of the complex analysis. In further investigations we will develop the resolution of this need with more advanced data science strategies.

Description of the problem

In order to exemplify a data science problem related to location and investment, the various places where the target audience can most likely be concentrated will be determined with machine learning, considering a high rate of competitors (surrounding businesses), in order to be able to carry out a placement of strategic business establishments.

How does Artificial Intelligence participate as a solution?

Unsupervised learning in conjunction with various statistical techniques and data mining, do not only study relationships or causalities, but also look for unknown but significant data patterns, which allows making decisions and preventing investment errors, key information in any project.

In location data analysis, the central area of interest is geographic location, however, the methodological question is how to address this unique attribute of the data. The separation between observations can be measured by distance using various algorithms, however, proximity is not the only solution to this problem (Figure 1).

Figure 1. Measurement of distances between pairs of coordinates.

Geolocated points are characterized by the projected coordinates of longitude and latitude (x, y). Lines are a connection of geographical points without their ends intersecting. And, on the contrary, the polygons are a set of polylines interconnected by different geographical points (Figure 2).

Figure 2. Territorial elements in vector plane.

Due to the above, a location problem cannot be solved only by geolocated points, since in the territories there are linear and polygon-shaped representations as well.

It is here where artificial intelligence systems allow us to implement different algorithms in various stages, in which coordinate points are not considered as the only isolated analysis entities, but rather how they behave with respect to the various geometric and geographical forms they represent each element in the territory.

What problem should AI solve?

According to national statistics, 33% of businesses fail in the first year and 65% within the first 5 years. This data is quite alarming and can be summed up as poor planning and poor execution. Planning being the most relevant factor of success or failure.

Planning as a key factor is more important than the product to be offered. Some of the top marketing-related reasons for failure can be broken down into various market impact ratios as follows:

Inadequate point of sale — 41%
Weak market research — 41%
Inadequate promotion — 38%
Poorly selected target market — 33%

Given the above, we can say that the potential market and the location of the point of sale were inadequate in the planning of 4 out of 10 businesses. For this reason, the evaluation of strategic areas for opening a business is of great importance.

That is why we want to check if with the implementation of an artificial intelligence system, different location areas can be strategically defined for the opening of new branches, since identifying the best location to open a branch will increase the chances of success of a new branch long term business.

Data Scientist Goals

Develop an artificial intelligence system that finds spatial relationships that allow defining evaluation criteria for a site to open a branch based on predictive GeoMarketing models.

  1. Population size and definition of geographic location. The first aspect because we are interested in having a larger population to satisfy with our product or service; and the second to be able to define our study area (city, state, population, locality).
  2. Population size by geographic location. The total volume of people is not the most important thing, but how this volume is distributed in a defined geographical space, and how this population could be induced to behave when faced with a product or service offered.
  3. Targeting of competitors. Although we will not be the first, it is important to see how the target users are influenced by other service providers and, based on this, be able to reduce losses and maximize profits.
  4. Deduction of opportunity areas. Finally, it gives me the confidence that the selected area will allow me to generate a profitable supply and demand chain. For this, it is important to differentiate between metrics that validate machine learning models and metrics that allow an inference or pattern found to be true, so we must validate that our model is accurate and that the inferences have statistical validity.

Answer the question: Does machine learning provide competitive GeoMarketing improvement in defining opening areas for new business?

Type of research and development

Quantitative investigation. The vectorization of spatial entities that represent an area of commercial opportunity (blocks, avenues, shops and houses) is considered (Figure 3).

Figure 3. Vectorization of territorial elements.

It is intended to integrate a zonal analysis methodology (Figure 4), in which it is evaluated how the population is influenced by the different geographic and economic elements located in their housing area (Figure 5).

Figure 4. Zonal statistics of polygons regarding their relationship with other geometric entities.
Figure 5. How zonal statistics influence each element of interest (polygons are equal to residential houses, that is, people who live there).

With the results of the zonal analysis and spatial relationships, the Getis-Ord Gi* is calculated, an indicator of z scores at resulting P values, which will indicate where the entities with high or low values are spatially grouped (Figure 6).

Figure 6. Getis-Ord Gi* indicator.

However, a feature with a high value is interesting, but may not be a statistically significant area. To be a statistically significant area of interest, an entity must have a high value and also be surrounded by other entities with high values. For this, after calculating the Getis-Ord Gi*, a polygon-based clustering is performed in order to establish the best GeoMarketing zones for a business of interest (Figure 7).

Figure 7. Statistical reliability of the results.

This will allow us to generate areas with high positive values that indicate the possibility of a local cluster of high values of the analyzed variable (profitable areas for establishing a business), and very low relative values of a similar cluster of low values (areas unprofitable for setting up a business) (Figure 8).

Figure 8. Predictive model of areas with greater probability of success to establish businesses.

Workflow Development

Results

--

--

José Luis Domínguez

Data scientist who develops sustainable reliability in the processes driven by the development of artificial intelligence in future society.