Hypothesis Testing

GiStarHotSpotAnalysis

The Getis-Ord Gi* statistic is used to identify clusters of high or low values in spatial data. More specifically, it is used to identify hot spots or cold spots by determining whether high or low values are concentrated in specific areas. The GiStarHotSpotAnalysis class provides GeoJikuu’s implementation of the Getis-Ord Gi* hot spot analysis.

The underlying algorithm calculates a Getis-Ord Gi* statistic for each point in the dataset. Positive Gi* values indicate that a given point belongs to a cluster of high values (i.e., hot spots), while negative Gi* values indicate that a point is part of a cluster of low values (i.e., cold spots). The strength of this technique lies in the fact that the Gi* statistic is a Z-score, meaning it can be used to calculate a measure of statistical significance. Essentially, the Getis-Ord Gi* technique is a hypothesis test with the null hypothesis that any observed clusters are due to spatial randomness.

For each point in the dataset, the Gi* statistic is calculated by considering its value in relation to the distance between it and its neighboring points, as well as the values of those neighbouring points. A critical distance must be defined before conducting a Getis-Ord Gi* hot spot analysis to establish what constitutes a neighbouring point. This critical distance determines the spatial relationships and influences the identification of significant spatial patterns.

Importing the Class

The GiStarHotSpotAnalysis class is located in GeoJikuu's hypothesis_testing.hot_spot_analysis module:

from geojikuu.hypothesis_testing.hot_spot_analysis import GiStarHotSpotAnalysis

Coordinate Projection

GeoJikuu's hypothesis testing classes assume that any input coordinates have already been projected to a linear coordinate system. In addition, they require the input distance to be in the same unit as the chosen projection system, so the projection system's unit conversion will be needed as well. For example:

from geojikuu.preprocessing.projection import CartesianProjector
cartesian_projector = CartesianProjector("wgs84")

data = {
    "lat": [34.6870676, 34.696109, 34.6525807, 35.7146509, 35.6653623, 35.6856905, 
            34.6608222, 34.6526051, 34.6489583, 34.383289, 34.2561102, 35.1445146,
            33.5597115, 33.5716997, 33.5244701, 33.5153417, 33.5206116, 33.4866878,
            33.5418927, 33.5387101, 33.522766, 33.5106768, 33.3832707, 33.3743253],
    "lon": [135.5237618, 135.5121774, 135.5059984, 139.7963897, 139.7254906, 139.7514867,
            135.5337285, 135.5309408, 135.5178779, 134.9062365, 134.5603363, 136.7803753,
            130.3818748, 130.4030704, 130.4063441, 130.4373212, 130.4841434, 130.5220605,
            130.4081343, 130.4112184, 130.3699527, 130.2558314, 130.2570976, 130.2118296],
    "value": [2, 3, 1, 4, 4, 2, 1, 1, 2, 4, 4, 4, 5, 6, 5, 7, 8, 8, 10, 9, 8, 8, 7, 7]
}

df = pd.DataFrame.from_dict(data)

results = cartesian_projector.project(list(zip(df["lat"], df["lon"])))
df["cartesian_coordinates"] = results["cartesian_coordinates"]
unit_conversion = results["unit_conversion"]
df.head()

	lat	lon	value	cartesian
0	34.687068	135.523762	2	(-0.5867252094096281, 0.5760951446437298, 0.5690939403658662)
1	34.696109	135.512177	3	(-0.5865446454704655, 0.5761508222375707, 0.5692236896905267)
2	34.652581	135.505998	1	(-0.5867908125787922, 0.5765169810552485, 0.5685989032948122)
3	35.714651	139.796390	4	(-0.6201191587857982, 0.5241083019965597, 0.5837488472665938)
4	35.665362	139.725491	4	(-0.6198530442409449, 0.5251996831772221, 0.5830501662256676)

For more information, see: Projection Classes

Creating a GiStarHotSpotAnalysis Object

A GiStarHotSpotAnalysis object is created by passing in a DataFrame and the label of the column that contains the coordinates:

hsa = GiStarHotSpotAnalysis(df, "cartesian")

Running the Getis-Ord Gi* Analysis

Once the object has been created, the run() function can be used to perform the Getis-Ord Gi* analysis on the provided data. As input, the run() function takes the column name of the desired input field and the critical distance for which the analysis will be run. To convert the critical distance from kilometres to the projection system's unit, divide by the unit_conversion variable. Here is an example of running a Getis-Ord Gi* analysis on an input field called "value" with a critical distance of 10 kilometres:

results = hsa.run(input_field="value", critical_distance=(10/unit_conversion))

Output:
  
Getis-Ord Gi* Hot Spot Analysis Summary
---------------------------------------
Statistically Significant Features: 13
    Statistically Significant Hot Spots: 7
    Statistically Significant Cold Spots: 6
Non-Statistically Significant Features: 11
Total Features: 24

Null Hypothesis (H₀): The observed pattern of the variable 'value' in feature ⅈ is the result of spatial randomness alone.
Alpha Level (α): 0.05
Critical Distance: 0.0015690819432247728
Spatial Relationship Function: Inverse Distance

Verdict: Sufficient evidence to reject H₀ when α = 0.05 for features ⅈ ∈ {0, 1, 2, 6, 7, 8, 12, 13, 14, 15, 16, 19, 20}

The output shows that 13 of the 24 points were classed as belonging to statistically significant hot spot or cold spot clusters. Statistical significance indicates that these particular points exhibit clustering that is unlikely to be due to spatial randomness. In this case, 7 of these statistically significant points are hot spots and 6 are cold spots.

The other output can be interpreted as follows:

Null Hypothesis: Conventionally denoted as H₀, this is the assumption that underpins the test.
Alpha Level: If the test results in a p-value that is below the Alpha level or a given point, then the null hypothesis can be rejected for that point.
Critical Distance: The distance, in the unit of the projection system, that was used to define a point's neighbours during analysis.
Spatial Relationship Function: The function used to determine the degree to which a point is related to each other point in its neighbourhood. In this case, Inverse Distance is used, which simply means that points are considered less related to each other the further apart they are.
Verdict: Shows the DataFrame IDs of the points for which the null hypothesis can be rejected (i.e., statistically significant points). Note that the number of IDs listed are truncated for large outputs.

The run() function also returns a DataFrame of each point along with its projected coordinates, number of neighbours, Z-score, p-value, whether it is statistically significant, and whether it is classified as belonging to a hot spot or a cold spot:

results.head()

	cartesian	neighbours	z-score	p-value	significant	type
0	(-0.5867252094096281, 0.5760951446437298, 0.5690939403658662)	6	-2.463902	0.013517	TRUE	COLD SPOT
1	(-0.5865446454704655, 0.5761508222375707, 0.5692236896905267)	6	-2.534064	0.011069	TRUE	COLD SPOT
2	(-0.5867908125787922, 0.5765169810552485, 0.5685989032948122)	6	-2.545621	0.010707	TRUE	COLD SPOT
3	(-0.6201191587857982, 0.5241083019965597, 0.5837488472665938)	3	-1.166897	0.243119	FALSE	COLD SPOT
4	(-0.6198530442409449, 0.5251996831772221, 0.5830501662256676)	3	-1.190887	0.233557	FALSE	COLD SPOT

Given that we are working with a Pandas DataFrame, we can easily filter our results to show only those that are statistically significant:

sig_results = results[results['significant'] == "TRUE"]
sig_results

	cartesian	neighbours	z-score	p-value	significant	type
0	(-0.5867252094096281, 0.5760951446437298, 0.5690939403658662)	6	-2.463902	0.013517	TRUE	COLD SPOT
1	(-0.5865446454704655, 0.5761508222375707, 0.5692236896905267)	6	-2.534064	0.011069	TRUE	COLD SPOT
2	(-0.5867908125787922, 0.5765169810552485, 0.5685989032948122)	6	-2.545621	0.010707	TRUE	COLD SPOT
6	(-0.5870113922787347, 0.5761756200345676, 0.5687172234184252)	6	-2.654355	0.007776	TRUE	COLD SPOT
7	(-0.5870415575785646, 0.576261310921995, 0.568599253613823)	6	-2.633585	0.008273	TRUE	COLD SPOT
8	(-0.5869359798276301, 0.5764204930024471, 0.5685468941350358)	6	-2.947968	0.003104	TRUE	COLD SPOT
12	(-0.5398841209446936, 0.6347684310705833, 0.5528057297712873)	7	2.454200	0.013891	TRUE	HOT SPOT
13	(-0.5400439241296583, 0.6344805730945283, 0.5529800741222769)	8	2.319278	0.020112	TRUE	HOT SPOT
14	(-0.5403754616556247, 0.6347965979275287, 0.552293074101157)	8	3.078293	0.002012	TRUE	HOT SPOT
15	(-0.5407756528925909, 0.6345713137039247, 0.5521602494409276)	9	2.834468	0.004469	TRUE	HOT SPOT
16	(-0.5412610703924415, 0.6340905492968818, 0.5522369319141986)	7	2.489708	0.012566	TRUE	HOT SPOT
19	(-0.5403404627390118, 0.6346460904442915, 0.5525002481544843)	8	2.142694	0.031830	TRUE	HOT SPOT
20	(-0.5399828012524008, 0.6351522038234202, 0.5522682793080395)	7	1.978493	0.047546	TRUE	HOT SPOT

Getting Started

Preprocessing

Aggregation

Descriptives