Hypothesis Testing

STGiStarHotSpotAnalysis

The Getis-Ord Gi* statistic is used to identify clusters of high or low values in spatial data. More specifically, it is used to identify hot spots or cold spots by determining whether high or low values are concentrated in specific areas. The STGiStarHotSpotAnalysis class provides GeoJikuu’s implementation of the Getis-Ord Gi* hot spot analysis in a spacetime/spatiotemporal context.

The underlying algorithm calculates a Getis-Ord Gi* statistic for each point in the dataset. Positive Gi* values indicate that a given point belongs to a spatiotemporal cluster of high values (i.e., hot spots), while negative Gi* values indicate that a point is part of a spatiotemporal cluster of low values (i.e., cold spots). The strength of this technique lies in the fact that the Gi* statistic is a Z-score, meaning it can be used to calculate a measure of statistical significance. Essentially, this spatiotemporal variant of the Getis-Ord Gi* technique is a hypothesis test with the null hypothesis that any observed clusters are due to spatiotemporal randomness.

For each point in the dataset, the Gi* statistic is calculated by considering its value in relation to the spatiotemporal distance between it and its neighboring points, as well as the values of those neighbouring points. A critical distance must be defined before conducting a Getis-Ord Gi* hot spot analysis to establish what constitutes a neighbouring point. This critical distance determines the spatial relationships and influences the identification of significant patterns through time and space.

Importing the Class

The GiStarHotSpotAnalysis class is located in GeoJikuu's hypothesis_testing.hot_spot_analysis module:

from geojikuu.hypothesis_testing.hot_spot_analysis import STGiStarHotSpotAnalysis

Coordinate Projection

GeoJikuu's hypothesis testing classes assume that any input coordinates have already been projected to a linear coordinate system. In addition, they require the input distance to be in the same unit as the chosen projection system, so the projection system's unit conversion will be needed as well. For example:

from geojikuu.preprocessing.projection import CartesianProjector
cartesian_projector = CartesianProjector("wgs84")

data = {
    "lat": [34.6870676, 34.696109, 34.6525807, 35.7146509, 35.6653623, 35.6856905, 
            34.6608222, 34.6526051, 34.6489583, 34.383289, 34.2561102, 35.1445146,
            33.5597115, 33.5716997, 33.5244701, 33.5153417, 33.5206116, 33.4866878,
            33.5418927, 33.5387101, 33.522766, 33.5106768, 33.3832707, 33.3743253],
    "lon": [135.5237618, 135.5121774, 135.5059984, 139.7963897, 139.7254906, 139.7514867,
            135.5337285, 135.5309408, 135.5178779, 134.9062365, 134.5603363, 136.7803753,
            130.3818748, 130.4030704, 130.4063441, 130.4373212, 130.4841434, 130.5220605,
            130.4081343, 130.4112184, 130.3699527, 130.2558314, 130.2570976, 130.2118296],
    "date": ["1/01/2000", "5/01/2000", "10/01/2000", "15/01/2000", "20/01/2000", "25/01/2000",
             "1/02/2000", "5/02/2000", "10/02/2000", "15/02/2000", "20/02/2000", "25/02/2000",
             "1/01/2001", "5/01/2001", "10/01/2001", "15/01/2001", "20/01/2001", "25/01/2001",
             "1/02/2001", "5/02/2001", "10/02/2001", "15/02/2001", "20/02/2001", "25/02/2001"],
    "value": [2, 3, 1, 4, 4, 2, 1, 1, 2, 4, 4, 4, 5, 6, 5, 7, 8, 8, 10, 9, 8, 8, 7, 7]
}

df = pd.DataFrame.from_dict(data)

results = cartesian_projector.project(list(zip(df["lat"], df["lon"])))
df["cartesian_coordinates"] = results["cartesian_coordinates"]
unit_conversion = results["unit_conversion"]
df.head()

	lat	lon	date	value	cartesian
0	34.687068	135.523762	1/01/2000	2	(-0.5867252094096281, 0.5760951446437298, 0.5690939403658662)
1	34.696109	135.512177	5/01/2000	3	(-0.5865446454704655, 0.5761508222375707, 0.5692236896905267)
2	34.652581	135.505998	10/01/2000	1	(-0.5867908125787922, 0.5765169810552485, 0.5685989032948122)
3	35.714651	139.796390	15/01/2000	4	(-0.6201191587857982, 0.5241083019965597, 0.5837488472665938)
4	35.665362	139.725491	20/01/2000	4	(-0.6198530442409449, 0.5251996831772221, 0.5830501662256676)

For more information, see: Projection Classes

Convert Dates to Time Steps

STGiStarHotSpotAnalysis requires dates to be converted to time steps before calling the run() function. For example:

from geojikuu.preprocessing.conversion_tools import DateConvertor
date_convertor = DateConvertor("%d/%m/%Y", "%d/%m/%Y")

dates = df["date"].tolist()

time_steps = []
for date in dates:
    time_steps.append(date_convertor.date_to_days(date))

df['time_step'] = pd.Series(time_steps)
df.head()

	lat	lon	date	value	cartesian	time_step
0	34.687068	135.523762	1/01/2000	2	(-0.5867252094096281, 0.5760951446437298, 0.5690939403658662)	730119
1	34.696109	135.512177	5/01/2000	3	(-0.5865446454704655, 0.5761508222375707, 0.5692236896905267)	730123
2	34.652581	135.505998	10/01/2000	1	(-0.5867908125787922, 0.5765169810552485, 0.5685989032948122)	730128
3	35.714651	139.796390	15/01/2000	4	(-0.6201191587857982, 0.5241083019965597, 0.5837488472665938)	730133
4	35.665362	139.725491	20/01/2000	4	(-0.6198530442409449, 0.5251996831772221, 0.5830501662256676)	730138

For more information, see: DateConvertor

Scale Variables

STGiStarHotSpotAnalysis's algorithm involves calculating spatio-temporal distances and multiplying them by the input variable. To prevent temporal biases or biases related to the input variable, it is important to apply normalisation so that they are on a similar scale to the projected coordinates. For example:

from geojikuu.preprocessing.normalisation import MinMaxScaler

time_step_scaler = MinMaxScaler(df["time_step"].tolist())
df["scaled_time_step"] = pd.Series(time_step_scaler.scale(df["time_step"].tolist()))

value_scaler = MinMaxScaler(df["value"].tolist())
df["scaled_value"] = pd.Series(time_step_scaler.scale(df["value"].tolist()))

df.head()

	lat	lon	date	value	cartesian	time_step	scaled_time_step	scaled_value
0	34.687068	135.523762	1/01/2000	2	(-0.5867252094096281, 0.5760951446437298, 0.5690939403658662)	730119	0.000000	-1734.244656
1	34.696109	135.512177	5/01/2000	3	(-0.5865446454704655, 0.5761508222375707, 0.5692236896905267)	730123	0.009501	-1734.242280
2	34.652581	135.505998	10/01/2000	1	(-0.5867908125787922, 0.5765169810552485, 0.5685989032948122)	730128	0.021378	-1734.247031
3	35.714651	139.796390	15/01/2000	4	(-0.6201191587857982, 0.5241083019965597, 0.5837488472665938)	730133	0.033254	-1734.239905
4	35.665362	139.725491	20/01/2000	4	(-0.6198530442409449, 0.5251996831772221, 0.5830501662256676)	730138	0.045131	-1734.239905

For more information, see: MinMaxScaler

Creating a STGiStarHotSpotAnalysis Object

A STGiStarHotSpotAnalysis object is created by passing in a DataFrame, the label of the column that contains the coordinates, and the label of the column that contains the time steps:

sthsa = STGiStarHotSpotAnalysis(df, "cartesian", "time_step")

Running the Spacetime Getis-Ord Gi* Analysis

Once the object has been created, the run() function can be used to perform the spacetime Getis-Ord Gi* analysis on the provided data. As input, the run() function takes the column name of the desired input field and the critical distance for which the analysis will be run. To convert the critical distance from kilometres to the projection system's unit, divide by the unit_conversion variable. Here is an example of running a spacetime Getis-Ord Gi* analysis on an input field called "value" with a critical distance of 10 kilometres:

results = sthsa.run(input_field="value", critical_distance=10/unit_conversion)

Output:
  
Getis-Ord Gi* Hot Spot Analysis Summary
---------------------------------------
Statistically Significant Features: 4
    Statistically Significant Hot Spots: 2
    Statistically Significant Cold Spots: 2
Non-Statistically Significant Features: 20
Total Features: 24

Null Hypothesis (H₀): The observed pattern of the variable 'value' in feature ⅈ is the result of spatiotemporal randomness alone.
Alpha Level (α): 0.05
Critical Distance: 0.0015690499376772409
Spatial Relationship Function: Inverse Spacetime Distance

Verdict: Sufficient evidence to reject H₀ when α = 0.05 for features ⅈ ∈ {6, 7, 18, 19}

The output shows that 4 of the 24 points were classed as belonging to statistically significant hot spot or cold spot clusters. Statistical significance indicates that these particular points exhibit clustering that is unlikely to be due to spatiotemporal randomness. In this case, 2 of these statistically significant points are hot spots and 2 are cold spots.

The other output can be interpreted as follows:

Null Hypothesis: Conventionally denoted as H₀, this is the assumption that underpins the test.
Alpha Level: If the test results in a p-value that is below the Alpha level or a given point, then the null hypothesis can be rejected for that point.
Critical Distance: The distance, in the unit of the projection system, that was used to define a point's neighbours during analysis.
Spatial Relationship Function: The function used to determine the degree to which a point is related to each other point in its neighbourhood. In this case, Inverse Spacetime Distance is used, which simply means that points are considered less related to each other the further apart they are in space and time.
Verdict: Shows the DataFrame IDs of the points for which the null hypothesis can be rejected (i.e., statistically significant points). Note that the number of IDs listed are truncated for large outputs.

The run() function also returns a DataFrame of each point along with its projected coordinates (incuding the temporal coordinate), number of neighbours, Z-score, p-value, whether it is statistically significant, and whether it is classified as belonging to a hot spot or a cold spot:

results.head()

	st_coordinates	neighbours	z-score	p-value	significant	type
0	(-0.5867252094096281, 0.5760951446437298, 0.5690939403658662, 730119)	6	-1.557552	0.119067	FALSE	COLD SPOT
1	(-0.5865446454704655, 0.5761508222375707, 0.5692236896905267, 730123)	6	-1.417533	0.156101	FALSE	COLD SPOT
2	(-0.5867908125787922, 0.5765169810552485, 0.5685989032948122, 730128)	6	-1.904935	0.056459	FALSE	COLD SPOT
3	(-0.6201191587857982, 0.5241083019965597, 0.5837488472665938, 730133)	3	-0.551010	0.581620	FALSE	COLD SPOT
4	(-0.6198530442409449, 0.5251996831772221, 0.5830501662256676, 730138)	3	-0.654974	0.512471	FALSE	COLD SPOT

Given that we are working with a Pandas DataFrame, we can easily filter our results to show only those that are statistically significant:

sig_results = results[results['significant'] == "TRUE"]
sig_results

	st_coordinates	neighbours	z-score	p-value	significant	type
6	(-0.5870113922787347, 0.5761756200345676, 0.5687172234184252, 730150)	6	-2.079184	0.037283	TRUE	COLD SPOT
7	(-0.5870415575785646, 0.576261310921995, 0.568599253613823, 730154)	6	-2.140328	0.032020	TRUE	COLD SPOT
18	(-0.5402864056943888, 0.6346518053114892, 0.5525465463078254, 730516)	8	2.473804	0.013145	TRUE	HOT SPOT
19	(-0.5403404627390118, 0.6346460904442915, 0.5525002481544843, 730520)	8	2.244474	0.024515	TRUE	HOT SPOT

Getting Started

Preprocessing

Aggregation

Descriptives