Datasets & Knowledge Graph

The foundation of MzeeChakula is a comprehensive dataset linking local Ugandan foods to nutritional data and health conditions. This data is structured into a Heterogeneous Knowledge Graph.

Explore the Graph

For a deep dive into the graph's schema, ontology, and reasoning capabilities, see the Knowledge Graph page.

Data Sources

Food Composition: FAO and local Ugandan food composition tables.
Health Conditions: WHO guidelines and local health ministry data.
Cultural Practices: Surveys and anthropological studies on Ugandan dietary habits.
Market Data: Local market surveys for price and seasonality.

The Knowledge Graph Structure

The graph consists of various node types and edge types, forming a rich network of information.

Node Types & Features

Node Type	Count	Feature Dim	Description
Food	5,005	8	Features include macronutrients (Protein, Carbs, Fat), energy, and fiber.
Nutrient	30	30	Detailed micronutrient profiles (Vitamins, Minerals).
Condition	4,852	2	Health conditions encoded with severity and type.
Benefit	10,000	10,000	Sparse vector representing specific health benefits.
MealPlan	10,000	4	Historical or template meal plans.
Region	5	-	Central, Western, Eastern, Northern, etc.
Season	12	-	Monthly seasonality nodes.

Edge Types (Relationships)

CONTAINS: Food $\rightarrow$ Nutrient (Weighted by amount).
AFFECTS: Nutrient $\rightarrow$ Condition (Positive/Negative impact).
AVAILABLE_IN: Food $\rightarrow$ Season.
GROWN_IN: Food $\rightarrow$ Region.
CULTURALLY_RELEVANT: Food $\rightarrow$ Culture.
HAS_PRICE: Food $\rightarrow$ PriceRange.

Processed Datasets

The data/processed/ directory contains cleaned CSVs ready for analysis and model training:

File	Rows	Description
`food_composition_clean.csv`	5,006	Nutritional breakdown of foods.
`health_conditions_clean.csv`	5,029	Mapping of conditions to dietary needs.
`food_prices_clean.csv`	10,001	Market prices for cost optimization.
`cultural_food_practices_clean.csv`	10,002	Cultural relevance data.
`food_seasonality_clean.csv`	5,001	Availability of foods by month.

Graph Tensors

For efficient GNN training, the graph is converted into PyTorch Geometric tensors stored in data/graph_tensors/.

import torch
from torch_geometric.data import HeteroData

# Load the precomputed graph
data = torch.load('graph_tensors/graph_pyg.pt')

print(f"Node types: {data.node_types}")
print(f"Edge types: {data.edge_types}")
print(f"Food features shape: {data['Food'].x.shape}")

Tensor Files

{Node}_features.npy: Numpy arrays containing node feature vectors.
{Source}-{Rel}-{Target}_edge_index.npy: Adjacency matrices for each relation type.
node_mappings.json: Maps original IDs (e.g., "Matoke") to graph indices (e.g., 0).

Data Processing Pipeline

Ingestion: Raw CSVs are loaded from data/raw/.
Cleaning: Missing values are imputed, and units are standardized.
Graph Construction: Nodes and edges are created in Neo4j.
Enrichment: Context nodes (Season, Region) are linked.
Export: The graph is exported to PyTorch Geometric tensors for training.