Datasets & Knowledge Graph
The foundation of MzeeChakula is a comprehensive dataset linking local Ugandan foods to nutritional data and health conditions. This data is structured into a Heterogeneous Knowledge Graph.
Explore the Graph
For a deep dive into the graph's schema, ontology, and reasoning capabilities, see the Knowledge Graph page.
Data Sources
- Food Composition: FAO and local Ugandan food composition tables.
- Health Conditions: WHO guidelines and local health ministry data.
- Cultural Practices: Surveys and anthropological studies on Ugandan dietary habits.
- Market Data: Local market surveys for price and seasonality.
The Knowledge Graph Structure
The graph consists of various node types and edge types, forming a rich network of information.
Node Types & Features
| Node Type | Count | Feature Dim | Description |
|---|---|---|---|
| Food | 5,005 | 8 | Features include macronutrients (Protein, Carbs, Fat), energy, and fiber. |
| Nutrient | 30 | 30 | Detailed micronutrient profiles (Vitamins, Minerals). |
| Condition | 4,852 | 2 | Health conditions encoded with severity and type. |
| Benefit | 10,000 | 10,000 | Sparse vector representing specific health benefits. |
| MealPlan | 10,000 | 4 | Historical or template meal plans. |
| Region | 5 | - | Central, Western, Eastern, Northern, etc. |
| Season | 12 | - | Monthly seasonality nodes. |
Edge Types (Relationships)
CONTAINS: Food $\rightarrow$ Nutrient (Weighted by amount).AFFECTS: Nutrient $\rightarrow$ Condition (Positive/Negative impact).AVAILABLE_IN: Food $\rightarrow$ Season.GROWN_IN: Food $\rightarrow$ Region.CULTURALLY_RELEVANT: Food $\rightarrow$ Culture.HAS_PRICE: Food $\rightarrow$ PriceRange.
Processed Datasets
The data/processed/ directory contains cleaned CSVs ready for analysis and model training:
| File | Rows | Description |
|---|---|---|
food_composition_clean.csv |
5,006 | Nutritional breakdown of foods. |
health_conditions_clean.csv |
5,029 | Mapping of conditions to dietary needs. |
food_prices_clean.csv |
10,001 | Market prices for cost optimization. |
cultural_food_practices_clean.csv |
10,002 | Cultural relevance data. |
food_seasonality_clean.csv |
5,001 | Availability of foods by month. |
Graph Tensors
For efficient GNN training, the graph is converted into PyTorch Geometric tensors stored in data/graph_tensors/.
import torch
from torch_geometric.data import HeteroData
# Load the precomputed graph
data = torch.load('graph_tensors/graph_pyg.pt')
print(f"Node types: {data.node_types}")
print(f"Edge types: {data.edge_types}")
print(f"Food features shape: {data['Food'].x.shape}")
Tensor Files
{Node}_features.npy: Numpy arrays containing node feature vectors.{Source}-{Rel}-{Target}_edge_index.npy: Adjacency matrices for each relation type.node_mappings.json: Maps original IDs (e.g., "Matoke") to graph indices (e.g.,0).
Data Processing Pipeline
- Ingestion: Raw CSVs are loaded from
data/raw/. - Cleaning: Missing values are imputed, and units are standardized.
- Graph Construction: Nodes and edges are created in Neo4j.
- Enrichment: Context nodes (Season, Region) are linked.
- Export: The graph is exported to PyTorch Geometric tensors for training.