A Mathematical Graph-Theoretic Model of Single Point Mutations Associated with Sickle Cell Anemia Disease

Many diseases like cystic fibrosis and sickle cell anemia disease (SCD), among others, arise from single point mutations in the respective proteins. How a single point mutation might lead to a global devastating consequence on a protein remains an intellectual mystery. SCD is a genetic blood-related disorder resulting from mutations in the beta chain of the human hemoglobin protein (simply, β-globin), subsequently affecting the entire human body. Higher mortality and morbidity rates have been reported for patients with SCD, especially in sub-Saharan Africa. Clinical management of SCD often requires specialized interdisciplinary clinicians. SCD presents a major global burden, hence an improved understanding of how single point mutations in β-globin results in different phenotypes of SCD might offer insight into protein engineering, with potential therapeutic intervention in view. By use of mathematical modeling, we built a hierarchical (nested) graph-theoretic model for the β-globin. Subsequently, we quantified the network of interacting amino acid residues, representing them as molecular system of three distinct stages (levels) of interactions. Using our nested graph model, we studied the effect of virtual single point mutations in β-globin that results in varying phenotypes of SCD, visualized by unsupervised machine learning algorithm, the dendrogram.


Introduction
Protein engineering requires deep insight into protein folding and misfolding, thereby underscoring the importance of experimental and computational models to study such complex network and systems. There is limited understanding of how a single point mutation in the β-globin prevents correct folding and maintenance of hemoglobin and subsequently affects the entire protein. Protein folding and maintenance are critical for ensuring proper protein function (Yagishita et al., 2008). Many factors (genetic or non-genetic) can prevent correct protein folding, leading to a protein's inability to perform the appropriate structure-related functions (Cooper & Hausman, 2007). Even though the human body has several biological corrective and clearance mechanisms, in extreme instances, these mechanisms are unable to undo the effect of such misfolded proteins that might arise from severe mutations. Misfolded uncorrected proteins aggregate in and around the cells. Many diseases including SCD, COVID-19 and neurodegenerative diseases (such as Parkinson's, Alzheimer's, Amyotrophic lateral sclerosis) show elevated levels of protein aggregates resulting from mostly severe mutations (Ayyadevara et al., 2017;Balasubramaniam et al., 2019;Kakraba et al., 2019). Conventionally, a mutation refers to changes in amino acid sequences of a protein, occurring by either addition, insertion, deletion or switching of positions in the amino acid residue(s) (Griffiths et al., 2002). Single point mutations (a mutation in only one amino acid residue) may result in varying degrees of severity and phenotypic clinical manifestations (Thom et al., 2013). Illustratively, mutation in the Huntington gene, Parkinson's (e.g. LRRK2, PARK7), cystic fibrosis (e.g. N1303K, deltaF508), diabetes mellitus, sickle-cell anemia (e.g. E6V, V23I, K82N SCD is a genetic hemoglobin disorder affecting many people in different populations, diagnosed at birth upon receiving either one or two sickle cell genes (one from each parent) resulting in continuous production of faulty hemoglobin molecules. The most severe form of the disease, homozygous HbSS, occurs when the faulty hemoglobin producing gene is passed on from both parents to the newborn (Da Guarda et al., 2020). The faulty hemoglobin occurs because of a single point mutation involving the substitution of glutamic acid with valine at the sixth position (denoted as Glu6Val or E6V) of the β-globin. In contrast to healthy red blood cells (RBCs) (that appear 'disc-like round'), sickled RBCs appears crescent, less flexible, less oxygenated (due to reduced cell surface and volume), stiff and have the general tendency to stick together, thus preventing smooth blood transport which may end up clogging the blood vessels (Hartl, D. L., & Ruvolo, 2012). Sickle RBCs have a shorter life span of 10 to 20 days instead of the typical 90 to 120 days for healthy RBCs (Kanter & Kruse-Jarres, 2013;Neumayr et al., 2019). A Characteristic symptom of SCD is anemia (low RBCs cells count) leading to fatigue, occasional difficulty in breathing and retarded growth in children. Pulmonary hypertension associated with SCD is a serious complication estimated to affect 6-11% of patients with SCD. The world health organization (WHO) estimates the global prevalence of sickle cells trait to be about 5%, comprising mainly the sickle cell disease and thalassemia. Yearly, about 30,0000 newborns are diagnosed with significant hemoglobin disorders. In Africa, which carries a major percentage of the global distribution of SCD (SCD prevalence 20% to 45%), several children born with SCD die by the age of five (5). In the United States (US) SCD prevalence is about 1 in every 16,300 Hispanic American and 1 in about every 365 African American births. SCD is among 8 out of 240 specific causes of disease that accounted for 100,000 deaths from 1990 to 2013 worldwide.
Current treatment options are case and symptoms dependent. In the US, Hydroxyurea is prescribed to alleviate episodic pain sensation in SCD patients (Naghavi et al., 2015;Wastnedge et al., 2018). Long time use of Hydroxyurea has been shown to reduce mortality rate by 40% (Neumayr et al., 2019). However, certain adverse events associated with long term use of Hydroxyurea, including alopecia (hair loss), headache, nausea, weight gain and carcinogenic potential continue to impact its global use in SCD clinical management ( . Though L-glutamine has been shown to reduce the occurrence of pain crises related to SCD, the frequency of administration of the drug and monthly usage cost (~$3000 USD) may be challenging to patients (Quinn, 2018). For complete treatment, the state-of-the art is bone marrow therapy or hematopoietic stem cell transplant (HSCT). HSCT procedure is an intensive and risky procedure, ultimately relying on availability of a matching donor. Although there are ongoing research efforts on gene editing therapy for SCD treatment, there are also medical ethics-related challenges and regulatory issues impacting the necessary technological advances required (Gardner, 2018;Neumayr et al., 2019). This underscores the importance of a continuous search for novel treatment for SCD. We are of the view that a great deal of understanding of how a single point mutation in the beta chain of the hemoglobin protein leads to different phenotypes of SCD is critical to protein engineering, drug discovery and design to address the increasing challenges presented by SCD.
Graph theory (a branch of discrete mathematics), concerned with the study of mathematical structures (graphs), is used for pairwise-object relationship modeling. It has been applied to model and study many dynamic systems in an attempt to gain insight into how such complex network systems function (Kyrtsos & Baras, 2012). Quite recently, few works have used mathematical graph-theoretic modeling for examining the effect of single point mutations in a protein domain that results in disease with varying phenotypes and clinical manifestations (Kakraba, 2015;Kakraba & Knisley, 2016;Knisley et al., 2013). Presently, to the best of our knowledge, graphtheoretic modeling has not been employed to examine the effect of single point mutations on in the β-globin that result in SCD with various clinical manifestations. In this work, we extend mathematical graph-theoretic modeling approach in our previous works (Knisley et al., 2013;Kakraba, 2015;Kakraba & Knisley, 2016) to the β-globin and examine the effect of single point mutations associated with SCD in β-globin of the hemoglobin.
In modeling the β-globin, we constructed a 3-level hierarchical vertex-weighted model for the β-globin. We obtained new vertex and edge weighted combinatorial molecular indices for our graph-theoretic model of the β-globin. The molecular descriptors were computed based on standard graph theory definitions and authorsadopted definitions for vertex-weighted graphs. The resulting impact of six (6) single point mutations associated with varying degrees of severity and phenotype of sickle cell anemia in the β-globin are measured by use of the nested graph model. Each virtual single point mutation results in distinct combinatorial invariants (indices or quantitative molecular descriptors) that reflect the local and global effect of each specific mutation on the entire protein. In this way, we captured the underlying consequential structural local and global network changes that stem from each single point mutation in the β-globin. Arguably, typical of our mathematical graph-theoretic modeling is the ability to rationally model how a local perturbation (mutation) results in small or huge global effect on the β-globin.

Graph-theoretic model for β-globin
Although, graph-theoretic modeling adopted in this work is similar to previously used methods (Kakraba, 2015;Kakraba & Knisley, 2016), this work focuses on the beta chain of the hemoglobin protein and broaden previously published combinatorial descriptors to incorporate edge-weight assignments. We adopted some molecular indices from the molecular database from Table 1-3 (Kakraba, 2015; Kakraba & Knisley, 2016) and incorporated other standard and author-adopted graph invariants in this work. Therefore, the molecular descriptors of the subdomain graphs (resulting from the unique partitioning into subsequences), the resulting β-globin graph, and the mutations examined in this work are specific to the β-globin.

Subsequence partition of the β-globin
The fasta sequence of the protein corresponding to the β-globin (PDB ID: 1A3N) was retrieved from the protein databank (Tame & Vallone, 2000; The Protein Data Bank, 1998). Using UCFS Chimera (Pettersen et al., 2004) plug-in built for Cystoscape (Shannon et al., 2003), we visualized and partitioned the β-globin into ten (10) nonoverlapping subsequences represented by Gi, i =1…10. To preserve essential biological information contained in the secondary structures of the β-globin, we avoided cutting through binding sites, beta strands and alpha helixes during partitioning of the β-globin into subsequences. We also ensured each subsequence contained only one type of secondary structure; either a beta strand, an alpha helix, turn, bend or a loop with no more than 25 amino acid residues. A loop region was tolerated to include turns, a 3/10-helix, and an alpha helix with less than 6 residues of amino acids. Table 1 is the resulting non-overlapping subsequences for each subdomain and reason for such partitioning of the β-globin. Subdomain graphs corresponding to the subsequences of the β-globin The subsequences corresponding to each subdomain of the β-globin were submitted to I-TASSER for Ab-initio modeling, and the results used in Cystoscape to generate the interaction or network graphs. Altogether, ten (10) subdomain graphs were generated corresponding toe the ten subsequences in Table 1.
To observe more interactions between the amino acid residues in the subdomain graphs of the β-globin, we chose a proximity measure of six (6) angstroms. We determined the endpoints by the center of mass of each amino acid residue in the 3D structure of β-globin. The wildtype subdomain graph and mutated subdomain graph (E6V) of subsequence corresponding to G1 are shown in Fig.1 (refer to the appendix 1 for other subdomain graphs generated for β-globin). Mutated subdomain graphs were obtained by submitting the mutated subsequences to I-TASSER (Zhang, 2007b(Zhang, , 2007a for ab initio predictive modeling, and subsequently using Cystoscope for the corresponding network graphs(refer to the appendix 2 for other mutated subdomain graphs generated for β-globin).

Hierarchical (Nested Graph) Model for the β-globin
Finally, we denote each of the subdomains with a vertex in a domain graph of the β-globin (H). From the protein sequence submitted to UCFS chimera software (Pettersen et al., 2004), we obtained the network interaction data and using Cystoscope (Shannon et al., 2003) , network (interaction) graph was generated for the β-globin.
Our hierarchical graph has three (3) layers, namely, the lowest level, middle level, and the top level. The lowest level comprises 20-small vertex-weighted amino acids, corresponding to the 20 most essential amino acids. The middle level graph denotes the subgraphs corresponding to the subsequences of the β-globin. We generated ten (10) distinct vertex-weighted subgraphs (each vertex in the subgraph denoting an amino acid) in which the weights of the vertices and edges are graph invariants (molecular descriptors) obtained from the amino acids of the lower-level graph. In the top-level graph, each subdomain graph (subgraph from the middle level) collapses into a single weighted vertex (node). The weights assigned to each subdomain graph as a vertex or node in the top-level graph are based on vertex and edge weighted molecular descriptors of each subdomain graph. The edges/links seen in the β-globin nested domain graph are determined by a nearness or proximity measure threshold distance of 6 angstroms (6Å) between any two adjacent residues of the graph. The molecular descriptors particular to our β-globin model were then calculated for further network analysis. Figure 2 shows a model for the β-globin based on the graph-theoretic approach applied in this work.

Graph-theoretic invariants (molecular descriptors) for Subdomain and Top-level Graph
We computed weighted graph invariants for subdomain graphs for the corresponding subsequences aimed at describing the local properties of the subgraphs. These values become the wildtype values of the subdomain graphs, and subsequently incorporated as vertex-edge weights in the top-level graph (H). The molecular descriptors calculated for the wildtype values of the subdomain graphs are depicted in Table 2. Descriptor e5 was based on assignment of molecular descriptors d13 from our previous work (Kakraba & Knisley, 2016). Weighted domination number, weighted betweenness, and weighted eccentricity measures were used in assessing the impact of specific nodes in the graphs to determine their influences within the graph networks. For a given subdomain or top-level graph, we also considered the absolute value of the change in molar mass per average edge ( ) along the edge between connected vertices Ri and Rj, using equation (1) in which ̅ is the average weighted degree.  With these weights assigned to the edges of the top-level graph (H), we computed other graph centrality measures like weighted order of the graph, degree centrality, closeness centrality, eigenvector centrality and betweenness centrality (Ni et al., 2011) for subdomain graphs and top-level graph using Cystoscape (Shannon et al., 2003).

Virtual Single Point SCD Mutations in the β-globin
To assess the effect of single point mutations on the entire β-globin, we proceeded by selecting six (6) prevalent mutations known to be associated with mild or severe SCD phenotypes in the β-globin from literature (refer to Table 3). In computing the set of graph-theoretic indices corresponding to each mutation, we employed these procedures. First, starting from the residue in the subdomain graphs (Gi), assign amino acid descriptors from the molecular database (Kakraba, 2015;Kakraba & Knisley, 2016) as weights to the vertices of the subdomain graphs and calculate graph invariants specific to each subdomain graph. Use subdomain molecular descriptors from above as weights of the vertices or edges in the top-level graph and subsequently obtain graph invariants for the top-level graph (H). These invariants of the top-level graph represent the wildtype molecular descriptors.
To perform the virtual single point mutations for the mild and severe phenotypes, return to each subdomain graph where a specific mutation occurs, and locate the corresponding amino acid at that position in the amino acid residue. Mutate the residue by substituting or deleting the mutant residue at the position (where appropriate) in the subdomain graph. Consequently, a 'mutant-specific vertex-weighted graph' results for that subdomain graph where the mutation occurred. Next, compute the corresponding graph-theoretic molecular descriptors or invariants for the new vertex-weighted subdomain graph ('mutated subdomain graph') arising from the mutation. Proceed to assign these new set of molecular descriptors to the top-level graph and recompute the molecular descriptors of the top-level graph resulting from the mutation. Each mutation results in graph-theoretic-specific molecular descriptors of the top-level graph. Through this approach, we determine the local and global effect of each single point mutation on the β-globin. The wildtype graph-theoretic molecular descriptors and those arising from the six (6) mutation provide indices useful for viewing the local and global effect of each single point mutations in the β-globin (mutated subdomain graphs depicted in appendix 2). The mutation-specific molecular descriptors for the top-level graph (β-globin) are calculated and reported in Table  4.

Results and Discussion
We used our graph-theoretic molecular weighted invariants or descriptors (refer to Table 4) computed for the wildtype, mild and severe mutations, and employed a statistical unsupervised machine learning algorithm (specifically, a hierarchical clustering) to visualize how each SCD single point mutation varies from the wildtype for the β-globin. In R statistical software (R Core Team, 2014), we used the single linkage function and the Euclidean distance for generating the dendrogram. Our choice of the single-linkage function stems from our effort to decrease any possible biases in the clustering of the single point mutations associated with SCD. In this way, we were able to visualize the effect of the virtual SCD single point mutations on the entire β-globin in relation to the wildtype. The dendrogram (depicted in Fig. 3) is the resulting visual representation of the effect of each single point mutation on the β-globin, and subsequently the entire hemoglobin protein.
Notably, our model ( Fig. 3) classifies the severe mutation, E6V into entirely different and farthest cluster (distance of 0.015 Euclidean distance) away from the wildtype, as an expression of the profound impact of E6V on the β-globin. Consistent with literature and clinical manifestations, E6V (Glu6Val) is a typical sickle-cell anemia mutation of severe clinical outcomes. E6V or Glu6Val is the mutation occurring as a substitution of glutamic acid with valine at position 6 in the β-globin. Characteristic of E6V is the formation of abnormal hemoglobin S subunits that aggregates or sticks to each other. This results in formation of long, rigid rod-like molecules, thereby altering the characteristic functional shape of the red blood cells to form sickle or crescent shape. Consequentially, premature death of sickle-shaped cells arises, leading to shortage of red blood cells (anemia). Severe pain and damage to the organs happens when the rigid sickle-shaped cells block small blood vessels, preventing easy flow of blood (

Conclusions
That single point mutations (local perturbations) in the beta chain of the hemoglobin consequentially translate into either mild or severe global effect on an entire protein has not been explicitly studied. To gain further insight into how single point mutations might impact the entire hemoglobin protein, we built a hierarchical (nested) graph-theoretic model for the β-globin. Subsequently, we quantified the network of interacting amino acid residues, representing them as molecular system of distinct stages (levels) of interactions. Using our nested graph model, we studied the effect of virtual single point mutations on the β-globin, visualized by unsupervised machine learning algorithm, the dendrogram. By employing such mathematical modeling, computational and systems biology methodology, we were able to visualize how a small perturbation such as a single point mutation affects the entire hemoglobin protein. Future research might consider addressing questions such as, why does mutation E6V cluster farthest from the wildtype in our model? Also, what inhibitors (corrector small molecules e.g., amino acid) or graph invariant(s) of the subdomain graph containing E6V would cause the severe Figure 3: Visualizing the effect of single point SCD-associated mutations in the β-globin. In R statistical software, molecular descriptors of the top-level graph (from Table 3) were used with the in-built functions (hclust and single linkage) and the Euclidean distance for clustering the single point mutations in associated with different phenotypes of SCD. Notably, from the cluster, mutation E6V is at 0.015 Euclidean distance away from the wildtype mutation, consistent with literature that E6V is the most severe SCD-associated mutation in the beta chain of the hemoglobin protein.
single point anemia-sickle-cell causing mutation (E6V) to cluster along or closer to the wildtype in our hierarchical graph-theoretic model? Such research questions, when addressed, will provide great insight from a computational therapeutic perspective on small molecule discovery and design to attenuate and or reverse the effects of the severe single point mutation (E6V) causing SCD.
Although, the present work focused on application of graph-theoretic model to examine the effect of single point mutations causing the mild and severe phenotype of SCD in the β-globin, the established mathematical modeling concept applied in this work is extendable to any disease arising from single point mutations in a protein like muscular dystrophy, COVID-19 (manuscript in preparation), among others. The existence of such a predictive graph-theoretic mathematical modeling that reflects systems biology might be informative in understanding unknown consequences of single point mutations in a protein domain, and aid in drug design for therapeutic intervention. This underscores the importance of computational models, especially mathematical graph-theoretic models in epidemiology. Our work adds to the existing knowledge on application of mathematical graph-theoretic modeling to study complex networks and systems biology, with potential therapeutic application in drug discovery and design.

Data Availability (excluding review articles)
Refer to the methods section for source of data from protein data bank

Author Contribution
Samuel Kakraba (SK) and Samuel M. Naandam (SMN) were responsible for concept, application, and interpretation of this work. Edem K. Netsey (EKN) performed works related to graph-theoretic model and generation of molecular descriptors under supervision of SK and SMN. Hierarchical clustering on dataset for the top-level graph invariants was implemented in R statistical software by SK. Manuscript was written by SK, Aayire C. Yadem (ACY), SMN and EKN.

Conflicts of Interest:
Authors declare no conflict of interest for this present study.
Funding Statement: No funding was received for this research work.