## Introduction

Genome-scale metabolic pathways, genome-environment interactions, the immune response, post-transcriptional regulatory mechanisms, and oncohistones represent aspects of a research field connecting the heritable genetic code to other biological codes.^{1–6} The aforementioned genetic code is defined precisely as a noninjective map from the 64 codons to the 20 amino acids. Both finite groups and quantum groups have leading roles in modeling this code.^{7–10} More explicitly, according to Planat *et al.*,^{8} complete quantum information is encoded in the 22 irreducible characters of the small group (240,105) ≌ Z_{5} ⋊ 2*O*, with 2*O* the binary octahedral group. The characters are put in correspondence with the DNA multiplets encoding the proteinogenic amino acids and the multiplicity is reflected in the dimension of the character representation. Further developments were explored in another study by Planat *et al.*,^{11} which showed that the small group (336,118) ≌ Z_{7} ⋊ 2*O* is another model of the genetic code reflecting the symmetry of the Lsm–7 complexes in the spliceosome. The eight-fold symmetric histone complex was subsequently investigated by Planat *et al.*,^{12} with the character table of the group (384, 5,589) ≌ Z_{8} ⋊ 2*O*.

The latest studies were the first to describe the role of a specific algebraic surface, called the Kummer surface, in the quantum modeling of the genetic code. From then on, we refer to the epigenetic code as all processes that reveal and execute gene expression. This includes DNA methylation processes,^{13} messenger RNA (mRNA) translation preparation, the poly(A) tail, the RNA-induced silencing complex, a vital tool in gene regulation comprising single strands of RNA and double strands of small interfering RNA, and other regulatory nucleotide sequence fragments that are discarded after splicing. Ultimately, this involves a relation between the epigenetic code and morphogenesis.^{14}

Chemical modifications of RNA also drive the metabolism of transcription of the genetic information. Post-transcriptional regulation of gene expression is a hot topic known as epitranscriptomics. There are more than 170 known types of RNA methylation processes but the most common in eukaryotes is the possible methylation of *N*^{6}-methyladenosine (*m*^{6}*A*) on sites with a specific short sequence *RR ACH* (

*R*=

*A*or

*G*,

*H*=

*A*,

*U, or C*).

^{15–17}

To study the epigenetic code (hereinafter referred to as the e-code), we used infinite (finitely generated) groups denoted by *f _{p}*, and their representations over the (2 × 2) matrix group

*SL*

_{2}(C), where the entries are complex numbers.

^{18,19}The significance of this group extends across all fields of physics, as it represents a space-time-spin group. In this study, we applied a mathematical field known as algebraic geometry to define the e-code, which has not been done before.

Our key observation is that an *f _{p}* group associated with a healthy sequence usually approximates a free group

*F*, where the rank

_{r}*r*equals the number of distinct nucleotides minus one. A sequence deviating from this may suggest a potential e-code deregulation leading to a disease. However, an

*f*group closely resembling a free group does not provide sufficient assurance against a disease. Additional examination of the

_{p}*SL*

_{2}(C) representations of

*f*, termed the character variety, and specifically its basis called a Groebner basis

_{p}*G*is necessary. The

*G*comprises a set of surfaces. A surface within

*G*containing isolated singularities indicates another potential disease that can be identified specifically,

*e.g.*, relating to an oncogene or a neurological disorder.

^{19}The e-code we define comprises such algebraic geometric characteristics.

An additional attribute of healthy sequences, which leads to a group *f _{p}* approximating the free group

*F*and not mentioned in the study of Planat

_{r}*et al.*,

^{19}is their connection to aperiodicity. Schrodinger proposed the periodicity of living crystals.

^{20}Planat

*et al.*

^{19}characterized aperiodic DNA sequences.

^{21}We advanced this concept by introducing the so-called profinite completion

*F*. A sequence

_{r}*f*

_{p}^{(l)}of finitely generated groups approaching

*F*emerges by applying

_{r}*l*repeated substitutions to the generators of

*f*. However, all distinct groups

_{p}*f*

_{p}^{(l)}should possess the same profinite completion

*F*. Profinite groups

_{r}^{22}We present the details below in a manner that is accessible to a non-specialist reader. In the Methods section, we illustrate our mathematical concepts through a few simple pedagogical examples. In the Results section, we apply these concepts to cases of mRNA translation, microRNAs (miRNAs), and

*m*

^{6}

*A*methylation. In the Discussion, we provide additional comments, a summary diagram, and perspectives.

## Methods and preliminary results

### Infinite finitely generated groups f_{p} and free groups F_{r}

#### TATA box

We start with a simple example of an infinitely finitely generated group taken from the context of introns. The DNA sequence located in the core promoter region of many eukaryotic genes is the Goldberg–Hogness sequence, also known as the TATA box. This sequence contains a noncoding segment with repeated T and A base pairs. The TATA box serves as the binding site for the TATA-binding protein and other transcription factors in some eukaryotic genes. Its consensus sequence takes the form rel = TATAAAA. Variations in this consensus sequence, resulting from genetic polymorphism, can lead to diseases like Gilbert’s syndrome and immune suppression (https://en.wikipedia.org/wiki/TATA_box ).

In our methodology, we defined the group *f _{p}* = 〈A,T|rel〉, which contains an infinite number of elements. There are numerous ways to investigate this group, but we opted for a specific one. This method involves calculating the number of conjugacy classes of subgroups of index

*d*of

*f*(a sequence we refer to as the card seq of

_{p}*f*). The card seq of

_{p}*f*for the selected TATA sequence is [1,1,2,3,2,8,7,10,18,28···]. Interestingly, the group

_{p}*H*

_{3}= 〈A, T|A

^{2}= T

^{3}〉 has a similar card seq (at least up to the highest index we can reach with the calculations). The group

*H*

_{3}, as defined, is isomorphic to the so-called modular group

*PSL*(2,Z) – the projective special linear group of (2 × 2) matrices of determinant 1 with integer entries. This group has an intriguing topological interpretation as the fundamental group of the trefoil knot manifold. Thus, we find that the group

*f*is close to

_{p}*H*

_{3}as the card seq of both groups is the same, but we can easily verify that

*f*and

_{p}*H*

_{3}are not isomorphic. According to Planat

*et al.*,

^{23}the Hecke groups

*H*= 〈A, T|A

_{q}^{2}= T

^{q}〉, with

*q*= 3 or 4, have a card seq corresponding to healthy TATA box sequences. The

*f*group for a TATA box with a card seq resembling that of Hecke groups, with q ≠ 3 or q ≠ 4, or even that of groups slightly different from

_{p}*H*

_{3}and

*H*

_{4}, signifies Gilbert’s syndrome.

#### Polyadenylation signals

For our second example, we select a sequence from the context of eukaryotic polyadenylation (https://en.wikipedia.org/wiki/Polyadenylation ). Polyadenylation involves the addition of a poly(A) tail to an RNA transcript, usually a mRNA. A consensus poly(A) sequence takes the form rel1 = AAUAAA, which corresponds to a two-generator group of the form *f _{p}* = 〈AU|rel1〉. The card seq of such a group is found to be [1,1,1,1,1,1,1,1,1,1,···], implying a single conjugacy class for each index. It appears that the free group

*F*

_{1}= 〈A, U|AU〉, of rank 1, has the same card seq as the

*f*group with relation rel1, even though neither group is isomorphic. Another consensus poly(A) sequence takes the form rel2 = UGUAA, which corresponds to a three-generator group of the form

_{p}*f*〈A, U, G|rel2〉. The card seq of such a group is found to be [1,3,7,26,97,624,4,163,···]. Intriguingly, the free group

_{p}*F*

_{2}= 〈A, U, G|AUG〉, of rank 2, has the same card seq as the

*f*group with relation rel2, despite both groups not being isomorphic. From our perspective, DNA/RNA sequences that lead to

_{p}*f*groups closely resembling a free group are considered healthy sequences.

_{p}^{19,21,23}The standard poly(A) sequences mentioned earlier play a regulatory role in producing mature mRNA during translation. Sequences that generate an

*f*group diverging from a free group

_{p}*F*may be indicative of a disease.

_{r}### Aperiodic sequences, their attached groups f_{p} and free groups

In this subsection, we elucidate how a group *f _{p}*, with a card seq identified to be close to a free group

*F*, can be linked to an aperiodic sequence and the profinite completion

_{r}^{21,23}Consider the motif rel =

*TTTATTA*, which serves as a consensus sequence for the transcription factor of the DBX gene in

*Drosophila melanogaster*(fruit fly). This gene is involved in neuronal specification and differentiation. The group

*f*= 〈A, T|rel〉 has the same card seq as the free group

_{p}*F*

_{1}of rank 1. Furthermore, by splitting rel into two segments rel = rel

*rel*

_{A}*and applying the substitution maps*

_{T}*A*→ rel

*=*

_{A}*TTTA*,

*T*→ rel

*=*

_{T}*TTA*, we generate the substitution sequence

*S*=

_{DBX}*A,T,AT,TTTATTA,TTATTATTATTTATTATTATTTA,*···. On inspection, it is straightforward to observe that all finitely generated groups

*f*

_{p}^{(l)}, with their generators being

*AT,TTTATTA,TTATTATTATTTATTATTATTTA,*···, respectively, have the card seq of

*F*

_{1}.

As per the findings of Planat *et al.*,^{23} for a substitution rule to be considered aperiodic it must satisfy two conditions: (1) The substitution matrix *M* must be primitive, meaning it should be a strictly positive matrix (all entries *>* 0), irreducible, and *M ^{k}* should be strictly positive for some

*k*. This condition is denoted as M ≫ 0. (2) The Perron–Frobenius

*λ*eigenvalue must be irrational. It is worth noting that the Perron–Frobenius eigenvector of an irreducible non-negative matrix is the only one whose entries are all positive. The aforementioned sequence has a substitution matrix:

_{PF}One can verify that *M* is primitive as *M*^{2} ≫ 0 and *S _{DBX}* is aperiodic. Of note, numerous other genes have transcription factors with a motif rel generating an aperiodic sequence.

^{21}

### Aperiodic sequences and the profinite groups F ^ r

This section can be skipped without affecting the comprehension of the rest of the paper. It endeavors to answer the question of why the aforementioned groups *f _{p}*

^{(l)}produce the same card seq as that of the free group

*F*. The tentative answer is that the profinite completion of all groups

_{r}*f*

_{p}^{(l)}is the profinite group

^{22}Here, we describe the necessary ingredients for the layperson, focusing first on

A group *G* can be considered a topological group by applying discrete topology, in which the elements of *G* are points of a discrete space, form a discontinuous sequence, and are isolated from each other. Every subset is open in the discrete topology. A profinite group is a topological group that, in a certain sense, is assembled from a system of finite groups. A profinite group requires a system of finite groups and group homomorphisms between them. Given a group *G*, there is a related profinite group *G* defined as the inverse limit *Ĝ* = lim_{←}*G/N,* of the groups *G/N*, where *N* runs through the normal subgroups of *G* of finite index. A normal subgroup is a subgroup that remains invariant under conjugation by members of the group. Each finite quotient group corresponds to a normal subgroup *N* of *G* and the profinite completion *Ĝ* can be perceived as containing an analog of each of these normal subgroups. The profinite group *Ĝ* exhibits several properties: it is nonabelian, residually finite, (meaning that for any nonidentity element *g* in *Ĝ*, there exists a finite quotient of *Ĝ* in which *g* is not the identity), and totally disconnected (meaning that the only connected subsets of *Ĝ* are singletons, sets containing only one element). In general, an explicit construction of profinite groups *Ĝ* cannot be obtained. However,

Considering the profinite group *F*_{1} on a single generator can be described as a group with one generator, say *a*, and no relations. It consists of all possible finite strings that can be formed by combining the generator and its inverse. It is the infinite cyclic group *Z* = {1,*a,a*^{−1},*a*^{2},*a*^{−2},*a*^{3},*a*^{−3},···}. Now, we discuss the profinite completion of *F*_{1}. The profinite group *Z _{p}*, across all primes

*p*. It is often denoted as

*Z*

_{p}^{*}, as it corresponds to the elements of

*Z*with a valuation of zero. The p-adic integers are a special class of numbers used in number theory and algebraic geometry.

_{p}Considering the profinite group ^{22} The subject is complex and connected to the so-called Belyi theorem, a fundamental result that establishes a connection between algebraic curves defined over the algebraic closure of the rationals, *Q*, and certain rational functions called Belyi functions. An algebraic curve defined over *Q* can be represented as a branched covering of the Riemann sphere (the complex projective line P^{1}(C)) branched only over three points (usually taken as 0, 1, and ∞) if and only if the curve itself is defined over a number field, which is a finite extension of the field of rational numbers Q.

In other words, the Belyi theorem implies that an algebraic curve defined over a number field can be mapped to the Riemann sphere in such a way that the ramification (branching) is restricted to just three points. The rational functions that provide these branched coverings are known as Belyi functions. The significance of the Belyi theorem lies in the fact that it provides a method to study algebraic curves defined over number fields by analyzing their ramified coverings and the associated ‘dessins d’enfants’, which are combinatorial objects encoding the ramification data. Specifically, we have the crucial result that:

*i.e.*the so-called étale fundamental group for the triply branched projective line is the profinite group

### SL_{2}(C) representations of groups f_{p} and a Groebner basis *G*

While the previous section describing profinite groups showcases the importance of algebraic geometry in the context of DNA/RNA sequences, it remains somewhat abstract. To address this, we can consider the representations of an *f _{p}* group over the space-time-spin group

*SL*

_{2}(C), as we did in previous studies.

^{18,19,21}Representations of

*f*in

_{p}*SL*

_{2}(C) are homomorphisms

*ρ*:

*f*→

_{p}*SL*

_{2}(C) with character

*κ*(

_{ρ}*g*) = tr(

*ρ*(

*g*)),

*g*∊

*f*.The notation tr(

_{p}*ρ*(

*g*)) signifies the trace of the matrix

*ρ*(

*g*). The set of characters is used to determine an algebraic set by taking the quotient of the set of representations

*ρ*by the group

*SL*

_{2}(C), which acts by conjugation on representations.

^{24,25}In such papers, we showed that the character variety of

*f*is a set comprised of a sequence

_{p}*X*of multivariate polynomials. A particular basis related to

*X*is the Groebner basis

*G*(

*X*), whose factors define hypersurfaces.

Our previous paper focused on a possible algebraic approach of topological quantum computing.^{18} In two subsequent papers,^{19,21} we investigated *SL*_{2}(C) representations of short DNA/RNA sequences (*e.g.*, the consensus sequence of a transcription factor or the seed of a miRNA) and related them to a potential disease. For a two-generator group *f _{p}*, the factors are three-dimensional surfaces. In general, these surfaces can be classified by mapping them to a rational surface across five categories.

^{19}Often encountered surfaces are degree

*p*Del Pezzo surfaces where 1 ≤

*p*≤ 9. A rational surface may either be nonsingular, almost nonsingular, having only isolated singularities, or singular. Almost nonsingular surfaces are key in our context. A simple singularity is referred to as an A-D-E singularity and must be of the type

*A*,

_{n}*n*≥ 1,

*D*,

_{n}*n*≥ 4,

*E*

_{6},

*E*

_{7}, or

*E*

_{8}. The A-D-E type is mirrored in the notation we employ. For instance,

*S*

^{(lA1,mA2,nA3,···)}denotes a surface containing

*l*type

*A*

_{1},

*m*type

*A*

_{2},

*n*type

*A*

_{3}singularities,

*etc*. A generic surface is the Cayley cubic we encountered in our previous papers, defined as

*S*

^{(4A1)}=

*xyz*+

*x*

^{2}+

*y*

^{2}+

*z*

^{2}−4.

^{19}

For a three-generator group *f _{p}*, the factors of

*G*(

*X*) are seven-dimensional surfaces of the form

*S*(

_{a,b,c,d}*x,y,z*). Some of them belong to the Fricke family,

^{19}which is associated with the four-punctured sphere. But for a chosen set of parameters

*a,b,c,d*, the hypersurface reduces to an ordinary three-dimensional surface. For a four-generator group

*f*, the factors of

_{p}*G*(

*X*) are 14-dimensional surfaces containing four copies of the form

*S*(

*x,y,z*),

*S*(

*x,u,v*),

*S*(

*y,u,v*), and

*S*(

*z,v,w*) for selected choices of eight parameters.

#### Groebner basis of the TATA box

The Groebner basis for the character variety associated with the *f _{p}* group of generators rel = TATAAAA of the TATA box as discussed above, is found to be:

*G*= (

_{TATA}*z*

^{4}−

*xy*

^{2}−

*xyz*+

*x*

^{2}+

*y*

^{2}+

*yz*− 3

*z*

^{2}+

*x*− 2) (

*x*

^{2}

*z*−

*xy*−

*xz*+

*y*−

*z*)

*S*

^{(A2)}

*S*

^{(A4)}(

*x*

^{3}−

*z*

^{2}− 3

*x*+ 2),

*S*

^{(A2)}=

*x*

^{2}

*y − z*

^{3}

*– xz – y*+ 3

*z*and

*S*

^{(A4)}=

*xz*

^{2}

*–x*

^{2}

*–yz − x*+ 2 are degree 3 Del Pezzo surfaces. The Groebner basis

*G*

_{TATA}comprises a degree 2 Del Pezzo surface (Fig. 1a, and a rational scroll whole analytic expression is in the first row. Both surfaces are singular. The second row consists of two surfaces with simple singularities of type

*A*

_{2}and

*A*

_{4}, respectively. The last term represents a curve (not a surface).

#### Groebner basis for polyadenylation signals

For the first polyadenylation signal considered in the paragraph describing infinite finitely generated groups. The relation of the *f _{p}* group is rel1 = AAUAAA. The corresponding Groebner basis is:

*G*

_{rel1}= 3 rational scrolls ×

*P*

^{2}×

*S*

^{(4A1)}

*S*

^{(A1)}× curve.

The Groebner basis *G*_{rel1} contains three rational scrolls, a surface birationally equivalent to the projective plane *P*^{2}, the Cayley cubic *S*^{(4A1)}, the degree 3 Del Pezzo surface *S*^{(A1)} = *x*^{2}*y* − *xz*^{2} – *xz* + *yz* + *x* − *y* (Fig. 1b) and a curve.

For the second polyadenylation signal considered above in the paragraph describing groups *f _{p}* and

*F*, the relation of the

_{r}*f*group is rel2 = UGUAA. The factors of

_{p}*G*(

*X*) are seven-dimensional hypersurfaces

*S*(

_{a,b,c,d}*x,y,z*). However, by choosing specific parameters, such as

*S*

_{0,0,0,0}(

*x,y,z*) or

*S*

_{1,1,1,1}(

*x,y,z*), we obtained three-dimensional surfaces. These were found to be degree 3 Del Pezzo surfaces with simple singularities of the form

*S*

^{(lA2)}, with l = 1, 2, or 3, quadrics, or curves.

#### Groebner basis of the transcription factor of DBX gene

For the *DBX* gene studied in the paragraph on aperiodic sequences, the Groebner basis takes the form of *G*_{DBX} = scroll × *P*^{2} × *S*^{(A4)} × *S*^{(A2)} × *S*^{(4A1)} × curve, where scroll = *y*^{2}z − *xy* − *yz* + *x* − *z* and *P*^{2} = *z*^{4} − *x*^{2}*y* + *xz* − 4*z*^{2} + *y* + 2 are singular. The other factors are *DP*^{3} surfaces with isolated singularities that are *S*^{(A4)} = *yz*^{2} − *y*^{2} − *xz* − *y*^{2}, *S*^{(A2)} = *z*^{3} − *xy*^{2} + *yz* + − 3*z*, the Cayley cubic *S*^{(4A1)} and curve = *y*^{3} − *z*^{2} − 3*y* + 2.

## Further results

In this section, we describe additional results related to mRNA metabolism and miRNA.

### Algebraic geometry of mRNA translation

#### Shine-Dalgarno box

Ribosomal RNA is a type of noncoding RNA and is the main component of a macromolecular machine, called the ribosome, whose role is to ensure mRNA translation. The initiation of translation needs the recognition of the appropriate sequences on the mRNA by the ribosome. A major factor in this recognition is an mRNA–ribosomal RNA interaction first proposed by Shine and Dalgarno.^{26} They proposed that the ribosomal nucleotides recognize the complementary purine-rich sequence rel3 = AGGAGGU, which is found approximately eight bases upstream of the start codon AUG in a number of mRNAs found in viruses that affect *Escherichia coli*.

Let us study the group *f _{p}* = 〈A, G, U|rel3〉. The card seq of

*f*is found to be the same as that of the free group

_{p}*F*

_{2}. The

*SL*

_{2}(C) character variety is a scheme

*X*whose Groebner basis

*G*(

*X*) comprises 7-dimensional surfaces

*S*(

_{a,b,c,d}*x,y,z*). By projecting to three dimensions, one gets surfaces like

*S*

_{0,0,0,0}(

*x,y,z*) and

*S*

_{1,1,1,1}(

*x,y,z*) as in the paragraph describing

*SL*

_{2}(C) representations of groups

*f*. We find degree 3 Del Pezzo surfaces with isolated singularities

_{p}*S*

^{(A1)}=

*x*

^{2}

*y + yz*

^{2}

*+*4

*xz +*4

*y*and

*x*

^{2}

*y + yz*

^{2}

*+x +z*

^{2}

*+*6

*xz +*5

*y −*6

*z −*7,

*S*

^{(A2)}=

*xyz +*2

*x*

^{2}

*+ z*

^{2}

*+*4 and

*S*

^{(A4)}=

*xyz +*3

*x*

^{2}

*+z*

^{2}− 5

*z*, quadrics, and curves.

#### Kozak consensus sequence

The Kozak consensus sequence is a nucleotide motif that functions as the protein translation initiation site in most eukaryotic mRNA transcripts.^{27} The small (40S) subunit of eukaryotic ribosomes bind, initially at the capped 5^{′}-end of the mRNA and then migrate, stopping at the first AUG codon in a favorable context for initiating translation. In eukaryotes, the Kozak sequence ensures that a protein is correctly translated from the genetic message, mediating ribosome assembly and translation initiation. A sequence logo of the most conserved bases around the initiation codon AUG for human mRNAs may be found in the first caption of Kozak (https://en.wikipedia.org/wiki/Kozak ) consensus sequence as rel4 = *ACCAUGGC*.

Let us study the group *f _{p}* = 〈A, C, G, U|rel4〉. The card seq of

*f*is found to be the same as that of the free group

_{p}*F*

_{3}of rank 3. This group can be linked to an aperiodic sequence by following the steps given in the paragraph describing aperiodic sequences. By splitting rel4 into four segments rel4 = rel

*rel*

_{A}*rel*

_{C}*rel*

_{G}*and applying the substitution maps*

_{U}*C*→ rel

*=*

_{C}*A*,

*A*→ rel

*=*

_{A}*CCAUG*,

*U*→ rel

*=*

_{U}*G*,

*G*→ rel

*=*

_{G}*C*, we generated the substitution sequence:

*S*

_{Kozak}=

*C,A,U,G,CAUG,ACCAUGGC,CCAUGA*

^{2}

*CCAUGGC*

^{2}

*A,*···.

On inspection, it is straightforward to observe that all finitely generated groups *f _{p}*

^{(l)}with their generations being

*CAUG, ACCAUGGC, CCAUGA*

^{2}

*CCAUGGC*

^{2}

*A,*···, respectively, have a card seq of

*F*

_{3}. The aforementioned sequence has a substitution matrix:

One can verify that *M* is primitive as *M*^{4} ≫ 0 and *λ _{PF}* ≈ 2.2055694 is the only real (and irrational) solution of the equation

*x*

^{3}− 2

*x*

^{2}– 1=0. Conditions (1) and (2) for aperiodic sequences are satisfied, implying that the substitution

*S*

_{Kozak}is aperiodic. Rittaud discussed the connection of the later Perron–Frobenius eigenvalue to random Fibonacci sequences.

^{28}

Mutation of a purine at position −3 with respect to the AUG codon is known to be associated with diseases including a type of thalassemia owing to a bad initiation of alpha-globin.^{27} In our approach, the mutation from rel4 to rel4′ = CCCAUGGC leads to a substitution *M*′that is no longer primitive, so that the property of aperiodicity of the sequence is lost. However, the card seq of the associated *f _{p}* group is still that of the free group

*F*

_{3}. No other substitution in the sequence rel4′ can be found to restore the aperiodicity.

### Algebraic geometry of miRNAs

miRNAs are small, single-stranded, noncoding RNA molecules containing approximately 22 nucleotides. miRNAs play crucial roles in RNA silencing and post-transcriptional regulation of gene expression by specifically targeting certain mRNAs for degradation and translational repression(https://en.wikipedia.org/wiki/MicroRNA ).^{29} miRNA genes are typically transcribed by RNA polymerase II (Pol II), which binds to a promoter located near the DNA sequence, encoding what will become the hairpin loop of a precursor (pre)-miRNA. Pre-miRNAs are approximately 70 nucleotides long and fold into imperfect stem-loop structures. A miRNA consists of a duplex comprising two strands (−5p and −3p). However, a single strand is selected into the RNA-induced silencing complex to serve as a template during the transcription of a complementary mRNA.^{30,31} For details of the miRNA sequences, we use the Mir database (https://www.mirbase.org/ ).^{32,33} It should be emphasized that a given miRNA may have hundreds of different mRNA targets and a single target may be regulated by multiple miRNAs. For previous discussions of how to define an *f _{p}* group from the seed of a miRNA, the reader may consult a recent review.

^{19}Below, we focus on other examples.

#### miRNA hsa-mir-122

mir-122 is a tissue-specific miRNA that is highly expressed in the liver.^{34} It is involved in cholesterol accumulation and fatty acid metabolism. It has a leading role in controlling the hepatitis C virus.^{35,36} The seed region for mir-122-5p is seed0 = GGAGUGU. The corresponding group *f _{p}* = 〈C, G, U|seed0〉 has the card seq of the free group

*F*

_{2}. Let us first check if the seed sequence is aperiodic. By splitting seed0 into three segments seed0 = seed

*seed*

_{A}*seed*

_{G}*and applying the substitution maps*

_{U}*A*→ seed

*,*

_{A =}GG*G*→ seed

*,*

_{G =}AGU*U*→ seed

*, one can check that the finitely generated groups*

_{U =}GU*f*

_{p}^{(l)}with generators

*GGAGUGU, AGUAGUGGAGUGUAGUGU,*possess the card seq of the free group

*F*

_{2}. Following the method described in the section on aperiodic sequences, their attached and free groups, one gets the (primitive) substitution matrix:

*λ*

^{3}− 2

*λ*

^{2}− 2

*λ*+2 has three real roots. The largest one is the (irrational) Perron–Frobenius eigenvalue

*λ*≈ 2.481194. One concludes that the sequence seed0 is aperiodic.

_{PF}Let us now look at the Groebner basis for the *SL*_{2}(C) representation of *f _{p}* with the method described above. One obtains:

*G*

_{mir-122−5p}(0,0,0,0) = 8

*yz*(2 −

*z*) and

^{2}*G*

_{mir-122−5p}(1,1,1,1) = −4

*z*(x −

^{2}*z*1) (

^{2}+z +*y*+

*z*

^{3}−

*z*− 2

^{2}*z*)

One can check that all values of the parameters *G**a,b,c,d* (*x, y, z*) only contain factors that are curves and not surfaces.

#### miRNA hsa-mir-503

The slowest evolving miRNA gene in the human species (hsa) is hsa-mir-503 (https://www.mirbase.org/ ). It regulates gene expression in various pathological processes of diseases, including carcinogenesis, angiogenesis, tissue fibrosis, and oxidative stress.^{37} The seed region of mir-503-5p is seed1 = AGCAGCGG. The corresponding group *f _{p}* = 〈A, C, U|seed1〉 has the card seq of the free group

*F*

_{2}. For this group, the Groebner basis with parameters (

*a,b,c,d*) = (0,0,0,0) is quite simple:

*G*

*mir*

_{−503−5p}(0,0,0,0) =

*S*

^{(4A1)}(

*x,y,z*), which is the already mentioned Cayley cubic. For (

*a,b,c,d*) = (1,1,0,0),

*G*

*mir*

_{−503−5p}(1,1,0,0) = −3

*xyzκ*

_{3}(

*x,y,z*), where

*κ*

_{3}(

*x,y,z*) is the Fricke surface described by Planat

*et al.*

^{38}For (

*a,b,c,d*) = (1,1,1,1), there are several more polynomials. One of which defines the Fricke surface

*xyz + x*

^{2}

*+ y*

^{2}

*+z*

^{2}

*−*2

*x − y –*2 = 0. The considered seed region for mir-503-3p is GGGUAUU. The surfaces in the Groebner basis are very simple in this case, and no simple singularities exist within them.

#### miRNA hsa-mir-146a

mir-146 is primarily involved in the regulation of inflammation and other processes functioning in the innate immune system. It has a role in neuropathogenesis. The considered seed region for hsa-mir-146a-5p is seed2 = GAGAAC (https://www.mirbase.org/ ). Again the corresponding group *f _{p}* = 〈A, C, G|seed2〉 has the card seq of the free group

*F*

_{2}. The Groebner basis with parameters (

*a,b,c,d*) = (0,0,0,0) is

*G*

_{hsa-146a−5p}(0,0,0,0) = (

*xz*+ y + 2) (

*y*−

*z*

^{2}+ 2)

^{2}(

*x*

^{2}+

*z*

^{2}− 2

*y*− 4)

*S*

^{(3A2)}, where

*S*

^{(3A2)}=

*z*

^{3}−

*xy*− 2

*yz*− 2

*x*− 4

*z*. The Groebner basis with parameters (

*a,b,c,d*) = (1,1,1,1) is of the form

*G*

_{hsa-146a−5p}(1,1,1,1) =

*DP*

^{4}×

*f*

^{(2A2)}× quadric × curves, where

*DP*

^{4}is a degree 4 del Pezzo surface.

#### miRNAs and disease

As described previously,^{19} a potential disease is associated with *f _{p}* groups that fail to satisfy at least one of three requirements: (1) the card seq of

*f*should be that of a free group

_{p}*F*; (2) the generating sequence should be aperiodic; or (3) the

_{r}*SL*

_{2}(C) character variety of

*f*should have a Groebner basis devoid of isolated singularities even though the

_{p}*f*group may have the card seq of a free group.

_{p}^{19}Following these criteria, the sequence hsa-mir-122-5p is healthy but the sequences hsa-mir-503-5p and hsa-mir-146a-5p are not because criterion three is not satisfied. Additional examples can be found in our previous study.

^{19}

In addition to isolated singularities, the Groebner basis may contain unique surfaces that are not simply singular. The *DP*^{4} surface in *G*_{hsa-146a−5p}(1,1,1,1) is an example of a singular surface. Further mathematical evaluation is required to investigate these surfaces.^{39} However, we will not include them in this review.

## Discussion

Figure 2 summarizes our key results. Given a short DNA/RNA sequence, rel that represents a consensus sequence in a transcription factor, the seed of a miRNA, or a relevant sequence in mRNA recognition and processing, we constructed a finitely generated group, *f _{p}*. The architecture of subgroups, card seq, within this group was computed, as described in the subsection about the infinite finitely generated groups

*f*. If the

_{p}*f*card seq matches that of the free group

_{p}*F*(of rank

_{r}*r*= nt − 1), we proceed to path four; otherwise, a potential disease could be in sight (path three). After reaching path four, the next step involves checking the aperiodicity of rel and the corresponding

*f*group, as described in the subsection about aperiodic sequences and their attached groups

_{p}*f*. The final step is to examine the presence (or absence) of isolated singularities in the Groebner basis

_{p}*G*for the

*SL*

_{2}(C) character variety associated with

*f*, as outlined in the subsection about

_{p}*SL*

_{2}(C) representations of groups

*f*. For a healthy sequence, the path concludes at six, while a potential disease may be indicated if the path ends at three, seven, or eight.

_{p}In Table 1, we provide several examples of paths.^{23,31,36,37,40} All three checks can be performed, even if paths 4 or 5 are not followed. For instance, the termination {7,8} signifies that the sequence fails both in being aperiodic and in being devoid of simple singularities. For sequences with four distinct nucleotides, like the sequence of transcription factor FOX or the Kozak sequence rel4, it is difficult to make a conclusion about the risk of a disease. The generic Groebner basis^{1}*G*(x,y,z) always contains a surface with isolated singularities such as *S*^{(4A1)} and *S*^{(3A1)} and there are four copies of them. The termination {6,8} applies for this case.

Sequence | rel | Path |
---|---|---|

EGR1^{23} | GCGTGGGCG | 1→2→4→5→6 |

FOS^{23} | TGAGTCA | 1→2→4→5→{6,8} |

Nanog^{23} | TAATGG | 1→2→4→{7,8} |

DBX | TTTATTA | 1→2→4→5→8 |

TATA | TATAAAA | 1→2→3→(7,8) |

Poly(A) (rel1) | AAUAAA | 1→2→3→{7,8} |

Poly(A) (rel2) | UGUAA | 1→2→4→{7,8} |

Shine-Dalgarno (rel3) | AGGAGGU | 1→2→4→5→8 |

Kozak (rel4) | ACCAUGGC | 1→2→4→5→{6,8} |

Kozak (rel4′) | CCCAUGGC | 1→2→4→7 |

hsa-mir-122-5p^{36}(seed0) | GGAGUGU | 1→2→4→5→6 |

hsa-mir-132-5p (https://fr.wikipedia.org/wiki/Micro-ARN_7 ) | CCGUGGC | 1→2→4→5→6 |

mir-503-5p (seed1)^{37} | AGCAGCGG | 1→2→5→8 |

mir-146a-5p (seed2)^{40} | GAGAAC | 1→2→{7,8} |

hsa-mir-7-5p (https://en.wikipedia.org/wiki/MiR-132 ) | GGAAGA | 1→2→{3,7,8} |

hsa-mir-7-5p | GGAAGAC | 1→2→4→5→6 |

hsa-mir-7-3p | AACAAAU | 1→2→4→7 |

hsa-mir-155-3p^{31}^{,}^{40} | UCCUAC | 1→2→4→{7,8} |

hsa-mir-155-3p | UCCUACA | 1→2→3 |

### Algebraic geometry of m^{6}A modifications

As mentioned in the Introduction, a subfield of epigenetics deals with post-transcriptional mRNA modifications. *m*^{6}*A* is the most frequent modification in most eukaryotes. But *m*^{6}*A* is also present in bacteria, with the consensus motif *GCCAG.*^{41,42} An interesting aspect is that the mRNA *m*^{6}*A* motif in bacteria is distinct from the consensus motif in eukaryotes (RR^{43} In Table 2, we provide details of the group generated by these sequences, when the sequence is aperiodic and/or has a Groebner basis of its character variety containing an isolated singularity. The path in the diagram of Figure 2 is shown in Table 1.

Sequence | Group | Aperiodic | Groebner basis | Path |
---|---|---|---|---|

Bacterial | ||||

GCCAG | F_{2} | 1.83928 | No | 1→2→4→5→6 |

Eukaryote | ||||

AAACA | F_{1} | No | 1→2→4→{7,8} | |

AAACC | H_{3} | No | 1→2→{3,7} | |

AAACU | F_{2} | No | S^{(A2)}, S^{(A1A2)} No | 1→2→4→7 |

GGACA | F_{2} | 1.83928 | No | 1→2→4→5→8 |

GGACC | F_{2} | No | S^{(A2)}, S^{(A2A2)} No | 1→2→4→7 |

GGACU | F_{3} | No | Unknown | 1→2→4→7 |

Only the bacterial sequence leads to a path terminating at edge 6 of the diagram of Figure 2. In the closest eukaryotic sequence GGACA (from the viewpoint of group analysis), isolated singularities are found, such as the degree 3 Del Pezzo surface *S*^{(A2A2)} = *y*^{3} − 2*xz* −4*y*. The other sequences are not aperiodic. From the biological point of view, it is known that an appropriate level of *m*^{6}*A* methylation is beneficial, but it may be a risk to drive it in an artificial way because it may destroy the delicate balance of regulations performed within the mRNA.

## Conclusions

Our approach was comprehensive and can be applied in numerous contexts beyond those we have considered thus far. It has the potential to impact the search for underlying causes of diseases and aid in the discovery of therapeutic strategies. The e-code, the processes that reveals and executes gene expression, has a sophisticated structure that our mathematical approach aimed to elucidate.

## Abbreviations

- m6A:
*N*^{6}-methyladenosine

- mRNA:
messenger RNA

- miRNA:
microRNA

## Declarations

### Acknowledgement

The first author would like to acknowledge the contribution of the COST Action CA21169, supported by COST (European Cooperation in Science and Technology).

### Data share statement

Computational data are available from the authors upon reasonable request.

### Funding

Funding was obtained from Quantum Gravity Research in Los Angeles, CA, USA.

### Conflict of interest

The authors declare that they have no conflicts of interest.

### Authors’ contributions

Conceptualization (MP, FF, KI), methodology (MP, DC, RA), software (MP), validation (RA, FF, DC, MMA), formal analysis (MP, MMA), investigation (MP, DC, FF, MMA), writing and original draft preparation (MP), writing, review and editing (MP) visualization (FF, RA), supervision (MP, KI), project administration (KI), and funding acquisition (KI). All authors have read and approved the final version of the manuscript.