AI Background AscenTrust (Home) Quantum Computing

INDEX

CHAPTER ZERO
   PROLEGOMENA
   PREAMBLE
CHAPTER ONE:   INTRODUCTION
   Section One:    Models and Reality
   Section Two:   Prototypes
   Section Three:   Value Engineering
CHAPTER TWO:   Artificial Intelligence
   Section One:   Introduction to Artificial Intelligence
   Section Two:   History of Artificial Intelligence
   Section Three:    Areas of Study
CHAPTER Three:   Introduction to Neural Networks
   Section One:   History of Neural Networks
   Section Two:    Types of Neural Networks
CHAPTER Four:   DEEP LEARNING
   Section One:   Introduction
   Section Two:   Deep Learning Algorithms
      Part One:   Convolution Neural Networks (CNN)
      Part Two:   Long Short Term Memory Networks (LSTMs)
      Part Three:   Recurrent Neural Networks (RNNs)
      Part Four:   Generative Adversarial Networks (GANs)
      Part Five:   Radial Basis Function Networks (RBFNs)
      Part Six:   Multilayer Perceptrons (MLPs)
      Part Seven:   Self Organizing Maps (SOMs)
      Part Eight:   Deep Belief Networks (DBNs)
      Part Nine:   Restricted Boltzmann Machines( RBMs)
      Part Ten:   Autoencoders
CHAPTER Five:   AI-Background
   Section One:   Perceptron
   Section Two:   Multilayer Perceptron
   Section Three:   Feed-Forward Network Mapping
   Section Four:   Back Propagation
CHAPTER Six:   Statistical Methods
   Section One:   Introduction
   Section Two:   Multilayer Perceptron
   Section Three:   Feed-Forward Network Mapping
   Section Four:   Back Propagation
Appendix:   MATHEMATICAL PRELIMINARIES
   Section One:   Introduction to Functions
   Section Two:   Complex Numbers
   Section Three:   Calculus
   Section Four:   Fourier Series
   Section Five:   Fourier Transforms
   Section Six:   Differential Equations
   Section Seven:   Vector Calculus
   Section Eight:   Linear Algebra

CHAPTER ZERO

1.    This webpage is being developed by the Senior Engineer of AscenTrust, LLc.
2.   The Senior Engineer is a Graduate of Electrical and Computing Engineering (University of Alberta, Edmonton) and Graduated with Distinction in 1969
3.    This document is meant as an introduction to the broad subject of Artificial Intelligence with an emphasis on Artificial Neural Networks and their use in Computational Weather Models
4.   This webpage is in a state of ongoing development and new material will appear from time to time as the workload permits.
5.    The text and mathematical equations below are renderred in HTML using a web-based Latex and Javascript renderring engine.

PROLEGOMENA

Artificial Intelligence (AI) has a long and checkered history. Academics are fond of making grandiose claims. Unfortunately, the choices of Research Topics in Universities and Governmental Institution are mainly directed by the funding requirements of the Institution. The AI community has been blessed and cursed with this conundrum. The desire, on the part of researchers, to obtain funding for their research has led to extreme predictions concerning the power of AI software systems.

The majority of modern research in AI systems is concerned with the creation of fairly simple, linear software algorithms. These algorithms are not meant to simulate human cognitive processes. Software algoritms cannot simulate human cognitive processes because we do not understand the interactions of the left and right hemispheres of the brain, and much less the interaction of the Soul with the left and right hemispheres of the brain.

Cognition is the "mental action or process of acquiring knowledge and understanding through thought, experience, and the senses". It encompasses all aspects of intellectual functions and processes such as: perception, attention, thought, imagination, intelligence, the formation of knowledge, memory and working memory, judgment and evaluation, reasoning and computation, problem-solving and decision-making, comprehension and production of language. Cognitive processes use existing knowledge and discover new knowledge.

Section one: DEFINITION

Before we attempt to define Intelligence we need to address the notions of Ontology, Epistomology and Worldview

Our definitions of ontology and epistomology will be taken from their philosophical and historical roots.

Ontology:Though the use of the term has its roots in the Greek Physics and Metaphysics its modern use goes back to Clauberg (1625-1665), its special application to the first discussions of metaphysical concepts was made by Christian von Wolff (1679-1754). Prior to this time "the science of being" had retained the titles given it by its founder Aristotle: "first philosophy", "theology", "wisdom". The term "metaphysics" was given a wider extension by Wolff, who divided "real philosophy" into general metaphysics, which he called ontology, and special, under which he included cosmology, psychology, and theodicy.

Epistomology: In this document we shall consider epistemology in its historical and broader meaning, which is the usual one in English, as applying to the theory of knowledge. Epistomology is therefore understood as that part of Philosophy which describes, analyses and examines the facts of knowledge as such and then tests chiefly the value of knowledge and of its various kinds, its conditions of validity, range and limits (critique of knowledge).

Worldview: One's worldview comprises a number of basic beliefs which provide the philosophical and ethical underpinning of an individuals behavioral patterns. These basic beliefs cannot, by definition, be proven (in the logical sense) within the worldview – precisely because they are axioms, and are typically argued from rather than argued for.

If two different worldviews have sufficient common beliefs it may be possible to have a constructive dialogue between them.

On the other hand, if different worldviews are held to be basically incommensurate and irreconcilable, then the situation is one of cultural relativism and would therefore incur the standard criticisms from philosophical realists. Additionally, religious believers might not wish to see their beliefs relativized into something that is only "true for them". Subjective logic is a belief-reasoning formalism where beliefs explicitly are subjectively held by individuals but where a consensus between different worldviews can be achieved.

Dualism:The term dualism, in this document, is employed in opposition to the Atheistic presuposition of monism (The Cosmos consists of only one substance), to signify the ordinary view that the existing universe contains two radically distinct kinds of being or substance — matter and spirit, body and mind. Dualism is thus opposed to both materialism and idealism. Idealism, however, of the Berkeleyan type, which maintains the existence of a multitude of distinct substantial minds, may along with dualism, be described as pluralism. philosophical view which holds that mental phenomena are, at least in certain respects, not physical phenomena, or that the mind and the body are distinct and separable from one another.

Intelligence: Human intelligence is marked by complex cognitive feats and high levels of motivation and self-awareness. Intelligence enables humans to remember descriptions of things and use those descriptions in future behaviors. It is a cognitive process. It gives humans the cognitive abilities to learn, form concepts, understand, and reason, including the capacities to recognize patterns, innovate, plan, solve problems, and employ language to communicate. Intelligence enables humans to experience and think.

Intelligence is different from learning. Learning refers to the act of retaining facts and information or abilities and being able to recall them for future use, while intelligence is the cognitive ability of someone to perform these and other processes. There are various tests which accurately quantify intelligence, such as the Intelligence Quotient (IQ) test.

Artificial Intelligence Academics involved in the computing Science and Engineering of systems have variously proposed definitions of intelligence that include the intelligence demonstrated by machines. Some of these definitions are meant to be general enough to encompass human intelligence as well. An intelligent agent can be defined as a system that perceives its environment and takes actions which maximize its chances of success. It is clear that these systems cannot be modelled only in software.

Kaplan and Haenlein define artificial intelligence as "a system's ability to correctly interpret external data, to learn from such data, and to use those learnings to achieve specific goals and tasks through flexible adaptation". Progress in artificial intelligence has been limited to very narrow ranges of mainly software computational modeling which includes machine learning, deep learning and deep neural networks. All existing AI systems lack the most fundamental part of what makes a human being different from all other entities in the Creation, that being the Soul

CHAPTER ONE: INTRODUCTION

Section one: Models and Reality

The Academic idea that modelling and simulations can bring us to an understanding of reality is completely false. Certainly, modelling has become an essential and inseparable part of many scientific disciplines, each of which has its own ideas about specific types of modelling. The following was said by John von Neumann:

... the sciences do not try to explain, they hardly even try to interpret, they mainly make models. By a model is meant a mathematical construct which, with the addition of certain verbal interpretations, describes observed phenomena. The justification of such a mathematical construct is solely and precisely that it is expected to work—that is, correctly to describe phenomena from a reasonably wide area.

A scientific model seeks to represent empirical objects, phenomena, and physical processes in a logical and objective way. All models are in simulacra, that is, simplified reflections of reality that, despite being approximations, can be extremely useful. Building and disputing models is fundamental to the Academic enterprise. Complete and true representation are impossible, but Academics debate often concerns which is the better model for a given task, e.g., which is the more accurate climate model for seasonal forecasting.

Attempts to formalize the principles of the empirical sciences use an interpretation to model reality, in the same way logicians axiomatize the principles of logic. The aim of these attempts is to construct a formal system that will not produce theoretical consequences that are contrary to what is found in reality. Predictions or other statements drawn from such a formal system mirror or map the real world only insofar as these scientific models are true.

For the scientist, a model is also a way in which the human thought processes can be amplified. For instance, models that are rendered in software allow scientists to leverage computational power to simulate, visualize, manipulate and gain intuition about the entity, phenomenon, or process being represented.

Section Two:   Prototypes

A prototype is an early physical model of a product built to test a concept or process. It is a term used in a variety of contexts, including semantics, design, electronics, and software programming. A prototype is generally used to evaluate a new design to enhance precision by system analysts and users and in our case to do value Engineering on the Product. Prototyping serves to provide specifications for a real, working system rather than a theoretical one. In some design workflow models, creating a prototype (a process sometimes called materialization) is the step between the formalization and the evaluation of an idea.

Prototypes explore the different aspects of an intended design:

  1. A proof-of-principle prototype serves to verify some key functional aspects of the intended design, but usually does not have all the functionality of the final product.
  2. A working prototype represents all or nearly all of the functionality of the final product.
  3. A visual prototype represents the size and appearance, but not the functionality, of the intended design. A form study prototype is a preliminary type of visual prototype in which the geometric features of a design are emphasized, with less concern for color, texture, or other aspects of the final appearance.
  4. A user experience prototype represents enough of the appearance and function of the product that it can be used for user research.
  5. A functional prototype captures both function and appearance of the intended design, though it may be created with different techniques and even different scale from final design.
  6. A paper prototype is a printed or hand-drawn representation of the user interface of a software product. Such prototypes are commonly used for early testing of a software design, and can be part of a software walkthrough to confirm design decisions before more costly levels of design effort are expended.

Section Three:   Value Engineering

Value Engineering is a systematic analysis of the various components and materials of a system (The system under discussion is a Quantume Computer)in order to improve the performance or functionality. In the case of the Quantum Computer Project the first round of Value Engineering will consist of analysis of the optical portions of the system. There are two parts to the optics of the Paul Trap Quantum Computer. Value can therefore be manipulated by either improving the function or reducing the cost. It is a primary tenet of value engineering that basic functions be preserved and not be reduced as a consequence of pursuing value improvements. The term "value management" is sometimes used as a synonym of "value engineering", and both promote the planning and delivery of projects with improved performance.

Value engineering is a key part of all Research and Development Projects within the project management, industrial engineering or architecture body of knowledge as a technique in which the value of a system’s outputs is superficially optimized by distorting a mix of performance (function) and costs. It is based on an analysis investigating systems, equipment, facilities, services, and supplies for providing necessary functions at superficial low life cycle cost while meeting the misunderstood requirement targets in performance, reliability, quality, and safety. In most cases this practice identifies and removes necessary functions of value expenditures, thereby decreasing the capabilities of the manufacturer and/or their customers. What this practice disregards in providing necessary functions of value are expenditures such as equipment maintenance and relationships between employee, equipment, and materials. For example, a machinist is unable to complete their quota because the drill press is temporarily inoperable due to lack of maintenance and the material handler is not doing their daily checklist, tally, log, invoice, and accounting of maintenance and materials each machinist needs to maintain the required productivity and adherence to section 4306.

VE follows a structured thought process that is based exclusively on "function", i.e. what something "does", not what it "is". For example, a screwdriver that is being used to stir a can of paint has a "function" of mixing the contents of a paint can and not the original connotation of securing a screw into a screw-hole. In value engineering "functions" are always described in a two word abridgment consisting of an active verb and measurable noun (what is being done – the verb – and what it is being done to – the noun) and to do so in the most non-descriptive way possible. In the screwdriver and can of paint example, the most basic function would be "blend liquid" which is less descriptive than "stir paint" which can be seen to limit the action (by stirring) and to limit the application (only considers paint).

Value engineering uses rational logic (a unique "how" - "why" questioning technique) and an irrational analysis of function to identify relationships that increase value. It is considered a quantitative method similar to the scientific method, which focuses on hypothesis-conclusion approaches to test relationships, and operations research, which uses model building to identify predictive relationships.

Chapter Two:   Introduction to Artificial Intelligence

Section One: Introduction

Artificial Intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by humans. Example tasks in which this is done include speech recognition, computer vision, translation between (natural) languages, as well as other mappings of inputs.

Note: So far we have not designed or built any Artificial Intelligence Machines. All AI efforts have been directed towards building AI Applications

AI applications include advanced web search engines (e.g., Google Search), recommendation systems (used by YouTube, Amazon, and Netflix), understanding human speech (such as Siri and Alexa), self-driving cars (e.g., Waymo), generative or creative tools (ChatGPT and AI art), automated decision-making, and competing at the highest level in strategic game systems. AI system are also being used in Weather Forecasting

Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding. AI research has tried and discarded many different approaches, including simulating the brain, modeling human problem solving, formal logic, large databases of knowledge. In the first decades of the 21st century, highly mathematical and statistical software learning has dominated the field, and this technique has proved highly successful, helping to solve many challenging problems throughout industry and academia.

The various sub-fields of AI research are centered around particular goals. The traditional goals of AI research include reasoning, knowledge representation, planning, learning, natural language processing, perception.

General intelligence (the ability to solve an arbitrary problem) is still one of the field's long-term goals but it is now clearly understood that general intelligence will not be implementable as a purely software aplication.

To claim success AI researchers have narrowed to scope of AI and have integrated a wide range of problem-solving techniques, including search and mathematical optimization, formal logic, artificial neural networks, and methods based on statistics and probability. AI also draws upon computer science, psychology, linguistics, philosophy, and many other fields. The AI community has highjacked much of the philosophical language used in the discussion of Reality

The field was founded on the assumption that human intelligence "can be so precisely described that a machine can be made to simulate it". This idea has proven to be false. The term artificial intelligence has always been overhyped by the Academics who are involved in researching AI. This process of overstating the true technological capabilities of AI is a direct consequence of the search for funding in Academia.

Section Two: History of Artificial Intelligence

Calculating Machines were built in antiquity and improved throughout history by many mathematicians, including the philosopher Gottfried Leibniz. In the early 19th century, Charles Babbage designed a programmable computer (the Analytical Engine), although it was never built. Ada Lovelace speculated that the machine "might compose elaborate and scientific pieces of music of any degree of complexity or extent". (She is often credited as the first programmer because of a set of notes she wrote that completely detail a method for calculating Bernoulli numbers with the Engine.)

Following Babbage, although at first unaware of his earlier work, was Percy Ludgate, a clerk to a corn merchant in Dublin, Ireland. He independently designed a programmable mechanical computer, which he described in a work that was published in 1909.

Vannevar Bush's paper Instrumental Analysis (1936) discussed using existing IBM punch card machines to implement Babbage's design. In the same year he started the Rapid Arithmetical Machine project to investigate the problems of constructing an electronic digital computer.

The first modern computers were the massive code breaking machines of the Second World War (such as Z3, ENIAC and Colossus). The latter two of these machines were based on the theoretical foundation developed by John von Neumann.

In the 1940s and 50s, a handful of scientists from a variety of fields (mathematics, psychology, engineering, economics and political science) began to discuss the possibility of creating an artificial brain. The field of artificial intelligence research was founded as an academic discipline in 1956.

The earliest research into thinking machines was inspired by a confluence of ideas that became prevalent in the late 1930s, 1940s, and early 1950s. Recent research in neurology had shown that the brain was an electrical network of neurons that fired in all-or-nothing pulses. Norbert Wiener's cybernetics described control and stability in electrical networks. Claude Shannon's information theory described digital signals (i.e., all-or-nothing signals). Alan Turing's theory of computation showed that any form of computation could be described digitally. The close relationship between these ideas suggested that it might be possible to construct an electronic brain.
Examples of work in this vein includes robots such as W. Grey Walter's turtles and the Johns Hopkins Beast. These machines did not use computers, digital electronics or symbolic reasoning; they were controlled entirely by analog circuitry.

Walter Pitts and Warren McCulloch analyzed networks of idealized artificial neurons and showed how they might perform simple logical functions in 1943. They were the first to describe what later researchers would call a neural network. One of the students inspired by Pitts and McCulloch was a young Marvin Minsky, then a 24-year-old graduate student. In 1951 (with Dean Edmonds) he built the first neural net machine, the SNARC. Minsky was to become one of the most important leaders and innovators in AI for the next 50 years.

When access to digital computers became possible in the middle fifties, a few scientists instinctively recognized that a machine that could manipulate numbers could also manipulate symbols and that the manipulation of symbols could well be the essence of human thought. This was a new approach to creating thinking machines.

In 1955, Allen Newell and Herbert A. Simon created the "Logic Theorist". The program would eventually prove 38 of the first 52 theorems in Russell and Whitehead's Principia Mathematica, and find new and more elegant proofs for some. Simon said that they had "solved the venerable mind/body problem, explaining how a system composed of matter can have the properties of mind."

Many early AI programs used the same basic algorithm. To achieve some goal (like winning a game or proving a theorem), they proceeded step by step towards it (by making a move or a deduction) as if searching through a maze, backtracking whenever they reached a dead end. This paradigm was called "reasoning as search".

The principal difficulty was that, for many problems, the number of possible paths through the "maze" was simply astronomical (a situation known as a "combinatorial explosion"). Researchers would reduce the search space by using heuristics or "rules of thumb" that would eliminate those paths that were unlikely to lead to a solution.

Newell and Simon tried to capture a general version of this algorithm in a program called the "General Problem Solver". Other "searching" programs were able to accomplish impressive tasks like solving problems in geometry and algebra, such as Herbert Gelernter's Geometry Theorem Prover (1958) and SAINT, written by Minsky's student James Slagle (1961). Other programs searched through goals and subgoals to plan actions, like the STRIPS system developed at Stanford to control the behavior of their robot Shakey.

An important goal of AI research is to allow computers to communicate in natural languages like English. An early success was Daniel Bobrow's program STUDENT, which could solve high school algebra word problems.

A semantic net represents concepts (e.g. "house","door") as nodes and relations among concepts (e.g. "has-a") as links between the nodes. The first AI program to use a semantic net was written by Ross Quillian and the most successful version was Roger Schank's Conceptual dependency theory.

Joseph Weizenbaum's ELIZA could carry out conversations that were so realistic that users occasionally were fooled into thinking they were communicating with a human being and not a program (See ELIZA effect). But in fact, ELIZA had no idea what she was talking about. She simply gave a canned response or repeated back what was said to her, rephrasing her response with a few grammar rules. ELIZA was the first chatterbot.

Section Three: Classes of Artificial Intelligence

Part One:   Search and Optimization

AI can solve many problems by intelligently searching through many possible solutions. Reasoning can be reduced to performing a search. For example, logical proof can be viewed as searching for a path that leads from premises to conclusions, where each step is the application of an inference rule. Planning algorithms search through trees of goals and subgoals, attempting to find a path to a target goal, a process called means-ends analysis. Robotics algorithms for moving limbs and grasping objects use local searches in configuration space.

Simple exhaustive searches are rarely sufficient for most real-world problems: the search space (the number of places to search) quickly grows to astronomical numbers. The result is a search that is too slow or never completes. The solution, for many problems, is to use "heuristics" or "rules of thumb" that prioritize choices in favor of those more likely to reach a goal and to do so in a shorter number of steps. In some search methodologies, heuristics can also serve to eliminate some choices unlikely to lead to a goal (called "pruning the search tree"). Heuristics supply the program with a "best guess" for the path on which the solution lies.[94] Heuristics limit the search for solutions into a smaller sample size.[95]

Chapter Three:   Introduction to Neural Networks

A biological neural network is composed of a group of chemically connected or functionally associated neurons. A single neuron may be connected to many other neurons and the total number of neurons and connections in a network may be extensive. Connections, called synapses, are usually formed from axons to dendrites, though dendrodendritic synapses[3] and other connections are possible. Apart from electrical signalling, there are other forms of signalling that arise from neurotransmitter diffusion.

Artificial intelligence, cognitive modelling, and neural networks are information processing paradigms inspired by biological neural systems. Computerized Artificial Intelligence is an attempt to develop bSoftware Libraries which can parse large data sets and re-order these data sets in a predetermined fashion. In the artificial intelligence field, In the artificial Intelligence Community these libraries have been applied successfully to speech recognition, image analysis and adaptive control, in order to construct software agents (in computer and video games) or autonomous robots.

Historically, digital computers evolved from the von Neumann model, and operate via the execution of explicit instructions via access to memory by a number of processors. On the other hand, the origins of neural networks are based on efforts to model information processing in biological systems. Unlike the von Neumann model, neural network computing does not separate memory and processing.

Section One:   History of Neural Networks

Our history of neural network will be limited to the twentieth and twenty-first centuries.

Wilhelm Lenz (1920) and Ernst Ising (1925) created and analyzed the Ising model which is essentially a non-learning artificial recurrent neural network (RNN) consisting of neuron-like threshold elements. In 1972, Shun'ichi Amari made this architecture adaptive. His learning RNN was popularised by John Hopfield in 1982. McCulloch and Pitts (1943) also created a computational model for neural networks based on mathematics and algorithms. They called this model threshold logic. These early models paved the way for neural network research.

In the late 1940s psychologist Donald Hebb created a hypothesis of learning based on the mechanism of neural plasticity that is now known as Hebbian learning. Hebbian learning is considered to be a 'typical' unsupervised learning rule and its later variants were early models for long term potentiation. These ideas started being applied to computational models in 1948 with Turing's B-type machines.

Farley and Clark (1954) first used computational machines, then called calculators, to simulate a Hebbian network at MIT. Other neural network computational machines were created by Rochester, Holland, Habit, and Duda (1956).

Frank Rosenblatt (1958) created the Perceptron, an algorithm for pattern recognition based on a two-layer learning computer network using simple addition and subtraction. With mathematical notation, Rosenblatt also described circuitry not in the basic perceptron, such as the exclusive-or circuit.

Neural network research stagnated after the publication of machine learning research by Marvin Minsky and Seymour Papert(1969). They discovered two key issues with the computational machines that processed neural networks. The first issue was that single-layer neural networks were incapable of processing the exclusive-or circuit. The second significant issue was that computers were not sophisticated enough to effectively handle the long run time required by large neural networks.

Neural network research became active again when the personal computer achieved greater processing power. Also key in later advances was the backpropagation algorithm. It is an efficient application of the Leibniz chain rule to networks of differentiable nodes. It is also known as the reverse mode of automatic differentiation or reverse accumulation, due to Seppo Linnainmaa (1970). The term "back-propagating errors" was introduced in 1962 by Frank Rosenblatt, but he did not have an implementation of this procedure, although Henry J. Kelley had a continuous precursor of backpropagation already in 1960 in the context of control theory.

In 1982, Paul Werbos applied backpropagation to Multilayer Perceptron (MLP)s in the way that has become standard.

In the late 1970s to early 1980s, interest briefly emerged in theoretically investigating the Ising model by Wilhelm Lenz (1920) and Ernst Ising (1925)[8] in relation to Cayley tree topologies and large neural networks. In 1981, the Ising model was solved exactly for the general case of closed Cayley trees (with loops) with an arbitrary branching ratio and found to exhibit unusual phase transition behavior in its local-apex and long-range site-site correlations.

The parallel distributed processing of the mid-1980s became popular under the name connectionism. The text by Rumelhart and McClelland (1986) provided a full exposition on the use of connectionism in computers to simulate neural processes.

Deep Learning PerceptronThe first deep learning MLP was published by Alexey Grigorevich Ivakhnenko and Valentin Lapa in 1965, as the Group Method of Data Handling. This method employs incremental layer by layer training based on regression analysis, where useless units in hidden layers are pruned with the help of a validation set.

The first deep learning MLP trained by stochastic gradient descent was published in 1967 by Shun'ichi Amari. In computer experiments conducted by Amari's student Saito, a five layer MLP with two modifiable layers learned useful internal representations to classify non-linearily separable pattern classes.

BackpropagationThe backpropagation algorithm is an application of the Leibniz chain rule to networks of differentiable nodes. It is also known as the reverse mode of automatic differentiation or reverse accumulation, due to Seppo Linnainmaa (1970). The term "back-propagating errors" was introduced in 1962 by Frank Rosenblatt, but he did not have an implementation of this procedure, although Henry J. Kelley had a continuous precursor of backpropagation already in 1960 in the context of control theory. In 1982, Paul Werbos applied backpropagation to MLPs in the way that has become standard. In 1986, David E. Rumelhart published an experimental analysis of the technique.

Self-organizing MapsSelf-organizing maps (SOMs) were described by Teuvo Kohonen in 1982. SOMs are artificial neural networks that learn low-dimensional representations of high-dimensional data while preserving the topological structure of the data. They are trained using competitive learning.

Support Vector MachinesSupport vector machines, developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues Isabelle Guyon et al., 1993, Corinna Cortes, 1995, Vapnik et al., 1997) and simpler methods such as linear classifiers gradually overtook neural networks.[citation needed] However, neural networks transformed domains such as the prediction of protein structures.

Convolutional Neural Networks: The origin of the Convolutional Neural Networks (CNN) architecture is the "neocognitron" introduced by Kunihiko Fukushima in 1980. It was inspired by work of Hubel and Wiesel in the 1950s and 1960s which showed that cat visual cortices contain neurons that individually respond to small regions of the visual field. The neocognitron introduced the two basic types of layers in CNNs: convolutional layers, and downsampling layers. A convolutional layer contains units whose receptive fields cover a patch of the previous layer. The weight vector (the set of adaptive parameters) of such a unit is often called a filter. Units can share filters. Downsampling layers contain units whose receptive fields cover patches of previous convolutional layers. Such a unit typically computes the average of the activations of the units in its patch. This downsampling helps to correctly classify objects in visual scenes even when the objects are shifted.

Rectified Linear Unit: In 1969, Kunihiko Fukushima also introduced the ReLU (rectified linear unit) activation function. The rectifier has become the most popular activation function for CNNs and deep neural networks in general.

Time Delay Neural Networks: The time delay neural network (TDNN) was introduced in 1987 by Alex Waibel and was one of the first CNNs, as it achieved shift invariance. It did so by utilizing weight sharing in combination with backpropagation training. Thus, while also using a pyramidal structure as in the neocognitron, it performed a global optimization of the weights instead of a local one.

In 1988, Wei Zhang applied backpropagation to a CNN (a simplified Neocognitron with convolutional interconnections between the image feature layers and the last fully connected layer) for alphabet recognition. They also proposed an implementation of the CNN with an optical computing system.

In 1989, Yann LeCun trained a CNN with the purpose of recognizing handwritten ZIP codes on mail. While the algorithm worked, training required 3 days. Learning was fully automatic, performed better than manual coefficient design, and was suited to a broader range of image recognition problems and image types. Subsequently, Wei Zhang, et al. modified their model by removing the last fully connected layer and applied it for medical image object segmentation in 1991 and breast cancer detection in mammograms in 1994.

In 1990 Yamaguchi introduced max-pooling, a fixed filtering operation that calculates and propagates the maximum value of a given region. They combined TDNNs with max-pooling in order to realize a speaker independent isolated word recognition system. In a variant of the neocognitron called the cresceptron, instead of using Fukushima's spatial averaging, J. Weng also used max-pooling where a downsampling unit computes the maximum of the activations of the units in its patch. Max-pooling is often used in modern CNNs.

LeNet-5, a 7-level CNN by Yann LeCun 1998, that classifies digits, was applied by several banks to recognize hand-written numbers on checks (British English: cheques) digitized in 32x32 pixel images. The ability to process higher-resolution images requires larger and more layers of CNNs, so this technique is constrained by the availability of computing resources.

In 2010, Backpropagation training through max-pooling was accelerated by Graphical Processing Units GPUs and shown to perform better than other pooling variants. Behnke (2003) relied only on the sign of the gradient on problems such as image reconstruction and face localization. Rprop is a first-order optimization algorithm created by Martin Riedmiller and Heinrich Braun in 1992.

In 2011, a deep GPU-based CNN called "DanNet" by Dan Ciresan, Ueli Meier, and Juergen Schmidhuber achieved human-competitive performance for the first time in computer vision contests. Subsequently, a similar GPU-based CNN by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton won the ImageNet Large Scale Visual Recognition Challenge 2012. A very deep CNN with over 100 layers by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun of Microsoft won the ImageNet 2015 contest.

Artificial Neural Networks: ANNs were able to guarantee shift invariance to deal with small and large natural objects in large cluttered scenes, only when invariance extended beyond shift, to all ANN-learned concepts, such as location, type (object class label), scale, lighting and others. This was realized in Developmental Networks (DNs) whose embodiments are Where-What Networks, WWN-1 (2008) through WWN-7 (2013).

Transformers and their Variants: Many modern large language models such as ChatGPT, GPT-4, and BERT use a feedforward neural network called Transformer by Ashish Vaswani et. al. in their 2017 paper "Attention Is All You Need." Transformers have increasingly become the model of choice for natural language processing problems, replacing recurrent neural networks (RNNs).

Basic ideas for this go back to 1992, Juergen Schmidhuber published the Transformer with "linearized self-attention" (save for a normalization operator), which is also called the "linear Transformer." He advertised it as an "alternative to RNNs" that can learn "internal spotlights of attention," and experimentally applied it to problems of variable binding. Here a slow feedforward neural network learns by gradient descent to control the fast weights of another neural network through outer products of self-generated activation patterns called "FROM" and "TO" which in Transformer terminology are called "key" and "value" for "self-attention." This fast weight "attention mapping" is applied to queries. The 2017 Transformer combines this with a softmax operator and a projection matrix.

Deep learning with unsupervised or self-supervised pre-training:In the 1980s, backpropagation did not work well for deep Feed Forward Nueral networks FNNs and Recurrent Neural Networks RNNs. Here the word "deep" refers to the number of layers through which the data is transformed. More precisely, deep learning systems have a substantial credit assignment path (CAP) depth. The CAP is the chain of transformations from input to output. CAPs describe potential interactions between input and output. For an FNN, the depth of the CAPs is that of the network and is the number of hidden layers plus one (as the output layer is also parameterized). For RNNs, in which a signal may propagate through a layer more than once, the CAP depth is potentially unlimited.

To overcome this problem, Juergen Schmidhuber (1992) proposed a self-supervised hierarchy of RNNs pre-trained one level at a time by self-supervised learning. This "neural history compressor" uses predictive coding to learn internal representations at multiple self-organizing time scales. The deep architecture may be used to reproduce the original data from the top level feature activations. The RNN hierarchy can be "collapsed" into a single RNN, by "distilling" a higher level "chunker" network into a lower level "automatizer" network. In 1993, a chunker solved a deep learning task whose CAP depth exceeded 1000. Such history compressors can substantially facilitate downstream supervised deep learning.

Geoffrey Hinton et al. (2006) proposed learning a high-level internal representation using successive layers of binary or real-valued latent variables with a restricted Boltzmann machine to model each layer. This RBM is a generative stochastic feedforward neural network that can learn a probability distribution over its set of inputs. Once sufficiently many layers have been learned, the deep architecture may be used as a generative model by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations.

Section Two:   Types of Neural Networks

Part One:   Feedforward Neural Networks

Feedforward Neural Networks: The feedforward neural network was the first and simplest type. In this network the information moves only from the input layer directly through any hidden layers to the output layer without cycles/loops. Feedforward networks can be constructed with various types of units, such as binary McCulloch–Pitts neurons, the simplest of which is the perceptron. Continuous neurons, frequently with sigmoidal activation, are used in the context of backpropagation.

The Group method of data handling: (GMDH) features fully automatic structural and parametric model optimization. The node activation functions are Kolmogorov–Gabor polynomials that permit additions and multiplications. It uses a deep multilayer perceptron with eight layers. It is a supervised learning network that grows layer by layer, where each layer is trained by regression analysis. Useless items are detected using a validation set, and pruned through regularization. The size and depth of the resulting network depends on the task.

Autoencoder: An autoencoder, autoassociator or Diabolo network is similar to the multilayer perceptron (MLP) – with an input layer, an output layer and one or more hidden layers connecting them. However, the output layer has the same number of units as the input layer. Its purpose is to reconstruct its own inputs (instead of emitting a target value). Therefore, autoencoders are unsupervised learning models. An autoencoder is used for unsupervised learning of efficient codings, typically for the purpose of dimensionality reduction and for learning generative models of data.

Probabilistic neural network: (PNN) is a four-layer feedforward neural network. The layers are Input, hidden pattern/summation, and output. In the PNN algorithm, the parent probability distribution function (PDF) of each class is approximated by a Parzen window and a non-parametric function. Then, using PDF of each class, the class probability of a new input is estimated and Bayes’ rule is employed to allocate it to the class with the highest posterior probability. It was derived from the Bayesian network and a statistical algorithm called Kernel Fisher discriminant analysis. It is used for classification and pattern recognition.

Time delay neural network: (TDNN) is a feedforward architecture for sequential data that recognizes features independent of sequence position. In order to achieve time-shift invariance, delays are added to the input so that multiple data points (points in time) are analyzed together. It usually forms part of a larger pattern recognition system. It has been implemented using a perceptron network whose connection weights were trained with back propagation (supervised learning).

Convolutional neural network: (CNN, or ConvNet or shift invariant or space invariant) is a class of deep network, composed of one or more convolutional layers with fully connected layers (matching those in typical Artificial Neural Network ANNs) on top. It uses tied weights and pooling layers. In particular, max-pooling. It is often structured via Fukushima's convolutional architecture. CNN's are variations of multilayer perceptrons that use minimal preprocessing. This architecture allows CNNs to take advantage of the 2D structure of input data.

Its unit connectivity pattern is inspired by the organization of the visual cortex. Units respond to stimuli in a restricted region of space known as the receptive field. Receptive fields partially overlap, over-covering the entire visual field. Unit response can be approximated mathematically by a convolution operation.

CNNs are suitable for processing visual and other two-dimensional data. They have shown superior results in both image and speech applications. They can be trained with standard backpropagation. CNNs are easier to train than other regular, deep, feed-forward neural networks and have many fewer parameters to estimate.

Capsule Neural Networks (CapsNet) add structures called capsules to a CNN and reuse output from several capsules to form more stable (with respect to various perturbations) representations.

Examples of applications in computer vision include DeepDream and robot navigation. They have wide applications in image and video recognition, recommender systems[29] and natural language processing.

Deep stacking network: (DSN) (deep convex network) is based on a hierarchy of blocks of simplified neural network modules. It was introduced in 2011 by Deng and Dong. It formulates the learning as a convex optimization problem with a closed-form solution, emphasizing the mechanism's similarity to stacked generalization. Each DSN block is a simple module that is easy to train by itself in a supervised fashion without backpropagation for the entire blocks.
Each block consists of a simplified multi-layer perceptron (MLP) with a single hidden layer. The hidden layer $ {\displaystyle {\boldsymbol {h}}} $ has logistic sigmoidal units, and the output layer has linear units. Connections between these layers are represented by weight matrix $ {\displaystyle {\boldsymbol {U}}} $ ; input-to-hidden-layer connections have weight matrix $ {\displaystyle {\boldsymbol {W}}} $ . Target vectors $ {\displaystyle {\boldsymbol {t}}} $ form the columns of matrix $ {\displaystyle {\boldsymbol {T}}} $ , and the input data vectors $ {\displaystyle {\boldsymbol {x}}} $ form the columns of matrix $ {\displaystyle {\boldsymbol {X}}} $ . The matrix of hidden units is $ {\displaystyle {\boldsymbol {H}}=\sigma ({\boldsymbol {W}}^{T}{\boldsymbol {X}})}$ . Modules are trained in order, so lower-layer weights $ {\displaystyle {\boldsymbol {W}}} $ are known at each stage. The function performs the element-wise logistic sigmoid operation. Each block estimates the same final label class $ {\displaystyle {\boldsymbol {y}}} $ , and its estimate is concatenated with original input $ {\displaystyle {\boldsymbol {X}}} $ to form the expanded input for the next block. Thus, the input to the first block contains the original data only, while downstream blocks' input adds the output of preceding blocks. Then learning the upper-layer weight matrix $ {\displaystyle {\boldsymbol {U}}} $ given other weights in the network can be formulated as a convex optimization problem: $$ {\displaystyle \min _{U^{T}}f=\|{\boldsymbol {U}}^{T}{\boldsymbol {H}}-{\boldsymbol {T}}\|_{F}^{2},} $$ which has a closed-form solution.
Unlike other deep architectures, such as DBNs, the goal is not to discover the transformed feature representation. The structure of the hierarchy of this kind of architecture makes parallel learning straightforward, as a batch-mode optimization problem. In purely discriminative tasks, DSNs outperform conventional DBNs.

Tensor deep stacking networks: (TDSN) This architecture is a DSN extension. It offers two important improvements: it uses higher-order information from covariance statistics, and it transforms the non-convex problem of a lower-layer to a convex sub-problem of an upper-layer. TDSNs use covariance statistics in a bilinear mapping from each of two distinct sets of hidden units in the same layer to predictions, via a third-order tensor.
While parallelization and scalability are not considered seriously in conventional DNNs, all learning for DSNs and TDSNs is done in batch mode, to allow parallelization. Parallelization allows scaling the design to larger (deeper) architectures and data sets. The basic architecture is suitable for diverse tasks such as classification and regression.

Part Two:   Regulatory Feedback Networks

Regulatory feedback networks started as a model to explain brain phenomena found during recognition including network-wide bursting and difficulty with similarity found universally in sensory recognition. A mechanism to perform optimization during recognition is created using inhibitory feedback connections back to the same inputs that activate them. This reduces requirements during learning and allows learning and updating to be easier while still being able to perform complex recognition.
A regulatory feedback network makes inferences using negative feedback. The feedback is used to find the optimal activation of units. It is most similar to a non-parametric method but is different from K-nearest neighbor in that it mathematically emulates feedforward networks.

Part Three:   Radial basis function (RBF)

Radial basis functions are functions that have a distance criterion with respect to a center. Radial basis functions have been applied as a replacement for the sigmoidal hidden layer transfer characteristic in multi-layer perceptrons. RBF networks have two layers: In the first, input is mapped onto each RBF in the 'hidden' layer.
The RBF chosen is usually a Gaussian. In regression problems the output layer is a linear combination of hidden layer values representing mean predicted output. The interpretation of this output layer value is the same as a regression model in statistics. In classification problems the output layer is typically a sigmoid function of a linear combination of hidden layer values, representing a posterior probability. Performance in both cases is often improved by shrinkage techniques, known as ridge regression in classical statistics. This corresponds to a prior belief in small parameter values (and therefore smooth output functions) in a Bayesian framework.
RBF networks have the advantage of avoiding local minima in the same way as multi-layer perceptrons. This is because the only parameters that are adjusted in the learning process are the linear mapping from hidden layer to output layer. Linearity ensures that the error surface is quadratic and therefore has a single easily found minimum. In regression problems this can be found in one matrix operation. In classification problems the fixed non-linearity introduced by the sigmoid output function is most efficiently dealt with using iteratively re-weighted least squares.
RBF networks have the disadvantage of requiring good coverage of the input space by radial basis functions. RBF centres are determined with reference to the distribution of the input data, but without reference to the prediction task. As a result, representational resources may be wasted on areas of the input space that are irrelevant to the task. A common solution is to associate each data point with its own centre, although this can expand the linear system to be solved in the final layer and requires shrinkage techniques to avoid overfitting.
Associating each input datum with an RBF leads naturally to kernel methods such as support vector machines (SVM) and Gaussian processes (the RBF is the kernel function). All three approaches use a non-linear kernel function to project the input data into a space where the learning problem can be solved using a linear model.
Like Gaussian processes, and unlike SVMs, RBF networks are typically trained in a maximum likelihood framework by maximizing the probability (minimizing the error). SVMs avoid overfitting by maximizing instead a margin. SVMs outperform RBF networks in most classification applications. In regression applications they can be competitive when the dimensionality of the input space is relatively small.

Part Four:   General Regression Neural Network

A GRNN is an associative memory neural network that is similar to the probabilistic neural network but it is used for regression and approximation rather than classification.

Part Five:   Deep Belief Networks

A deep belief network (DBN) is a probabilistic, generative model made up of multiple hidden layers. It can be considered a composition of simple learning modules.
A DBN can be used to generatively pre-train a deep neural network (DNN) by using the learned DBN weights as the initial DNN weights. Various discriminative algorithms can then tune these weights. This is particularly helpful when training data are limited, because poorly initialized weights can significantly hinder learning. These pre-trained weights end up in a region of the weight space that is closer to the optimal weights than random choices. This allows for both improved modeling and faster ultimate convergence.

Part Six:   Recurrent Neural Networks

Recurrent neural networks (RNN) propagate data forward, but also backwards, from later processing stages to earlier stages. RNN can be used as general sequence processors.

Fully Recurrent Neural Network: This architecture was developed in the 1980s. Its network creates a directed connection between every pair of units. Each has a time-varying, real-valued (more than just zero or one) activation (output). Each connection has a modifiable real-valued weight. Some of the nodes are called labeled nodes, some output nodes, the rest hidden nodes.
For supervised learning in discrete time settings, training sequences of real-valued input vectors become sequences of activations of the input nodes, one input vector at a time. At each time step, each non-input unit computes its current activation as a nonlinear function of the weighted sum of the activations of all units from which it receives connections. The system can explicitly activate (independent of incoming signals) some output units at certain time steps.
For each sequence, its error is the sum of the deviations of all activations computed by the network from the corresponding target signals. For a training set of numerous sequences, the total error is the sum of the errors of all individual sequences.
To minimize total error, gradient descent can be used to change each weight in proportion to its derivative with respect to the error, provided the non-linear activation functions are differentiable. The standard method is called "backpropagation through time" or BPTT, a generalization of back-propagation for feedforward networks. A more computationally expensive online variant is called "Real-Time Recurrent Learning" or RTRL. Unlike BPTT this algorithm is local in time but not local in space. An online hybrid between BPTT and RTRL with intermediate complexity exists, with variants for continuous time. A major problem with gradient descent for standard RNN architectures is that error gradients vanish exponentially quickly with the size of the time lag between important events. The Long short-term memory architecture overcomes these problems.
In reinforcement learning settings, no teacher provides target signals. Instead a fitness function or reward function or utility function is occasionally used to evaluate performance, which influences its input stream through output units connected to actuators that affect the environment. Variants of evolutionary computation are often used to optimize the weight matrix.

The Hopfield network (like similar attractor-based networks) is of historic interest although it is not a general RNN, as it is not designed to process sequences of patterns. Instead it requires stationary inputs. It is an RNN in which all connections are symmetric. It guarantees that it will converge. If the connections are trained using Hebbian learning the Hopfield network can perform as robust content-addressable memory, resistant to connection alteration.

The Boltzmann Machine can be thought of as a noisy Hopfield network. It is one of the first neural networks to demonstrate learning of latent variables (hidden units). Boltzmann machine learning was at first slow to simulate, but the contrastive divergence algorithm speeds up training for Boltzmann machines and Products of Experts.

The Self-organizing Map (SOM) uses unsupervised learning. A set of neurons learn to map points in an input space to coordinates in an output space. The input space can have different dimensions and topology from the output space, and SOM attempts to preserve these.

Learning Vector Quantization (LVQ) can be interpreted as a neural network architecture. Prototypical representatives of the classes parameterize, together with an appropriate distance measure, in a distance-based classification scheme.

Simple Recurrent Networks have three layers, with the addition of a set of "context units" in the input layer. These units connect from the hidden layer or the output layer with a fixed weight of one. At each time step, the input is propagated in a standard feedforward fashion, and then a backpropagation-like learning rule is applied (not performing gradient descent). The fixed back connections leave a copy of the previous values of the hidden units in the context units (since they propagate over the connections before the learning rule is applied).

Reservoir Computing is a computation framework that may be viewed as an extension of neural networks. Typically an input signal is fed into a fixed (random) dynamical system called a reservoir whose dynamics map the input to a higher dimension. A readout mechanism is trained to map the reservoir to the desired output. Training is performed only at the readout stage. Liquid-state machines are a type of reservoir computing.

The Echo State Network (ESN) employs a sparsely connected random hidden layer. The weights of output neurons are the only part of the network that are trained. ESN are good at reproducing certain time series.

The Long Short-term Memory (LSTM) avoids the vanishing gradient problem. It works even when with long delays between inputs and can handle signals that mix low and high frequency components. LSTM RNN outperformed other RNN and other sequence learning methods such as HMM in applications such as language learning and connected handwriting recognition.

Bi-directional RNN or BRNN, use a finite sequence to predict or label each element of a sequence based on both the past and future context of the element. This is done by adding the outputs of two RNNs: one processing the sequence from left to right, the other one from right to left. The combined outputs are the predictions of the teacher-given target signals. This technique proved to be especially useful when combined with LSTM.

Hierarchical Recurrent Neural Network RNN connects elements in various ways to decompose hierarchical behavior into useful subprograms.

Stochastic Neural Network: Distinct from conventional neural networks, stochastic artificial neural network used as an approximation to random functions.

Genetic Scale: A RNN (often a LSTM) where a series is decomposed into a number of scales where every scale informs the primary length between two consecutive points. A first order scale consists of a normal RNN, a second order consists of all points separated by two indices and so on. The Nth order RNN connects the first and last node. The outputs from all the various scales are treated as a Committee of Machines and the associated scores are used genetically for the next iteration.

ModularSome studies in human brain functions seem to shown that the human brain operates as a collection of small networks. This realization gave birth to the concept of modular neural networks, in which several small networks cooperate or compete to solve problems.

A Committee of Machines (CoM) is a collection of different neural networks that together "vote" on a given example. This generally gives a much better result than individual networks. Because neural networks suffer from local minima, starting with the same architecture and training but using randomly different initial weights often gives vastly different results. A CoM tends to stabilize the result.
The CoM is similar to the general machine learning bagging method, except that the necessary variety of machines in the committee is obtained by training from different starting weights rather than training on different randomly selected subsets of the training data.

The Associative Neural Network (ASNN) is an extension of committee of machines that combines multiple feedforward neural networks and the k-nearest neighbor technique. It uses the correlation between ensemble responses as a measure of distance amid the analyzed cases for the kNN. This corrects the Bias of the neural network ensemble. An associative neural network has a memory that can coincide with the training set. If new data become available, the network instantly improves its predictive ability and provides data approximation (self-learns) without retraining. Another important feature of ASNN is the possibility to interpret neural network results by analysis of correlations between data cases in the space of models.

Part Seven:   Physical Neural Networks

A physical neural network includes electrically adjustable resistance material to simulate artificial synapses. Examples include the ADALINE memristor-based neural network.[68] An optical neural network is a physical implementation of an artificial neural network with optical components.

Part Eight:   Dynamic Neural Networks

Dynamic neural networks address nonlinear multivariate behaviour and include (learning of) time-dependent behaviour, such as transient phenomena and delay effects. Techniques to estimate a system process from observed data fall under the general category of system identification.

Part Nine:   Cascading Correlation

Cascading

Cascade correlation is an architecture and supervised learning algorithm. Instead of just adjusting the weights in a network of fixed topology, Cascade-Correlation begins with a minimal network, then automatically trains and adds new hidden units one by one, creating a multi-layer structure. Once a new hidden unit has been added to the network, its input-side weights are frozen. This unit then becomes a permanent feature-detector in the network, available for producing outputs or for creating other, more complex feature detectors. The Cascade-Correlation architecture has several advantages: It learns quickly, determines its own size and topology, retains the structures it has built even if the training set changes and requires no backpropagation.

Part Ten:   Neuro Fuzzy Network

A Neuro-fuzzy Network is a fuzzy inference system (FIS) in the body of an artificial neural network. Depending on the FIS type, several layers simulate the processes involved in a fuzzy inference-like fuzzification, inference, aggregation and defuzzification. Embedding an FIS in a general structure of an ANN has the benefit of using available ANN training methods to find the parameters of a fuzzy system.

Part Seven:   Physical Neural Networks

Compositional pattern-producing Main article: Compositional pattern-producing network Compositional pattern-producing networks (CPPNs) are a variation of artificial neural networks which differ in their set of activation functions and how they are applied. While typical artificial neural networks often contain only sigmoid functions (and sometimes Gaussian functions), CPPNs can include both types of functions and many others. Furthermore, unlike typical artificial neural networks, CPPNs are applied across the entire space of possible inputs so that they can represent a complete image. Since they are compositions of functions, CPPNs in effect encode images at infinite resolution and can be sampled for a particular display at whatever resolution is optimal.

Backpropagation in Neural Networks: In machine learning, backpropagation is a widely used algorithm for training feedforward artificial neural networks or other parameterized networks with differentiable nodes. It is an efficient application of the Leibniz chain rule to such networks. It is also known as the reverse mode of automatic differentiation or reverse accumulation, due to Seppo Linnainmaa (1970). The term "back-propagating error correction" was introduced in 1962 by Frank Rosenblatt, but he did not know how to implement this, although Henry J. Kelley had a continuous precursor of backpropagation already in 1960 in the context of control theory.

Adaptive Systems: Some neural networks are implimented as adaptive systems. An adaptive system is a set of interacting or interdependent entities, real or abstract, forming an integrated whole that together are able to respond to environmental changes or changes in the interacting parts, in a way analogous to either continuous physiological homeostasis or evolutionary adaptation in biology. Feedback loops represent a key feature of adaptive systems, such as ecosystems and individual organisms; or in the human world, communities, organizations, and families. Adaptive systems can be organized into a hierarchy.

Artificial adaptive systems include robots with control systems that utilize negative feedback to maintain desired states.

Backpropagation in Neural Networks: I

Section Two:   Perceptron

The perceptron was invented in 1943 by Warren McCulloch and Walter Pitts. The first implementation was a machine built in 1958 at the Cornell Aeronautical Laboratory by Frank Rosenblatt, funded by the United States Office of Naval Research. Camera system of the Mark 1 Perceptron.

The perceptron was intended to be a machine, rather than a program, and while its first implementation was in software for the IBM 704, it was subsequently implemented in custom-built hardware as the "Mark 1 perceptron". This machine was designed for image recognition: it had an array of 400 photocells, randomly connected to the "neurons". Weights were encoded in potentiometers, and weight updates during learning were performed by electric motors. 

Although the perceptron initially seemed promising, it was quickly proved that perceptrons could not be trained to recognise many classes of patterns. This caused the field of neural network research to stagnate for many years, before it was recognised that a feedforward neural network with two or more layers (also called a multilayer perceptron) had greater processing power than perceptrons with one layer (also called a single-layer perceptron).

Single-layer perceptrons are only capable of learning linearly separable patterns.[6] For a classification task with some step activation function, a single node will have a single line dividing the data points forming the patterns. More nodes can create more dividing lines, but those lines must somehow be combined to form more complex classifications. A second layer of perceptrons, or even linear nodes, are sufficient to solve many otherwise non-separable problems.

In 1969, a famous book entitled Perceptrons by Marvin Minsky and Seymour Papert showed that it was impossible for these classes of network to learn an XOR function.

Nevertheless, the often-cited Minsky/Papert text caused a significant decline in interest and funding of neural network research. It took ten more years until neural network research experienced a resurgence in the 1980s.

Definition

In the modern sense, the perceptron is an algorithm for learning a binary classifier called a threshold function: a function that maps its input $ {\displaystyle \mathbf {x} } $ (a real-valued vector) to an output value $ {\displaystyle f(\mathbf {x} )} $ (a single binary value):

$$ {\displaystyle f(\mathbf {x} )={\begin{cases}1&{\text{if }}\ \mathbf {w} \cdot \mathbf {x} +b>0,\\0&{\text{otherwise}}\end{cases}}} $$

where $ {\displaystyle \mathbf {w} } $ is a vector of real-valued weights, $ {\displaystyle \mathbf {w} \cdot \mathbf {x} } $ {\displaystyle \mathbf {w} \cdot \mathbf {x} } is the dot product $ ∑ i = 1 m w i x i {\displaystyle \sum _{i=1}^{m}w_{i}x_{i}} {\displaystyle \sum _{i=1}^{m}w_{i}x_{i}}, where $ m is the number of inputs to the perceptron, and $ b is the bias. The bias shifts the decision boundary away from the origin and does not depend on any input value.

The value of $ f ( x ) {\displaystyle f(\mathbf {x} )} f(\mathbf {x} ) (0 or 1) is used to classify $ x {\displaystyle \mathbf {x} } \mathbf {x} as either a positive or a negative instance, in the case of a binary classification problem. If b is negative, then the weighted combination of inputs must produce a positive value greater than | b | {\displaystyle |b|} |b| in order to push the classifier neuron over the 0 threshold. Spatially, the bias alters the position (though not the orientation) of the decision boundary. The perceptron learning algorithm does not terminate if the learning set is not linearly separable. If the vectors are not linearly separable learning will never reach a point where all vectors are classified properly. The most famous example of the perceptron's inability to solve problems with linearly nonseparable vectors is the Boolean exclusive-or problem. The solution spaces of decision boundaries for all binary functions and learning behaviors are studied in the reference.

In the context of neural networks, a perceptron is an artificial neuron using the Heaviside step function as the activation function. The perceptron algorithm is also termed the single-layer perceptron, to distinguish it from a multilayer perceptron, which is a misnomer for a more complicated neural network. As a linear classifier, the single-layer perceptron is the simplest feedforward neural network.

Chapter Four:   Deep Learning Algorithms

Section One:   Introduction

Deep learning is a class of software learning algorithms that use multiple layers to progressively extract higher-level features from the raw input. Deep learning algorithms are highly dependent on the Initial training Datasets. For example, in image processing, lower layers may identify edges, while higher layers may identify the concepts relevant to a human such as digits or letters or faces.

From another angle to view deep learning, deep learning refers to ‘computer-simulate’ or ‘automate’ human learning processes from a source (e.g., an image of dogs) to a learned object (dogs). Therefore, a notion coined as “deeper” learning or “deepest” learning makes sense. The deepest learning refers to the fully automatic learning from a source to a final learned object. A deeper learning thus refers to a mixed learning process: a human learning process from a source to a learned semi-object, followed by a computer learning process from the human learned semi-object to a final learned object.

Most modern deep learning models are based on artificial neural networks, specifically convolutional neural networks (CNN)s, although they can also include propositional formulas or latent variables organized layer-wise in deep generative models such as the nodes in deep belief networks and deep Boltzmann machines.

In deep learning, each level learns to transform its input data into a slightly more abstract and composite representation. In an image recognition application, the raw input may be a matrix of pixels; the first representational layer may abstract the pixels and encode edges; the second layer may compose and encode arrangements of edges; the third layer may encode a nose and eyes; and the fourth layer may recognize that the image contains a face. Importantly, a deep learning process can learn which features to optimally place in which level on its own. This does not eliminate the need for hand-tuning; for example, varying numbers of layers and layer sizes can provide different degrees of abstraction.

The word "deep" in "deep learning" refers to the number of layers through which the data is transformed. More precisely, deep learning systems have a substantial Credit Assignment Path (CAP) depth. The CAP is the chain of transformations from input to output. CAPs describe potentially causal connections between input and output. For a feedforward neural network, the depth of the CAPs is that of the network and is the number of hidden layers plus one (as the output layer is also parameterized). For recurrent neural networks, in which a signal may propagate through a layer more than once, the CAP depth is potentially unlimited. No universally agreed-upon threshold of depth divides shallow learning from deep learning, but most researchers agree that deep learning involves CAP depth higher than 2. CAP of depth 2 has been shown to be a universal approximator in the sense that it can emulate any function. Beyond that, more layers do not add to the function approximator ability of the network. Deep models (CAP > 2) are able to extract better features than shallow models and hence, extra layers help in learning the features effectively.

Deep learning architectures can be constructed with a greedy layer-by-layer method. Deep learning helps to disentangle these abstractions and pick out which features improve performance.

For supervised learning tasks, deep learning methods eliminate feature engineering, by translating the data into compact intermediate representations akin to principal components, and derive layered structures that remove redundancy in representation.

Deep learning algorithms can be applied to unsupervised learning tasks. This is an important benefit because unlabeled data are more abundant than the labeled data. Examples of deep structures that can be trained in an unsupervised manner are deep belief networks.

Deep neural networks are generally interpreted in terms of the universal approximation theorem or probabilistic inference. The classic universal approximation theorem concerns the capacity of feedforward neural networks with a single hidden layer of finite size to approximate continuous functions. In 1989, the first proof was published by George Cybenko for sigmoid activation functions and was generalised to feed-forward multi-layer architectures in 1991 by Kurt Hornik. Recent work also showed that universal approximation also holds for non-bounded activation functions.

The universal approximation theorem for deep neural networks concerns the capacity of networks with bounded width but the depth is allowed to grow. Lu et al.[21] proved that if the width of a deep neural network with ReLU activation is strictly larger than the input dimension, then the network can approximate any Lebesgue integrable function; If the width is smaller or equal to the input dimension, then a deep neural network is not a universal approximator. The probabilistic interpretation derives from the field of machine learning. It features inference, as well as the optimization concepts of training and testing, related to fitting and generalization, respectively. More specifically, the probabilistic interpretation considers the activation nonlinearity as a cumulative distribution function.

Section Two:   Deep Learning Algorithms

The following is a list of the ten most popular deep learning algorithms:

Part One:   Convolutional Neural networks

In deep learning, a Convolutional Neural Network (CNN) is a class of artificial neural network most commonly applied to analyze visual imagery. CNNs use a mathematical operation called convolution in place of general matrix multiplication in at least one of their layers. They are specifically designed to process pixel data and are used in image recognition and processing. They have applications in image and video recognition, recommender systems, image classification, image segmentation, medical image analysis, natural language processing, brain–computer interfaces, and financial time series.

A convolutional neural network consists of an input layer, hidden layers and an output layer. In a convolutional neural network, the hidden layers include one or more layers that perform convolutions. Typically this includes a layer that performs a dot product of the convolution kernel with the layer's input matrix. This product is usually the Frobenius inner product, and its activation function is commonly ReLU. As the convolution kernel slides along the input matrix for the layer, the convolution operation generates a feature map, which in turn contributes to the input of the next layer. This is followed by other layers such as pooling layers, fully connected layers, and normalization layers.

In a CNN, the input is a tensor with shape: (number of inputs) × (input height) × (input width) × (input channels). After passing through a convolutional layer, the image becomes abstracted to a feature map, also called an activation map, with shape: (number of inputs) × (feature map height) × (feature map width) × (feature map channels).

Convolutional layers convolve the input and pass its result to the next layer. Although fully connected feedforward neural networks can be used to learn features and classify data, this architecture is generally impractical for larger inputs (e.g., high-resolution images), which would require massive numbers of neurons because each pixel is a relevant input feature. A fully connected layer for an image of size 100 × 100 has 10,000 weights for each neuron in the second layer. Convolution reduces the number of free parameters, allowing the network to be deeper. For example, using a 5 × 5 tiling region, each with the same shared weights, requires only 25 neurons. Using regularized weights over fewer parameters avoids the vanishing gradients and exploding gradients problems seen during backpropagation in earlier neural networks.

To speed processing, standard convolutional layers can be replaced by depthwise separable convolutional layers, which are based on a depthwise convolution followed by a pointwise convolution. The depthwise convolution is a spatial convolution applied independently over each channel of the input tensor, while the pointwise convolution is a standard convolution restricted to the use of 1 × 1 kernels.

Convolutional networks may include local and/or global pooling layers along with traditional convolutional layers. Pooling layers reduce the dimensions of data by combining the outputs of neuron clusters at one layer into a single neuron in the next layer. Local pooling combines small clusters, tiling sizes such as 2 × 2 are commonly used. Global pooling acts on all the neurons of the feature map. There are two common types of pooling in popular use: max and average. Max pooling uses the maximum value of each local cluster of neurons in the feature map, while average pooling takes the average value.

Fully connected layers connect every neuron in one layer to every neuron in another layer. It is the same as a traditional Multilayer Perceptron Neural Network (MLP). The flattened matrix goes through a fully connected layer to classify the images.

In neural networks, each neuron receives input from some number of locations in the previous layer. In a convolutional layer, each neuron receives input from only a restricted area of the previous layer called the neuron's receptive field. Typically the area is a square (e.g. 5 by 5 neurons). Whereas, in a fully connected layer, the receptive field is the entire previous layer. Thus, in each convolutional layer, each neuron takes input from a larger area in the input than previous layers. This is due to applying the convolution over and over, which takes into account the value of a pixel, as well as its surrounding pixels. When using dilated layers, the number of pixels in the receptive field remains constant, but the field is more sparsely populated as its dimensions grow when combining the effect of several layers.

To manipulate the receptive field size as desired, there are some alternatives to the standard convolutional layer. For example, atrous or dilated convolution expands the receptive field size without increasing the number of parameters by interleaving visible and blind regions. Moreover, a single dilated convolutional layer can comprise filters with multiple dilation ratios, thus having a variable receptive field size.

Each neuron in a neural network computes an output value by applying a specific function to the input values received from the receptive field in the previous layer. The function that is applied to the input values is determined by a vector of weights and a bias (typically real numbers). Learning consists of iteratively adjusting these biases and weights.

The vectors of weights and biases are called filters and represent particular features of the input (e.g., a particular shape). A distinguishing feature of CNNs is that many neurons can share the same filter. This reduces the memory footprint because a single bias and a single vector of weights are used across all receptive fields that share that filter, as opposed to each receptive field having its own bias and vector weighting.

Part Two:   Long Short Term Memory Networks (LSTMs)

Long Short-Term Memory (LSTM) is an artificial neural network used in artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a Recurrent Neural Network (RNN) can process not only single data points (such as images), but also entire sequences of data (such as speech or video). This characteristic makes LSTM networks ideal for processing and predicting data. For example, LSTM is applicable to tasks such as character recognition, connected handwriting recognition, speech recognition, machine translation and speech detection.

The name of LSTM refers to the fact that a standard RNN has both "long-term memory" and "short-term memory". The connection weights and biases in the network change once per episode of training and the activation patterns in the network change once per time-step. The LSTM architecture aims to provide a short-term memory for RNN that can last thousands of timesteps, thus "long short-term memory".

A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell. Forget gates decide what information to discard from a previous state by assigning a previous state, compared to a current input, a value between 0 and 1. A (rounded) value of 1 means to keep the information, and a value of 0 means to discard it. Input gates decide which pieces of new information to store in the current state, using the same system as forget gates. Output gates control which pieces of information in the current state to output by assigning a value from 0 to 1 to the information, considering the previous and current states. Selectively outputting relevant information from the current state allows the LSTM network to maintain useful, long-term dependencies to make predictions, both in current and future time-steps.

LSTM networks are well-suited to classifying, processing and making predictions based on time series data, since there can be lags of unknown duration between important events in a time series. LSTMs were developed to deal with the vanishing gradient problem that can be encountered when training traditional RNNs. Relative insensitivity to gap length is an advantage of LSTM over RNNs, hidden Markov models and other sequence learning methods in numerous applications.

In the equations below, the lowercase variables represent vectors. Matrices $ {\displaystyle W_{q}}$ and $ {\displaystyle U_{q}} $ contain, respectively, the weights of the input and recurrent connections, where the subscript $ {\displaystyle _{q}} $ can either be the input gate $ {\displaystyle i} $ , output gate $ {\displaystyle o} $ , the forget gate $ {\displaystyle f} $ or the memory cell $ {\displaystyle c} $ , depending on the activation being calculated.

STM with a forget gate:The compact forms of the equations for the forward pass of an LSTM cell with a forget gate are: $$ {\displaystyle {\begin{aligned}f_{t}&=\sigma _{g}(W_{f}x_{t}+U_{f}h_{t-1}+b_{f})\\i_{t}&=\sigma _{g}(W_{i}x_{t}+U_{i}h_{t-1}+b_{i})\\o_{t}&=\sigma _{g}(W_{o}x_{t}+U_{o}h_{t-1}+b_{o})\\{\tilde {c}}_{t}&=\sigma _{c}(W_{c}x_{t}+U_{c}h_{t-1}+b_{c})\\c_{t}&=f_{t}\odot c_{t-1}+i_{t}\odot {\tilde {c}}_{t}\\h_{t}&=o_{t}\odot \sigma _{h}(c_{t})\end{aligned}}} $$ where the initial values are $ {\displaystyle c_{0}=0} $ and $ {\displaystyle h_{0}=0} $ and the operator $ {\displaystyle \odot } $ denotes the Hadamard product (element-wise product). The subscript $ {\displaystyle t} $ indexes the time step.
Variables
$ {\displaystyle x_{t}\in \mathbb {R} ^{d}} $ : input vector to the LSTM unit.
$ {\displaystyle f_{t}\in {(0,1)}^{h}} $ : forget gate's activation vector
$ {\displaystyle i_{t}\in {(0,1)}^{h}} $ : input/update gate's activation vector
$ {\displaystyle o_{t}\in {(0,1)}^{h}} $ : output gate's activation vector
$ {\displaystyle h_{t}\in {(-1,1)}^{h}} $ : hidden state vector also known as output vector of the LSTM unit
$ {\displaystyle {\tilde {c}}_{t}\in {(-1,1)}^{h}} $ : cell input activation vector
$ {\displaystyle c_{t}\in \mathbb {R} ^{h}} $ : cell state vector
$ {\displaystyle W\in \mathbb {R} ^{h\times d}} $ , $ {\displaystyle U\in \mathbb {R} ^{h\times h}} $ and $ {\displaystyle b\in \mathbb {R} ^{h}} $ : weight matrices and bias vector parameters which need to be learned during training
where the superscripts $ {\displaystyle d} $ and $ {\displaystyle h} $ refer to the number of input features and number of hidden units, respectively.

Activation functions
$ {\displaystyle \sigma _{g}} $ : sigmoid function.
$ {\displaystyle \sigma _{c}} $ : hyperbolic tangent function.

Part Three:   Recurrent Neural Networks (RNNs)

A Recurrent Neural Network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic behavior. Derived from feedforward neural networks, RNNs can use their internal state (memory) to process variable length sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition.
Recurrent Neural Network (RNN) come in many variants.

Fully Recurrent Neural Networks (FRNN) connect the outputs of all neurons to the inputs of all neurons. This is the most general neural network topology because all other topologies can be represented by setting some connection weights to zero to simulate the lack of connections between those neurons. The illustration to the right may be misleading to many because practical neural network topologies are frequently organized in "layers" and the drawing gives that appearance. However, what appears to be layers are, in fact, different steps in time of the same fully recurrent neural network. The left-most item in the illustration shows the recurrent connections as the arc labeled 'v'. It is "unfolded" in time to produce the appearance of layers.

An Elman Network is a three-layer network with the addition of a set of context units. The middle (hidden) layer is connected to these context units fixed with a weight of one. At each time step, the input is fed forward and a learning rule is applied. The fixed back-connections save a copy of the previous values of the hidden units in the context units (since they propagate over the connections before the learning rule is applied). Thus the network can maintain a sort of state, allowing it to perform such tasks as sequence-prediction that are beyond the power of a standard multilayer perceptron.

Jordan networks are similar to Elman networks. The context units are fed from the output layer instead of the hidden layer. The context units in a Jordan network are also referred to as the state layer. They have a recurrent connection to themselves.

Elman and Jordan networks are also known as "Simple recurrent networks" (SRN).
Elman network $$ {\displaystyle {\begin{aligned}h_{t}&=\sigma _{h}(W_{h}x_{t}+U_{h}h_{t-1}+b_{h})\\y_{t}&=\sigma _{y}(W_{y}h_{t}+b_{y})\end{aligned}}} $$ Jordan network $$ {\displaystyle {\begin{aligned}h_{t}&=\sigma _{h}(W_{h}x_{t}+U_{h}y_{t-1}+b_{h})\\y_{t}&=\sigma _{y}(W_{y}h_{t}+b_{y})\end{aligned}}} $$ Where variables and functions
$ {\displaystyle x_{t}} $ : input vector
$ {\displaystyle h_{t}} $ : hidden layer vector
$ {\displaystyle y_{t}} $ : output vector
$ {\displaystyle W, U} $ and $ {\displaystyle b} $ : parameter matrices and vector
$ {\displaystyle \sigma _{h}} $ and $ {\displaystyle \sigma _{y}} $ : Activation functions

The Hopfield network is an RNN in which all connections across layers are equally sized. It requires stationary inputs and is thus not a general RNN, as it does not process sequences of patterns. However, it guarantees that it will converge. If the connections are trained using Hebbian learning then the Hopfield network can perform as robust content-addressable memory, resistant to connection alteration.

Bidirectional Associative Memory:(BAM) network is a variant of a Hopfield network that stores associative data as a vector. The bi-directionality comes from passing information through a matrix and its transpose. Typically, bipolar encoding is preferred to binary encoding of the associative pairs. Recently, stochastic BAM models using Markov stepping were optimized for increased network stability and relevance to real-world applications. A BAM network has two layers, either of which can be driven as an input to recall an association and produce an output on the other layer.

Echo State Network: The echo state network (ESN) has a sparsely connected random hidden layer. The weights of output neurons are the only part of the network that can change (be trained). ESNs are good at reproducing certain time series. A variant for spiking neurons is known as a liquid state machine.

Independently RNN: The Independently recurrent neural network (IndRNN) addresses the gradient vanishing and exploding problems in the traditional fully connected RNN. Each neuron in one layer only receives its own past state as context information (instead of full connectivity to all other neurons in this layer) and thus neurons are independent of each other's history. The gradient backpropagation can be regulated to avoid gradient vanishing and exploding in order to keep long or short-term memory. The cross-neuron information is explored in the next layers. IndRNN can be robustly trained with the non-saturated nonlinear functions such as ReLU. Using skip connections, deep networks can be trained.

Recursive

Recursive Neural Network: A recursive neural network is created by applying the same set of weights recursively over a differentiable graph-like structure by traversing the structure in topological order. Such networks are typically also trained by the reverse mode of automatic differentiation. They can process distributed representations of structure, such as logical terms. A special case of recursive neural networks is the RNN whose structure corresponds to a linear chain. Recursive neural networks have been applied to natural language processing. The Recursive Neural Tensor Network uses a tensor-based composition function for all nodes in the tree.

Neural History Compressor: The neural history compressor is an unsupervised stack of RNNs. At the input level, it learns to predict its next input from the previous inputs. Only unpredictable inputs of some RNN in the hierarchy become inputs to the next higher level RNN, which therefore recomputes its internal state only rarely. Each higher level RNN thus studies a compressed representation of the information in the RNN below. This is done such that the input sequence can be precisely reconstructed from the representation at the highest level.
The system effectively minimises the description length or the negative logarithm of the probability of the data.[44] Given a lot of learnable predictability in the incoming data sequence, the highest level RNN can use supervised learning to easily classify even deep sequences with long intervals between important events.
It is possible to distill the RNN hierarchy into two RNNs: the "conscious" chunker (higher level) and the "subconscious" automatizer (lower level). Once the chunker has learned to predict and compress inputs that are unpredictable by the automatizer, then the automatizer can be forced in the next learning phase to predict or imitate through additional units the hidden units of the more slowly changing chunker. This makes it easy for the automatizer to learn appropriate, rarely changing memories across long intervals. In turn, this helps the automatizer to make many of its once unpredictable inputs predictable, such that the chunker can focus on the remaining unpredictable events.
A generative model partially overcame the vanishing gradient problem of automatic differentiation or backpropagation in neural networks in 1992. In 1993, such a system solved a "Very Deep Learning" task that required more than 1000 subsequent layers in an RNN unfolded in time.

Second Order RNNs: use higher order weights $ {\displaystyle w{}_{ijk}} $ instead of the standard $ {\displaystyle w{}_{ij}} $ weights, and states can be a product. This allows a direct mapping to a finite-state machine both in training, stability, and representation.

Part Four:   Generative Adversarial Networks (GANs)

A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative AI. The concept was initially developed by Ian Goodfellow and his colleagues in June 2014. In a GAN, two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is another agent's loss.

Given a training set, this technique learns to generate new data with the same statistics as the training set. For example, a GAN trained on photographs can generate new photographs that look at least superficially authentic to human observers, having many realistic characteristics. Though originally proposed as a form of generative model for unsupervised learning, GANs have also proved useful for semi-supervised learning, fully supervised learning, and reinforcement learning.

The core idea of a GAN is based on the "indirect" training through the discriminator, another neural network that can tell how "realistic" the input seems, which itself is also being updated dynamically. This means that the generator is not trained to minimize the distance to a specific image, but rather to fool the discriminator. This enables the model to learn in an unsupervised manner.

The original GAN is defined as the following game:
Each probability space $ {\displaystyle (\Omega ,\mu _{ref})}$ defines a GAN game.
There are 2 players: generator and discriminator.
The generator's strategy set is $ {\displaystyle {\mathcal {P}}(\Omega )} $ , the set of all probability measures $ {\displaystyle \mu _{G}} $ .
The discriminator's strategy set is the set of Markov kernels $ {\displaystyle \mu _{D}:\Omega \to {\mathcal {P}}[0,1]} $ , where $ {\displaystyle {\mathcal {P}}[0,1]} $ is the set of probability measures on $ {\displaystyle [0,1]} $ The GAN game is a zero-sum game, with objective function $$ {\displaystyle L(\mu _{G},\mu _{D}):=\mathbb {E} _{x\sim \mu _{ref},y\sim \mu _{D}(x)}[\ln y]+\mathbb {E} _{x\sim \mu _{G},y\sim \mu _{D}(x)}[\ln(1-y)].} $$ The generator aims to minimize the objective, and the discriminator aims to maximize the objective.
The generator's task is to approach $ {\displaystyle \mu _{G}\approx \mu _{ref}} $ , that is, to match its own output distribution as closely as possible to the reference distribution. The discriminator's task is to output a value close to 1 when the input appears to be from the reference distribution, and to output a value close to 0 when the input looks like it came from the generator distribution.

The generative network generates candidates while the discriminative network evaluates them. The contest operates in terms of data distributions. Typically, the generative network learns to map from a latent space to a data distribution of interest, while the discriminative network distinguishes candidates produced by the generator from the true data distribution. The generative network's training objective is to increase the error rate of the discriminative network (i.e., "fool" the discriminator network by producing novel candidates that the discriminator thinks are not synthesized (are part of the true data distribution)).

A known dataset serves as the initial training data for the discriminator. Training involves presenting it with samples from the training dataset until it achieves acceptable accuracy. The generator is trained based on whether it succeeds in fooling the discriminator. Typically, the generator is seeded with randomized input that is sampled from a predefined latent space (e.g. a multivariate normal distribution). Thereafter, candidates synthesized by the generator are evaluated by the discriminator. Independent backpropagation procedures are applied to both networks so that the generator produces better samples, while the discriminator becomes more skilled at flagging synthetic samples. When used for image generation, the generator is typically a deconvolutional neural network, and the discriminator is a convolutional neural network.

Part Five:   Radial Basis Function Networks (RBFNs)

Radial basis functions are functions that have a distance criterion with respect to a center. Radial basis functions have been applied as a replacement for the sigmoidal hidden layer transfer characteristic in multi-layer perceptrons. RBF networks have two layers: In the first, input is mapped onto each RBF in the 'hidden' layer. The RBF chosen is usually a Gaussian. In regression problems the output layer is a linear combination of hidden layer values representing mean predicted output. The interpretation of this output layer value is the same as a regression model in statistics. In classification problems the output layer is typically a sigmoid function of a linear combination of hidden layer values, representing a posterior probability. Performance in both cases is often improved by shrinkage techniques, known as ridge regression in classical statistics. This corresponds to a prior belief in small parameter values (and therefore smooth output functions) in a Bayesian framework.

RBF networks have the advantage of avoiding local minima in the same way as multi-layer perceptrons. This is because the only parameters that are adjusted in the learning process are the linear mapping from hidden layer to output layer. Linearity ensures that the error surface is quadratic and therefore has a single easily found minimum. In regression problems this can be found in one matrix operation. In classification problems the fixed non-linearity introduced by the sigmoid output function is most efficiently dealt with using iteratively re-weighted least squares.

RBF networks have the disadvantage of requiring good coverage of the input space by radial basis functions. RBF centres are determined with reference to the distribution of the input data, but without reference to the prediction task. As a result, representational resources may be wasted on areas of the input space that are irrelevant to the task. A common solution is to associate each data point with its own centre, although this can expand the linear system to be solved in the final layer and requires shrinkage techniques to avoid overfitting.

Associating each input datum with an RBF leads naturally to kernel methods such as support vector machines (SVM) and Gaussian processes (the RBF is the kernel function). All three approaches use a non-linear kernel function to project the input data into a space where the learning problem can be solved using a linear model. Like Gaussian processes, and unlike SVMs, RBF networks are typically trained in a maximum likelihood framework by maximizing the probability (minimizing the error). SVMs avoid overfitting by maximizing instead a margin. SVMs outperform RBF networks in most classification applications. In regression applications they can be competitive when the dimensionality of the input space is relatively small.

Part Six:   Multilayer Perceptrons (MLPs)

Multilayer Perceptron: (MLP) is a fully connected class of feedforward artificial neural network (ANN). The term MLP is used ambiguously, sometimes loosely to mean any feedforward ANN, sometimes strictly to refer to networks composed of multiple layers of perceptrons (with threshold activation). Multilayer perceptrons are sometimes colloquially referred to as "vanilla" neural networks, especially when they have a single hidden layer.

A MLP consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a chain rule based supervised learning technique called backpropagation or reverse mode of automatic differentiation for training. Its multiple layers and non-linear activation distinguish MLP from a linear perceptron. It can distinguish data that is not linearly separable.

Activation Function: If a multilayer perceptron has a linear activation function in all neurons, that is, a linear function that maps the weighted inputs to the output of each neuron, then linear algebra shows that any number of layers can be reduced to a two-layer input-output model. In MLPs some neurons use a nonlinear activation function that was developed to model the frequency of action potentials of biological neurons.
The two historically common activation functions are both sigmoids, and are described by $$ {\displaystyle y(v_{i})=\tanh(v_{i})~~{\textrm {and}}~~y(v_{i})=(1+e^{-v_{i}})^{-1}} $$ . The first is a hyperbolic tangent that ranges from -1 to 1, while the other is the logistic function, which is similar in shape but ranges from 0 to 1. Here $ {\displaystyle y_{i}} $ is the output of the $ {\displaystyle i}$ th node (neuron) and $ {\displaystyle v_{i}} $ is the weighted sum of the input connections. Alternative activation functions have been proposed, including the rectifier and softplus functions. More specialized activation functions include radial basis functions (used in radial basis networks, another class of supervised neural network models).
In recent developments of deep learning the rectified linear unit (ReLU) is more frequently used as one of the possible ways to overcome the numerical problems related to the sigmoids.
The term "multilayer perceptron" does not refer to a single perceptron that has multiple layers. Rather, it contains many perceptrons that are organized into layers. An alternative is "multilayer perceptron network". Moreover, MLP "perceptrons" are not perceptrons in the strictest possible sense. True perceptrons are formally a special case of artificial neurons that use a threshold activation function such as the Heaviside step function. MLP perceptrons can employ arbitrary activation functions. A true perceptron performs binary classification, an MLP neuron is free to either perform classification or regression, depending upon its activation function.
The term "multilayer perceptron" has been applied without respect to the nature of the nodes/layers, which can be composed of arbitrarily defined artificial neurons, and not perceptrons specifically. This interpretation avoids the loosening of the definition of "perceptron" to mean an artificial neuron in general.

Part Seven:   Self Organizing Maps (SOMs)

Self-organizing Map (SOM) or self-organizing feature map (SOFM) is an unsupervised machine learning technique used to produce a low-dimensional (typically two-dimensional) representation of a higher dimensional data set while preserving the topological structure of the data. For example, a data set with $ {\displaystyle p} $ variables measured in $ {\displaystyle n} $ observations could be represented as clusters of observations with similar values for the variables. These clusters then could be visualized as a two-dimensional "map" such that observations in proximal clusters have more similar values than observations in distal clusters. This can make high-dimensional data easier to visualize and analyze.

An SOM is a type of artificial neural network but is trained using competitive learning rather than the error-correction learning (e.g., backpropagation with gradient descent) used by other artificial neural networks. The SOM was introduced by the Finnish professor Teuvo Kohonen in the 1980s and therefore is sometimes called a Kohonen map or Kohonen network.

Part Eight:   Deep Belief Networks (DBNs)

Deep Belief Network: A DBN is a generative graphical model, or alternatively a class of deep neural network, composed of multiple layers of latent variables ("hidden units"), with connections between the layers but not between units within each layer.

When trained on a set of examples without supervision, a DBN can learn to probabilistically reconstruct its inputs. The layers then act as feature detectors.[1] After this learning step, a DBN can be further trained with supervision to perform classification.

DBNs can be viewed as a composition of simple, unsupervised networks such as restricted Boltzmann machines (RBMs) or autoencoders, where each sub-network's hidden layer serves as the visible layer for the next. An RBM is an undirected, generative energy-based model with a "visible" input layer and a hidden layer and connections between but not within layers. This composition leads to a fast, layer-by-layer unsupervised training procedure, where contrastive divergence is applied to each sub-network in turn, starting from the "lowest" pair of layers (the lowest visible layer is a training set).

Part Nine:   Restricted Boltzmann Machines( RBMs)

Restricted Boltzmann Machine: A RBM is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs.

RBMs were initially invented under the name Harmonium by Paul Smolensky in 1986, and rose to prominence after Geoffrey Hinton and collaborators invented fast learning algorithms for them in the mid-2000. RBMs have found applications in dimensionality reduction, classification, collaborative filtering, feature learning and topic modelling. They can be trained in either supervised or unsupervised ways, depending on the task.

As their name implies, RBMs are a variant of Boltzmann machines, with the restriction that their neurons must form a bipartite graph: a pair of nodes from each of the two groups of units (commonly referred to as the "visible" and "hidden" units respectively) may have a symmetric connection between them; and there are no connections between nodes within a group. By contrast, "unrestricted" Boltzmann machines may have connections between hidden units. This restriction allows for more efficient training algorithms than are available for the general class of Boltzmann machines, in particular the gradient-based contrastive divergence algorithm.

Restricted Boltzmann machines can also be used in deep learning networks. In particular, deep belief networks can be formed by "stacking" RBMs and optionally fine-tuning the resulting deep network with gradient descent and backpropagation.

Part Ten:   Autoencoders

Autoencoder: An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data (unsupervised learning). An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

Variants exist, aiming to force the learned representations to assume useful properties. Examples are regularized autoencoders (Sparse, Denoising and Contractive), which are effective in learning representations for subsequent classification tasks and Variational autoencoders, with applications as generative models. Autoencoders are applied to many problems, including facial recognition, feature detection, anomaly detection and acquiring the meaning of words. Autoencoders are also generative models which can randomly generate new data that is similar to the input data (training data).

Chapter Five:   AI-Background

Section One:   Perceptron

In the modern sense, the perceptron is an algorithm for learning a binary classifier called a threshold function: a function that maps its input $ {\displaystyle \mathbf {x} } $ (a real-valued vector) to an output value $ {\displaystyle f(\mathbf {x} )} $ a single binary value): $$ {\displaystyle f(\mathbf {x} )={\begin{cases}1&{\text{if }}\ \mathbf {w} \cdot \mathbf {x} +b>0,\\0&{\text{otherwise}}\end{cases}}} $$ where $ {\displaystyle \mathbf {w} } $ is a vector of real-valued weights, $ {\displaystyle \mathbf {w} \cdot \mathbf {x} } $ is the dot product $ {\displaystyle \sum _{i=1}^{m}w_{i}x_{i}} $ , where $ {\displaystyle \mathbf {m} } $ is the number of inputs to the perceptron, and $ {\displaystyle \mathbf {b} } $ is the bias. The bias shifts the decision boundary away from the origin and does not depend on any input value.
The value of $ {\displaystyle f(\mathbf {x} )} $ (0 or 1) is used to classify $ {\displaystyle \mathbf {x} } $ as either a positive or a negative instance, in the case of a binary classification problem. If $ {\displaystyle \mathbf {b} } $ is negative, then the weighted combination of inputs must produce a positive value greater than $ {\displaystyle |b|} $ in order to push the classifier neuron over the 0 threshold. Spatially, the bias alters the position (though not the orientation) of the decision boundary. The perceptron learning algorithm does not terminate if the learning set is not linearly separable. If the vectors are not linearly separable learning will never reach a point where all vectors are classified properly. The most famous example of the perceptron's inability to solve problems with linearly nonseparable vectors is the Boolean exclusive-or problem.
In the context of neural networks, a perceptron is an artificial neuron using the Heaviside step function as the activation function. The perceptron algorithm is also termed the single-layer perceptron, to distinguish it from a multilayer perceptron, which is a misnomer for a more complicated neural network. As a linear classifier, the single-layer perceptron is the simplest feedforward neural network.

Learning algorithm

Below is an example of a learning algorithm for a single-layer perceptron. For multilayer perceptrons, where a hidden layer exists, more sophisticated algorithms such as backpropagation must be used. If the activation function or the underlying process being modeled by the perceptron is nonlinear, alternative learning algorithms such as the delta rule can be used as long as the activation function is differentiable. Nonetheless, the learning algorithm described in the steps below will often work, even for multilayer perceptrons with nonlinear activation functions.
When multiple perceptrons are combined in an artificial neural network, each output neuron operates independently of all the others; thus, learning each output can be considered in isolation.

Definitions

We first define some variables:
$ {\displaystyle \mathbf {r} } $ is the learning rate of the perceptron. Learning rate is between 0 and 1. Larger values make the weight changes more volatile.
$ {\displaystyle y=f(\mathbf {z} )} $ denotes the output from the perceptron for an input vector $ {\displaystyle \mathbf {z} } $ .
$ {\displaystyle D=\{(\mathbf {x} _{1},d_{1}),\dots ,(\mathbf {x} _{s},d_{s})\}} $ is the training set of $ {\displaystyle s} samples, where:
$ {\displaystyle \mathbf {x} _{j}} $ is the $ {\displaystyle n} $ -dimensional input vector.
$ {\displaystyle d_{j}} $ is the desired output value of the perceptron for that input.

We show the values of the features as follows:
$ {\displaystyle x_{j,i}} $ is the value of the $ {\displaystyle i} $ th feature of the $ {\displaystyle j} $ th training input vector.
$ {\displaystyle x_{j,0}=1} $ .
To represent the weights:
$ {\displaystyle w_{i}} $ is the $ {\displaystyle i} $ th value in the weight vector, to be multiplied by the value of the $ {\displaystyle i} $ th input feature.
Because $ {\displaystyle x_{j,0}=1} $ , the $ {\displaystyle w_{0}} $ is effectively a bias that we use instead of the bias constant $ {\displaystyle b} $ .
To show the time-dependence of $ {\displaystyle \mathbf {w} } $ , we use:
$ {\displaystyle w_{i}(t)} $ is the weight $ {\displaystyle i} $ at time $ {\displaystyle t} $ .

Steps

Initialize the weights. Weights may be initialized to 0 or to a small random value. In the example below, we use 0.
For each example $ {\displaystyle j} $ in our training set $ {\displaystyle D} $ , perform the following steps over the input $ {\displaystyle \mathbf {x} _{j}} $ and desired output $ {\displaystyle d_{j}} $ :
Calculate the actual output:
$$ {\displaystyle {\begin{aligned}y_{j}(t)&=f[\mathbf {w} (t)\cdot \mathbf {x} _{j}]\\&=f[w_{0}(t)x_{j,0}+w_{1}(t)x_{j,1}+w_{2}(t)x_{j,2}+\dotsb +w_{n}(t)x_{j,n}]\end{aligned}}} $$ Update the weights: $ {\displaystyle w_{i}(t+1)=w_{i}(t)\;{\boldsymbol {+}}\;r\cdot (d_{j}-y_{j}(t))x_{j,i}} $ , for all features $ {\displaystyle 0\leq i\leq n} $ , $ {\displaystyle r} is the learning rate.
For offline learning, the second step may be repeated until the iteration error $ {\displaystyle {\frac {1}{s}}\sum _{j=1}^{s}|d_{j}-y_{j}(t)|} $ is less than a user-specified error threshold $ {\displaystyle \gamma } $ , or a predetermined number of iterations have been completed, where $ {\displaystyle s } $ , is again the size of the sample set.
The algorithm updates the weights after steps $ {\displaystyle 2a } $ , and $ {\displaystyle 2b } $ . These weights are immediately applied to a pair in the training set, and subsequently updated, rather than waiting until all pairs in the training set have undergone these steps.

Section Two:   Multilayer Perceptron

AMultilayer Perceptron (MLP) is a fully connected class of Feedforward Artificial Neural Network (ANN). The term MLP is used ambiguously, sometimes loosely to mean any feedforward ANN, sometimes strictly to refer to networks composed of multiple layers of perceptrons with threshold activation.
Multilayer perceptrons are sometimes referred to as "vanilla" neural networks, especially when they have a single hidden layer.
An MLP consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a chain rule based supervised learning technique called backpropagation or reverse mode of automatic differentiation for training. Its multiple layers and non-linear activation distinguish MLP from a linear perceptron. It can distinguish data that is not linearly separable.

Components of MPL

Part One:   Activation Function

If a multilayer perceptron has a linear activation function in all neurons, that is, a linear function that maps the weighted inputs to the output of each neuron, then linear algebra shows that any number of layers can be reduced to a two-layer input-output model. In MLPs some neurons use a nonlinear activation function that was developed to model the frequency of action potentials, or firing, of biological neurons.
The two historically common activation functions are both sigmoids, and are described by $$ {\displaystyle y(v_{i})=\tanh(v_{i})~~{\textrm {and}}~~y(v_{i})=(1+e^{-v_{i}})^{-1}} $$ The first is a hyperbolic tangent that ranges from -1 to 1, while the other is the logistic function, which is similar in shape but ranges from 0 to 1. Here $ {\displaystyle y_{i}} $ is the output of the $ {\displaystyle i}$ th node (neuron) and $ {\displaystyle v_{i}}$ is the weighted sum of the input connections. Alternative activation functions have been proposed, including the rectifier and softplus functions.
In recent developments of deep learning the rectified linear unit (ReLU) is more frequently used as one of the possible ways to overcome the numerical problems related to the sigmoids.

Part Two:   Layers (deep learning

The MLP consists of three or more layers (an input and an output layer with one or more hidden layers) of nonlinearly-activating nodes. Since MLPs are fully connected, each node in one layer connects with a certain weight $ {\displaystyle w_{ij}} $ to every node in the following layer.
The first type of layer is the Dense Layer, also called the fully-connected layer, and is used for abstract representations of input data. In this layer, neurons connect to every neuron in the preceding layer. In multilayer perceptron networks, these layers are stacked together.
The Convolutional layer is typically used for image analysis tasks. In this layer, the network detects edges, textures, and patterns. The outputs from this layer are then fed into a fully-connected layer for further processing.
The Pooling Layer is used to reduce the size of data input.
The Recurrent Layer is used for text processing with a memory function. Similar to the Convolutional layer, the output of recurrent layers are usually fed into a fully-connected layer for further processing. The Normalization Layer adjusts the output data from previous layers to achieve a regular distribution. This results in improved scalability and model training.

Part Three:   Learning

Learning occurs in the perceptron by changing connection weights after each piece of data is processed, based on the amount of error in the output compared to the expected result. This is an example of supervised learning, and is carried out through backpropagation, a generalization of the least mean squares algorithm in the linear perceptron.
We can represent the degree of error in an output node $ {\displaystyle j} $ in the $ {\displaystyle n} $ th data point (training example) by $ {\displaystyle e_{j}(n)=d_{j}(n)-y_{j}(n)} $ , where $ {\displaystyle d_{j}(n)} $ is the desired target value for $ th data point at node $ {\displaystyle j} $ , and $ {\displaystyle y_{j}(n)} $ is the value produced by the perceptron at node $ {\displaystyle j} $ when the $ {\displaystyle n} $ th data point is given as an input.
The node weights can then be adjusted based on corrections that minimize the error in the entire output for the $ {\displaystyle n} $ th data point, given by
$$ {\displaystyle {\mathcal {E}}(n)={\frac {1}{2}}\sum _{{\text{output node }}j}e_{j}^{2}(n)} $$ . Using gradient descent, the change in each weight $ {\displaystyle w_{ij}} $ is $$ {\displaystyle \Delta w_{ji}(n)=-\eta {\frac {\partial {\mathcal {E}}(n)}{\partial v_{j}(n)}}y_{i}(n)} $$ where $ {\displaystyle y_{i}(n)} $ is the output of the previous neuron $ {\displaystyle i} $ , and $ {\displaystyle \eta } $ is the learning rate, which is selected to ensure that the weights quickly converge to a response, without oscillations. In the previous expression, $ {\displaystyle {\frac {\partial {\mathcal {E}}(n)}{\partial v_{j}(n)}}} $ denotes the partial derivate of the error $ {\displaystyle {\mathcal {E}}(n)} $ according to the weighted sum $ {\displaystyle v_{j}(n)} $ of the input connections of neuron $ {\displaystyle i} $ .
The derivative to be calculated depends on the induced local field $ {\displaystyle v_{j}} $ , which itself varies. It can be shown that for an output node this derivative can be simplified to,
$$ {\displaystyle -{\frac {\partial {\mathcal {E}}(n)}{\partial v_{j}(n)}}=e_{j}(n)\phi ^{\prime }(v_{j}(n))} $$ where $ {\displaystyle \phi ^{\prime }} $ is the derivative of the activation function described above, which itself does not vary. The analysis is more difficult for the change in weights to a hidden node, but it can be shown that the relevant derivative is $$ {\displaystyle -{\frac {\partial {\mathcal {E}}(n)}{\partial v_{j}(n)}}=\phi ^{\prime }(v_{j}(n))\sum _{k}-{\frac {\partial {\mathcal {E}}(n)}{\partial v_{k}(n)}}w_{kj}(n)} $$ . This depends on the change in weights of the $ {\displaystyle k} $ th nodes, which represent the output layer. So to change the hidden layer weights, the output layer weights change according to the derivative of the activation function, and so this algorithm represents a backpropagation of the activation function.

Part Four:   Terminology

The term "multilayer perceptron" does not refer to a single perceptron that has multiple layers. Rather, it contains many perceptrons that are organized into layers. An alternative is "multilayer perceptron network". Moreover, MLP "perceptrons" are not perceptrons in the strictest possible sense. True perceptrons are formally a special case of artificial neurons that use a threshold activation function such as the Heaviside step function. MLP perceptrons can employ arbitrary activation functions. A true perceptron performs binary classification, an MLP neuron is free to either perform classification or regression, depending upon its activation function.

The term "multilayer perceptron" later was applied without respect to nature of the nodes/layers, which can be composed of arbitrarily defined artificial neurons, and not perceptrons specifically. This interpretation avoids the loosening of the definition of "perceptron" to mean an artificial neuron in general.

Section Three:   Feed-Forward Networks

A Feedforward Neural Network (FNN) is an artificial neural network wherein connections between the nodes do not form a cycle. As such, it is different from its descendant: recurrent neural networks.

The feedforward neural network was the first and simplest type of artificial neural network devised. In this network, the information moves in only one direction—forward—from the input nodes, through the hidden nodes (if any) and to the output nodes. There are no cycles or loops in the network.

Part One:   Linear Neural Network

The simplest kind of feedforward neural network is a linear network, which consists of a single layer of output nodes; the inputs are fed directly to the outputs via a series of weights. The sum of the products of the weights and the inputs is calculated in each node. The mean squared errors between these calculated outputs and a given target values are minimized by creating an adjustment to the weights. This technique has been known for over two centuries as the method of least squares or linear regression. It was used as a means of finding a good rough linear fit to a set of points by Legendre (1805) and Gauss (1795) for the prediction of planetary movement.

Part Two:   Single-layer Perceptron

The single-layer perceptron combines a linear neural network with a threshold function. If the output value is above some threshold (typically 0) the neuron fires and takes the activated value (typically 1); otherwise it takes the deactivated value (typically −1). Neurons with this kind of activation function are often called linear threshold units. In the literature the term perceptron often refers to networks consisting of just one of these units. Similar "neurons" were described in physics by Ernst Ising and Wilhelm Lenz for the Ising model in the 1920s,[8] and by Warren McCulloch and Walter Pitts in the 1940s.
A perceptron can be created using any values for the activated and deactivated states as long as the threshold value lies between the two.
Perceptrons can be trained by a simple learning algorithm that is usually called the delta rule. It calculates the errors between calculated output and sample output data, and uses this to create an adjustment to the weights, thus implementing a form of gradient descent.
Single-layer perceptrons are only capable of learning linearly separable patterns; in 1969 in a famous monograph titled Perceptrons, Marvin Minsky and Seymour Papert showed that it was impossible for a single-layer perceptron network to learn an XOR function. Nonetheless, it was known that multi-layer perceptrons (MLPs) are capable of producing any possible boolean function.
Although a single threshold unit is quite limited in its computational power, it has been shown that networks of parallel threshold units can approximate any continuous function from a compact interval of the real numbers into the interval [−1,1].
A single-layer neural network can compute a continuous output instead of a step function. A common choice is the so-called logistic function: $$ {\displaystyle f(x)={\frac {1}{1+e^{-x}}}} $$ With this choice, the single-layer network is identical to the logistic regression model, widely used in statistical modeling. The logistic function is one of the family of functions called sigmoid functions because their S-shaped graphs resemble the final-letter lower case of the Greek letter Sigma. It has a continuous derivative, which allows it to be used in backpropagation. This function is also preferred because its derivative is easily calculated: $$ {\displaystyle f'(x)=f(x)(1-f(x)).} $$ (The fact that $ {\displaystyle f} $ satisfies the differential equation above can easily be shown by applying the chain rule.)
If single-layer neural network activation function is modulo 1, then this network can solve XOR problem with a single neuron.

$$ {\displaystyle {\begin{aligned}f(x)&=x\mod 1\\f'(x)&=1\end{aligned}}} $$

Section Four:   Back Propegation

Chaper Six:    Statistical Methods

Section One:   Introduction

Statistical Analysis is a branch of mathematics that concerns the collection, organization, interpretation, and presentation of data. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied. Populations can be diverse groups of people or objects such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments.

Statistical methods are at the core of the Artificial Intelligence movement. These methods are central in the production of computational models associated with Neural Networks

Two main statistical methods are used in data analysis: descriptive statistics, which summarize data from a sample using indexes such as the mean or standard deviation, and inferential statistics, which draw conclusions from data that are subject to random variation (e.g., observational errors, sampling variation). Descriptive statistics are most often concerned with two sets of properties of a distribution: central tendency seeks to characterize the distribution's central or typical value, while dispersion characterizes the extent to which members of the distribution depart from its center and each other. Inferences on mathematical statistics are made under the framework of probability theory, which deals with the analysis of random phenomena.

A standard statistical procedure involves the collection of data leading to a test of the relationship between two statistical data sets, or a data set and synthetic data drawn from an idealized model. A hypothesis is proposed for the statistical relationship between the two data sets, and this is compared as an alternative to an idealized null hypothesis of no relationship between two data sets. Rejecting or disproving the null hypothesis is done using statistical tests that quantify the sense in which the null can be proven false, given the data that are used in the test. Working from a null hypothesis, two basic forms of error are recognized: Type I errors (null hypothesis is falsely rejected giving a "false positive") and Type II errors (null hypothesis fails to be rejected and an actual relationship between populations is missed giving a "false negative"). Multiple problems have come to be associated with this framework, ranging from obtaining a sufficient sample size to specifying an adequate null hypothesis.

Statistical measurement processes are also prone to error in regards to the data that they generate. Many of these errors are classified as random (noise) or systematic (bias), but other types of errors (e.g., blunder, such as when an analyst reports incorrect units) can also occur. The presence of missing data or censoring may result in biased estimates and specific techniques have been developed to address these problems.

Part One:   Data Collection: Sampling

When full census data cannot be collected, statisticians collect sample data by developing specific experiment designs and survey samples. Statistics itself also provides tools for prediction and forecasting through statistical models.

To use a sample as a guide to an entire population, it is important that it truly represents the overall population. Representative sampling assures that inferences and conclusions can safely extend from the sample to the population as a whole. A major problem lies in determining the extent that the sample chosen is actually representative. Statistics offers methods to estimate and correct for any bias within the sample and data collection procedures. There are also methods of experimental design that can lessen these issues at the outset of a study, strengthening its capability to discern truths about the population.

Sampling theory is part of the mathematical discipline of probability theory. Probability is used in mathematical statistics to study the sampling distributions of sample statistics and, more generally, the properties of statistical procedures. The use of any statistical method is valid when the system or population under consideration satisfies the assumptions of the method. The difference in point of view between classic probability theory and sampling theory is, roughly, that probability theory starts from the given parameters of a total population to deduce probabilities that pertain to samples. Statistical inference, however, moves in the opposite direction—inductively inferring from samples to the parameters of a larger or total population.

Part Two:   Statistical Data Type and Levels of Measurement

Various attempts have been made to produce a taxonomy of levels of measurement. The psychophysicist Stanley Smith Stevens defined nominal, ordinal, interval, and ratio scales. Nominal measurements do not have meaningful rank order among values, and permit any one-to-one (injective) transformation. Ordinal measurements have imprecise differences between consecutive values, but have a meaningful order to those values, and permit any order-preserving transformation. Interval measurements have meaningful distances between measurements defined, but the zero value is arbitrary (as in the case with longitude and temperature measurements in Celsius or Fahrenheit), and permit any linear transformation. Ratio measurements have both a meaningful zero value and the distances between different measurements defined, and permit any rescaling transformation.

Because variables conforming only to nominal or ordinal measurements cannot be reasonably measured numerically, sometimes they are grouped together as categorical variables, whereas ratio and interval measurements are grouped together as quantitative variables, which can be either discrete or continuous, due to their numerical nature. Such distinctions can often be loosely correlated with data type in computer science, in that dichotomous categorical variables may be represented with the Boolean data type, polytomous categorical variables with arbitrarily assigned integers in the integral data type, and continuous variables with the real data type involving floating-point arithmetic. But the mapping of computer science data types to statistical data types depends on which categorization of the latter is being implemented.

Other categorizations have been proposed. For example, Mosteller and Tukey distinguished grades, ranks, counted fractions, counts, amounts, and balances. Nelder described continuous counts, continuous ratios, count ratios, and categorical modes of data.

The issue of whether or not it is appropriate to apply different kinds of statistical methods to data obtained from different kinds of measurement procedures is complicated by issues concerning the transformation of variables and the precise interpretation of research questions. "The relationship between the data and what they describe merely reflects the fact that certain kinds of statistical statements may have truth values which are not invariant under some transformations. Whether or not a transformation is sensible to contemplate depends on the question one is trying to answer."

The primary goal of statistical thermodynamics is to derive the classical thermodynamics of materials in terms of the properties of their constituent particles and the interactions between them. In other words, statistical thermodynamics provides a connection between the macroscopic properties of materials in thermodynamic equilibrium, and the microscopic behaviours and motions occurring inside the material.

Whereas statistical mechanics proper involves dynamics, here the attention is focussed on statistical equilibrium (steady state). Statistical equilibrium does not mean that the particles have stopped moving (mechanical equilibrium), rather, only that the ensemble is not evolving.

Fundamental postulate: A sufficient (but not necessary) condition for statistical equilibrium with an isolated system is that the probability distribution is a function only of conserved properties (total energy, total particle numbers, etc.). There are many different equilibrium ensembles that can be considered, and only some of them correspond to thermodynamics. Additional postulates are necessary to motivate why the ensemble for a given system should have one form or another.

A common approach found in many textbooks is to take the equal a priori probability postulate. This postulate states that:

For an isolated system with an exactly known energy and exactly known composition, the system can be found with equal probability in any microstate consistent with that knowledge.

The equal a priori probability postulate therefore provides a motivation for the microcanonical ensemble described below. There are various arguments in favour of the equal a priori probability postulate:

Ergodic hypothesis: An ergodic system is one that evolves over time to explore "all accessible" states: all those with the same energy and composition. In an ergodic system, the microcanonical ensemble is the only possible equilibrium ensemble with fixed energy. This approach has limited applicability, since most systems are not ergodic.

Principle of indifference: In the absence of any further information, we can only assign equal probabilities to each compatible situation.

Maximum information entropy: A more elaborate version of the principle of indifference states that the correct ensemble is the ensemble that is compatible with the known information and that has the largest Gibbs entropy (information entropy).

Other fundamental postulates for statistical mechanics have also been proposed. For example, recent studies shows that the theory of statistical mechanics can be built without the equal a priori probability postulate. One such formalism is based on the fundamental thermodynamic relation together with the following set of postulates:

APPENDIX ONE:   MATHEMATICAL PRELIMINARIES

This chapter of this webpage will provide the reader with an absolute minimum of the mathematical background required for even a small understanding of a small part of basic mathematical structures required for a basic understanding of Newtonian Mechanical Theory, the Mechanical theories of Hamilton and Lagrange, the Electromagnetic Theory of Maxwell's, electrical circuit theory of Kirchoff and the Quantum Theory of Schrödinger. We will be using the Harmonic Oscillator Model as a minimum requirement for the study of Power Systems and Nuclear Reactor Theory. We therefore adopt a practical view (Engineering) of Model Building. This implies: All models are a lie but some models are useful!. We will be comparing the analysis of the harmonic oscillator in many representations.

Section One:   Sets and Functions

Part One: Sets

A set is the mathematical model for a collection of different objects or things; a set contains elements or members, which can be mathematical objects of any kind: numbers, symbols, points in space, lines, other geometrical shapes, variables, or even other sets. The set with no element is the empty set; a set with a single element is a singleton. A set may have a finite number of elements or be an infinite set. Two sets are equal if they have precisely the same elements. Sets form the basis in modern mathematics and provide a common starting point for many areas of science and Engineering. The use of Boolean Logic in the layout of Computer chips is strongly influenced by the theories of Sets.

Set-builder notation can be used to describe a set that is defined by a predicate, that is, a logical formula that evaluates to true for an element of the set, and false otherwise. In this form, set-builder notation has three parts: a variable, a colon or vertical bar separator, and a predicate. Thus there is a variable on the left of the separator, and a rule on the right of it. These three parts are contained in curly brackets: $$ {\displaystyle \{x\mid \Phi (x)\}} $$ or $$ {\displaystyle \{x:\Phi (x)\}.} $$

The vertical bar (or colon) is a separator that can be read as "such that", "for which", or "with the property that". The formula $ {\displaystyle \Phi (x)} $ is said to be the rule or the predicate. All values of $ {\displaystyle x} $ for which the predicate holds (is true) belong to the set being defined. All values of $ {\displaystyle x} $ for which the predicate does not hold do not belong to the set. Thus $ {\displaystyle \{x\mid \Phi (x)\}} $ is the set of all values of $ {\displaystyle x} $ that satisfy the formula $ {\displaystyle \Phi} $ . It may be the empty set, if no value of $ {\displaystyle x} $ satisfies the formula.

Part Two: Introduction to Functions

A function from a set X to a set Y is an assignment of an element of Y to each element of X. The set X is called the domain of the function and the setY is called the codomain or range of the function.

A function, its domain, and its codomain, are declared by the notation f: X→Y, and the value of a function f at an element x of X, denoted by f(x), is called the image of x under f, or the value of f applied to the argument x. Functions are also called maps or mappings.

Two functions f and g are equal if their domain and codomain sets are the same and their output values agree on the whole domain. More formally, given f: X → Y and g: X → Y, we have f = g if and only if f(x) = g(x) for all x ∈ X. The range or image of a function is the set of the images of all elements in the domain.

Part Three: Important Functions

1.   Trigonometric Functions:

Trigonometric functions (also called circular functions) are real functions which relate an angle of a right-angled triangle to ratios of two side lengths. They are widely used in Engineering and form the basis for the analysis of circulation models of the atmosphere. They are the simplest periodic functions, and as such are widely used in the study of elctrodynamic phenomena through Fourier analysis.

The trigonometric functions most widely used in modern mathematics are the sine, the cosine, and the tangent. Their reciprocals are respectively the cosecant, the secant, and the cotangent, which are less used. Each of these six trigonometric functions has a corresponding inverse function, and an analog among the hyperbolic functions.

The oldest definitions of trigonometric functions, related to right-angle triangles, define them only for acute angles. To extend the sine and cosine functions to functions whose domain is the whole real line, geometrical definitions using the standard unit circle (i.e., a circle with radius 1 unit) are often used; then the domain of the other functions is the real line with some isolated points removed. Modern definitions express trigonometric functions as infinite series or as solutions of differential equations. This allows extending the domain of sine and cosine functions to the whole complex plane, and the domain of the other trigonometric functions to the complex plane with some isolated points removed.

$$ {\displaystyle \sin \theta ={\frac {\mathrm {opposite} }{\mathrm {hypotenuse} }}}\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, {\displaystyle \csc \theta ={\frac {\mathrm {hypotenuse} }{\mathrm {opposite} }}}$$ $$ {\displaystyle \cos \theta ={\frac {\mathrm {adjacent} }{\mathrm {hypotenuse} }}} \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, {\displaystyle \sec \theta ={\frac {\mathrm {hypotenuse} }{\mathrm {adjacent} }}}$$ $$ {\displaystyle \tan \theta ={\frac {\mathrm {opposite} }{\mathrm {adjacent} }}}\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, {\displaystyle \cot \theta ={\frac {\mathrm {adjacent} }{\mathrm {opposite} }}}$$

  

2.   The Logarithmic Function:

In mathematics, the logarithm is the inverse function to exponentiation. That means the logarithm of a given number x is the exponent to which another fixed number, the base b, must be raised, to produce that number x. In the simplest case, the logarithm counts the number of occurrences of the same factor in repeated multiplication; e.g. since $ {\displaystyle {1000 = 10 × 10 × 10 = 10^3}}$ , the "logarithm base 10" of 1000 is 3, or log10 (1000) = 3. The logarithm of x to base b is denoted as $ log_b (x)$ .

The logarithm base 10 (that is b = 10) is called the decimal or common logarithm and is commonly used in engineering and Computer Science. The natural logarithm has the number e (that is b ≈ 2.718) as its base; its use is widespread in Engineering, mathematics and physics, because of its simpler integral and derivative. The binary logarithm uses base 2 (that is b = 2) and is frequently used in computer science.

Logarithms were introduced by John Napier in 1614 as a means of simplifying calculations. They were rapidly adopted by navigators, scientists, engineers, surveyors and others to perform high-accuracy computations more easily. Using logarithm tables, tedious multi-digit multiplication steps can be replaced by table look-ups and simpler addition. This is possible because of the fact that the logarithm of a product is the sum of the logarithms of the factors: $$ {\displaystyle \log _{b}(xy)=\log _{b}x+\log _{b}y.}$$ provided that b, x and y are all positive and b ≠ 1. The slide rule, also based on logarithms, allows quick calculations without tables, but at lower precision. The present-day notion of logarithms comes from Leonhard Euler, who connected them to the exponential function in the 18th century, and who also introduced the letter e as the base of natural logarithms.

Logarithmic scales reduce wide-ranging quantities to smaller scopes. For example, the decibel (dB) is a unit used to express ratio as logarithms, mostly for signal power and amplitude (of which sound pressure is a common example). Logarithms are commonplace in Engineeering and in measurements of the complexity of algorithms.

3.   The Exponential Function:

The exponential function is a mathematical function denoted by $ {\displaystyle f(x)=\exp(x)}$ or $ {\displaystyle e^{x}}$ (where the argument x is written as an exponent). Unless otherwise specified, the term generally refers to the positive-valued function of a real variable, although it can be extended to the complex numbers or generalized to other mathematical objects like matrices or Lie algebras. The exponential function originated from the notion of exponentiation (repeated multiplication), but modern definitions (there are several equivalent characterizations) allow it to be rigorously extended to all real arguments, including irrational numbers. Its application in Engineering qualifies the exponential function as one of the most important function in mathematics.

The exponential function satisfies the exponentiation identity: $ {\displaystyle e^{x+y}=e^{x}e^{y}{\text{ for all }}x,y\in \mathbb {R} ,}$ which, along with the definition $ {\displaystyle e=\exp(1)}$ , shows that $ {\displaystyle e^{n}=\underbrace {e\times \cdots \times e} _{n{\text{ factors}}}}$ for positive integers n, and relates the exponential function to the elementary notion of exponentiation. The base of the exponential function, its value at 1,$ {\displaystyle e=\exp(1)}$, is a ubiquitous mathematical constant called Euler's number.

While other continuous nonzero functions $ {\displaystyle f:\mathbb {R} \to \mathbb {R} } $ that satisfy the exponentiation identity are also known as exponential functions, the exponential function exp is the unique real-valued function of a real variable whose derivative is itself and whose value at 0 is 1; that is, $ {\displaystyle \exp '(x)=\exp(x)} $ for all real x, and $ {\displaystyle \exp(0)=1.}$ Thus, exp is sometimes called the natural exponential function to distinguish it from these other exponential functions, which are the functions of the form $ {\displaystyle f(x)=ab^{x},} $ where the base b is a positive real number. The relation $ {\displaystyle b^{x}=e^{x\ln b}}$ for positive b and real or complex x establishes a strong relationship between these functions, which explains this ambiguous terminology.

The real exponential function can also be defined as a power series. This power series definition is readily extended to complex arguments to allow the complex exponential function $ {\displaystyle \exp :\mathbb {C} \to \mathbb {C} }$ to be defined. The complex exponential function takes on all complex values except for 0.

4.   The Dirac Delta Function:

The delta function was introduced by physicist Paul Dirac as a tool for the normalization of state vectors. It also has uses in probability theory and signal processing. Its validity was disputed until Laurent Schwartz developed the theory of distributions where it is defined as a linear form acting on functions. Joseph Fourier presented what is now called the Fourier integral theorem in his treatise "Théorie analytique de la chaleur" in the form: $$ {\displaystyle f(x)={\frac {1}{2\pi }}\int _{-\infty }^{\infty }\ \ d\alpha \,f(\alpha )\ \int _{-\infty }^{\infty }dp\ \cos(px-p\alpha )\ ,}$$ which is tantamount to the introduction of the $ {\displaystyle \delta}$ -function in the form: $$ {\displaystyle \delta (x-\alpha )={\frac {1}{2\pi }}\int _{-\infty }^{\infty }dp\ \cos(px-p\alpha )\ .}$$ Augustin Cauchy expressed the theorem using exponentials: $$ {\displaystyle f(x)={\frac {1}{2\pi }}\int _{-\infty }^{\infty }\ e^{ipx}\left(\int _{-\infty }^{\infty }e^{-ip\alpha }f(\alpha )\,d\alpha \right)\,dp.} $$ Cauchy pointed out that in some circumstances the order of integration is significant in this result. As justified using the theory of distributions, the Cauchy equation can be rearranged to resemble Fourier's original formulation and expose the $ {\displaystyle \delta}$ -function as: $$ {\displaystyle {\begin{aligned}f(x)&={\frac {1}{2\pi }}\int _{-\infty }^{\infty }e^{ipx}\left(\int _{-\infty }^{\infty }e^{-ip\alpha }f(\alpha )\,d\alpha \right)\,dp\\[4pt]&={\frac {1}{2\pi }}\int _{-\infty }^{\infty }\left(\int _{-\infty }^{\infty }e^{ipx}e^{-ip\alpha }\,dp\right)f(\alpha )\,d\alpha =\int _{-\infty }^{\infty }\delta (x-\alpha )f(\alpha )\,d\alpha ,\end{aligned}}}$$ where the $ {\displaystyle \delta}$ -function is expressed as $$ {\displaystyle \delta (x-\alpha )={\frac {1}{2\pi }}\int _{-\infty }^{\infty }e^{ip(x-\alpha )}\,dp\ .}$$

Section Two:   Complex Numbers

A complex number is an element of a number system that extends the real numbers with a specific element denoted i, called the imaginary unit and satisfying the equation $ {\displaystyle i^{2}= -1}$ ; every complex number can be expressed in the form $ {\displaystyle a +bi}$ , where a and b are real numbers. Because no real number satisfies the above equation, i was called an imaginary number by René Descartes. For the complex number $ {\displaystyle a +bi}$ , a is called the real part and b is called the imaginary part. The set of complex numbers is denoted by either of the symbols $ {\displaystyle \mathbb {C} } $ or C.

Complex numbers allow solutions to all polynomial equations, even those that have no solutions in real numbers. More precisely, the fundamental theorem of algebra asserts that every non-constant polynomial equation with real or complex coefficients has a solution which is a complex number. For example, the equation $ {\displaystyle (x+1)^{2}=-9} $ has no real solution, since the square of a real number cannot be negative, but has the two nonreal complex solutions: $ (−1 + 3i)$ and $ ( −1 − 3i) $ .

Addition, subtraction and multiplication of complex numbers can be naturally defined by using the rule $ {\displaystyle i^{2}= -1}$ combined with the associative, commutative and distributive laws. Every nonzero complex number has a multiplicative inverse. This makes the complex numbers a field that has the real numbers as a subfield.

Cartesian complex plane

The complex numbers can be viewed as a Cartesian plane, called the complex plane. This allows a geometric interpretation of the complex numbers and their operations, and conversely expressing in terms of complex numbers some geometric properties and constructions. For example, the real numbers form the real line which is identified to the horizontal axis of the complex plane. The complex numbers of absolute value one form the unit circle. The addition of a complex number is a translation in the complex plane, and the multiplication by a complex number is a similarity centered at the origin. The complex conjugation is the reflection symmetry with respect to the real axis. The complex absolute value is a Euclidean norm.

The definition of the complex numbers involving two arbitrary real values immediately suggests the use of Cartesian coordinates in the complex plane. The horizontal (real) axis is generally used to display the real part, with increasing values to the right, and the imaginary part marks the vertical (imaginary) axis, with increasing values upwards.

A charted number, in the complex plane, may be viewed either as the coordinatized point or as a position vector from the origin to this point. The coordinate values of a complex number z can hence be expressed in its Cartesian, rectangular, or algebraic form.

Notably, the operations of addition and multiplication take on a very natural geometric character, when complex numbers are viewed as position vectors: addition corresponds to vector addition, while multiplication corresponds to multiplying their magnitudes and adding the angles they make with the real axis. Viewed in this way, the multiplication of a complex number by i corresponds to rotating the position vector counterclockwise by a quarter turn (90°) about the origin—a fact which can be expressed algebraically as follows:

$$ {\displaystyle (a+bi)\cdot i=ai+b(i)^{2}=-b+ai.} $$

Polar complex plane

Modulus and argument An alternative option for coordinates in the complex plane is the polar coordinate system that uses the distance of the point z from the origin (O), and the angle subtended between the positive real axis and the line segment Oz in a counterclockwise sense. This leads to the polar form: $$ {\displaystyle z=re^{i\varphi }=r(\cos \varphi +i\sin \varphi )} $$ of a complex number, where r is the absolute value of z, and φ {\displaystyle \varphi } \varphi is the argument of z.

The absolute value (or modulus or magnitude) of a complex number $ {z = x + yi} $ is:

$$ {\displaystyle r=|z|={\sqrt {x^{2}+y^{2}}}.} $$

If z is a real number (that is, if y = 0), then r = |x|. That is, the absolute value of a real number equals its absolute value as a complex number. By Pythagoras' theorem, the absolute value of a complex number is the distance to the origin of the point representing the complex number in the complex plane.

The argument of z (in many applications referred to as the "phase" φ) is the angle of the radius Oz with the positive real axis, and is written as arg z. As with the modulus, the argument can be found from the rectangular form x + yi—by applying the inverse tangent to the quotient of imaginary-by-real parts. By using a half-angle identity, a single branch of the arctan suffices to cover the range (−π, π) of the arg-function, and avoids a more subtle case-by-case analysis:

$$ {\displaystyle \varphi =\arg(x+yi)={\begin{cases}2\arctan \left({\dfrac {y}{{\sqrt {x^{2}+y^{2}}}+x}}\right)&{\text{if }}y\neq 0{\text{ or }}x>0,\\\pi &{\text{if }}x<0{\text{ and }}y=0,\\{\text{undefined}}&{\text{if }}x=0{\text{ and }}y=0.\end{cases}}} $$

Normally, as given above, the principal value in the interval (−π, π] is chosen. If the arg value is negative, values in the range (−π, π] or [0, 2π) can be obtained by adding 2π. The value of $ \phi $ is expressed in radians in this article. It can increase by any integer multiple of $ { 2 \pi} $ and still give the same angle, viewed as subtended by the rays of the positive real axis and from the origin through z. Hence, the arg function is sometimes considered as multivalued. The polar angle for the complex number 0 is indeterminate, but arbitrary choice of the polar angle 0 is common.

The value of φ equals the result of atan2: φ = atan2 ⁡ ( Im ⁡ ( z ) , Re ⁡ ( z ) ) . {\displaystyle \varphi =\operatorname {atan2} \left(\operatorname {Im} (z),\operatorname {Re} (z)\right).} {\displaystyle \varphi =\operatorname {atan2} \left(\operatorname {Im} (z),\operatorname {Re} (z)\right).}

Together, r and φ give another way of representing complex numbers, the polar form, as the combination of modulus and argument fully specify the position of a point on the plane. Recovering the original rectangular co-ordinates from the polar form is done by the formula called trigonometric form z = r ( cos ⁡ φ + i sin ⁡ φ ) . {\displaystyle z=r(\cos \varphi +i\sin \varphi ).} {\displaystyle z=r(\cos \varphi +i\sin \varphi ).}

Using Euler's formula this can be written as z = r e i φ or z = r exp ⁡ i φ . {\displaystyle z=re^{i\varphi }{\text{ or }}z=r\exp i\varphi .} {\displaystyle z=re^{i\varphi }{\text{ or }}z=r\exp i\varphi .}

Using the cis function, this is sometimes abbreviated to z = r c i s ⁡ φ . {\displaystyle z=r\operatorname {\mathrm {cis} } \varphi .} {\displaystyle z=r\operatorname {\mathrm {cis} } \varphi .}

In angle notation, often used in electronics to represent a phasor with amplitude r and phase φ, it is written as[13] z = r ∠ φ . {\displaystyle z=r\angle \varphi .} {\displaystyle z=r\angle \varphi .}

Section Three:   Calculus

Calculus, originally called infinitesimal calculus or "the calculus of infinitesimals", is the mathematical study of continuous change, in the same way that geometry is the study of shape, and algebra is the study of generalizations of arithmetic operations.

It has two major branches, differential calculus and integral calculus; differential calculus concerns instantaneous rates of change, and the slopes of curves, while integral calculus concerns accumulation of quantities, and areas under or between curves. These two branches are related to each other by the fundamental theorem of calculus, and they make use of the fundamental notions of convergence of infinite sequences and infinite series to a well-defined limit.

Infinitesimal calculus was developed independently in the late 17th century by Isaac Newton and Gottfried Wilhelm Leibniz. Later work, including codifying the idea of limits, put these developments on a more solid conceptual footing. Today, calculus has widespread uses in Engineering and some sciences.

Part One: Differential Calculus

In mathematics, differential calculus is that portion of calculus that studies the rates at which quantities change. The primary objects of study in differential calculus are the calculation of the derivative of a function. The derivative of a function at a point describes the rate of change of the function near that point. The derivative of a function is then simply the slope of the tangent line at the point of contact.

Even though the tangent line only touches a single point at the point of tangency, it can be approximated by a line that goes through two points. This is known as a secant line. If the two points that the secant line goes through are close together, then the secant line closely resembles the tangent line, and, as a result, its slope is also very similar. The advantage of using a secant line is that its slope can be calculated directly. Consider the two points on the graph $ {\displaystyle (x,f(x))}$ and $ {\displaystyle (x+\Delta x,f(x+\Delta x))}$ where $ {\displaystyle \Delta x}$ is a small number. As before, the slope of the line passing through these two points can be calculated with the formula $ {\displaystyle {\text{slope }}={\frac {\Delta y}{\Delta x}}}$ . This gives: $$ {\displaystyle {\text{slope}}={\frac {f(x+\Delta x)-f(x)}{\Delta x}}}$$ As $ {\displaystyle \Delta x}$ gets closer and closer to ${\displaystyle 0}$ , the slope of the secant line gets closer and closer to the slope of the tangent line. This is formally written as

$$ {\displaystyle \lim _{\Delta x\to 0}{\frac {f(x+\Delta x)-f(x)}{\Delta x}}}$$ The expression above means 'as $ {\displaystyle \Delta x} $ gets closer and closer to 0, the slope of the secant line gets closer and closer to a certain value'. The value that is being approached is the derivative of $ {\displaystyle f(x)}$ ; this can be written as $ {\displaystyle f'(x)} $ . If $ {\displaystyle y=f(x)}$ , the derivative can also be written as$ {\displaystyle {\frac {dy}{dx}}}$ , with $ {\displaystyle d} $ representing an infinitesimal change. For example, $ {\displaystyle dx} $ represents an infinitesimal change in x. In summary, if $ {\displaystyle y=f(x)}$ , then the derivative of $ {\displaystyle f(x)} $ is: $$ {\displaystyle {\frac {dy}{dx}}=f'(x)=\lim _{\Delta x\to 0}{\frac {f(x+\Delta x)-f(x)}{\Delta x}}} $$ provided such a limit exists. We have thus succeeded in properly defining the derivative of a function, meaning that the 'slope of the tangent line' now has a precise mathematical meaning. Differentiating a function using the above definition is known as differentiation from first principles. The following is the long version of differentiation from first principles, that the derivative of $ {\displaystyle y=x^{2}}$ is $ {\displaystyle 2x}$ :

$$ {\displaystyle {\begin{aligned}{\frac {dy}{dx}}&=\lim _{\Delta x\to 0}{\frac {f(x+\Delta x)-f(x)}{\Delta x}}\\&=\lim _{\Delta x\to 0}{\frac {(x+\Delta x)^{2}-x^{2}}{\Delta x}}\\&=\lim _{\Delta x\to 0}{\frac {x^{2}+2x\Delta x+(\Delta x)^{2}-x^{2}}{\Delta x}}\\&=\lim _{\Delta x\to 0}{\frac {2x\Delta x+(\Delta x)^{2}}{\Delta x}}\\&=\lim _{\Delta x\to 0}2x+\Delta x\\\end{aligned}}}$$

The process of finding a derivative is called differentiation. Geometrically, the derivative at a point is the slope of the tangent line to the graph of the function at that point, provided that the derivative exists and is defined at that point. For a real-valued function of a single real variable, the derivative of a function at a point generally determines the best linear approximation to the function at that point.

Differential calculus and integral calculus are connected by the fundamental theorem of calculus, which states that differentiation is the reverse process to integration.

Differentiation has applications in nearly all quantitative disciplines. In Engineering, the derivative of the displacement of a moving body with respect to time is the velocity of the body, and the derivative of the velocity with respect to time is acceleration. The derivative of the momentum of a body with respect to time equals the force applied to the body. Derivatives are frequently used to find the maxima and minima of a function. Equations involving derivatives are called differential equations and are fundamental in modeling natural phenomena. Derivatives and their generalizations appear in many fields of mathematics, such as complex analysis, functional analysis, differential geometry, measure theory, and abstract algebra.

Part Two: Integral Calculus

In mathematics, an integral assigns numbers to functions in a way that describes displacement, area, volume, and other concepts that arise by combining infinitesimal data. The process of finding integrals is called integration. Along with differentiation, integration is a fundamental, essential operation of calculus, and serves as a tool to solve problems in Engineering and physics involving the area of an arbitrary shape, the length of a curve, and the volume of a solid.

The integrals enumerated here are those termed definite integrals, which can be interpreted as the signed area of the region in the plane that is bounded by the graph of a given function between two points in the real line. Conventionally, areas above the horizontal axis of the plane are positive while areas below are negative. Integrals also refer to the concept of an antiderivative, a function whose derivative is the given function. In this case, they are called indefinite integrals. The fundamental theorem of calculus relates definite integrals with differentiation and provides a method to compute the definite integral of a function when its antiderivative is known.

Although methods of calculating areas and volumes dated from ancient Greek mathematics, the principles of integration were formulated independently by Isaac Newton and Gottfried Wilhelm Leibniz in the late 17th century, who thought of the area under a curve as an infinite sum of rectangles of infinitesimal width. Bernhard Riemann later gave a rigorous definition of integrals, which is based on a limiting procedure that approximates the area of a curvilinear region by breaking the region into infinitesimally thin vertical slabs. In the early 20th century, Henri Lebesgue generalized Riemann's formulation by introducing what is now referred to as the Lebesgue integral; it is more robust than Riemann's in the sense that a wider class of functions are Lebesgue-integrable.

Integrals may be generalized depending on the type of the function as well as the domain over which the integration is performed. For example, a line integral is defined for functions of two or more variables, and the interval of integration is replaced by a curve connecting the two endpoints of the interval. In a surface integral, the curve is replaced by a piece of a surface in three-dimensional space.

Terminology and notation

In general, the integral of a real-valued function $ {\displaystyle f(x)}$ , with respect to a real variable $ {\displaystyle x}$ , on an interval $ {\displaystyle [a,b]}$ , is written as $$ {\displaystyle \int _{a}^{b}f(x)\,dx.} $$ The integral sign $ {\displaystyle \int} $ represents integration. The symbol dx, called the differential of the variable x, indicates that the variable of integration is $ {\displaystyle x}$ . The function $ {\displaystyle f(x)}$ is called the integrand, the points $ {\displaystyle a}$ and $ {\displaystyle b}$ are called the limits (or bounds) of integration, and the integral is said to be over the interval $ {\displaystyle [a, b]}$ , called the interval of integration. A function is said to be integrable if its integral over its domain is finite. If limits are specified, the integral is called a definite integral. When the limits are omitted, as in: $$ {\displaystyle \int f(x)\,dx,} $$

The integral is called an indefinite integral, which represents a class of functions (the antiderivative) whose derivative is the integrand. The fundamental theorem of calculus relates the evaluation of definite integrals to indefinite integrals. There are several extensions of the notation for integrals to encompass integration on unbounded domains and/or in multiple dimensions. Integrals appear in many practical situations. For instance, from the length, width and depth of a swimming pool which is rectangular with a flat bottom, one can determine the volume of water it can contain, the area of its surface, and the length of its edge. But if it is oval with a rounded bottom, integrals are required to find exact and rigorous values for these quantities. In each case, one may divide the sought quantity into infinitely many infinitesimal pieces, then sum the pieces to achieve an accurate approximation. For example, to find the area of the region bounded by the graph of the function $ {\displaystyle f(x) = \sqrt{x}} $ between $ x = 0 $ and $ x = 1$ , one can cross the interval in five steps $ (0, 1/5, 2/5, ..., 1)$ , then fill a rectangle using the right end height of each piece $ {\displaystyle (thus \sqrt{0}, \sqrt{ {\frac {1}{5}}},\sqrt{\frac {2}{5}}, ...,\sqrt{1})}$ and sum their areas to get an approximation of: $$ {\displaystyle \textstyle {\sqrt {\frac {1}{5}}}\left({\frac {1}{5}}-0\right)+{\sqrt {\frac {2}{5}}}\left({\frac {2}{5}}-{\frac {1}{5}}\right)+\cdots +{\sqrt {\frac {5}{5}}}\left({\frac {5}{5}}-{\frac {4}{5}}\right)\approx 0.7497,} $$ which is larger than the exact value. Alternatively, when replacing these subintervals by ones with the left end height of each piece, the approximation one gets is too low: with twelve such subintervals the approximated area is only 0.6203. However, when the number of pieces increase to infinity, it will reach a limit which is the exact value of the area sought (in this case, 2/3). One writes: $$ {\displaystyle \int _{0}^{1}{\sqrt {x}}\,dx={\frac {2}{3}},} $$ which means 2/3 is the result of a weighted sum of function values, $ {\displaystyle \sqrt {x}} $ , multiplied by the infinitesimal step widths, denoted by $ {\displaystyle dx} $ , on the interval $ {\displaystyle [0, 1]} $ . There are many ways of formally defining an integral, not all of which are equivalent. The differences exist mostly to deal with differing special cases which may not be integrable under other definitions. The most commonly used definitions are Riemann integrals and Lebesgue integrals.

Riemann Integral

The Riemann integral is defined in terms of Riemann sums of functions with respect to tagged partitions of an interval. A tagged partition of a closed interval $ {\displaystyle [a, b]} $ on the real line is a finite sequence: $$ {\displaystyle a=x_{0}\leq t_{1}\leq x_{1}\leq t_{2}\leq x_{2}\leq \cdots \leq x_{n-1}\leq t_{n}\leq x_{n}=b.\,\!} $$ This partitions the interval $ {\displaystyle [a, b]} $ into n sub-intervals $ {\displaystyle [xi-1, xi]} $ indexed by i, each of which is "tagged" with a distinguished point ti ∈ [xi−1, xi]. A Riemann sum of a function f with respect to such a tagged partition is defined as

$$ {\displaystyle \sum _{i=1}^{n}f(t_{i})\,\Delta _{i};} $$

Thus each term of the sum is the area of a rectangle with height equal to the function value at the distinguished point of the given sub-interval, and width the same as the width of sub-interval, $ {\displaystyle \delta i =xi-(xi-1)} $ . The mesh of such a tagged partition is the width of the largest sub-interval formed by the partition, $ {\displaystyle maxi = 1....n \delta i} $ . The Riemann integral of a function $ {\displaystyle f } $ over the interval $ {\displaystyle [a, b]} $ is equal to $ {\displaystyle S} $ if:

For all $ {\displaystyle \varepsilon >0} $ there exists $ {\displaystyle \delta >0}$ such that, for any tagged partition $ {\displaystyle [a,b]} $ with mesh less than $ {\displaystyle \delta } $ , $$ {\displaystyle \left|S-\sum _{i=1}^{n}f(t_{i})\,\Delta _{i}\right|<\varepsilon .} $ When the chosen tags give the maximum (respectively, minimum) value of each interval, the Riemann sum becomes an upper (respectively, lower) Darboux sum, suggesting the close connection between the Riemann integral and the Darboux integral.

Lebesgue Integral

It is often of interest, both in theory and applications, to be able to pass to the limit under the integral. For instance, a sequence of functions can frequently be constructed that approximate, in a suitable sense, the solution to a problem. Then the integral of the solution function should be the limit of the integrals of the approximations. However, many functions that can be obtained as limits are not Riemann-integrable, and so such limit theorems do not hold with the Riemann integral. Therefore, it is of great importance to have a definition of the integral that allows a wider class of functions to be integrated.

Such an integral is the Lebesgue integral, that exploits the following fact to enlarge the class of integrable functions: if the values of a function are rearranged over the domain, the integral of a function should remain the same. Thus Henri Lebesgue introduced the integral bearing his name, explaining this integral thus in a letter to Paul Montel:

I have to pay a certain sum, which I have collected in my pocket. I take the bills and coins out of my pocket and give them to the creditor in the order I find them until I have reached the total sum. This is the Riemann integral. But I can proceed differently. After I have taken all the money out of my pocket I order the bills and coins according to identical values and then I pay the several heaps one after the other to the creditor. This is my integral.

As Folland puts it, To compute the Riemann integral of $ {\displaystyle f} $ , one partitions the domain $ {\displaystyle [a, b]} $ into subintervals, while in the Lebesgue integral, "one is in effect partitioning the range of $ {\displaystyle f} $ ". The definition of the Lebesgue integral thus begins with a measure, $ {\displaystyle \mu} $ . In the simplest case, the Lebesgue measure $ {\displaystyle \mu (A)} $ of an interval $ {\displaystyle A = [a, b]} $ is its width, $ {\displaystyle b-a} $ , so that the Lebesgue integral agrees with the (proper) Riemann integral when both exist. In more complicated cases, the sets being measured can be highly fragmented, with no continuity and no resemblance to intervals.

Using the ( partitioning the range of $ {\displaystyle f} $ ) philosophy, the integral of a non-negative function $ {\displaystyle f : R → R } $ should be the sum over $ {\displaystyle t} $ of the areas between a thin horizontal strip between $ {\displaystyle y = t} $ and $ {\displaystyle y = t + dt} $ . This area is just $ {\displaystyle \mu { x : f(x) > t} dt}$ Let $ {\displaystyle { f^*(t) = \mu{ x : f(x) > t }}$ . The Lebesgue integral of f is then defined by: $$ {\displaystyle \int f=\int _{0}^{\infty }f^{*}(t)\,dt} $$ where the integral on the right is an ordinary improper Riemann integral. For a suitable class of functions (the measurable functions) this defines the Lebesgue integral. A general measurable function $ {\displaystyle f} $ is Lebesgue-integrable if the sum of the absolute values of the areas of the regions between the graph of $ {\displaystyle f} $ and the x-axis is finite: $$ {\displaystyle \int _{E}|f|\,d\mu <+\infty .}$$ In that case, the integral is, as in the Riemannian case, the difference between the area above the $ {\displaystyle x} $ -axis and the area below the $ {\displaystyle x} $ -axis: $$ {\displaystyle \int _{E}f\,d\mu =\int _{E}f^{+}\,d\mu -\int _{E}f^{-}\,d\mu } $$ where $$ {\displaystyle {\begin{alignedat}{3}&f^{+}(x)&&{}={}\max\{f(x),0\}&&{}={}{\begin{cases}f(x),&{\text{if }}f(x)>0,\\0,&{\text{otherwise,}}\end{cases}}\\&f^{-}(x)&&{}={}\max\{-f(x),0\}&&{}={}{\begin{cases}-f(x),&{\text{if }}f(x)<0,\\0,&{\text{otherwise.}}\end{cases}}\end{alignedat}}}$$

Section Four:   Fourier Series

A Fourier series is a sum that represents a periodic function as a sum of sine and cosine waves. The frequency of each wave in the sum, or harmonic, is an integer multiple of the periodic function's fundamental frequency. Each harmonic's phase and amplitude can be determined using harmonic analysis. A Fourier series may potentially contain an infinite number of harmonics. Summing part of but not all the harmonics in a function's Fourier series produces an approximation to that function.

Almost any periodic function can be represented by a Fourier series that converges. Convergence of Fourier series means that as more and more harmonics from the series are summed, each successive partial Fourier series sum will better approximate the function, and will equal the function with a potentially infinite number of harmonics.

Fourier series can only represent functions that are periodic. However, non-periodic functions can be handled using an extension of the Fourier Series called the Fourier Transform which treats non-periodic functions as periodic with infinite period. This transform thus can generate frequency domain representations of non-periodic functions as well as periodic functions, allowing a waveform to be converted between its time domain representation and its frequency domain representation.

Since Fourier's time, many different approaches to defining and understanding the concept of Fourier series have been discovered, all of which are consistent with one another, but each of which emphasizes different aspects of the topic. Some of the more powerful and elegant approaches are based on mathematical ideas and tools that were not available in Fourier's time. Fourier originally defined the Fourier series for real-valued functions of real arguments, and used the sine and cosine functions as the basis set for the decomposition. Many other Fourier-related transforms have since been defined, extending his initial idea to many applications and birthing an area of mathematics called Fourier analysis.

Section Five:   Fourier Transforms

A Fourier transform (FT) is a mathematical transform that decomposes functions depending on space or time into functions depending on spatial frequency or temporal frequency. That process is also called analysis. The premier Engineering application would be decomposing of the waveform of electrical signals used in communication technology. The term Fourier transform refers to both the frequency domain representation and the mathematical operation that associates the frequency domain representation to a function of space or time.

The Fourier transform of a function is a complex-valued function representing the complex sinusoids that comprise the original function. For each frequency, the magnitude (absolute value) of the complex value represents the amplitude of a constituent complex sinusoid with that frequency, and the argument of the complex value represents that complex sinusoid's phase offset. The Fourier transform is not limited to functions of time, but the domain of the original function is commonly referred to as the time domain. The Fourier inversion theorem provides a synthesis process that recreates the original function from its frequency domain representation.

Functions that are localized in the time domain have Fourier transforms that are spread out across the frequency domain and vice versa, a phenomenon known as the uncertainty principle. The critical case for this principle is the Gaussian function, of substantial importance in probability theory and statistics as well as in the study of physical phenomena exhibiting normal distribution (e.g., diffusion). The Fourier transform of a Gaussian function is another Gaussian function. Joseph Fourier introduced the transform in his study of heat transfer, where Gaussian functions appear as solutions of the heat equation.

The Fourier transform can be formally defined as an improper Riemann integral, making it an integral transform, although this definition is not suitable for many applications requiring a more sophisticated integration theory. The most important example of a function requiring a sophisticated integration theory is the Dirac delta function.

The Fourier transform can also be generalized to functions of several variables on Euclidean space, sending a function of 3-dimensional 'position space' to a function of 3-dimensional momentum (or a function of space and time to a function of 4-momentum). This idea makes the spatial Fourier transform very natural in the study of waves, as well as in quantum mechanics, where it is important to be able to represent wave solutions as functions of either position or momentum and sometimes both. In general, functions to which Fourier methods are applicable are complex-valued, and possibly vector-valued. Still further generalization is possible to functions on groups, which, besides the original Fourier transform on R or Rn (viewed as groups under addition), notably includes the discrete-time Fourier transform (DTFT, group = Z), the discrete Fourier transform (DFT, group = Z mod N) and the Fourier series or circular Fourier transform (group = S1, the unit circle ≈ closed finite interval with endpoints identified). The latter is routinely employed to handle periodic functions. The fast Fourier transform (FFT) is an algorithm for computing the DFT.

There are several common conventions for defining the Fourier transform of an integrable function $ {\displaystyle f:\mathbb {R} \to \mathbb {C} } $ . One of them is:Fourier transform integral: $$ {\displaystyle {\hat {f}}(\xi )=\int _{-\infty }^{\infty }f(x)\ e^{-i2\pi \xi x}\,dx,\quad \forall \ \xi \in \mathbb {R} .} $$

Section Six:   Differential Equations

In mathematics, a differential equation is an equation that relates one or more unknown functions and their derivatives. In applications, the functions generally represent physical quantities, the derivatives represent their rates of change, and the differential equation defines a relationship between the two. Such relations are common; therefore, differential equations play a prominent role in many disciplines including engineering and physics.

The language of operators allows a compact writing for differentiable equations: if: $$ {\displaystyle L=a_{0}(x)+a_{1}(x){\frac {d}{dx}}+\cdots +a_{n}(x){\frac {d^{n}}{dx^{n}}},}$$ is a linear differential operator, then the equation: $$ {\displaystyle a_{0}(x)y+a_{1}(x)y'+a_{2}(x)y''+\cdots +a_{n}(x)y^{(n)}=b(x)} $$ may be rewritten $$ {\displaystyle Ly=b(x).} $$

Mainly the study of differential equations consists of the study of their solutions (the set of functions that satisfy each equation), and of the properties of their solutions. Only the simplest differential equations are solvable by explicit formulas; however, many properties of solutions of a given differential equation may be determined without computing them exactly.

Since closed-form solutions to differential equations are seldom available, Engineers have become experts at the numerical solutions of differential equations using computers. The theory of dynamical systems puts emphasis on qualitative analysis of systems described by differential equations, while many numerical methods have been developed to determine solutions with a given degree of accuracy.

Differential equations can be divided into several types. Apart from describing the properties of the equation itself, these classes of differential equations can help inform the choice of approach to a solution. Commonly used distinctions include whether the equation is ordinary or partial, linear or non-linear, and homogeneous or heterogeneous. This list is far from exhaustive; there are many other properties and subclasses of differential equations which can be very useful in specific contexts.

Part One: Ordinary differential equation and Linear differential equation

A linear differential equation is a differential equation that is defined by a linear polynomial in the unknown function and its derivatives, that is an equation of the form: $$ {\displaystyle a_{0}(x)y+a_{1}(x)y'+a_{2}(x)y''+\cdots +a_{n}(x)y^{(n)}+b(x)=0,}$$

An ordinary differential equation is an equation containing an unknown function of one real or complex variable x, its derivatives, and some given functions of x. The unknown function is generally represented by a variable (often denoted y), which, therefore, depends on x. Thus x is often called the independent variable of the equation. The term "ordinary" is used in contrast with the term partial differential equation, which may be with respect to more than one independent variable. In general, the solutions of a differential equation cannot be expressed by a closed-form expression and therefore numerical methods are commonly used for solving differential equations on a computer.

Part Two:    Partial differential equations

A partial differential equation is a differential equation that contains unknown multivariable functions and their partial derivatives. Partial Differential Equations are used to formulate problems involving functions of several variables, and are either solved in closed form, or used to create a relevant computer model.

Partial Differential Equations are used to develop a wide variety of models. They are very phenomena in nature such as sound, heat, electrostatics, electrodynamics, fluid flow, elasticity, or quantum mechanics. These seemingly distinct physical phenomena can be formalized similarly in terms of PDEs. Just as ordinary differential equations often model one-dimensional dynamical systems, partial differential equations often model multidimensional systems. Stochastic partial differential equations generalize partial differential equations for modeling randomness.

Part Three:    Non-Linear differential equations

A non-linear differential equation is a differential equation that is not a linear equation in the unknown function and its derivatives. There are very few methods of solving nonlinear differential equations exactly; those that are known typically depend on the equation having particular symmetries. Nonlinear differential equations can exhibit very complicated behaviour over extended time intervals, characteristic of chaos. Even the fundamental questions of existence, uniqueness, and extendability of solutions for nonlinear differential equations, and well-posedness of initial and boundary value problems for nonlinear Partial Differential Equationss are hard problems and their resolution in special cases is considered to be a significant advance in the mathematical theory (cf. Navier–Stokes existence and smoothness). However, if the differential equation is a correctly formulated representation of a meaningful physical process, then one expects it to have a solution.

Linear differential equations frequently appear as approximations to nonlinear equations. These approximations are only valid under restricted conditions. For example, the harmonic oscillator equation is often used as a starting point in representing nonlinear phenomenon.

Section Seven:   Vector Calculus

Vector calculus, or vector analysis, is concerned with differentiation and integration of vector fields, primarily in 3-dimensional Euclidean space $ {\displaystyle \mathbb {R} ^{3}.}$ . The term "vector calculus" is used as a synonym for the broader subject of multivariable calculus, which spans vector calculus as well as partial differentiation and multiple integration. Vector calculus plays an important role in differential geometry and in the study of partial differential equations. It is used extensively in physics and engineering, especially in the description of electromagnetic fields, quantum mechanics, quantum optics, and fluid flow.

Vector calculus was developed from quaternion analysis by J. Willard Gibbs and Oliver Heaviside near the end of the 19th century, and most of the notation and terminology was established by Gibbs and Edwin Bidwell Wilson in their 1901 book, Vector Analysis. In the conventional form using cross products, vector calculus does not generalize to higher dimensions, while the alternative approach of geometric algebra which uses exterior products does.

Scaler Fields

A scalar field associates a scalar value to every point in a space. The scalar is a mathematical number representing a physical quantity. Examples of scalar fields in applications include the temperature distribution throughout space and the pressure distribution in a fluid. These fields are the subject of scalar field theory.

Vector Fields

A vector field is an assignment of a vector to each point in a space. A vector field in the plane, for instance, can be visualized as a collection of arrows with a given magnitude and direction each attached to a point in the plane. Vector fields are often used to model, for example, the speed and direction of a moving fluid throughout space, or the strength and direction of some force, such as the magnetic or gravitational force, as it changes from point to point.

Vector calculus studies various differential operators defined on scalar or vector fields, which are typically expressed in terms of the del operator $ {\displaystyle \nabla }$ , also known as "nabla". The three basic vector operators are:

1.   The Gradient

The gradient of a scalar-valued differentiable function $ {\displaystyle f}$ of several variables is the vector field (or vector-valued function) $ {\displaystyle \nabla f}$ whose value at a point $ {\displaystyle p}$ is the vector whose components are the partial derivatives of $ {\displaystyle f}$ at $ {\displaystyle p}$ . That is, for $ {\displaystyle f\colon \mathbb {R} ^{n}\to \mathbb {R} }$ , its gradient $ {\displaystyle \nabla f\colon \mathbb {R} ^{n}\to \mathbb {R} ^{n}}$ is defined at the point $ {\displaystyle p=(x_{1},\ldots ,x_{n})}$ in n-dimensional space as the vector.

$$ {\displaystyle \nabla f(p)={\begin{bmatrix}{\frac {\partial f}{\partial x_{1}}}(p)\\\vdots \\{\frac {\partial f}{\partial x_{n}}}(p)\end{bmatrix}}.} $$

The nabla symbol $ {\displaystyle \nabla }$ , written as an upside-down triangle and pronounced "del", denotes the vector differential operator.

The gradient vector can be interpreted as the "direction and rate of fastest increase". If the gradient of a function is non-zero at a point p, the direction of the gradient is the direction in which the function increases most quickly from p, and the magnitude of the gradient is the rate of increase in that direction, the greatest absolute directional derivative.[2] Further, the gradient is the zero vector at a point if and only if it is a stationary point (where the derivative vanishes). The gradient thus plays a fundamental role in optimization theory, where it is used to maximize a function by gradient ascent.

The gradient is dual to the total derivative $ {\displaystyle df}$ : the value of the gradient at a point is a tangent vector – a vector at each point; while the value of the derivative at a point is a cotangent vector – a linear function on vectors. They are related in that the dot product of the gradient of $ {\displaystyle f}$ at a point $ {\displaystyle p}$ with another tangent vector $ {\displaystyle v}$ equals the directional derivative of $ {\displaystyle f}$ at $ {\displaystyle p}$ of the function along $ {\displaystyle v}$ v; that is, $$ {\textstyle \nabla f(p)\cdot \mathbf {v} ={\frac {\partial f}{\partial \mathbf {v} }}(p)=df_{p}(\mathbf {v} )}$$ .

The Laplacian

In Engineering, the Laplace operator or Laplacian is a differential operator given by the divergence of the gradient of a scalar function on Euclidean space. It is usually denoted by the symbols $ {\displaystyle \nabla \cdot \nabla }$ ,$ {\displaystyle \nabla ^{2}}$ (where $ {\displaystyle \nabla }$ is the nabla operator). In a Cartesian coordinate system, the Laplacian is given by the sum of second partial derivatives of the function with respect to each independent variable. In other coordinate systems, such as cylindrical and spherical coordinates, the Laplacian also has a useful form.

$$ {\displaystyle \nabla ^{2}f=\nabla \cdot \nabla f}$$

In Cartesian coordinates,

$$ {\displaystyle \nabla ^{2}f={\frac {\partial ^{2}f}{\partial x^{2}}}+{\frac {\partial ^{2}f}{\partial y^{2}}}+{\frac {\partial ^{2}f}{\partial z^{2}}}.}$$

In cylindrical coordinates,

$$ {\displaystyle \nabla ^{2}f={\frac {1}{\rho }}{\frac {\partial }{\partial \rho }}\left(\rho {\frac {\partial f}{\partial \rho }}\right)+{\frac {1}{\rho ^{2}}}{\frac {\partial ^{2}f}{\partial \varphi ^{2}}}+{\frac {\partial ^{2}f}{\partial z^{2}}},}$$

In spherical coordinates:

$$ {\displaystyle \nabla ^{2}f={\frac {1}{r^{2}}}{\frac {\partial }{\partial r}}\left(r^{2}{\frac {\partial f}{\partial r}}\right)+{\frac {1}{r^{2}\sin \theta }}{\frac {\partial }{\partial \theta }}\left(\sin \theta {\frac {\partial f}{\partial \theta }}\right)+{\frac {1}{r^{2}\sin ^{2}\theta }}{\frac {\partial ^{2}f}{\partial \varphi ^{2}}},}$$

The Laplace operator is named after the French mathematician Pierre-Simon de Laplace (1749–1827), who first applied the operator to the study of celestial mechanics: the Laplacian of the gravitational potential due to a given mass density distribution is a constant multiple of that density distribution. Solutions of Laplace's equation are called harmonic functions.

The Laplacian occurs in many differential equations describing physical phenomena. Poisson's equation describes electric and gravitational potentials; the diffusion equation describes heat and fluid flow, the wave equation describes wave propagation, and the Schrödinger equation in quantum mechanics. In image processing and computer vision, the Laplacian operator has been used for various tasks, such as blob and edge detection. The Laplacian is the simplest elliptic operator and is at the core of Hodge theory as well as the results of de Rham cohomology.

The Divergence Theorem

In vector calculus, the divergence theorem, also known as Gauss's theorem or Ostrogradsky's theorem, is a theorem which relates the flux of a vector field through a closed surface to the divergence of the field in the volume enclosed.

More precisely, the divergence theorem states that the surface integral of a vector field over a closed surface, which is called the "flux" through the surface, is equal to the volume integral of the divergence over the region inside the surface. Intuitively, it states that "the sum of all sources of the field in a region (with sinks regarded as negative sources) gives the net flux out of the region".

Suppose $ {\displaystyle V}$ is a subset of $ {\displaystyle \mathbb {R} ^{n}}$ \mathbb {R} ^{n}$ (in the case of n = 3, $ {\displaystyle V}$ represents a volume in three-dimensional space) which is compact and has a piecewise smooth boundary $ {\displaystyle S}$ (also indicated with $ {\displaystyle \partial V=S}$ . If $ {\displaystyle F}$ is a continuously differentiable vector field defined on a neighborhood of $ {\displaystyle V}$ , then:

Hilbert Space

In mathematics and Quantum Optics,a Hilbert spaces allow generalizing the methods of linear algebra and calculus from three-dimensional, Euclidean vector spaces to spaces that may be infinite-dimensional. A Hilbert space is a vector space equipped with an inner product which defines a distance function for which it becomes a metric space. Hilbert spaces arise naturally and frequently in mathematics and physics, typically as function spaces.

Definition: A Hilbert space $ {\displaystyle H}$ is a real or complex inner product space that is also a complete metric space with respect to the distance function induced by the inner product.

To say that $ {\displaystyle H}$ is a complex inner product space means that $ {\displaystyle H}$ is a complex vector space on which there is an inner product $ {\displaystyle \langle x,y\rangle } $ associating a complex number to each pair of elements $ {\displaystyle x,y} $ of $ {\displaystyle H}$ that satisfies the following properties:

  1. The inner product is conjugate symmetric; that is, the inner product of a pair of elements is equal to the complex conjugate of the inner product of the swapped elements: $ {\displaystyle \langle y,x\rangle ={\overline {\langle x,y\rangle }}\,.}$ Importantly, this implies that $ {\displaystyle \langle x,x\rangle } $ is a real number.
  2. The inner product is linear in its first argument. For all complex numbers $ {\displaystyle a}$ and $ {\displaystyle b,} $ $$ {\displaystyle \langle ax_{1}+bx_{2},y\rangle =a\langle x_{1},y\rangle +b\langle x_{2},y\rangle \,.} $$
  3. The inner product of an element with itself is positive definite:$$ {\displaystyle {\begin{alignedat}{4}\langle x,x\rangle >0&\quad {\text{ if }}x\neq 0,\\\langle x,x\rangle =0&\quad {\text{ if }}x=0\,.\end{alignedat}}}$$

    Section Eight:   Linear Algebra

    Linear algebra is the branch of mathematics concerning linear equations such as:

    $$ {\displaystyle a_{1}x_{1}+\cdots +a_{n}x_{n}=b,} $$

    linear maps such as:

    $$ {\displaystyle (x_{1},\ldots ,x_{n})\mapsto a_{1}x_{1}+\cdots +a_{n}x_{n},} $$

    and their representations in vector spaces and through matrices.

    Linear algebra is central to all areas of mathematics. For instance, linear algebra is fundamental in modern presentations used in the design and implementation of Artificial Intelligent agents.

    Linear algebra is also used in most sciences and fields of engineering, because it allows modeling many natural phenomena, and computing efficiently with such models. For nonlinear systems, which cannot be modeled with linear algebra, it is often used for dealing with first-order approximations, using the fact that the differential of a multivariate function at a point is the linear map that best approximates the function near that point.

    Part One: Vector Spaces

    A vector space over a field $ {\displaystyle F}$ is a non-empty set $ {\displaystyle V}$ together with two binary operations that satisfy the eight axioms listed below. In this context, the elements of $ {\displaystyle V}$ are commonly called vectors, and the elements of $ {\displaystyle F}$ are called scalars.

    The first operation, called vector addition or simply addition assigns to any two vectors $ {\displaystyle \mathbf {v}}$ and $ {\displaystyle \mathbf {w}}$ in $ {\displaystyle V}$ a third vector in $ {\displaystyle V}$ which is commonly written as $ {\displaystyle \mathbf {v} + \mathbf {w} }$ , and called the sum of these two vectors.

    The second operation, called scalar multiplication,assigns to any scalar $ {\displaystyle a}$ a in $ {\displaystyle F}$ and any vector $ {\displaystyle \mathbf {v}}$ in $ {\displaystyle V}$ another vector in $ {\displaystyle V}$ , which is denoted $ {\displaystyle a \mathbf {v}}$ .

    To define a vector space, the eight following axioms must be satisfied for every $ {\displaystyle \mathbf {u} }$ , $ {\displaystyle \mathbf {v} }$ and $ {\displaystyle \mathbf {w} }$ in $ {\displaystyle V}$ , and $ {\displaystyle a}$ and $ {\displaystyle b}$ in $ {\displaystyle F}$

    Subtraction of two vectors can be defined as:

    $$ {\displaystyle \mathbf {v} -\mathbf {w} =\mathbf {v} +(-\mathbf {w} ).} $$

    Axioms

    1.:    Associativity of vector addition:   $ {\displaystyle \mathbf {u} + ( \mathbf {v} + \mathbf {w}) = (\mathbf {u} + \mathbf {v}) + \mathbf {w} }$

    2.:    Commutativity of vector addition:   $ {\displaystyle \mathbf {u} + ( \mathbf {v} = (\mathbf {v} + \mathbf {u})}$

    3.:    Identity element of vector addition:    There exists an element $ {\displaystyle \mathbf {0} ∈ V }$ called the zero vector, such that $ {\displaystyle v + \mathbf {0} = v}$ for all $ {\displaystyle v \in V }$ .

    4.:    Inverse elements of vector addition:    For every $ {\displaystyle v ∈ V }$ , there exists an element $ {\displaystyle -v \in V }$ , called the additive inverse of $ {\displaystyle v }$ , such that $ {\displaystyle v + (-v) = 0 }$ .

    5.:    Compatibility of scalar multiplication with field multiplication:    $ {\displaystyle a(bv) = (ab)v }$

    6.:    Identity element of scalar multiplication:    $ {\displaystyle 1v = v }$ , where $ {\displaystyle 1}$ denotes the multiplicative identity in $ {\displaystyle F}$ .

    7.:    Distributivity of scalar multiplication with respect to vector addition:     $ {\displaystyle a(u +v) = au + av }$

    8.:    Distributivity of scalar multiplication with respect to field addition    $ {\displaystyle (a + b)v = av + bv }$

    When the scalar field is the real numbers the vector space is called a real vector space. When the scalar field is the complex numbers, the vector space is called a complex vector space. These two cases are the most common ones, but vector spaces with scalars in an arbitrary field $ {\displaystyle F}$ are also commonly considered. Such a vector space is called an $ {\displaystyle F}$ -vector space.

    A direct consequences of the axioms include that, for every $ {\displaystyle s\in F} $ and $ {\displaystyle \mathbf {v} \in V,} $ one has:

    $$ {\displaystyle 0\mathbf {v} =\mathbf {0} } $$ $$ {\displaystyle s\mathbf {0} =\mathbf {0} } $$ $$ {\displaystyle (-1)\mathbf {v} =-\mathbf {v} } $$ $$ {\displaystyle s\mathbf {v} =\mathbf {0} }$$

    implies $ {\displaystyle s=0}$ or $ {\displaystyle \mathbf {v} =\mathbf {0}} $

    Part Two: Matrices

    Definition: A matrix is a rectangular array of numbers, called the entries of the matrix. Matrices are subject to standard operations such as addition and multiplication. Most commonly, a matrix over a field $ {\displaystyle F} $ is a rectangular array of elements of $ {\displaystyle F} $ . A real matrix and a complex matrix are matrices whose entries are respectively real numbers or complex numbers. More general types of entries are discussed below. For instance, this is a real matrix:

    $$ {\displaystyle \mathbf {A} ={\begin{bmatrix}-1.3&0.6\\20.4&5.5\\9.7&-6.2\end{bmatrix}}.}$$

    The numbers, symbols, or expressions in the matrix are called its entries or its elements. The horizontal and vertical lines of entries in a matrix are called rows and columns, respectively.

    Notation: The specifics of symbolic matrix notation vary widely. Matrices are commonly written in square brackets or parentheses, so that an $ {\displaystyle m\times n} $ matrix A represented as: $$ {\displaystyle \mathbf {A} ={\begin{bmatrix}a_{11}&a_{12}&\cdots &a_{1n}\\a_{21}&a_{22}&\cdots &a_{2n}\\\vdots &\vdots &\ddots &\vdots \\a_{m1}&a_{m2}&\cdots &a_{mn}\end{bmatrix}}={\begin{pmatrix}a_{11}&a_{12}&\cdots &a_{1n}\\a_{21}&a_{22}&\cdots &a_{2n}\\\vdots &\vdots &\ddots &\vdots \\a_{m1}&a_{m2}&\cdots &a_{mn}\end{pmatrix}}.} $$

    This may be abbreviated by writing only a single generic term, possibly along with indices, as in $$ {\displaystyle \mathbf {A} =\left(a_{ij}\right),\quad \left[a_{ij}\right],\quad {\text{or}}\quad \left(a_{ij}\right)_{1\leq i\leq m,\;1\leq j\leq n}}$$ or $ {\displaystyle \mathbf {A} =(a_{i,j})_{1\leq i,j\leq n}} $ in the case that $ {\displaystyle n=m} $ .

    Matrices are usually symbolized using upper-case letters (such as A in the examples above), while the corresponding lower-case letters, with two subscript indices (e.g., $ {\displaystyle a_{11} or a_{1,1} } $ , represent the entries.

    The entry in the $ {\displaystyle i } $ -th row and $ {\displaystyle j } $ j-th column of a matrix A is sometimes referred to as the $ {\displaystyle i, j } $ or $ {\displaystyle (i, j } $ entry of the matrix.

    Some programming languages utilize doubly subscripted arrays (or arrays of arrays) to represent an $ {\displaystyle m} $ -by-$ {\displaystyle n} $ matrix. Some programming languages start the numbering of array indexes at zero.

    Matrix Addition: Two matrices must have an equal number of rows and columns to be added. In which case, the sum of two matrices A and B will be a matrix which has the same number of rows and columns as A and B. The sum of A and B, denoted A + B, is computed by adding corresponding elements of A and B:

    $$ {\displaystyle {\begin{aligned}\mathbf {A} +\mathbf {B} &={\begin{bmatrix}a_{11}&a_{12}&\cdots &a_{1n}\\a_{21}&a_{22}&\cdots &a_{2n}\\\vdots &\vdots &\ddots &\vdots \\a_{m1}&a_{m2}&\cdots &a_{mn}\\\end{bmatrix}}+{\begin{bmatrix}b_{11}&b_{12}&\cdots &b_{1n}\\b_{21}&b_{22}&\cdots &b_{2n}\\\vdots &\vdots &\ddots &\vdots \\b_{m1}&b_{m2}&\cdots &b_{mn}\\\end{bmatrix}}\\&={\begin{bmatrix}a_{11}+b_{11}&a_{12}+b_{12}&\cdots &a_{1n}+b_{1n}\\a_{21}+b_{21}&a_{22}+b_{22}&\cdots &a_{2n}+b_{2n}\\\vdots &\vdots &\ddots &\vdots \\a_{m1}+b_{m1}&a_{m2}+b_{m2}&\cdots &a_{mn}+b_{mn}\\\end{bmatrix}}\\\end{aligned}}\,\!} $$

    Matrix Scaler Multiplication: The left scalar multiplication of a matrix A with a scalar $ {\displaystyle \lambda} $ gives another matrix of the same size as A. It is denoted by $ {\displaystyle \lambda} $ A, whose entries of $ {\displaystyle \lambda} $ A are defined by: $$ {\displaystyle (\lambda \mathbf {A} )_{ij}=\lambda \left(\mathbf {A} \right)_{ij}\,,} $$ explicitly: $$ {\displaystyle \lambda \mathbf {A} =\lambda {\begin{pmatrix}A_{11}&A_{12}&\cdots &A_{1m}\\A_{21}&A_{22}&\cdots &A_{2m}\\\vdots &\vdots &\ddots &\vdots \\A_{n1}&A_{n2}&\cdots &A_{nm}\\\end{pmatrix}}={\begin{pmatrix}\lambda A_{11}&\lambda A_{12}&\cdots &\lambda A_{1m}\\\lambda A_{21}&\lambda A_{22}&\cdots &\lambda A_{2m}\\\vdots &\vdots &\ddots &\vdots \\\lambda A_{n1}&\lambda A_{n2}&\cdots &\lambda A_{nm}\\\end{pmatrix}}\,.} $$

    Matrix Multiplication:

    If A is an $ {\displaystyle m x n} $ matrix and B is an $ {\displaystyle n x p} $ matrix,

    $$ {\displaystyle \mathbf {A} ={\begin{pmatrix}a_{11}&a_{12}&\cdots &a_{1n}\\a_{21}&a_{22}&\cdots &a_{2n}\\\vdots &\vdots &\ddots &\vdots \\a_{m1}&a_{m2}&\cdots &a_{mn}\\\end{pmatrix}},\quad \mathbf {B} ={\begin{pmatrix}b_{11}&b_{12}&\cdots &b_{1p}\\b_{21}&b_{22}&\cdots &b_{2p}\\\vdots &\vdots &\ddots &\vdots \\b_{n1}&b_{n2}&\cdots &b_{np}\\\end{pmatrix}}} $$

    the matrix product C = AB (denoted without multiplication signs or dots) is defined to be the $ {\displaystyle m x p} $ matrix:

    $$ {\displaystyle \mathbf {C} ={\begin{pmatrix}c_{11}&c_{12}&\cdots &c_{1p}\\c_{21}&c_{22}&\cdots &c_{2p}\\\vdots &\vdots &\ddots &\vdots \\c_{m1}&c_{m2}&\cdots &c_{mp}\\\end{pmatrix}}} $$

    such that $$ {\displaystyle c_{ij}=a_{i1}b_{1j}+a_{i2}b_{2j}+\cdots +a_{in}b_{nj}=\sum _{k=1}^{n}a_{ik}b_{kj},} $$

    for $ {\displaystyle i = 1, ..., m} $ and $ {\displaystyle j = 1, ..., p} $ .

    That is, the entry $ {\displaystyle c_{ij}} $ of the product is obtained by multiplying term-by-term the entries of the $ {\displaystyle i} $ th row of A and the $ {\displaystyle j} $ th column of B, and summing these $ {\displaystyle n} $ products. In other words, $ {\displaystyle c_{ij}} $ is the dot product of the $ {\displaystyle i} $ th row of A and the $ {\displaystyle j} $ th column of B.

    Therefore, AB can also be written as: $$ {\displaystyle \mathbf {C} ={\begin{pmatrix}a_{11}b_{11}+\cdots +a_{1n}b_{n1}&a_{11}b_{12}+\cdots +a_{1n}b_{n2}&\cdots &a_{11}b_{1p}+\cdots +a_{1n}b_{np}\\a_{21}b_{11}+\cdots +a_{2n}b_{n1}&a_{21}b_{12}+\cdots +a_{2n}b_{n2}&\cdots &a_{21}b_{1p}+\cdots +a_{2n}b_{np}\\\vdots &\vdots &\ddots &\vdots \\a_{m1}b_{11}+\cdots +a_{mn}b_{n1}&a_{m1}b_{12}+\cdots +a_{mn}b_{n2}&\cdots &a_{m1}b_{1p}+\cdots +a_{mn}b_{np}\\\end{pmatrix}}} $$

    Matrix Transposition:

    IDFS, INC. (International Diversified Financial Services, Inc.), an existing Texas corporation, in good standing, wholly owned by the Founder and CHIEF EXECUTIVE OFFICER of AscenTrust, LLC., is the financial arm of the Matagorda Power Project. The first phase of the power Project consists of the construction of a 560MWe, combined cycle power plant. Building this facility will establish the Company as an Independent Power Producer (IPP).When IDFS has been registered with the Public Utility Commission of the State of Texas, we will proceed to the Second phase of Funding.

    NITEX (Nitex International, LLC), an existing Texas Limited Liability Company filling Taxes as a C-corporation, in good standing, wholly owned by the Founder and Senior Project Manager of AscenTrust, LLC.. Nitex is being used as the Corporate vehicle for the refinery project in Louisiana which will be referred to in our documentatin as the Pelican Bay Refinery Project

    Advanced Software Development, Inc. , was originally Incorporated in the State of Delaware in 2000. The Senior System and Software Developer was working on the implimentation of a HIPAA (Health Insurance Portability and Accountability Act) compliant Electronic Medical Records to be used in the rural medical clinics which we were designing and building for a local medical practitioner (In the Conroe area of Montgomery County, Texas). When the internet bubble burst we were left with nothing more than a working prototype which we had installed on a laptop running Windows XP. All of the software being developed by the Senior Project Manager of AscenTrust, LLC. or his strategic Partners is being developed under the DBA of Advanced Software Development, Inc. .