Experimental Evaluation of an Industrial Technique for the Approximation of Software Functional Size

The Early & Quick sizing techniques, built based on ISO standards, have been proposed to derive an early approximation of software functional size when only high-level and incomplete requirements specifications are available. In the literature, there is a lack of research to evaluate the performance of such approximation sizing methods. This paper presents an experimental study to evaluate their reproducibility and accuracy. The experimental results show both poor reproducibility and large inaccurate approximations. In particular, the analysis of the findings indicates that the practitioners could not classify the functional requirements specifications in accordance to their levels of granularity using the rules and the concepts of the Early & Quick COSMIC technique.


INTRODUCTION
Software project managers and technical leaders use all the available information, including the approximation of the software functional size, to estimate the cost and duration of software projects [1][2][3][4][5][6][7].Estimation of software projects based on measuring software functionality was first proposed by Albrecht [8] in 1979.Several methods containing refinement of Albrecht"s concepts and rules are proposed in order specify its use and applicability have been standardized by ISO: COSMIC Function Points [9] and Function Points Analysis (FPA) [10].Although the functional size of software can be measured accurately with these ISO standards when all the functionality details are available, size measurement is much more challenging and imprecise when the initial requirements are high level and lack details: under these conditions functional size can only be approximated and not measured accurately.Desharnais et al. [11] recommend using functional size approximation techniques for such 'partially documented' software functional requirements specifications.
Functional size approximation techniques can be classified in two (2) main categories, according to Meli [12]:

A. Direct approximation techniques
Direct functional size approximation techniques adopt the "expert opinion" approach, which depends completely on the expertise of the individuals responsible for the approximation of software functional size.This means that these approximations may be influenced by many subjective factors, like personal relationships in the case of collaborative teams, contractual aspects of the task which commonly affect team performance.The direct approximation techniques it may result in reasonable functional size approximations, but it is challenging to recognize when they reasonable, and when they are not.Examples of such techniques include the following: Analogy-based approximation technique [13], in which a repository of measured software applications is used.The approximator looks for "similar" pieces of software, calculates their average size, and then assigns an approximate value to the piece of software that he is approximating.However, the accuracy of this technique is poor [14].
Delphi technique [15], which considers a group approximation approach, rather than an individual one.For example, each individual involved in the approximation constructs an anonymous approximation, and then these individual approximations are combined to achieve an overall size approximation as a group estimate.However, the results are difficult to justify, and this technique is not recommended for software enhancement projects [12].
Three-point approximation technique [16], in which the functional size approximations are collected from experts, and then calculates the final approximate functional size is calculated using the formula: ApproxSize = (Min + 4 × MostLikely + Max) / 6, with a standard deviation σ = (Max -Min) / 6.However, the approximator faces the same challenges as with the Delphi technique [12].

B. Derived approximation techniques
Derived approximation techniques are algorithmic or structured, and based on theoretical or statistical models.A few derived algorithmic functional size approximation techniques have been proposed: Extrapolative approximation technique [17], which is applied by asking each individual involved in the task to approximate one functional component, and derive the remaining approximations through statistical or theoretical means.However, the accuracy of this technique is poor, strongly depends on distribution profiles, and not recommended for enhancement projects [12].
Average complexity approximation technique [18], in which functional components are identified in accordance to FPA method [10] in order to approximate functional size according to these components.
Early & Quick techniques [19][20][21], which was initially published in 1997 for the original Function Points Analysis sizing method [10].In this context, the term "Early" refers to the need to obtain functional size approximation before a significant portion of the software requirements is detailed enough for precise measurement, and the term "Quick" means that typically such size approximation must be obtained rapidly, since they must be provided to management within a short time, in spite of the obvious constraints.As the COSMIC measurement method [9] became adopted as an international standard, the initial design of the Early & Quick technique was extended to the COSMIC measurement method.The initial design of the Early & Quick COSMIC technique was proposed in 2000 [17], and subsequently generalized by Conte et al. [21] in 2004.The Early & Quick techniques are presented in more detail in the next section.
Functional size approximation techniques are in great demand to tackle the lack of precise and detailed software requirements specifications at early phases of the software development life cycle.However, a key finding from our literature survey is that while there are "opinions" on the performance of these approximation techniques, but there is no experimental research evaluating their performance, especially for those based on ISO standards, such as the Early & Quick techniques.The Early & Quick COSMIC technique is selected to evaluate its performance since it refers to the 2nd generation of functional size measurement methods which was developed in the early 2000 to correct weaknesses of the 1st generation of FSM initially developed at the end of the 1970s.
In the ISO International vocabulary of basic and general terms in metrology [22], reproducibility and accuracy are defined as follows: Reproducibility, as a condition of measurement: "condition of measurement, out of a set of conditions that includes different locations, operators, measuring systems, and replicate measurements on the same or similar objects"; and Accuracy, as applied to measurement: "closeness of agreement between a measured quantity value and a true quantity value of a measurand".

EARLY & QUICK TECHNIQUES: AN OVERVIEW
The Early & Quick (E&Q) techniques [19,21] define a set of concepts and procedures which combine various functional size approximation approaches to derive an approximation for the functional size of software.They classify functions (i.e.functional processes and data groups in the FPA variant, E&Q FPA) in an analogical and an analytical fashion.E&Q techniques provide the opportunity to use different levels of detail of the software during the functional size approximation process.Therefore, the total amount of functional size uncertaintywithin a range of values (minimum, most likely, and maximum)will be the weighted sum of the uncertainty values of individual components.
An E&Q functional size approximation starts with a breaking down of the structure of the software system under study, an example of which is shown in Figure 1 for the FPA variant, E&Q FPA.This figure depicts the elementary functional processes, as well as the logical data groups and their aggregations that represent different levels of detail.These heterogeneous levels of knowledge make it possible to take advantage of all the information available: in other words, the E&Q techniques enable the use of all the available non detailed information in the functional size approximation process.The elementary functional processes can be grouped into "small", "medium", or "large" typical and general processes.General processes in turn can be grouped into "small", "medium", or "large" macro processes, and the elementary logical data groups can be grouped into multiple data groups.
The functional processes in the E&Q FPA technique correspond to the elementary processes of the standard FPA method: (i.e.External Input (EI), External Output (EO), and External Query (EQ)), and in the E&Q COSMIC technique, they correspond exactly to the functional processes of the standard COSMIC method, without distinction as to their type.Both the IFPUG and COSMIC versions [19,21] of the Early & Quick technique provide large range of size values, with no reference to the specific rules of either sizing methods.The Early & Quick technique also provides generic definitions that do not exactly map the detailed definitions and rules of either methods.By design, the Early & Quick technique does not have the same level of details as the ISO standards.
Typically, the root is the highest level in the hierarchy (i.e. the application level), and lower levels stem from that root, based on the number of software artifacts under study.The method is applied down through the levels until the approximator decides that it is not useful to proceed with further decomposition (i.e. at the functional process level).It is worth mentioning that all the functions provided by the application must be at leaf level, since there is no explicit functionality at higher levels of the hierarchy.Therefore, a functional approximation of all the leaves provides a bottom-up approximation of the whole tree (i.e. the software application).
Figure 1.Functional hierarchy in the Early & Quick FPAan example [19,21] Table 1 provides the descriptions and acronyms of the functional levels in the E&Q techniques: a functional process (FP) represents the smallest software process with autonomy and significance, and corresponds to the functional process in the standard COSMIC method; a general process (GP) consists of a set of two or more average functional processes; a typical process is a particular case of a general process and normally consists of a set of the most frequently occurring operational transactions; A u g 10, 2 0 1 3 a macro process (MP) consists of two or more average general processes; a logical data group (LDG) represents a group of logical data attributes; and a multiple data group (MDG) consists of a set of two or more logical data groups.A set of two or more average GPs.The MP can be likened to a relevant subsystem, or even a bounded application, of an overall Information System.

General Process (GP)
A set of two or more average FPs.The GP can be likened to an operational subsystem, which provides an organized, comprehensive response to a specific application goal.

Typical Process (TP)
A particular case of a GP: the set of the most frequently occurring operational transactions.The TP can be found in two "flavors": CRUD (Create, Retrieve, Update, and Delete), and CRUD plus (CRUD with the addition of List and Report).

Functional Process (FP)
The smallest software process with autonomy and significance.Its FP allows the user to achieve a single business objective at the operational level.

Multiple Data Group (MDG)
A set of two or more LDGs.Its size is evaluated based on the approximated number of LDGs included.

Logical Data Group (LDG)
A group of logical data attributes, it represents a conceptual entity which is functionally significant as a whole for the user.
The E&Q techniques assign a set of size values (minimum, most likely, and maximum) to each leaf in the hierarchy, based on the analytical and analogical table mentioned earlier.Then, these size values are summed to provide the overall approximation result (minimum, most likely, and maximum).It is worth mentioning that the E&Q FPA technique assigns numerical size values to logical data groups and multiple data groups, whereas the E&Q COSMIC technique identifies "objects of interest", but does not assign any numerical size value to them.A reference manual for approximating function points at early phases of the software development life cycle using of the E&Q FPA technique is documented in [23].This manual describes the E&Q FPA technique without mentioning the need for any other guidelines for its application.More specifically, the goals of this reference manual [23] are: to provide an exhaustive and clear description of the FPA variant of the E&Q technique (i.e.E&Q FPA); and to promote comprehensive and homogenous application of E&Q FPA technique by providing a guideline to approximate function points at early phases of the software development life cycle.
The authors of this reference manual [23] mention that it was designed to be applied by practitioners with an "average" to "good" knowledge of standard function point counting (i.e. the standard FPA method) [10]; however, a detailed knowledge of function point counting is not needed, since applying E&Q FPA technique does not need extensive knowledge of the standard FPA rules and practices.

EXPERIMENTAL DESIGN
This section presents the eight steps of the experiment reported in this paper -see Figure 2.

Identification of experiment objective
The specific experiment objectives were to evaluate the reproducibility and accuracy of the approximation results using only the concepts and rules of the Early & Quick COSMIC technique.In this experiment, 'reproducibility' refers to the degree of closeness between the results of functional size approximation calculated by different approximators on the same case study and 'accuracy' refers to the level of closeness between the results of functional size approximation calculated by different approximators of the case study calculated as described below.
It is worth noting that two preliminary steps are recommended in [24] for the use of the E&Q COSMIC technique [21].However, no details are provided in [24] on how to apply these steps in practice: identification of the levels of granularity of the requirements specifications; and identification and use of the size scaling factors.
The experiment is conducted without taking into consideration the two preliminary steps mentioned above for the usage of the E&Q COSMIC technique [21] after fifteen years of the initial publication of the Early & Quick technique [19], such guidelines are still not available to the industry.

Identification of the case study
The case study in [25] presents the software requirements specifications (SRS) of the first release of the "uObserve" system.This system is intended as a proof of concept for usability testing, and was developed for a research laboratory at the École de technologie supérieure (ÉTS) in Montreal, Canada.The SRS document used in the experiment is written in accordance to the UML 2.0 specifications [26] and IEEE-Std-830 [27] in terms of content and structure.This SRS document [25] consists of 18 pages of textual specifications, divided into three main sections: section 1 provides introductory information, including background, software purpose and scope, software objectives, and references; section 2 provides a high-level description of the software to be developed, a list of the software functionality and features, and the characteristics of the users; and section 3 provides the software functional and non-functional requirements, along with the user interfaces, the hardware interfaces, and the software prototype.
The functional size of the original document of this case study had been previously measured by a team of measurement experts using the international standard for software functional measurement: COSMIC [9].The measurement experts had an average of fifteen years of industrial experience, were all COSMIC Certified Entry Level practitioners [28] were experienced in functional size measurement, and were active members of the COSMIC Measurement Practice Committee.3 presents the functional size calculated by each member of this team of four experts, and the average functional of 79.3 CFP.The differences in the functional size measurement results of each individual measurer are due to measurement assumptions made by the four experts, and do not represent the existence of measurement errors.They do, however reflect the various 'flavors' that develop owing to the assumptions that can be made by different development teams during the development phase of the software [28].
The case study document from [25] describes the functionality of the software system based on 15 use-cases that specify software system functionality in textual form.Table 4 presents the reference classification of the functional components of the software system by applying the E&Q COSMIC technique prepared by the designer of this experiment: 8 use-cases are classified as "small" functional processes; 5 use-cases are classified as "medium" functional processes; and 2 use-cases are classified as "large" functional processes.
On the right-hand side of Table 4 is the approximation of the software functional size in CFP by applying the E&Q COSMIC table (i.e.Table 2).The functional size approximation ranges (Min: 57 CFP, Most-likely: 87 CFP, Max: 108 CFP).

Preparation of the experiment
This activity consisted of three sub activities: materials preparation, pilot testing, and call for participation, as follows:

Materials preparation
Prior to the experimental session, the E&Q COSMIC technique was reviewed by the experiment designer in order to provide a description of the concepts and rules, as well as a procedure for applying the technique.The experiment materials included: a description of the E&Q COSMIC technique; the case study document; a defined set of rules; and a defined set of participant roles.
The original case study document [25] was considered to be detailed and complete [28] and it specifies in detail the functionality that has to be delivered by the software system.Therefore, it allows a standardized functional size measurement method to be used to obtain an accurate measurement of software system functional size.size.To meet the objective of this experiment in an approximation context, the case study document was modified as follows: 6 use-cases were kept "as is" (i.e.without any modification of their specifications); 4 use-cases were partially modified, by removing portions of specifications; and 5 use-cases were significantly modified by removing use-case specifications entirely.
The average size of 79.3 CFP is the functional size of the uObserve software as developed and implemented: in this experiment, is it indeed the right reference size to evaluate -for approximation purposes -the accuracy of any combination of deletions of details from the set of implemented use-cases of the uObserve software.An alternate strategy -for experimental purposeswould have been to use a set of incomplete requirements which had never been implemented, but this would not provide any basis for evaluating the accuracy of the E&Q COSMIC technique.Opinions of the participants about the altered requirements specifications were not sought since they did not have access to the documentation of the uObserve software as ultimately implemented.This is therefore representative of current practices in industry: approximation is typically done based on incomplete information and no prior knowledge of which specific details are missing.

Pilot testing
A preliminary run of the experiment was performed by the designer of the experiment and an independent expert to identify the potential challenges in the experimental procedures, including: the applicability of the SRS to the experiment, in terms of scope and objective; an estimate of the time required to conduct the experiment; the usability of the data collection forms to be used by the participants in the experiment; and verify the correctness of the reference classification of the functional components of the software system (see Table 4) by having an independent expert to perform the experiment activities.

Participants in the experiment
This experiment was stage as part of the 2 nd International Symposium in Software Engineering Management (ISSEM 2011).Twelve participants volunteered to conduct the experiment.
The industrial experience profile of the 12 participants involved in the experiment is presented in Figure 3, based on their industrial experience in software engineering topics: the participants had an average of nine years of experience in software requirements analysis and modeling, software development, software documentation, software quality assurance, and software project management.It can be also observed in Figure 3 that participants A1 to A8 have an average of 12 years of industry experience, while participants A9 to A12 have very limited industry experience.In summary, two-thirds of the participants had significant industry experience.

Pre-experiment training
The participants in the experiment were given a one-hour training session, to familiarize them with the E&Q COSMIC technique, the rules to follow, and the roles that would govern the participants" behavior during the experiment.They were then given 30 minutes to read the SRS document.

Conducting the experiment
The participants were handed a printed copy of the case study document and given one (1) hour to: classify the set of software requirements specifications as E&Q COSMIC functional components in accordance to their level of granularity; and then use the statistical table of the Early & Quick COSMIC technique (i.e.Table 2) to calculate an approximate functional size of the software system presented in the modified SRS document (see Figure 4).
The following experimental data were to be captured on forms designed for this purpose: software process types: Functional, General, Typical, or Macro; total number of software processes for each process type; total functional size for each process type and total functional size of the software system; and total effort required to approximate the functional size.

EXPERIMENTAL RESULTS
This section presents the evaluation of the reproducibility of the functional size approximations calculated by the participants in the experiment, as well as, the evaluation of the accuracy of their functional size approximations with relative to the reference functional size of the case study (see Table 3).

Descriptive data from the Experiment
To explore whether or not the experience of the participants had an impact on the results of the experiment, the results are presented in two groups: results of the 8 participants in Table 5 with an average of 12 years of industry experience; and results of the 4 participants in Table 6 with an average of 1 year of industry experience.
Tables 5 and 6 present the classification of the software processes (i.e.functional components) of the case study used in this experiment (i.e. the uObserve software system) and the functional size approximation calculated by each participant using the E&Q COSMIC table.These tables also present the effort expended in minutes by each participant in conducting the experiment.

Evaluation of the reproducibility of the functional size approximation
To evaluate the reproducibility of the E&Q COSMIC technique, the approximations of the functional size of the 12 participants are compared to the median functional size approximation.For this data set, the median is represented by approximation of participant A2 (see Table 7).Therefore, the percentage difference in functional size approximation for participant A2 is (Min: 0%, Most-likely: 0%, Max: 0%), and the average percentage difference in approximation is calculated using the percentage difference of the other 11 participants.The plus sign in Table 7 indicates an increase in the percentage difference of the functional size approximation, and the minus sign in Table 7 indicates a decrease.Of the twelve 12 participants in the experiment, the functional size approximations calculated by participants A3, A9, and A5 look like 'reproducible' approximations relative to the median, which is represented by the approximation of participant A2.Even though the functional size approximation of participants A3, A9, and A5 look like 'reproducible' approximations, their approximations of the functional size yield the following average percentage difference of (Min: +5%, Most-likely: −20%, Max: −26%).
Overall, the average percentage difference for the 12 participants is (Min: +158%, Most-likely: +87.4%, Max: +74%) which indicates non-reproducible results for most of the participants.The sources of large variations in the approximations of the functional size presented in Table 7 which yield an average percentage difference of (Min: +158%, Most-likely: +87.4%, Max: +74%) are the incorrect identification of the number of software processes and the incorrect classification made for the software processes (i.e. the functional components) in the case study.Overall, the results presented in Table 7 indicate that the use of the rules and concepts of the E&Q COSMIC technique by the 12 participants does not provide a 'reproducible' approximation of the functional size of the case study used in the experiment.

Evaluation of the accuracy of the functional size approximation
The functional size approximations of the 12 participants Tables 5 and 6 are first compared with the average functional size of 79.3 CFP (see Table 3) which was measured by the team of experts using the original version of the SRS document.The Magnitude of Relative Error (MRE) equation [29] is used to calculate the accuracy of the functional size approximations (see Table 8) as follows: processes prepared by the designer of this experiment and verified for correctness by the independent expert.Tables 9 and 10 present: the number of software processes identified by the 12 participants (1 st column); the correct number of software processes identified by the designer of the experiment and verified by the independent expert (2 nd column); and the corresponding Magnitude of Relative Error (MRE) values (3 rd column) calculated using the values from the 1 st and 2 nd columns.
Only participant A6 in Table 9 was able to identify the correct number of software processes explained in the case study.However, this participant could not classify them in accordance to their levels of granularity in the correct E&Q functional classessee Table 5.
In addition, only participant A11 in Table 10 was able to identify the correct number of software processes explained in the case study, but he misclassified them, which led to a large range of size approximations (Min: 581 CFP, Most-likely: 1185 CFP, Max: 1972 CFP).Furthermore, the functional size approximations of participant A12 look 'reasonable' (see Table 6).However, participant A12 identified only 9 software processes, instead of the correct number of 15 software processes and could not classify them in accordance to their levels of granularity in the correct E&Q functional classes.
It is worth mentioning that the average MRE of 17% of the participants in Table 10 (i.e.participants with limited industry experience) is great deal better (i.e. a smaller MRE) than the average MRE of 74% for the participants in Table 9 (i.e.participants with 12 years of industry experience).This is because the participants with limited industry experience identified less software processes than the participants with 12 years of industry experience.

Average MRE on software processes 17%
Most of the participants in Tables 5 and 6 calculated inaccurate functional size approximations, since they had incorrectly identified and classified the software processes (i.e. the functional components) of the case study.Overall, the results presented in this subsection indicate that use of the rules and concepts of the E&Q COSMIC technique by the 12 participants did not help them arrive at an 'accurate' approximation of the functional size of the case study used in the experiment.

Summary of findings
The experimental results presented in sections 4.2 and 4.3 lead to the following findings: the functional size approximations calculated by the 12 participants using only the rules and the concepts of the E&Q COSMIC technique did not lead to reproducible or accurate results in this experiment.
the incorrect identification and classification of the functional components had in impact on the reproducibility and accuracy of the functional size approximations of the functional components.
no relationship was observed between misclassification of the functional components and amount of details available in the use-cases.
the participants with extensive industry experience and those with limited industry experience made similar mistakes, in terms of incorrectly identifying and classifying of the software processes in the case study.In other words, the participants with extensive industry experience did not perform better than those with limited industry experience.

Construct validity threats
A construct validity threat is associated to the failure of the experimental setting to reflect the conditions of the technique under study (i.e. the E&Q COSMIC technique).In the case of the experiment reported here, the type of reference manual mentioned in [23] for the E&Q COSMIC technique was not available.To mitigate the risk of this type of threat occurring, the experimental material that was made available to the participants in the experiment was designed to contain equivalent information to that in the reference manual of the E&Q FPA technique [23].In other words, the material used in the experiment contains a complete description of the E&Q COSMIC technique, including the functional size approximation rules and procedures.The participants in the experiment were not able to correctly classify the software processes in accordance with their levels granularity, as described "as is" in the proposed E&Q COSMIC technique.The preliminary steps recommended in [24] were not taken into account in the experiment design, owing to the unavailability of related guidelines from the literature fifteen years after the initial publication of the E&Q technique in [19] and eight years after the publication of the COSMIC variant in [21].
A second construct validity threat is the restricted time available in which to conduct the experiment.This lack of time prevented the participants from asking for clarification from their colleagues, or from experts in the field, which is common practice in software development organizations.
A third construct validity threat is the inexperience of the participants with the E&Q COSMIC technique.On the one hand, the E&Q COSMIC technique includes few simple rules and concepts.In theory, the eight (8) participants who had an average of twelve (12) years of industrial experience at the time of the experiment should be able to apply the technique correctly without extensive training.On the other hand, if those participants have such difficulty to apply consistently the technique, then it could be that it is the E&Q COSMIC technique itself that may be immature and in needs of much further refinements in its definitions and rules.
Another construct validity threat is about the different expectations of the participants about the SRS document they were given.On the one hand, the participants were handled a modified version of the SRS document where some of the usecases lack details (i.e.functional specification) which is indeed a challenge.On the other hand, the E&Q COSMIC technique is indeed designed to be used at the early phases of the software development life cycle when the inputs to an approximation come from the customers of the software have ambiguous and incomplete expectations of the software to be developed: the approximators -using the E&Q COSMIC technique -will use such incomplete information to approximate the functional size of software for effort estimation purposes.Some variance in approximators results are expected but a good approximation method should lead to a minimum of variation.Therefore, the experiment presented in this paper reproduces such a context.It must be noted that when all the details (i.e. the functional specifications) of the use-cases are available, the E&Q COSMIC technique becomes irrelevant since with such details, the full detailed ISO measurement rules of standard measurement methods like COSMIC [9] can be applied and precise measurement results can be obtained rather than approximations with ranges of values.

Internal validity threats
An internal validity threat is associated with any changes in the design of the experiment, such as lack of discussion or clarification during the experimental period, lack of clear data collection procedures, or description of the concept(s) to be evaluated in the experiment, that could affect the validity of the experimental results.To mitigate the risk of this type of validity threat occurring, a one-hour tutorial session was held prior to the experiment to describe its objectives, scope, and rules, as well as the roles of the participants.The designer of the experiment explained the E&Q COSMIC technique in detail, and opened the door to discussion to clarify the activities and materials of the experiment, including a complete description of the E&Q COSMIC technique, a participant experience survey, and data collection forms.
Moreover, the designer of the experiment conducted a pilot test of the experiment by performing the experimental activities of the prior to running the actual experiment.This was done with the help of an independent expert, in order to identify any potential challenge in the experimental procedures, including the applicability of the SRS to the experiment, the time required to conduct the experiment, and the usability of the data collection forms to be used by the participants in the experiment, as well as to verify the correctness of the reference classification of the functional components of the software system.The independent expert had 20 years of experience in requirements analysis and modeling, 6 years of experience in software documentation and software quality assurance, and 3 years of experience in functional size measurement using the COSMIC measurement method.
The independent expert identified the correct number of software processes (15) explained in the requirements document in [25], and classified 13 of them in the reference classification proposed by the designer of the experiment.However, the independent expert classified 1 of the software processes as a large functional process, whereas this functional process was deemed by the designer of the experiment to be a medium functional process.The independent expert identified 3 more data movements than the designer of the experiment, and this affected the total number of data movements identified in that functional process and resulted in its classification as a large functional process.
Similarly, the independent expert classified another software process as a medium functional process, whereas this functional process was deemed by the designer of the experiment to be a large one.The independent expert identified 2 data movements fewer than the designer of the experiment, and this affected the total number of data movements identified in that functional process and its classification.
The differences in the classification of the 2 software processes were caused by assumptions made by the independent expert for elements in the Graphical User Interface (GUI) of the software system.This affected the identification of the data movements in each software process.These differences should not be considered as misclassifications, because they reflect the various "flavors" of functional behavioras a result of the assumptionsof the software [28].Also, the differences in the classification of these software processes did not affect the final functional size approximation of the software.
Next, the Magnitude of Relative Error (MRE) equation [29] was used to calculate the accuracy of the functional size approximation in Table 4 relative to the reference average functional size of 79.3 CFP (see Table 3).This gives a range of MRE values of (Min: 28%, Most-likely: 8.4%, Max: 36.2%).It is worth noting that the approximate 'Most-likely' functional size of 87 CFP is close to the reference average functional size of 79.3 CFP (i.e. it yields an MRE value of 8.4%).The experiment was designed to apply the E&Q COSMIC technique using a single case study (i.e.uObserve requirements specifications) with a group of 12 participants.In other words, the experiment tested the reproducibility of the classification process with multiple subjects (i.e.participants) using the same requirements document, in order to obtain multiple ranges of functional size on the same requirements document.Assessment of the ranges introduced in the analytical/statistical table was outside the scope of the design of this experiment.
Another potential threat is that all the software processes described in the case study document were only functional processes, and none were higher-level processes (Macro, General, or Typical).To mitigate this threat, future experiments will be designed to apply the E&Q COSMIC technique using multiple case studies with higher process levels (i.e.Macro, Generic, Typical) in order to assess the ranges introduced in the analytical/statistical table (i.e.minimum, most likely, and maximum size values).

External validity threats
One external validity threat here is associated with the failure to be able to generalize the experimental results beyond the experimental setting.The number of participants in the experiment was limited to 12.However, the experiment involved participants with 2 profiles: experienced participants, and participants with limited experience.Participants A1 to A8 had significant experience in software requirements analysis, modeling, and quality assurance, while participants A9 to A12 had limited experience in these areas.In spite of this, the classification results showed that they all committed similar errors in classifying the software processes of the software system.

DISCUSSION AND FUTURE WORK
This experiment looked into the application of the Early & Quick COSMIC technique using a single case study (i.e.uObserve requirements specifications).The experiment tested the reproducibility and accuracy of the functional size approximations with multiple subjects (i.e.participants) using the same requirements document.The functional size approximations produced by 12 participants from the software engineering industry using only the rules and concepts of the E&Q COSMIC technique currently available to the industry did not lead to results that were either reproducible or accurate: the average MRE of the functional size approximation of the 12 participants relative to the reference average functional size of 79.3 CFP is as follows: Min MRE 684%, Most likely MRE 1502%, Max MRE 2546%.
only 2 participants were able to identify the correct number of software processes: the average MRE of the number of identified software processes of participants with 12 years of industry experience is 74%, and the average MRE of the number of identified software processes of participants with limited industry experience is 17%.
none of the 12 participants in the experiment classified the identified software processes in the correct E&Q functional classes, in accordance with their levels of granularity.
This experiment could not take into consideration the two preliminary steps recommended in [24] for the application of the Early & Quick COSMIC technique: identification of the levels of granularity of the software requirements specifications; and identification and use of size scaling factors.
This was mainly because of the unavailability in the literature of guidelines for performing these two steps, even though 15 years has passed since the initial publication of the Early & Quick techniques and 8 years has passed since the publication of the COSMIC variant [21].The implicit assumption in [24] is that such guidelines would lead to reasonably accurate and reproducible approximations, but there is no supporting evidence that this assumption works as intended.This experiment has used all information available on such a technique, but could not demonstrate that it led to either reproducible or accurate results.All the available information on such a technique has been used in this experiment, but we could not demonstrate that it led to either reproducible or accurate results.
The experiment reported in this paper has been conducted in 2011 with the COSMIC version of the E&Q approximation technique.For the IFPUG version of the E&Q technique, there is an April 2012 edition of a reference manual, version 1.1 of the "Early & Quick Function Points 3.1 [30].Commercial training and certification is provided in Italy on the basis of this reference manual.However, there is not yet publicly available information on the performance, in terms of reproducibility and accuracy of approximation results, of practitioners trained or certified on the basis of this reference manual.Organizations providing commercial training should conduct (themselves, or preferably through an independent third party) such experiments with the people they have trained to demonstrate that training improves reproducibility and accuracy of the functional size and the E&Q technique itself can produce reproducible and accurate functional size results.
The Early & Quick COSMIC technique does not require from the participants the knowledge of the detailed rules and definitions of the standard COSMIC and IFPUG method.A future experiment may investigate whether or not people with expertise with either ISO standards would come up with better approximation results.
The software measurement industry has recognized that guidelines are needed, but none has been put into the public domain.Consequently, it has yet to be demonstrated that guidelines lead to reasonably reproducible and accurate results.In summary, there is no documented evidence that: a) the initial E&Q COSMIC design works as intended; or that b) the preparatory guidelines designed to support these approximation techniques work as intended.
The software measurement industry and the researchers need to work on developing such guidelines and on verifying that they work as intended.It must be shown that they lead to: a reproducible approximation of functional size; and a reasonably accurate approximation of functional size.
The methodology used in this experiment can be reused to test the contributions of guidelines as they become available in the public domain.In addition, the case study used in this experiment and the quantitative findings of this research can be used as a benchmark to quantitatively test the contributions of any guidelines that are proposed in the future by researchers or practitioners.

Figure 2 .
Figure 2. Steps of the experiment

Figure 3 .
Figure 3. Industrial experience (in years) of the 12 participants in the experiment

Figure 4 .
Figure 4. Overview of the activities of the participants in the experiment

Table 5 .Table 6 .
Experimental results of the 8 participants with an average of 12 years of experience Experimental results of the 4 participants with an average of 1 year of experience The industry and the research community recognize the importance of approximate sizing, and size approximation techniques, like the E&Q COSMIC technique, have been proposed, which consist of: a) a procedural part (i.e.identification and classification of the functional components of software); and b) assignment of the numerical size values of the classified functional components using tables of size factors, such as the E&Q COSMIC statistical table.

Table 2
[21]ents both the component ranges and the size values for the Early & Quick COSMIC approximation technique.Table 2. Early & Quick COSMIC components and size values[21]

Table 3 .
[28]tional size of the original case study measured by experts[28]

Table 4 .
The reference classification & size approximation

Table 7 .
Percentage difference in functional size approximation

Table 9 .
Number of software processes identified by participants A1 to A8

Average MRE on software processes 74%Table 10 .
Number of software processes identified by participants A9 to A12