Process-Function Data Mining for the Discovery of ... - ACS Publications

Process-Function Data Mining for the Discovery of...

0 downloads 66 Views 2MB Size

Research Article Cite This: ACS Comb. Sci. XXXX, XXX, XXX-XXX

Process-Function Data Mining for the Discovery of Solid-State IronOxide PV Elana Borvick, Assaf Y. Anderson,* Hannah-Noa Barad, Maayan Priel, David A. Keller, Adam Ginsburg, Kevin J. Rietwyk, Simcha Meir, and Arie Zaban* Department of Chemistry, Institute for Nanotechnology & Advanced Materials, Bar Ilan University, Ramat-Gan 52900, Israel S Supporting Information *

ABSTRACT: Data mining tools have been known to be useful for analyzing large material data sets generated by highthroughput methods. Typically, the descriptors used for the analysis are structural descriptors, which can be difficult to obtain and to tune according to the results of the analysis. In this Research Article, we show the use of deposition process parameters as descriptors for analysis of a photovoltaics data set. To create a data set, solar cell libraries were fabricated using iron oxide as the absorber layer deposited using different deposition parameters, and the photovoltaic performance was measured. The data was then used to build models using genetic programing and stepwise regression. These models showed which deposition parameters should be used to get photovoltaic cells with higher performance. The iron oxide library fabricated based on the model predictions showed a higher performance than any of the previous libraries, which demonstrates that deposition process parameters can be used to model photovoltaic performance and lead to higher performing cells. This is a promising technique toward using data mining tools for discovery and fabrication of high performance photovoltaic materials. KEYWORDS: data mining, iron-oxide, photovoltaics, genetic programing


Furthermore, the model enables the fabrication of cells with improved photovoltaic performance. Data mining tools are vital for the study of the metal oxide photovoltaics as the cells are being fabricated and characterized using combinatorial methods. Combinatorial methods enable high-throughput fabrication and characterization of a large variety of metal oxides, which lead to the production of large amounts of data.2 Even when studying just a few samples the information obtained can include hundreds to thousands of data points, where each of these points have several characterizations. When further treatments are done on the samples (for example, post annealing) the numbers multiply. Traditional ways of analyzing and plotting data are no longer appropriate to examine this large mass of data that has been produced. To analyze this data and gain understandings that will lead toward new discoveries and higher PV efficiencies,

Because of the increase in population and the growing use of energy, there is a need to develop alternative renewable energy sources, which will minimize pollution. One of these sources, which is already being implemented, is solar energy. Recently, investigations have been performed into using metal oxides for photovoltaics (PV), as they are stable, nontoxic, and abundant materials, which have characteristics making them potential candidates for use in photovoltaics.1 Although these materials have some promising attributes suitable for photovoltaics, they still have to be improved before they reach efficiencies that will make them competitive to options currently on the market. One way to study photovoltaic cells in order to improve their efficiencies is using data mining tools. The present research shows the use of data mining tools to model the effect of the process parameters on the performance of metal oxide photovoltaic cells. This model shows the direct relationship between the process parameters and the performance without requiring structural characterization as shown in Figure 1. © XXXX American Chemical Society

Received: August 8, 2017 Revised: October 22, 2017


DOI: 10.1021/acscombsci.7b00121 ACS Comb. Sci. XXXX, XXX, XXX−XXX

Research Article

ACS Combinatorial Science

done for all materials. Moreover, when working with a large variety of materials and data sets it is important to be able to get the same descriptors for all materials,18 which is often not the case for structural characteristics in materials science. Another difficulty using material property descriptors is that the knowledge resulting of the analysis usually cannot be tested or implemented to verify the model. For example, if we find that a specific bandgap in a material will lead to the functionality we are looking for, we often do not know how to fabricate the new material with the desired bandgap. Furthermore, tailoring the bandgap can lead to other undesired changes in the materials. Accordingly, employing material properties as descriptors leads to interesting conclusions that are not useful on a practical level. In this study, we use an alternative approach to choosing the model descriptors for studying the absorber layers in photovoltaic cells, which has been suggested elsewhere in studying transparent conductive oxides,19 instead of using structural descriptors we use process descriptors. Process descriptors illustrate the fabrication process of the material in a data set. The process descriptors can include many deposition parameters, for example deposition duration, temperature, oxygen flow, deposition rate, and more. The relationship between process descriptors, structural descriptors, and functionality of a material is shown in Figure 1, and Table 1

Figure 1. Process−structure−function relationships. Process affects both structure and function. The structure−function route can lead to interesting scientific conclusions; however, the process−function route can lead to faster discoveries that can affect future work on a practical level.

sophisticated informatics and statistics tools are needed.3 This requirement leads to combining the world of materials science with the world of data mining. Data mining is the process of discovering patterns and gaining new knowledge from large amounts of data.4 This technique has been used in many different fields, including finance,5 business,6 and science.7 Although within the scientific community data mining is more commonly used in the field of bioinformatics,8 recent work has been done utilizing data mining algorithms to visualize,9 cluster,10 and model11 materials data. The data mining algorithms have been employed on a variety of material data sets to reach new understandings such as analyzing XRD spectra,12 predicting crystal structures13 and predicting photovoltaic performance.11 Application of data mining tools to material data sets has shown the importance of these algorithms and how they can help reach new and rapid understanding of material structures and properties. There is a large variety of data mining tools that are available for analyzing data arising from materials science. The data mining tools have many purposes such as dividing data into groups, summarizing the data and visualizing it,9 classifying the data,10 and building models that describe the data.11 In this study, we concentrate on regression models used to describe the data and understand the underlying trends in the data. There are various algorithms that have been developed for this objective.4 Our work focuses on two methods, genetic programing as implemented in Eureqa program and stepwise regression, which we use to model the data. Stepwise regression and genetic programing have been used to determine absorption wavelengths that can be used to find the chemistry of leaf samples,14 to predict photovoltaic performance,11 and to identify the structural features associated with hole traps in silicon.15 The models obtained are used to find underlying trends in the data, and the trends found are then used to fabricate cells with improved PV performance. A detailed explanation of the genetic programing and stepwise regression methods, and their implementation and analysis is given in the Experimental Section. Traditionally, when analyzing a materials science data set, the descriptors chosen for the analysis are the material properties.16 These descriptors include bandgaps,9 thickness,11 structure,17 etc. Although these descriptors are important and can lead to interesting physical discoveries about the material data set there are several difficulties when working with them. First, these descriptors are often not readily available, since these descriptors can involve long, complicated, and costly measurements, or calculations that are not universal, that is, cannot be

Table 1. Comparison of Structural Descriptors and Process Descriptorsa descriptor structure



layer thickness composition


ro-vibrational crystal structure temperature deposition duration target power pressure

acquisition method



calculation through absorption measurement four point probe measurement/ impedance measurement ellipsometry/cross section/AFM calculation/EDS/ XRD Raman XRD









complex complex

hard hard

user determined user determined

simple simple

simple simple

material/user determined user determined






This table illustrates how process descriptors are easiest to acquire and to tune. The table elaborates on how the descriptor can be acquired (either by a certain measurement or determined by user or material specifications), how complex it is to acquire each descriptor, and if the descriptor can be controlled or tuned, that is, can another deposition be performed where the descriptor is set to a specific value.

shows a comparison between the process and structure descriptors. The advantage of process descriptors is that they are simple, readily available, and do not involve any additional measurements or analyses. Furthermore, by studying the relationship between the functionality and the process parameters we can obtain an exact “recipe” for a material with the desired functionality. After the relationships between the process and function parameters are understood and the B

DOI: 10.1021/acscombsci.7b00121 ACS Comb. Sci. XXXX, XXX, XXX−XXX

Research Article

ACS Combinatorial Science optimal process parameters are found we can study the structure in a targeted manner and see how the deposition parameters also affect the structure, which in turn affects the function. Analyzing the structure after the parameters are optimized will bring to deeper understanding of the system. The process descriptor approach can lead to faster discoveries that have a practical effect on the scientific research. To test this approach on experimental data, we employed a data set of photovoltaic cells with an absorber layer deposited from a Fe2O3 target. We chose Fe2O3 since it is an especially interesting material because of its high abundance, stability, low cost,20 and bandgap (∼2 eV), it has been studied in the past for water oxidation applications.21 There is almost no research done on the use of Fe2O3 as a material for photovoltaic systems. One report, on utilizing Fe2O3 instead of TiO2 or ZnO in dye cells shows that when using Fe2O3 as the electron conducting layer its energy levels can be aligned quite well with the dye HOMO−LUMO states, if the electron conducting layer is coated with reduced graphene oxide.22 There have been two reports on employing Fe2O3 as the absorber layer in photoelectrochemical solar cells. In this case, the Fe2O3 is deposited by pulsed laser deposition on to a transparent conductive oxide and sandwiched with an aqueous electrolyte and a counter electrode (Pt). Somekawa et al.23 showed that they were able to improve efficiencies by thermally treating their electrodes, with their highest efficiency being 0.035%. In the second report, Seki et al.24 doped Fe2O3 with Rh, and demonstrated an improvement in electrical conductivity, due to increased electron mobility. Yet even though they had higher electron mobilities they obtained low photocurrents up to 2.5 μA cm−2. In our data set, the Fe−O absorber layer was fabricated by sputtering on a glass substrate to create a library of data points with different deposition parameters, which lead to different functionality. To achieve a diversity of the deposition parameters per library a static mask was used during the deposition.25 This mask divided the library into six sections each with different deposition parameters. The libraries were measured and several models were developed to connect the functionality of the solar cells (maximum performance) to the parameters used to deposit the Fe−O absorber layer. The developed models were established using genetic programing and stepwise regression, and then analyzed to understand the trends observed in the data. To apply the discovered trends from the model and see how they affect the deposited layer another library was fabricated with the same starting material (Fe2O3). The newly prepared library had a higher PV performance than any of the previously fabricated libraries, indicating that our process descriptor based model was essential in choosing the correct deposition parameters in order to improve the material. A flowchart explaining the process done in this work is shown in Figure 2.

Figure 2. Flowchart of the work done in the study. The samples are fabricated, measured, annealed, the data is cleaned and models are built. At the next stage more samples are fabricated and measured according to the obtained models and more accurate models can be built by including the new data in the original data set.

getting insights into the changes taking place. In the library fabricated to test the model, all parameters were changed in each section to enable maximum diversity in process parameters. In future studies all the samples used for training the model could have maximal diversity as the tools used to build the models can tolerate multiple variations. A schematic view of a library and the mask during the deposition can be seen in Figure 3. To analyze the PV parameters of the libraries

Figure 3. Fabrication. (a) Schematic of a standard library used for this study. The library contains a glass substrate covered with a homogeneous layer of a transparent conductive oxide (TCO), and a homogeneous layer of TiO2 is deposited by spray pyrolysis on top of it. The next layer in the library is the absorber Fe2O3 layer deposited by RF sputtering. The absorber layer is deposited using a static mask and as a result is divided into six sections, each deposited using different parameters. The top layer is an array of 169 Au electrical back contacts deposited using RF sputtering through a shadow mask. (b) Top and side views of the static mask which is used to divide the library into six sections.

RESULTS AND DISCUSSION To build the process descriptor model a diverse data set was required, leading to the fabrication of four different libraries. The absorber layer (Fe−O) in each of these libraries is divided into six sections using a static mask during the deposition. Initially, in the four samples fabricated to train the models, the design of experiments was done so that all but one descriptor are kept constant along the sections of each library, and one descriptor changes along the different sections. This was done to ensure we cover the relevant parameter space while still

current−voltage (IV) measurements were performed. The libraries then went through three cycles of post annealing with increasing temperatures, and IV measurements were performed after every cycle of post annealing. The IV maps showing the maximum power (Pmax) for all the points of each library, and the data of process parameters of each point with the measured Pmax are given in the Supporting Information. C

DOI: 10.1021/acscombsci.7b00121 ACS Comb. Sci. XXXX, XXX, XXX−XXX

Research Article

ACS Combinatorial Science Table 2. Model Buildinga R2






0.2 × temp = 0.3 × power + 0.2 × temp × oxygen + 0.5 × temp2 + power + 0.5 × power × temp2 − 0.1 − 0.5 × oxygen − 0.6 × temp3




genetic programing

− 0.5 × power × temp3

Pmax =

power − oxygen − 1.7 × temp − 0.05 2.4 + temp




genetic programing

Pmax =

power − oxygen − 1.7 × temp 2.5 + temp




genetic programing

Pmax =

power − oxygen − temp 2.1 + temp




genetic programing



Pmax = 0.04 − 0.5 × temp − 0.2 × oxygen + 0.4 × power − 0.3 × temp × oxygen a

stepwise regression

Four best models generated by Eureqaand model generated by stepwise regression, with their R2, mean square error (MSE), and complexity.

since our objective was not to make exact and accurate predictions of the Pmax based on the descriptors, yet to find relationships between the deposition parameters and the Pmax. The direction (positive/negative) and magnitude of the impact each descriptor has on the performance were studied in order to find the trends in the data and are shown in Table 3.

During the deposition, some of the points in the border of the sections were affected by the deposition of the neighboring sections since there was a small gap between the static mask and the substrate. The points that were affected were removed from the data set, since they did not depict the change between areas accurately. After these points were eliminated, the data was cleaned and normalized and the mean of the top 25 percentile of the data points in each section was taken for each of the libraries. The results generated a data set consisting of 96 data points, where each point included different deposition and post annealing parameters. The data set was used to build a model that showed the impact of the various descriptors (post annealing temperature, oxygen flow, power applied on sputtering target, and deposition duration) on the PV performance. The models with the highest ratings made using genetic programing and the model built using stepwise regression are shown in Table 2. All the models provided by Eureqa are shown in the Supporting Information, Table 1. A detailed explanation of the various processes imposed on the data and the process of building the models is given in the Experimental Section. The formulas shown for the Eureqa models are the four formulas with the best ratings out of 7 × 1010 formulas that the program evaluated. As the complexity of the model increases the R2 increases as well, while the MSE decreases, which is because, as more operators and descriptors are added to the calculation a better fit can be obtained. The R2 for the models built using genetic programing show a significantly better fit than those built using stepwise regression. Genetic programming allows for many operators to be used and does not test the significance of each variable added. Any variable or operator that increases the fit to the data will be added to the model. As a result, genetic programing generally produces long complicated models with high R2. Stepwise regression on the other hand, is a simpler method that tests the significance of each variable before it is added to the model. Stepwise regression models are usually short and simple with lower R2 than the genetic programing models. Although complex models can be useful for making accurate predictions, simple models, like the ones generated using stepwise regression, have an advantage for obtaining qualitative insights into how the various descriptors effect the Pmax. For the purpose of this study the differences in R2 and MSE between the models are insignificant,

Table 3. Model Analysisa variable


sensitivity (genetic programing)

P value (stepwise regression)

temperature power oxygen

negative positive negative

0.7 0.4 0.4

2.3 × 10−9 2.3 × 10−5 0.03


Relationships between descriptors and performance according to the models built using genetic programing and stepwise regression. The direction (positive/negative) as well as the extent of the impact (which can be seen by the sensitivity (Eureqa) and the P value (stepwise regression)) show similar trends in the various models built.

All the models built, either with Eureqa or stepwise regression, show the same trends. The results show that the power applied to the sputtering target has a positive impact on the performance so that libraries fabricated using higher deposition power achieved better performances. On the other hand, temperature and oxygen flow have a negative impact on the performance. Low temperatures and a low oxygen flow lead to better performance. The oxygen dependency is explained by the fact that iron has many different oxidation states ranging from +6 to −2.26 Each of these oxides have different band gaps, which affect the photovoltaic performance. The oxygen flow during the deposition affects which iron oxide absorber layer will form, which in turn will affect the band gap and the photovoltaic performance. The temperature and power applied on the target during the deposition affects the size of the grains of material. Lower annealing temperatures and higher power lead to smaller grains, which can affect the performance of the solar cell with this absorber layer.27 In addition, it has been shown in other studies on different materials that post annealing temperature also affects the crystallinity of materials.27b,c As the post annealing temperature increases the material becomes more crystalline, which in turn affects the photovoltaic performance. D

DOI: 10.1021/acscombsci.7b00121 ACS Comb. Sci. XXXX, XXX, XXX−XXX

Research Article

ACS Combinatorial Science Table 4. Second Round of Fabricationa


Parameters chosen for test library according to the relationships shown in the models. Each row represents one section of the library, and the Pmax is the average of the top 25th percentile of data points after the data cleaning was performed.

The deposition duration was not included as a parameter in any of the models. This is surprising as it would be expected that the deposition duration would affect the thickness of the layer, which, in turn, would influence the performance. The omission of the deposition duration descriptor from the generated models can be due to the fact that this work did not cover the extreme cases of very short or very long deposition time. When designing the experiment, deposition durations where chosen in order to deposit enough material to create a significant absorber layer, yet not deposit a very thick layer as to prevent suppressing the charge carriers. The limited range of deposition durations may have caused this parameter to be omitted from the model created. Alternatively, the omission of the deposition duration could be due to the noise in the experimental data, which makes the impact of the deposition duration unclear. On the basis of the results obtained by the models produced, specifically the trends that were observed for the various process descriptors, an additional library was fabricated. The library was prepared to examine the results of the model predictions on the performance of the Fe−O absorber layer and the parameters used are shown in Table 4, and a table presenting the predicted Pmax for the new library according to each of the five models presented is given in the Supporting Information Table 2. The new library was also divided into six sections with the deposition parameters varying along the different sections. The sputtering target power, which had a positive impact, was set to be high in all sections and the temperature and oxygen flow, which had a negative impact were set to be low. The values of these parameters were slightly changed along the different sections to cover all the options within this optimized range. The deposition duration was varied along the different sections within the same range, which was understood to be ideal. The library fabricated based on the trends found in the models showed a significantly better performance than any of the original libraries fabricated for this study. Two out of the six sections of the test library showed higher performances than any of the previous sections from all the libraries, and three of the four remaining sections fell within the top 20 percent of the original data set. A comparison between the Pmax of the Fe−O photovoltaic cells from previous studies in our group to the improved cells fabricated in this study is shown in Figure 4. Eleven libraries with Fe−O were fabricated and measured prior to this study, each of these libraries was saved in our database for future use. Four “new” libraries were deposited in order to form the data set used for this study. The samples that were fabricated for

Figure 4. Comparison of performance of Fe−O photovoltaic cells from the database to those fabricated as part of this study. Each bar represents a library with 169 different solar cells. The gray (triangle) bars are libraries fabricated for previous studies, the blue (square) bars are the four libraries fabricated for the data set used to build the model, and the red (circle) bar is the library fabricated in order to test the trends shown in the model. As can be seen, the library fabricated based on the results and trends obtained by the model has cells with higher performances than what had been achieved prior to this study.

building the model have a wide range of photovoltaic performances, some of the cells in these libraries have slightly better performance than any of the other cells in libraries from previous studies. The PV performance in the last sample was highly improved because of the models made in this study.

CONCLUSIONS In this study, we show an approach of using genetic programing and stepwise regression in conjunction with deposition parameters to model the performance of Fe−O as an absorber layer for PV devices. We then test the models obtained and demonstrate how this approach can lead to useful insights into the relationships between the deposition parameters and performances, as well as generate cells with higher PV performance.

EXPERIMENTAL SECTION Fabrication of Solar Cell Libraries. Four libraries were fabricated for this study. Each library consisted of a glass substrate coated with a flourine doped SnO2(FTO) layer as a transparent conductive oxide (72 mm × 72 mm, Hartford Glass Co., Inc.). On top of the FTO, a homogeneous electron conducting layer of TiO2 was deposited using a home-built spray pyrolysis system.28 The precursor was prepared from titaniumtetraisopropoxide mixed with acetylacetone and E

DOI: 10.1021/acscombsci.7b00121 ACS Comb. Sci. XXXX, XXX, XXX−XXX

Research Article

ACS Combinatorial Science ethanol to reach a final concentration of 0.1 M. The resulting solution was sprayed on a hot plate at 425 °C to form the TiO2 layer. The Fe−O absorber layer was deposited on top of the TiO2 layer using RF sputtering (AJA International Inc.) from a 2 in. Fe2O3 target (Holland-Moran, 99.9%).29 The deposition was done at room temperature with an argon gas flow of 30 sccm, and total chamber pressure of 3 mTorr. To achieve variation in deposition parameters the deposition was done using a static mask (see Figure 3) and the substrate was rotated by 60° between depositions so that different parts of the substrate were revealed for different deposition parameters. This allowed for the deposition of a library with six sections where each section had different deposition parameters. For each library, we kept all but one parameter constant, that is, changed only one deposition parameter for every library. To complete the solar cell devices electrical back contacts were deposited by RF sputtering from a 2 in. Au target (Petrus, 99.99%). To obtain 169 solar cells per library a shadow mask with an array of 13 × 13 round holes (1.8 mm diameter) was placed on the library.30 The deposition was done at room temperature under argon flow of 30 sccm, with a different oxygen flow in each library as stated above and with a total chamber pressure of 2 mTorr. The target power was set to 100W and the deposition was set for 10 min, based on the deposition rate, to obtain a thickness of 100 nm. To have electrical contact during IV measurements a metal frame was soldered around the edges of the sample. The samples then underwent three cycles of post annealing in low vacuum, each cycle at a different temperature (50, 100, and 150 °C) for 2 h each. All the characterization measurements were done before the post annealing, as well as in between each post annealing cycle. Characterization of the Solar Cell Libraries. Optical characterization was performed by measuring total transmission (TT), total reflectance (TR), and specular reflection (SR). These were measured using a home-built system, which has been reported in depth elsewhere.31 The system consists of an x−y scanning table (Märzhäuser Wetzlar GmbH & Co. KG) with a specular reflectance probe and two integrating spheres connected by optical fibers to CCD array spectrometers (HR4000, Ocean Optics Inc.). Photovoltaic characterization was done by performing I−V measurements on all 169 contacts of all libraries. These measurements were executed using a home-built system, which has been reported in depth in previous work.31 The measurement is taken by illuminating each contact using a xenon laser-driven light source (LDLS, EQ-99FC, ENERGETIQ), and using a Keithly 2400 source meter. Each point was measured both in the ascending and descending voltage scan directions. In this study, we concentrated on the data measured in the ascending scan though no significant hysteresis was visible. Data Cleaning. The data for each library is divided according to the various sections, defined by their position under the static mask during deposition, and overlapping parts were removed from the data set (the data still remained in the database for possible future use). The removed points can be interesting for a structure−function study, but to use data for a process-function study the exact process parameters must be known. After removing the overlapping points there remained approximately 108 data points for each library. Dividing the points into the different sections with diversity in deposition

parameters resulted with approximately 15−20 data points per section. Subsequent to the initial data clean up, each section of the library, with different deposition parameters, is attended to separately. For each section all nonphotovoltaic points were removed, and if all the points did not show photovoltaic activity the section’s Pmax was set to zero. To remove outliers, any point that varied by three standard deviations from the mean of the analyzed section was removed before further analysis. Some variations were found between the different points in each area, as a result of the distance from the target. To see the effect of the distance from the target on the performance an additional descriptor could have been added with the distance of each point from the target but this would add further complexity to the model. To keep the model simple and understandable, the mean of the top 25 percentile of points for each section was used. The top 25% were chosen as opposed to the mean of all the points since it is known that due to pinholes and droplets that are created during the deposition some of the cells have very low currents, which are not typical to the material or process parameters chosen. Building Models Using Genetic Programing and Stepwise Regression. Genetic Programing. Genetic programing is a modeling technique that is based on the mechanisms of biological evolution.32 The genetic programing modeling technique is used to find solutions to complex problems, such as multidimensional problems or disorganized data. Similar to biological evolution, in this approach a set of randomly generated solutions to a given problem form the “first generation” of solutions. Each solution from the “first generation” is characterized with a “fitness function”, according to the error metric that the user defines, in order to test how well the solution fits the data. The solutions that show the best fit are modified and parts of them are combined creating new functions (mutation and crossover), which form the next generation of solutions. The combination and modification process is performed several times until it reaches the predefined error threshold or until it is stopped by the user. After many “combination-modification” repetitions the solutions become more complex and the fit to the data is highly improved. For this study we used the “Eureqa” modeling engine in order to perform the genetic programing on our data set. Eureqa is a software developed by Schmidt and Lipson,33 which implements genetic programing in an intuitive and convenient way.34 Eureqa generates a set of solutions with increasing complexity and fit. The complexity is measured by the amount of descriptors and mathematical functions used, and the fit is measured according to the error metric the user defines. The program rates the solutions so that the solutions with lowest complexity and highest fit are given the best ratings. In addition, Eureqa has several tools that allow further analysis of the models generated. These tools allow the user to get a summary of the descriptors used in the models, to get predictions of the model on a different data set and get model statistics and plots. One of the tools Eureqa offers allows the user to generate a report called the variable sensitivity report, which summarizes the impact that each of the descriptors has on the model. The variable sensitivity report shows for each descriptor if the impact it has on the model is a positive or negative impact, where a positive impact means that when there is an increase in the value of the descriptor the result of the model increases, and an negative impact means that when there F

DOI: 10.1021/acscombsci.7b00121 ACS Comb. Sci. XXXX, XXX, XXX−XXX

Research Article

ACS Combinatorial Science Author Contributions

is an increase in the value of the descriptor the result of the model decreases. The variable sensitivity report also shows the magnitude of the effect the descriptor has on the model. When using Eureqa for this study, the data was split into a 75% training set and a 25% validation set, which is used to calculate the error. The rows were shuffled before splitting the data in order to ensure a random splitting. The mathematical operators used in this study were addition, subtraction, multiplication, division, square roots, and integer powers. The error metric was set as absolute error (the default). Stepwise Regression. Stepwise regression is a technique that is used to model data and find the significances of descriptors.35 This technique measures the significances of each descriptor in order to determine if the descriptor should be included in the model. It then measures the significances of the cross-products of the descriptors and adds the significant ones to the model. At each step each parameter in the model is retested to see if it is still important to the model once other descriptors were added. Descriptors that are no longer important to the model are removed from the model. The significance of each descriptor is measured by calculating the P value of adding the descriptor to the model. The P value is the probability of finding the same, or more extreme results when the null hypothesis of a study is true. The “null hypothesis” is defined to be that there is “no difference” and the P value is a number between zero and one, where a smaller P value means that the null hypothesis can be rejected and there is a significant difference. The “threshold” of the P value is typically set to 0.05, so that if a P value larger than 0.05 is calculated the null hypothesis is accepted and there is no significant difference. In this study if the P value for a given descriptor was larger than 0.05 it was concluded that this descriptor does not have a significant impact and the descriptor was excluded from the model. Stepwise regression with 10-fold cross validation was performed on the data using the “stepwiselm” function in Matlab.36 The function obtained using stepwise regression was then analyzed using the solver tool in excel. Solver finds the optimal solution for a function with several variables and constrains. In this study we used the evolutionary method offered in solver to find the parameters that lead to a maximum Pmax according to the function found using stepwise regression.

The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript. Notes

The authors declare no competing financial interest.

ACKNOWLEDGMENTS The research was supported by the Israeli National Nanotechnology Initiative (INNI, FTA project) and by the European Union’s Seventh Program for research, technological development and demonstration under grant agreement no. 309018. E.B would like to thank the Rieger foundation for their financial support.

(1) Rühle, S.; Anderson, A. Y.; Barad, H.-N.; Kupfer, B.; Bouhadana, Y.; Rosh-Hodesh, E.; Zaban, A. All-oxide photovoltaics. J. Phys. Chem. Lett. 2012, 3 (24), 3755−3764. (2) (a) Keller, D. A.; Ginsburg, A.; Barad, H.-N.; Shimanovich, K.; Bouhadana, Y.; Rosh-Hodesh, E.; Takeuchi, I.; Aviv, H.; Tischler, Y. R.; Anderson, A. Y.; Zaban, A. Utilizing pulsed laser deposition lateral inhomogeneity as a tool in combinatorial material science. ACS Comb. Sci. 2015, 17 (4), 209−216. (b) Majhi, K.; Bertoluzzi, L.; Rietwyk, K. J.; Ginsburg, A.; Keller, D. A.; Lopez-Varo, P.; Anderson, A. Y.; Bisquert, J.; Zaban, A. Combinatorial Investigation and Modelling of MoO3 Hole-Selective Contact in TiO2| Co3O4| MoO3 All-Oxide Solar Cells. Adv. Mater. Interfaces 2016, 3 (1), 1500405. (3) Fayyad, U.; Piatetsky-Shapiro, G.; Smyth, P. From data mining to knowledge discovery in databases. AI Magazine 1996, 17 (3), 37. (4) Friedman, J.; Hastie, T.; Tibshirani, R. The Elements of Statistical Learning; Springer Series in Statistics; Springer: Berlin, 2001; Vol. 1. (5) Kim, K.-j.; Han, I. Genetic algorithms approach to feature discretization in artificial neural networks for the prediction of stock price index. Expert systems with Applications 2000, 19 (2), 125−132. (6) Apte, C.; Liu, B.; Pednault, E. P.; Smyth, P. Business applications of data mining. Commun. ACM 2002, 45 (8), 49−53. (7) Mjolsness, E.; DeCoste, D. Machine learning for science: state of the art and future prospects. Science 2001, 293 (5537), 2051−2055. (8) Wang, J. T.; Zaki, M. J.; Toivonen, H. T.; Shasha, D., Introduction to data mining in bioinformatics. In Data Mining in Bioinformatics; Springer, 2005; pp 3−8. (9) Isayev, O.; Fourches, D.; Muratov, E. N.; Oses, C.; Rasch, K.; Tropsha, A.; Curtarolo, S. Materials cartography: Representing and mining materials space using structural and electronic fingerprints. Chem. Mater. 2015, 27 (3), 735−743. (10) Kusne, A.; Keller, D.; Anderson, A.; Zaban, A.; Takeuchi, I. High-throughput determination of structural phase diagram and constituent phases using GRENDEL. Nanotechnology 2015, 26 (44), 444002. (11) Yosipof, A.; Nahum, O. E.; Anderson, A. Y.; Barad, H. N.; Zaban, A.; Senderowitz, H. Data Mining and Machine Learning Tools for Combinatorial Material Science of All-Oxide Photovoltaic Cells. Mol. Inf. 2015, 34 (6−7), 367−379. (12) Long, C.; Hattrick-Simpers, J.; Murakami, M.; Srivastava, R.; Takeuchi, I.; Karen, V. L.; Li, X. Rapid structural mapping of ternary metallic alloy systems using the combinatorial approach and cluster analysis. Rev. Sci. Instrum. 2007, 78 (7), 072217. (13) Curtarolo, S.; Morgan, D.; Persson, K.; Rodgers, J.; Ceder, G. Predicting Crystal Structures with Data Mining of Quantum Calculations. Phys. Rev. Lett. 2003, 91 (13), 135503. (14) Kokaly, R. F.; Clark, R. N. Spectroscopic determination of leaf biochemistry using band-depth analysis of absorption features and stepwise multiple linear regression. Remote Sensing of Environment 1999, 67 (3), 267−287. (15) Mueller, T.; Johlin, E.; Grossman, J. C. Origins of hole traps in hydrogenated nanocrystalline and amorphous silicon revealed through


S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acscombsci.7b00121. IV measurements for all four libraries fabricated for building the models, a table with all models provided by Eureqa, a table with the predicted Pmax for the new library according to each of the five models presented in the paper, and the experimental values, and the full data set including deposition parameters used for all four libraries (PDF)



Corresponding Authors

*E-mail: [email protected] *E-mail: [email protected] ORCID

Assaf Y. Anderson: 0000-0003-1657-4415 Kevin J. Rietwyk: 0000-0002-2266-2713 G

DOI: 10.1021/acscombsci.7b00121 ACS Comb. Sci. XXXX, XXX, XXX−XXX

Research Article

ACS Combinatorial Science machine learning. Phys. Rev. B: Condens. Matter Mater. Phys. 2014, 89 (11), 115202. (16) Jain, A.; Hautier, G.; Ong, S. P.; Persson, K. New opportunities for materials informatics: Resources and data mining techniques for uncovering hidden relationships. J. Mater. Res. 2016, 31 (08), 977− 994. (17) Venkatraman, V.; Åstrand, P. O.; Kåre Alsberg, B. Quantitative structure−property relationship modeling of Grätzel solar cell dyes. J. Comput. Chem. 2014, 35 (3), 214−226. (18) Ghiringhelli, L. M.; Vybiral, J.; Levchenko, S. V.; Draxl, C.; Scheffler, M. Big data of materials science: Critical role of the descriptor. Phys. Rev. Lett. 2015, 114 (10), 105503. (19) Suh, C.; Biagioni, D.; Glynn, S.; Scharf, J.; Contreras, M. A.; Noufi, R.; Jones, W. B. In Exploring High-Dimensional Data Space: Identifying Optimal Process Conditions in Photovoltaics, 2011 37th IEEE Photovoltaic Specialists Conference, 19−24 June 2011; IEEE, 2011; pp 000762−000767. (20) Cornell, R. M.; Schwertmann, U. The Iron Oxides: Structure, Properties, Reactions, Occurrences and Uses; John Wiley & Sons, 2003. (21) (a) Klahr, B.; Gimenez, S.; Fabregat-Santiago, F.; Bisquert, J.; Hamann, T. W. Electrochemical and photoelectrochemical investigation of water oxidation with hematite electrodes. Energy Environ. Sci. 2012, 5 (6), 7626−7636. (b) Badia-Bou, L.; Mas-Marza, E.; Rodenas, P.; Barea, E. M.; Fabregat-Santiago, F.; Gimenez, S.; Peris, E.; Bisquert, J. Water oxidation at hematite photoelectrodes with an iridium-based catalyst. J. Phys. Chem. C 2013, 117 (8), 3826−3833. (22) Shang, X.; Guo, Z.; Gan, W.; Zhou, R.; Ma, C.; Hu, K.; Niu, H.; Xu, J. Dye-sensitized solar cells with 3D flower-like α-Fe2O3decorated reduced graphenes oxide as photoanodes. Ionics 2016, 22 (3), 435−443. (23) Somekawa, S.; Kusumoto, Y.; Abdulla-Al-Mamun, M.; Muruganandham, M.; Horie, Y. Wet-type Fe 2 O 3 solar cells based on Fe 2 O 3 films prepared by laser ablation: Drastic temperature effect. Electrochem. Commun. 2009, 11 (11), 2150−2152. (24) Seki, M.; Takahashi, M.; Ohshima, T.; Yamahara, H.; Tabata, H. Solid−liquid-type solar cell based on α-Fe2O3 heterostructures for solar energy harvesting. Jpn. J. Appl. Phys. 2014, 53 (5S1), 05FA07. (25) Rühle, S.; Barad, H.; Bouhadana, Y.; Keller, D.; Ginsburg, A.; Shimanovich, K.; Majhi, K.; Lovrincic, R.; Anderson, A.; Zaban, A. Combinatorial solar cell libraries for the investigation of different metal back contacts for TiO 2−Cu 2 O hetero-junction solar cells. Phys. Chem. Chem. Phys. 2014, 16 (15), 7066−7073. (26) Russo, U.; Long, G. J. Mössbauer Spectroscopic Studies of the High Oxidation States of Iron. Mössbauer Spectroscopy Applied to Inorganic Chemistry 1989, 3, 289. (27) (a) Lu, Y.-M.; Hwang, W.-S.; Liu, W.; Yang, J. Effect of RF power on optical and electrical properties of ZnO thin film by magnetron sputtering. Mater. Chem. Phys. 2001, 72 (2), 269−272. (b) Figueiredo, V.; Elangovan, E.; Goncalves, G.; Barquinha, P.; Pereira, L.; Franco, N.; Alves, E.; Martins, R.; Fortunato, E. Effect of post-annealing on the properties of copper oxide thin films obtained from the oxidation of evaporated metallic copper. Appl. Surf. Sci. 2008, 254 (13), 3949−3954. (c) Puchert, M.; Timbrell, P.; Lamb, R. Postdeposition annealing of radio frequency magnetron sputtered ZnO films. J. Vac. Sci. Technol., A 1996, 14 (4), 2220−2230. (28) Ginsburg, A.; Keller, D. A.; Barad, H.-N.; Rietwyk, K.; Bouhadana, Y.; Anderson, A.; Zaban, A. One-step synthesis of crystalline Mn 2 O 3 thin film by ultrasonic spray pyrolysis. Thin Solid Films 2016, 615, 261−264. (29) (a) Yan, Z.; Keller, D. A.; Rietwyk, K. J.; Barad, H. N.; Majhi, K.; Ginsburg, A.; Anderson, A. Y.; Zaban, A. Effect of Spinel Inversion on (CoxFe1− x) 3O4 All-Oxide Solar Cell Performance. Energy Technology 2016, 4, 809−815. (b) Rühle, S.; Yahav, S.; Greenwald, S.; Zaban, A. Importance of recombination at the TCO/electrolyte interface for high efficiency quantum dot sensitized solar cells. J. Phys. Chem. C 2012, 116 (33), 17473−17478. (30) Barad, H. N.; Ginsburg, A.; Cohen, H.; Rietwyk, K. J.; Keller, D. A.; Tirosh, S.; Bouhadana, Y.; Anderson, A. Y.; Zaban, A. Hot

Electron-Based Solid State TiO2| Ag Solar Cells. Adv. Mater. Interfaces 2016, 3, 1500789. (31) Anderson, A. Y.; Bouhadana, Y.; Barad, H.-N.; Kupfer, B.; RoshHodesh, E.; Aviv, H.; Tischler, Y. R.; Rühle, S.; Zaban, A. Quantum efficiency and bandgap analysis for combinatorial photovoltaics: sorting activity of Cu−O compounds in all-oxide device libraries. ACS Comb. Sci. 2014, 16 (2), 53−65. (32) Mitchell, M. An Introduction to Genetic Algorithms; MIT Press, 1998. (33) Schmidt, M.; Lipson, H. Eureqa; Nutonian Inc., 2016. (34) (a) Goldstein, E. B.; Coco, G.; Murray, A. B. Prediction of wave ripple characteristics using genetic programming. Cont. Shelf Res. 2013, 71, 1−15. (b) Dubčaḱ ová, R. Eureqa: software review. Genetic programming and evolvable machines 2011, 12 (2), 173−178. (c) Schmidt, M.; Lipson, H. Distilling free-form natural laws from experimental data. Science 2009, 324 (5923), 81−85. (35) (a) Cordell, H. J.; Clayton, D. G. A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. Am. J. Hum. Genet. 2002, 70 (1), 124−141. (b) Liao, X.; Li, Q.; Yang, X.; Zhang, W.; Li, W. Multiobjective optimization for crash safety design of vehicles using stepwise regression model. Struct Multidiscipl Optim 2008, 35 (6), 561−569. (36) MATLAB and Statistics Toolbox; The MathWorks, Inc: Massachusetts, United States, 2016.


DOI: 10.1021/acscombsci.7b00121 ACS Comb. Sci. XXXX, XXX, XXX−XXX