Statistics and Data Management The ways a researcher collects, organizes, stores, and accesses data they collect for research. Creating a data management plan allows a researcher to know what data they will be collecting and how they will store and organize it during the research project. Basics
So far, we have looked primarily at different research methods you might adopt. This can give you a sense of what kind of data you might want to collect, such as interview data from a population you are learning more about, conference proceedings to conduct a text analysis on, or student work that you might analyze.
An important question to consider is, what do you do with that data? That is our next topic, with a focus on statistics. We will briefly cover the process you will generally adopt for data analysis, some statistics pitfalls to be aware of and avoid, and how to select the most appropriate program. We round out this section with a word on ensuring validity in quantitative and qualitative research Research that aims to understand the whys and hows of human behavior through the gathering of non-numerical data. as well as data management essentials. While this is just an overview, you may want to look further into the programs or techniques mentioned here to learn more outside of this lesson.
Statistics Overview
Statistics is the collection and analysis of numerical data. This means statistics will apply most frequently to quantitative research Research that collects and analyzes numerical data in order to test a hypothesis, discover correlations, or describe characteristics.
, but simple statistics such as percentages can be useful in qualitative as well. The information here will be useful to many types of research but especially that which involves numerical information of some type.
First, and before you have collected your data, think about the type of analysis that will be required. For statistics, these can be grouped into two categories: descriptive statistics The description of a sample and population; alternatively, statistics that summarize and describe the features or characteristics of data. Examples of descriptive statistics include mean, median, mode, range, standard deviation, or quantity. (the description of a sample and population) and inferential statistics The use of descriptive statistics to test hypotheses; alternatively, statistics that analyze a subset of data (a sample) and allow for conclusions and inferences to be made about a larger population. Examples of inferential statistics include hypothesis testing, confidence intervals, t-tests, regression analysis, and chi-square tests. (the use of descriptive statistics to test hypotheses). Put another way, descriptive statistics summarize results The section of a research article where researchers share the results from the research. This section takes the results and directly connects them to the research questions or hypotheses posed at the start of the article. Also can be called “Findings.” through percentages, averages, and other numerical values, while inferential statistics involve performing statistical tests to produce a study’s results.
Your goal may not be to prove correlation between two variables, and you just want to compare results through percentages and straightforward descriptive statistics. These are easy to calculate and are a good way of summarizing quantitative information. On the other hand, inferential statistics allow you to develop generalizable results about a population — you can’t record information regarding your entire population, but you can make some conclusions based on a sample and the appropriate statistical tests. These tests can get complicated quickly, but different tests are appropriate for different scenarios.
Second, after identifying which types of statistics you expect to use, you will want to choose the best program for your data analysis needs. Two essential questions are cost and time, particularly whether you will need to invest time in learning a program. For descriptive statistics, and even some statistical tests, Microsoft Excel works well and is readily available. There are a number of programs to consider, which we will talk about later in this section.
The third step is preparing your data. This can be time consuming, but it is essential to take your time to ensure your results are valid and reliable. That requires cleaning your data, meaning reviewing your data before it is analyzed, removing any errors you find, and then formatting and separating out the variables so the data are compatible with the program you selected. For one librarian’s experience using quantitative methods and especially statistics, see the Concept 4 References.
Statistics Pitfalls
Many of the pitfalls related to statistics come down to the data they is based on — the sample. Before and during data collection, it is beneficial to pay close attention to sample size and to ensure a random sample. This is most often seen in surveys. For a quantitative study, a survey that receives too few replies is unlikely to produce reliable results. Similarly, a survey that is sent to only selected individuals is likely to have biased results. Though there are no hard and fast rules, it is generally recommended to aim for a 20 percent response rate for surveys. For profession-wide surveys that seek to represent librarians in the United States, responses often number in the hundreds. However, it is worth noting that different types of surveys have different functions — one might be conducting a qualitative study exploring participants’ experiences or attitudes, for example, and a survey would occur before interviews are conducted — so the best course is to review the relevant LIS An interdisciplinary field that examines how physical and digital information is organized, accessed, collected, managed, disseminated and used, particularly in library settings. literature and see what response rates and number of respondents other authors have received.
One additional pitfall common to statistics is a lack of causal links. Correlation is not causation. Just because two variables are tested together and show statistical significance does not necessarily mean that one causes the other. For instance, a study may be examining whether student use of the college library results in a higher grade point average. The data collected includes students’ ID card swipes into the library building, which is matched with their GPAs over an academic year. If students who use the library more frequently have higher grades in their courses, does this mean the library is the cause? Not necessarily. It is more likely that other factors, such as socioeconomic status, are contributing to the results. In a study, this should be acknowledged and accounted for to avoid claiming causation without proving correlation.
Statistical Programs
Statistical programs are not typically necessary for descriptive statistics, but can often be useful for inferential statistics. These programs vary in complexity but it is generally recommended to start with something simpler and then move up to the next level if that doesn’t suit your needs. Another consideration is whether your institution has access to one of these programs, since depending on the program, a personal subscription can be a significant cost.
Of the many programs available, there are three that cover many potential uses: Microsoft Excel, SPSS, and R. Microsoft Excel is best for descriptive statistics like averages and percentages and works well for simple statistical tests like T-tests and ANOVA. SPSS is an intermediate-level program that is widely known as the standard for inferential statistics. It has a gradual learning curve, an interface that is easier to use compared to more advanced programs, and allows for the creation of custom tables. R is an advanced-level option, and is a programming language with a steep learning curve that will require considerable learning time for beginners. However, it is an open source and very versatile option that is especially strong in data manipulation and predictive modeling.
Coding and Qualitative Research
Now we will look at two concepts in ensuring research validity as well as data management basics. The linchpins of qualitative and quantitative research reliability are coding and significance, and these will be considered one by one. Coding is the process of labeling and organizing qualitative data in order to identify themes. The type of coding will depend on the data collection method, but following an established process and ideally having more than one person code the data is highly useful to increase the validity and decrease bias. For instance, you may be conducting exploratory research into a topic and use inductive coding A strategy of generating codes from the data as you analyze it instead of prior to data collection., where you come up with codes from the data as you analyze it instead of prior to data collection. If you were conducting interviews, this means you could: 1) Conduct an initial review of the interview transcripts, reading them for any general themes or concepts that emerge; 2) Group the data into more solidified themes that exist across transcripts; and 3) Develop codes from the data, which are then applied to the transcripts. To increase validity, you may check for and calculate interrater reliability by having a colleague or research partner also apply codes to a selection of transcripts. In short, coding is the key to making sense of qualitative data.
Significance and Quantitative Research
In the context of quantitative research, significance often refers to the statistical significance of the results. Results are found to be significant or not significant based on the statistical test that is conducted, and the significance or lack thereof is applied to the hypotheses. Significance helps tell us whether the results are due to chance or to another factor. It is particularly useful in comparing different scenarios to see whether an intervention resulted in a change. As one example, a researcher may want to know whether their changes to the curriculum for a library instruction class resulted in greater student engagement. After defining student engagement, the researcher could teach some classes using the revised curriculum, and some classes using the prior curriculum, and have participants complete a pre- and post-test survey. In comparing the survey results using a t-test An inferential statistic used to determine if there is a significant difference between the means of two groups and how they are related., the researcher would be able to determine whether changes occurred, and if so, whether they were statistically significant given the sample size and effect. This informs the researcher whether their hypothesis that students who participated in the revised curriculum were more engaged was correct.
Data Management Basics
Data management refers to the collection, organization, and access to your data. Any primary information you collect for your research should be part of your data management plan, whether survey questionnaires and responses, interview transcripts and recordings, codebooks, or spreadsheets containing data. Data management is largely determined by the scope and type of your project. Some major considerations include:
- Where will this data be stored? What file formats will be used?
- Who else will need to access this data? How will they do so?
- How long will the data be stored for? Will it be destroyed at a certain point, or will it be preserved?
- How will confidentiality be ensured for participants? What steps will be taken to ensure anonymity?
- How and where will the data be backed up?
- What data can or should be made publicly accessible, if any?
Data management can be a source of concern for new researchers but think of it as planning ahead for later stages in your research. Once you begin collecting data and your project is well underway, you will know where and how to store the information.
For those conducting a study involving human participants, your research will require approval through your Institutional Review Board A group that is charged with overseeing and approving research projects. The group ensures that research projects are ethical, meet regulations and standards, and protect any human subjects involved in the research. (IRB A group that is charged with overseeing and approving research projects. The group ensures that research projects are ethical, meet regulations and standards, and protect any human subjects involved in the research.). IRBs are administrative bodies (in colleges and universities, typically committees) that are established to protect the rights and welfare of human research subjects recruited to participate in research conducted through a given institution. IRBs typically require detailed information concerning your study’s aims, population, data collection procedures, and data management plan, especially how participant confidentiality will be guaranteed. For more information on this, see Course 3, Lesson 3: Ethics & Data Management.
Topic 4 References
Becksford, Lisa. “Facing My Fear of Quantitative Research Methods.” The Librarian Parlor. September 24, 2019. https://libparlor.com/2019/09/24/facing-my-fear-of-quantitative-research-methods/
Statistics and Data ManagementData management The ways a researcher collects, organizes, stores, and accesses data they collect for research. Creating a data management plan allows a researcher to know what data they will be collecting and how they will store and organize it during the research project. Basics
So far, we have looked primarily at different research methods you might adopt. This can give you a sense of what kind of data you might want to collect, such as interview data from a population you are learning more about, conference proceedings to conduct a text analysis on, or student work that you might analyze.
An important question to consider is, what do you do with that data? That is our next topic, with a focus on statistics. We will briefly cover the process you will generally adopt for data analysis, some statistics pitfalls to be aware of and avoid, and how to select the most appropriate program. We round out this section with a word on ensuring validity in quantitative and qualitative researchQualitative research Research that aims to understand the whys and hows of human behavior through the gathering of non-numerical data. as well as data management essentials. While this is just an overview, you may want to look further into the programs or techniques mentioned here to learn more outside of this lesson.
Statistics Overview
Statistics is the collection and analysis of numerical data. This means statistics will apply most frequently to quantitative researchQuantitative research Research that collects and analyzes numerical data in order to test a hypothesis, discover correlations, or describe characteristics. , but simple statistics such as percentages can be useful in qualitative as well. The information here will be useful to many types of research but especially that which involves numerical information of some type.
First, and before you have collected your data, think about the type of analysis that will be required. For statistics, these can be grouped into two categories: descriptive statisticsDescriptive statistics The description of a sample and population; alternatively, statistics that summarize and describe the features or characteristics of data. Examples of descriptive statistics include mean, median, mode, range, standard deviation, or quantity. (the description of a sample and population) and inferential statisticsInferential statistics The use of descriptive statistics to test hypotheses; alternatively, statistics that analyze a subset of data (a sample) and allow for conclusions and inferences to be made about a larger population. Examples of inferential statistics include hypothesis testing, confidence intervals, t-tests, regression analysis, and chi-square tests. (the use of descriptive statistics to test hypotheses). Put another way, descriptive statistics summarize resultsResults The section of a research article where researchers share the results from the research. This section takes the results and directly connects them to the research questions or hypotheses posed at the start of the article. Also can be called “Findings.” through percentages, averages, and other numerical values, while inferential statistics involve performing statistical tests to produce a study’s results.
Your goal may not be to prove correlation between two variables, and you just want to compare results through percentages and straightforward descriptive statistics. These are easy to calculate and are a good way of summarizing quantitative information. On the other hand, inferential statistics allow you to develop generalizable results about a population — you can’t record information regarding your entire population, but you can make some conclusions based on a sample and the appropriate statistical tests. These tests can get complicated quickly, but different tests are appropriate for different scenarios.
Second, after identifying which types of statistics you expect to use, you will want to choose the best program for your data analysis needs. Two essential questions are cost and time, particularly whether you will need to invest time in learning a program. For descriptive statistics, and even some statistical tests, Microsoft Excel works well and is readily available. There are a number of programs to consider, which we will talk about later in this section.
The third step is preparing your data. This can be time consuming, but it is essential to take your time to ensure your results are valid and reliable. That requires cleaning your data, meaning reviewing your data before it is analyzed, removing any errors you find, and then formatting and separating out the variables so the data are compatible with the program you selected. For one librarian’s experience using quantitative methods and especially statistics, see the Concept 4 References.
Statistics Pitfalls
Many of the pitfalls related to statistics come down to the data they is based on — the sample. Before and during data collection, it is beneficial to pay close attention to sample size and to ensure a random sample. This is most often seen in surveys. For a quantitative study, a survey that receives too few replies is unlikely to produce reliable results. Similarly, a survey that is sent to only selected individuals is likely to have biased results. Though there are no hard and fast rules, it is generally recommended to aim for a 20 percent response rate for surveys. For profession-wide surveys that seek to represent librarians in the United States, responses often number in the hundreds. However, it is worth noting that different types of surveys have different functions — one might be conducting a qualitative study exploring participants’ experiences or attitudes, for example, and a survey would occur before interviews are conducted — so the best course is to review the relevant LISLibrary and Information Science An interdisciplinary field that examines how physical and digital information is organized, accessed, collected, managed, disseminated and used, particularly in library settings. literature and see what response rates and number of respondents other authors have received.
One additional pitfall common to statistics is a lack of causal links. Correlation is not causation. Just because two variables are tested together and show statistical significance does not necessarily mean that one causes the other. For instance, a study may be examining whether student use of the college library results in a higher grade point average. The data collected includes students’ ID card swipes into the library building, which is matched with their GPAs over an academic year. If students who use the library more frequently have higher grades in their courses, does this mean the library is the cause? Not necessarily. It is more likely that other factors, such as socioeconomic status, are contributing to the results. In a study, this should be acknowledged and accounted for to avoid claiming causation without proving correlation.
Statistical Programs
Statistical programs are not typically necessary for descriptive statistics, but can often be useful for inferential statistics. These programs vary in complexity but it is generally recommended to start with something simpler and then move up to the next level if that doesn’t suit your needs. Another consideration is whether your institution has access to one of these programs, since depending on the program, a personal subscription can be a significant cost.
Of the many programs available, there are three that cover many potential uses: Microsoft Excel, SPSS, and R. Microsoft Excel is best for descriptive statistics like averages and percentages and works well for simple statistical tests like T-tests and ANOVA. SPSS is an intermediate-level program that is widely known as the standard for inferential statistics. It has a gradual learning curve, an interface that is easier to use compared to more advanced programs, and allows for the creation of custom tables. R is an advanced-level option, and is a programming language with a steep learning curve that will require considerable learning time for beginners. However, it is an open source and very versatile option that is especially strong in data manipulation and predictive modeling.
Coding and Qualitative Research
Now we will look at two concepts in ensuring research validity as well as data management basics. The linchpins of qualitative and quantitative research reliability are coding and significance, and these will be considered one by one. Coding is the process of labeling and organizing qualitative data in order to identify themes. The type of coding will depend on the data collection method, but following an established process and ideally having more than one person code the data is highly useful to increase the validity and decrease bias. For instance, you may be conducting exploratory research into a topic and use inductive codinginductive coding A strategy of generating codes from the data as you analyze it instead of prior to data collection., where you come up with codes from the data as you analyze it instead of prior to data collection. If you were conducting interviews, this means you could: 1) Conduct an initial review of the interview transcripts, reading them for any general themes or concepts that emerge; 2) Group the data into more solidified themes that exist across transcripts; and 3) Develop codes from the data, which are then applied to the transcripts. To increase validity, you may check for and calculate interrater reliability by having a colleague or research partner also apply codes to a selection of transcripts. In short, coding is the key to making sense of qualitative data.
Significance and Quantitative Research
In the context of quantitative research, significance often refers to the statistical significance of the results. Results are found to be significant or not significant based on the statistical test that is conducted, and the significance or lack thereof is applied to the hypotheses. Significance helps tell us whether the results are due to chance or to another factor. It is particularly useful in comparing different scenarios to see whether an intervention resulted in a change. As one example, a researcher may want to know whether their changes to the curriculum for a library instruction class resulted in greater student engagement. After defining student engagement, the researcher could teach some classes using the revised curriculum, and some classes using the prior curriculum, and have participants complete a pre- and post-test survey. In comparing the survey results using a t-testt-test An inferential statistic used to determine if there is a significant difference between the means of two groups and how they are related., the researcher would be able to determine whether changes occurred, and if so, whether they were statistically significant given the sample size and effect. This informs the researcher whether their hypothesis that students who participated in the revised curriculum were more engaged was correct.
Data Management Basics
Data management refers to the collection, organization, and access to your data. Any primary information you collect for your research should be part of your data management plan, whether survey questionnaires and responses, interview transcripts and recordings, codebooks, or spreadsheets containing data. Data management is largely determined by the scope and type of your project. Some major considerations include:
Data management can be a source of concern for new researchers but think of it as planning ahead for later stages in your research. Once you begin collecting data and your project is well underway, you will know where and how to store the information.
For those conducting a study involving human participants, your research will require approval through your Institutional Review BoardInstitutional Review Board A group that is charged with overseeing and approving research projects. The group ensures that research projects are ethical, meet regulations and standards, and protect any human subjects involved in the research. (IRBInstitutional Review Board A group that is charged with overseeing and approving research projects. The group ensures that research projects are ethical, meet regulations and standards, and protect any human subjects involved in the research.). IRBs are administrative bodies (in colleges and universities, typically committees) that are established to protect the rights and welfare of human research subjects recruited to participate in research conducted through a given institution. IRBs typically require detailed information concerning your study’s aims, population, data collection procedures, and data management plan, especially how participant confidentiality will be guaranteed. For more information on this, see Course 3, Lesson 3: Ethics & Data Management.
Topic 4 References
Becksford, Lisa. “Facing My Fear of Quantitative Research Methods.” The Librarian Parlor. September 24, 2019. https://libparlor.com/2019/09/24/facing-my-fear-of-quantitative-research-methods/