Data Organization in Spreadsheets Flashcards
Spreadsheats are mst often used as a multipurpose tool for what..?
For data entry, storage, analysis and visualization.
Most spreadsheetprograms allow users to perform all the tasks (data entry, storage, analysis and visualization). Does the paper recommend to perform all these tasks with the use of spreadsheet programs? Why (not)?
No, spreadsheets are best suited for data entry and storage. Analysis and visualization should happen separately. This reduces the risk of contaminating or destroying the raw data in the spreadsheet.
The first rule of data organization is to be consistent. Why is this and what does this mean?
Entering and organizing your data in aconsistent way from the start will prevent you and your collab-orators from having to spend time harmonizing the data later.
- Use consistent codes for categorical variables
- Use a consistent fixed code for any missing values
- Use consistent variable names
- Use consistent subject identifiers
- Use consistent data layout in multiple files
- Use consistent file names
- Use consistent format for all dates
- Use consistent phrases in your notes
- Be careful about extra spaces within cells
The second rule of data organization is to choose good names for things. What is meant by this?
It is important to pick good names for things. This can be hard, and so it is worth putting some time and thought into it. As a general rule, do not use spaces, either in variablenames or file names. They make programming harder: the analyst will need to surround everything in double quotes, like”glucose 6 weeks”, rather than just writing glucose_6_weeks.
How should dates be written within a spreadsheet?
As YYYY-MM-DD
Can you leave cells empty?
No, use common code for missing data (NA/-/999).
Can you put more than one piece of information in a cell?
No, the cells in your spreadsheet should each contain one piece of data.
What is the best layout for your data within a spreadshit?
A single big rectangle with rows corresponding to subjects and columns corresponding to variables.
What does it imply if data does not fit into a set of rectangles?
That maybe spreadsheets are not the best format for them, as spreadsheets are inherently rectangular.
What is a data dictionary?
A separate file that explains what all of the variables are. It is helpful if this is laid out in a rectangular form, so that the data analyst can make use of it in analyses.
What can a data dictionary contain?
- The exact variable name as in the data file
- A version of the variable name that might be used in data visualizations
- A longer explanation of what the variable means
- The measurement units
- Expected minimum and maximum values
Can the spreadsheet contain calculations and graphs?
No, the primary data file should only contain the data and nothing else.
Why isn’t it advised to use calculations and graphs in your spreadsheet?
If you are doing calculations in your data file, that likely means you are regularly opening it and typing into it. Doing so incurs some risk that you will accidentally type junk into your data.
Do not use font color or highlighting as data
You might be tempted to highlight particular cells with suspicious data, or rows that should be ignored. Or the font or font color might have some meaning. What should you do instead?
Add another column with an indicator variable (e.g. ‘trusted’ with values TRUE or FALSE)
What is also very important to do?
Make regular backups of your data. In multiple locations.