Commit 6e1be825 authored by Russell Jarvis's avatar Russell Jarvis
Browse files

files integrated

parent f6618151
# Contributing to Open Source Science Accessibility Project
This content is based upon a pre-established [standard](https://github.com/github/opensource.guide) that developers in this project adhere to.
We're excited to hear and learn from you.
We've put together the following guidelines to help you figure out where you can best be helpful.
## Table of Contents
0. [Types of contributions we're looking for](#types-of-contributions-were-looking-for)
0. [Ground rules & expectations](#ground-rules--expectations)
0. [How to contribute](#how-to-contribute)
0. [Style guide](#style-guide)
0. [Setting up your environment](#setting-up-your-environment)
0. [Contribution review process](#contribution-review-process)
0. [Community](#community)
## Types of contributions we're looking for
There are many ways you can directly contribute to the guides (in descending order of need):
* maintance of scraping code.
* science writing
* machine learning and analysis
* integration between science writing and analysis
Interested in making a contribution? Read on!
## Ground rules & expectations
Before we get started, here are a few things we expect from you (and that you should expect from others):
* Anyone who makes constructive contributions, big or small, deserves credit. It is helpful, but not necessary
if they can initiate self stating of affiliations in code files, and documents, were appropriate.
* Be kind and thoughtful in your conversations around this project. We all come from different backgrounds and projects, which means we likely have different perspectives on "how open source is done." Try to listen to others rather than convince them that your way is correct.
* Open Source Guides are released with a [Contributor Code of Conduct](./CODE_OF_CONDUCT.md). By participating in this project, you agree to abide by its terms.
* If you open a pull request, please ensure that your contribution passes all tests. If there are test failures, you will need to address them before we can merge your contribution.
* When adding content, please consider if it is widely valuable. Please don't add references or links to things you or your employer have created as others will do so if they appreciate it.
## How to contribute
If you'd like to contribute, start by searching through the [issues](https://github.com/github/opensource.guide/issues) and [pull requests](https://github.com/github/opensource.guide/pulls) to see whether someone else has raised a similar idea or question.
If you don't see your idea listed, and you think it fits into the goals of this guide, do one of the following:
* **If your contribution is minor,** such as a typo fix, open a pull request.
* **If your contribution is major,** such as a new guide, start by opening an issue first. That way, other people can weigh in on the discussion before you do any work.
## Style guide
If you're writing content, see the [style guide](./docs/styleguide.md) to help your prose match the rest of the Guides.
## Setting up your environment
Try to build within the Docker container, hyperlinked in the README.md
## Community
Discussions about the Science Accessibilty Project take place on this repository's [Issues](https://github.com/russelljjarvis/ScienceAccessibility/issues).
Although public conversation is good as it means everybody can benefit and learn from the conversation. All communication is merited, including private communication.
References:
This project is built on top of heaps of FOS Software. Chief among those are [text-stat](https://github.com/shivam5992/textstat), NLTK, and GoogleScrape.
Which employs algorithms like:
Kincaid, J. Peter, et al. "Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel." (1975).
## Other Relevant Links:
https://stackoverflow.com/questions/32429445/is-web-scraping-allowed?fbclid=IwAR1noACRESdM1zld2zKJZ019ZB5CUFj2Xw38hh4fsrJXwZINsE0I09uT2_A
https://www.copyright.gov/fair-use/more-info.html
**[Installation](Documentation/Documentation_Quick_Start.md)** |
**[Documentation](#documentation)** |
**[Contributing](contributing.md)** |
**[Testing](#testing)** |
**[License](license.md)** |
**[Manuscript](Documentation/manuscript.md)** |
[![Build Status](https://travis-ci.com/russelljjarvis/ScienceAccessibility.png)](https://travis-ci.com/russelljjarvis/ScienceAccessibility)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/russelljjarvis/simple_science_access.git/master)
# Overview
Understanding a big word is hard, so when big ideas are written down with lots of big words, the large pile of big words is also hard to understand.
We used a computer to quickly visit and read many different websites to see how hard each piece of writing was to understand. People may avoid learning hard ideas, only because too many hard words encountered in the process. We think we can help by explaining the problem with smaller words, and by creating tools to address the problem.
## Why Are We Doing This?
We want to promote clearer and simpler writing in science, by encorouging scientists in the same field to compete with each other over writing more clearly.
## How Are we Doing This?
### Machine Estimation of Writing Complexity:
The accessibility of written word can be approximated by a computer program that reads over the text and guesses the mental difficulty, associated with comprehending a written document. The computer program maps reading difficult onto a quantity that is informed by the cognitive load of the writing, and the number of years of schooling needed to decode the language in the document. For convenience, we can refer to the difficulty associated with the text as the 'complexity' of the document.
### How do some well-known texts do?
First, we sample some extremes in writing style, and then we will tabulate results, so we have some nice reference points to help us to make sense of other results. On the lower and upper limits we have: XKCD: [Pushing the limits of extremely readable science](http://splasho.com/upgoer5/library.php) and for some comparison, we wanted to check some [Machine generated postmodern nonesense](http://www.elsewhere.org/pomo/)
Higher is worse:
| complexity | texts |
|----------|:-------------:|
| 6.0 | [upgoer5](http://splasho.com/upgoer5/library.php) |
| 9.0 | [readability of science declining](https://elifesciences.org/download/aHR0cHM6Ly9jZG4uZWxpZmVzY2llbmNlcy5vcmcvYXJ0aWNsZXMvMjc3MjUvZWxpZmUtMjc3MjUtdjIucGRm/elife-27725-v2.pdf?_hash=WA%2Fey48HnQ4FpVd6bc0xCTZPXjE5ralhFP2TaMBMp1c%3D) |
| 14.0 | [science of writing](https://cseweb.ucsd.edu/~swanson/papers/science-of-writing.pdf) |
| 14.9 | mean wikipedia |
| 16.5 | [mean post modern essay generator](http://www.elsewhere.org/pomo/) |
# Some particular cases:
| complexity | texts |
|----------|:-------------:|
| 13.0 | this readme.md |
| 17.0 | [The number of olfactory stimuli that humans can discriminate is still unknown](https://elifesciences.org/articles/08127)|
| 18.68 | [Intermittent dynamics and hyper-aging in dense colloidal gels](https://www.researchgate.net/publication/244552241_Intermittent_dynamics_and_hyper-aging_in_dense_colloidal_gelsThis_paper_was_originally_presented_as_a_poster_at_the_Faraday_Discussion_123_meeting) |
| 37.0 | [Phytochromobilin C15-Z,syn - C15-E,anti isomerization: concerted or stepwise?](https://www.researchgate.net/profile/Bo_Durbeej/publication/225093436_Phytochromobilin_C15-Zsyn_C15-Eanti_isomerization_Concerted_or_stepwise/links/0912f4fcd237e6701a000000.pdf) |
### Proposed Remedies:
* 1
Previously I mentioned creating tools to remedy inaccessible academic research> One tool, that functions as a natural extension of this work, is to enable 'clear writing' tournaments between prominent academic researchers, for example:
| mean complexity | author |
|----------|:-------------:|
| 28.85 | [professor R Gerkin](https://scholar.google.com/citations?user=GzG5kRAAAAAJ&hl=en&oi=sra) |
| 29.8 | [ other_author] |
| 30.58 | [other_author] |
Example code for the [proposed tool](https://github.com/russelljjarvis/ScienceAccessibility/blob/dev/Examples/Incentivise_by_competing.ipynb) would allow you to select academic authors who then play out a competition demand, and to utilize their writing contributions in the context of a tournament where academic tournament members compete to write simpler text. A more recently maintained version of that [file](https://github.com/russelljjarvis/ScienceAccessibility/blob/dev/Examples/compete.py)
* 2
A different remedy proposal is to run the text through [simplify](http://nlpprogress.com/english/simplification.html?fbclid=IwAR0B8G7zEmxVYbFWJMOyVTaHWkv4o9tTTFvVpsOcWrUQ777SXpM6KuM-8QI), evaluate complexity after translating the document simplify.
How different are the scores?
### The Following is a plot of the Distribution of Science Writing Versus non-science writing the [ART Science corpus](https://www.aber.ac.uk/en/media/departmental/computerscience/cb/art/gz/ART_Corpus.tar.gz):
![image](https://user-images.githubusercontent.com/7786645/53215155-96dbb780-360c-11e9-9280-d8592d31d2f9.png)
The science writing niche is characterized, by having a mean reading grade level of 18, neutral, to negatively polarized sentiment type and close to an almost complete absence of subjectivity. Science writing is more resistant to file compression, meaning that information entropy is high, due to concise, coded language. These statistical features, give quite a lot to go on, with regards to using language style to predict the scientific status of a randomly selected web document. The same notion of entropy being generally higher in science is corroborated with the perplexity measure, which measures how improbable the particular frequency distribution of words of observed in a document was.
## Developer Overview
Non-scientific writing typically exceeds genuine scientific writing in two important aspects: in contrast to genuine science, non-science is often expressed with a less complex, and more engaging writing style. We believe non-science writing occupies a more accessible niche, that academic science writing should also occupy.
Unfortunately, writing styles intended for different audiences, are predictably different We show that computers can learn to guess the type of a written document: blog, Wikipedia, opinion, and traditional science, by first sampling a large variety of web documents, and then classifying using sentiment, complexity, and other variables. By predicting which of the several different niches a document occupies, we are able to characterize the different writing types and to describe strategies to remedy writing complexity.
Multiple stakeholders benefit when science is communicated with lower complexity expression of ideas. With lower complexity science writing, knowledge would be more readily transferred into public awareness, additionally, the digital organization of facts derived from journal articles would occur more readily, as successful machine comprehension of documented science would likely occur with less human intervention.
The impact of science on society is likely proportional to the accessibility of written work. Objectively describing the character of the different writing styles will allow us to prescribe how, to shift academic science writing into a more accessible niche, where science can more aggressively compete with pseudo-science, and blogs.
## Analysis of Text.
Running the scraper is not necessary for analysing the text documents.
## Sentiment Versus Complexity
[An interactive plot of the same thing, where clicking on a data point takes you to the webpage that generated the data point](https://russelljjarvis.github.io/ScienceAccessibility/)
### Word frequencies as clouds:
### Per category
#### Not Science
![image](https://user-images.githubusercontent.com/7786645/52091608-322fbe80-2572-11e9-8553-3e346a8b824e.png)
#### Science
![image](https://user-images.githubusercontent.com/7786645/52091615-352aaf00-2572-11e9-905a-0b75fe0005d7.png)
The observant reader will see, 'et al', occurs in published literature quite a lot, highlighting an obvious finding that science writing often refers to external evidence.
[Similar projects](https://blog.machinebox.io/detect-fake-news-by-building-your-own-classifier-31e516418b1d).
## Building All of the Project.
(including the scraper).
The internet in some ways is like a big group of computers that are all friends with each. A scraper is A computer that visits many of the other computers on the internet. The scraper does not have to be friends with the computers it visits, it just needs to know the address at which each computer in the big friendship group can be reached.
The scraping and crawling code for this is dependency heavy. Who wants to duplicate the building of this whole environment from scratch? No-one? I thought so. [Docker is used to providing a universal build, and prevent duplicated effort](https://hub.docker.com/r/russelljarvis/science_accessibility).
If Docker is installed on the base OS, git clone this repository, and assuming the file build.sh is chmod +x , run: `bash build.sh` to perform the dockerbuild. To run the jupyter notebook over docker, enter the docker environment interactively in one of two ways, via a bash shell, or via an ipython notebook or
and then launch python via BASH in Linux as follows:
Warning: This Docker environment is currently 11.5GB, however it contains some non trivial scraping tools.
```BASH
docker login your_user_name@dockerhub.com
docker pull russelljarvis/science_accessibility
mkdir $HOME/data_words
docker run -it -v $HOME/data_words russelljarvis/science_accessibility
```
```BASH
cd Examples
ipython -i enter_author_name.py "R Gerkin"
```
To Run the project, you need to navigate to the Examples directory and then execute:
`python use_scrape.py`, which scrapes search engines for parameters defined in that file.
Once that is done an analysis program `use_analysis` is then called to run an analysis on the scraped text. This program generates some simple figures. The figures are very basic, and they act to function only as proof of concept.
Given pre-existing data (pickled files consisting of raw text contents), the analysis file can also be run on its own by executing: `python use_analysis.py`. To analyse the scraped texts, the Jupyter notebook: `vstrl.ipynb` also contains idioms for plotting and analysis based on scrapped data, although it is not maintained. The package Bokeh facilitates pretty interactive plots with data point mouse over data metrics.
Another file `Examples/use_code_complexity.py` reports back about the complexity of the code base. This code complexity analysis is not thorough enough to include third-party modules that were heavily utilized in the analysis, however, the principle of code complexity, with an application limited scope is generally applied in our approach, as it's obviously not desirable to use obfuscated code as a tool used to advocate for a simple language.
### What about Code Cognitive Complexity?
That is an issue too. The project takes measures to minimize that also. Many modern text editors feature cyclomatic complexity plugins.
This code is glue built on top of a lot of Free and Open Source tools.
Any GNU licenses associated with existing tools may take precedence over this one.
MIT License
Copyright (c) 2019
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
title: 'A Tool for Assesing the Readability of Scientific Publications on Mass'
tags:
- readability
- science communication
- science writing
authors:
- name: Russell Jarvis
affiliation: PhD Candidate Neuroscience, Arizona State University
- name: Patrick McGurrin
affiliation: National Institute of Neurological Disorders and Stroke, National Institutes of Health
- name: Shivam Bansal
affiliation: Senior Data Scientist, H2O.ai
- name: Bradley G Lusk
affiliation: Science The Earth; Mesa, AZ 85201, USA
date: 20 October 2019
bibliography: paper.bib
# Summary
To ensure that writing is accessible to the general population, authors must consider the length of written text, as well as sentence structure, vocabulary, and other language features [@Kutner:2006]. While popular magazines, newspapers, and other outlets purposefully cater language for a wide audience, there is a tendency for academic writing to use more complex, jargon-heavy language [@Plavén-Sigray:2017].
In the age of growing science communication, this tendency for scientists to use more complex language can carry over when writing in more mainstream media, such as blogs and social media. This can make public-facing material difficult to comprehend, undermining efforts to communicate scientific topics to the general public.
While readability tools, such as Readable (https://www.webfx.com/tools/read-able/) and Upgoer5 (https://splasho.com/upgoer5/) currently exist to report on readability of text, they report the complexity of only a single document. In addition, these tools do not address complexity in a more academic-type setting.
To address this, we created a tool that uses a data-driven approach to provide authors with insights into the readability of the entirety of their published scholarly work with regard to other text repositories. The tool first quantifies existing text repositories with varying complexity, and subsequently uses this output as a reference to show how the readability of user-selected written work compares to these other known resources.
This tool also introduces one additional feature for readability comparison and improvement. It allows the entry of two author names to enable a competition as to whose text has the lowest average readability score. Public competitions can often incentivize good practices, and this may be a fun and interactive tool to help improve readability scores over time.
Ultimately, this tool will expand upon current readability metrics by computing a more detailed and comparative look at the complexity of written text. We hope that this will allow scientists and other experts to better monitor the complexity of their writing relative to other text types, leading to the creation of more accessible online material. And with hope, an improved global communication and understanding of complex topics.
# Methods
### Text Analysis Metrics
We built a web-scraping and text analysis infrastructure by extending many existing Free and Open Source (FOS) tools, including Google Scrape, Beautiful Soup, and Selenium.
We first query a number of available text repositories with varying complexity:
| Text Source | Mean Complexity | Description |
|----------|----------|:-------------:|
| Upgoer 5 | 6 | library using only the 10,000 most commonly occurring English words |
| Wikipedia | 14.9 | free, popular, crowdsourced encyclopedia |
| Post-Modern Essay Generator (PMEG) | 16.5 | generates output consisting of sentences that obey the rules of written English, but without restraints on the semantic conceptual references |
| Art Corpus | 18.68 | library of scientific papers published in The Royal Society of Chemistry |
Entering an author's name (or two authors for the competition plot) by the user begins a query through Google Scholar, returning the scraped results from articles containing the author's name(s).
The Flesch-Kincaid readability score [@Kincaid:1975] - the most commonly used metric to assess readability - is then used to quantify the complexity of all items.
### Reproducibility
A Docker file and associated container together serve as a self-documenting and portable software environment clone to ensure reproducibility given the hierarchy of software dependencies.
# Output
Data are available here: [Open Science Framework data repository](https://osf.io/dashboard).
## Contextualized Readability Output
The generated plot for contextualized readability information is a histogram binned by readability score, initially populated exclusively by the ART corpus [@Soldatova:2007] data. We use this data because it is a pre-established library of scientific papers. The readability of ART Corpus has also been shown to be comparable to that of other scientific journals [2].
The mean readability scores of Upgoer5 [@Kuhn:2016], Wikipedia, and PMEG [@Bulhak:1996] libraries are labeled on the plot as single data points to contextualize the complexity of the ART corpus data with other text repositories of known complexity.
We also include mean readability scores from two scholarly reference papers, Science Declining Over Time [@Kutner:2006] and Science of Writing [@Gopen:1990], which discuss writing to a broad audience in an academic context. We use these to demonstrate the feasibility of discussing complex content using more accessible language.
Lastly, the mean reading level of the entered author's work is displayed as a boxplot that has is shares an x-axis with the ART-corpus distribution data. The boxplot depicts mean, and the first and third quartiles of the authors specific works. The box plot enables the viewer of the report to get a sense of underlying variance in the specific authors work, relative to variance in the ART-corpus. We also display single data points for the maximum and minimum scores. Thus, the resulting graph displays the mean writing complexity of the entered author against a distribution of ART corpus content as well as these other text repositories of known complexity.
![Specific Author Relative to Distribution](figures/boxplot.png)
## Competition Output
The three-author competition plot displays two distributions which display the readability distribution of only the author's written work, as scraped and analyzed from Google Scholar. Vertical lines are used to plot the mean readability value for each author. Anonymous authors A and B, are co-authors that publish in the same field, thus their readability scores should be closely matched, as their score will be derived from some mutual documents. Anonymous author C, publishes in an unrelated field and does not co-author with authors A and B.
![Specific Author Relative to Distribution](figures/tournament.png)
# References
This diff is collapsed.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment