Tag Archives: probability

Is there a statistically significant correlation between religious faith and total family income in the US?

Introduction and Aim of the Study

The main target of this study (which is available here in pdf) is to investigate any possible relation between religion and financial income in the US in the last decade. More precisely I decided to focus on Protestants, Catholics and the ones who claimed to belong to no religious community at all (identified as None). These three categories, according to the data, have been the most common ones in the United States in the period 2000-2012.
Therefore the target of the investigation may be summarized by the following question: “Is there a relationship between the religious faith of a US citizen (Protestant, Catholic or None) and his/her total family income?”
My personal interest derives from a general conviction that there could be some religious communities wealthier than others, due to historical, social or political reasons and the exploration of this kind of aspects may lead to underline some specific features going on behind the scenes. More generally I thing that digging this matters may enlighten some subtle pattern hiding behind the data, such as religious discrimination at work resulting in people belonging to a specific community getting higher, more qualified and more paid jobs. Highlighting this kind of aspects is a starting point for a broader research about social and financial conditions among and within different religious communities.

General Discussion about the Data of Interest

The research project was based on the data collected in the online-available database of the General Social Survey, 1972-2012 (Citation: Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11. doi:10.3886/ICPSR34802.v1) (GSS), which since 1972 has been monitoring societal change and studying the growing complexity of American society. The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society.

The dataset is composed by 57061 cases corresponding to an equal number of interviewed citizens. Each person may be considered as a single case, as there are several pieces of information (variables) which were recorded for each case. In particular, cases are single individuals respecting the following characteristics: all non institutionalized, English and Spanish speaking persons 18 years of age or older, living in the United States. As just mentioned each candidate was asked several questions about a number of aspects of his own life, his family, his community, the society he lives in.
The data were collected by three main methods:

  • computer-assisted personal interview (CAPI). Data are inserted directly into an electronic sheet over a PC and the interviewer and the respondent are both present at the moment of the survey, in front of the computer. The difference with CASI is that in the latter the interviewed is left alone in order to answer the questions more privately.
  • face-to-face interview.
  • telephone interview.

The two variables I took into account from the data set are the following:

  • income06: categorical variable. The interviewed was asked which of the proposed groups he/she thought his/her total-tax-less-family income of the previous year would fall. There are 25 possible interval varying from a minimum of less than 1000$ to a maximum of more than 150.000$, plus a category named “Refused”, including all the cases who did not accept to reveal their financial condition.
  • relig: categorical variable. The interviewed was asked by which of the proposed communities his/her religious faith would better be identified. A more proper description of the variable is going to be provided during the exploratory data analysis.

The study is observational because researchers recorded data “in a way that does not directly interfere with how the data arise”. The structure of the survey and the data collection methods are clearly not typical of an experimental setup. In the latter case, in fact, researchers would have sampled individuals and divided them into groups organizing an experiment in order to investigate the possibility of a causal connection between two or more variables.
From the point of view of the generalizability of the study it is crucial to focus on the population of interest, whic,h in this very case, includes all non institutionalized, English and Spanish speaking persons 18 years of age or older, living in the United States. According to 2011 American Community Survey Data on Language Use 79.2205% of american families speak English at home, while the 20.7794% speak Spanish which added up result in a global 99.9999%.

This means that we can reasonably generalize the results to the totality of US population 18 years of age or older. Furthermore the used data collection methods compensate each other in terms of any potential source of sampling bias. For instance, CAPI is mainly addressed to computer friendly persons. This bias may be prevented by phone interviews which enable researchers to reach and convince less “technology-friendly” people. The last but not the least is the face-to-face survey which compensates the unavoidable bias introduced by a phone call. The latter gives for granted the connection to a phone line which may not always be respected. In addition to that it is necessary to consider that generally children, youngsters or more generally minors do not have a clear financial overview over the family. Despite their belonging to a particular religious community they may have great insights into the total family income, which means that their contribution to the survey, at least on this very aspect, would have been pointless.

All this considerations lead us to the conclusion that the results of the study may be generalized to all US families. However, since the survey is observational, the findings do not imply causal relationships.

Exploratory Data Analysis

In the present section a brief exploratory data analysis is performed. The relevant statistics is provided together with the associated R code.

The first two used functions R are summary and str, which help to get a broader and in the mean time synthetic view over the data. As ir is clear the gss.after.subsetting data set is composed by only two variables , Income and Religion. Both are factors consisting respectively in 26 (actually 25 as I did not take into account the Refused category) and 13 levels. In particular Religion mantains all the original 13 levels despite only three of them have been selected (Protestant, Catholic and None).

In order to visualize the data in a cleaner way a plot is provided too. The whole data set has been converted into a contingency table, which has been properly plotted in the figure below.

The figure below shows pretty clearly the distribution of incomes among and within the three investigated communities. Nevertheless it is quite hard to identify any particular pattern hiding behind the data. It is necessary to proceed with a more complete and rigorous analysis in order to draw any conclusion concerning a possible correlation between religious community and family financial income. For further details about the data see the Appendix at the end of the report.



As stated at the end of the previous section in order to end up with a proper conclusion and answer the original question at the base of the study it is necessary to perform a rigorous statistic test on the data set. First a purpose of clearness, first of all we recap the main target of the project which is to answer the following question: “Is there a relationship between the religious faith of a US citizen (Protestant, Catholic or None) and his/her total family income?”

As we are dealing with two categorical variables (Income and Religion), both of which with more than two levels (respectively 25 and 3), only an hypothesis test is admittable. In particular, as no defined parameter of interest can be highlighted, I have performed a theroretical Chi-square test for independence, which is allowed by the fact that each particular scenario (i.e. cell count) has at least 5 expected cases. The proof of this condition being met is provided in the following table, which summarizes the whole dataset with each case joined by its expected value. As you can see all scenarios have expected value well above 5.

Income Protestant Catholic None
(x$1000) Real Expected Real Expected Real Expected
Below-1 67.00 65.90 23.00 30.80 30.00 23.20
1-2.999 62.00 56.60 22.00 26.50 19.00 19.90
3-3.999 40.00 41.80 17.00 19.50 19.00 14.70
4-4.999 26.00 28.00 11.00 13.10 14.00 9.90
5-5.999 36.00 42.30 22.00 19.80 19.00 14.90
6-6.999 48.00 51.10 21.00 23.90 24.00 18.00
7-7.999 63.00 59.90 27.00 28.00 19.00 21.10
8-9.999 98.00 94.00 33.00 43.90 40.00 33.10
10-12.499 181.00 179.70 72.00 84.00 74.00 63.30
12.5-14.999 172.00 160.50 70.00 75.00 50.00 56.50
15-17.499 156.00 156.10 70.00 73.00 58.00 55.00
17.5-19.999 118.00 114.80 47.00 53.70 44.00 40.40
20-22.499 174.00 173.60 81.00 81.20 61.00 61.10
22.5-24.999 166.00 169.80 83.00 79.40 60.00 59.80
25-29.999 256.00 241.80 109.00 113.10 75.00 85.10
30-34.999 252.00 263.20 133.00 123.10 94.00 92.70
35-39.999 270.00 250.00 109.00 116.90 76.00 88.00
40-49.999 440.00 417.10 199.00 195.00 120.00 146.90
50-59.999 369.00 370.40 175.00 173.20 130.00 130.40
60-74.999 471.00 453.90 206.00 212.30 149.00 159.80
75-89.999 349.00 345.10 176.00 161.40 103.00 121.50
90-109.999 279.00 284.10 142.00 132.90 96.00 100.00
110-129.999 185.00 185.70 87.00 86.90 66.00 65.40
130-149.999 100.00 116.50 64.00 54.50 48.00 41.00
150-Over 228.00 284.10 155.00 132.90 134.00 100.00

The total number of degrees of freedom is df = (R-1) X (C-1) which is equal to df = (25-1) X (3-1) = 48, well above the minimum allowed of 2.
As for the independence issue, the GSS sampling has been randomic and in any case the number of cases in each scenario as well as the total amount of cases is below the 10% of the population of the US.

Given that, we can state our hypothesis:

  • H0 : (nothing going on): Religion and Total Family Income are independent, meaning that the amount of money earned by a US family per year does not vary by belonging to either the Protestant or the Catholic community, or no religious community at all.
  • HA : Religion and Total Family Income are dependent, meaning that the amount of money earned by a US family per year does vary by belonging to either the Protestant or the Catholic community, or no religious community at all.

Let’s recall that applying the Chi-square test for independence means that we are to evaluate whether there is convincing evidence that a set of observed counts O11, O12, O13… ORC in RC categories are unusually different from what might be expected under a null hypothesis. Call the expected counts that are based on the null hypothesis, E11, E12, E13 … ERC computed as

$$E_{row \hspace{1mm} i,\hspace{1mm} col \hspace{1mm} j} = \frac{(row \hspace{1mm} i \hspace{1mm} total) \times (column \hspace{1mm} j \hspace{1mm} total)}{table \hspace{1mm} total}$$

If certain conditions are met, then the test statistic below follows a chi-square distribution with (R-1)X(C-1) degrees of freedom:

$$ \chi^2 = \frac{(O_{11}-E_{11})^2}{E_{11}} + \frac{(O_{12}-E_{12})^2}{E_{12}} + \cdots + \frac{(O_{RC}-E_{RC})^2}{E_{RC}} $$

The p-value for this test statistic is found by looking at the upper tail of this Chi-square distribution. We consider the upper tail because larger values of chi squared would provide greater evidence against the null hypothesis.

The result of the Chi-square test for independence over the data set of interest is the following:

Because we typically test at a significance level of α = 0.05 and the p-value is less than 0.05, the null hypothesis is rejected. That is, the data provide convincing evidence that there is some association between the amount of money earned by a US family per year and belonging to either the Protestant or the Catholic community, or no religious community at all.


The aim of the research project was to investigate whether there could be any association between the total tax-free income of an American family and their belonging to the Protestant, Catholic or none religious community at all. The dataset was taken from the General Social Survey 1972-2012 (GSS), which since 1972 has been monitoring societal change and studying the growing complexity of American society. the original database was subset in order to take into account only the two variables of interest, Income and Religion; the latter has been subset itself to select only three inner levels, Protestant, Catholic and None. Due to the tipology of the investigated data, only a hypothesis test based on the Chi-square test for independence can be performed.

The result of the statistical analysis leads us to reject the null hypothesis and then to state that there is some association between the amount of money earned by a US family per year and belonging to either the Protestant or the Catholic community, or no religious community at all.

This could be only the beginning of a wider study about correlation between religion and financial condition in the US. Deeper insights must be get into the matter and more complex statistical tools and techniques must be used in order to infer complete and satisfying conclusions.

APPENDIX – Attached Dataset

by Francesco Pochetti

Car or Goat? The Monty Hall problem

 Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice? (Whitaker, 1990, as quoted by vos Savant 1990a)

Remember the awesome scene in the movie “21” starred by Kevin Spacey about him (Professor Rosa) asking Ben (Jim Sturgess) exactly the same question mentioned above?

Were you (as me!) among the ones who at the end of the scene kept staring the screen wondering “what the hell is going on here?”?

Well, if so take a look at the following post!

The above brain teaser is known among mathematicians and statisticians as the so called Monty Hall problem , named after the original host (Monty Hall) of an American TV game show who first mentioned the trick.

In the recalled “21” movie scene Ben answers Prof Rosa’s question claiming that changing door is indeed in his favor because the probability of getting a car after the host has opened a fake door switches from the 33,3% (1/3) of the beginning to the 66,6% (2/3). Is it true?

If you don’t want to get too much insights into the mathematical description you may be satisfied with the following image showing in a qualitative way the reasoning behind Ben’s answer. The player has an equal chance (1/3) of initially selecting the car, goat A or goat B. Suppose the car was behind door 1; then if he selects it, the host could open door 2 or 3 but in any case switching would turn into a lose for the player (switching loses = 1/3). If  instead the player selected door 2, then the host could only open door 3, meaning that switching would turn into a win for our player. The same would happen if the candidate selected door 3; in this case too he would win if he decided to change to the left door. This qualitative analysis shows that in two cases out of three changing own’s mind wins, resulting in a global probability of 2/3 = 66,6% against 1/3 = 33,3% of winning in case of switching door.


Now let’s turn into something a bit more sophisticated. Same problem, same solution, different but more rigorous path to get there.

We are going to tackle the teaser applying the Bayes Theorem about conditional probability which states that “the probability of event A given B is equal to the probability of event A and B divided by the probability of event B alone”. In order to fully understand what goes on behind the scenes let’s summarize the problem by the decision tree shown below.


The diagram has been built assuming the player would choose always door 1 as his primary decision and the upcoming discussion may be as well followed sticking to the image below which is a self explanatory figure of the above tree. The probability the car is hiding behind one of the three doors is exactly 1/3. We shall begin with the first top branch. Given that the car stays behind door 1 the host may decide to open door 2 or 3 with the same probability of 1/2 (remember that the player selected door 1 behind which there is the car, meaning the other two doors hide goats). The probability the host opens door 2 or 3 given the car is hiding behind door 1 and the player has selected door 1 is (1/3 x 1/2) = 1/6.

Now let’s move to the second branch. If the car stays behind door 2 and the player has selected door 1 then the host can open only door 3 with a probability of 1. This means that the  the probability the host opens door 3 given the car is hiding behind door 2 and the player has selected door 1 is (1/3 x 1) = 1/3.

The same reasoning holds for the third branch resulting in the fact that the probability the host opens door 2 given the car is hiding behind door 3 and the player has selected door 1 is (1/3 x 1) = 1/3.

Now let’s apply Bayes Theorem which states that $$P(A|B)=\frac{P(A \; AND \; B)}{P(B)}$$

In our case:

  • A = the player wins by switching from door 1 to door 2 which is the same as the car is behind door 2.
  • B = the host opens door 3.

Summarizing, we’re answering the following question “What is the probability the player wins switching door given his first choice was door 1 and the host opens door 3?“. Note that the result wouldn’t change if I asked the same question conditioning the probability with the host opening door 2. Which translated into Bayes notation is

$$ P(car \, is \, behind \, door \, 2 \, | \, host \, opens \, door \, 3)=\frac{P(car \, is \, behind \, door \, 2 \, AND\, host \, opens \, door \, 3)}{P(host \, opens \, door \, 3)} $$

computing the probabilities helping us with the previous decision tree we have:

  • P(car is behind door 2  AND host opens door 3) = 1/3
  • P(host opens door 3) = 1/6 + 1/3 = 1/2

$$ P(car \, is \, behind \, door \, 2 \, | \, host \, opens \, door \, 3)= \frac{\frac{1}{3}}{\frac{1}{2}} = \frac{2}{3} = 66,6\%$$

Here we are! Finally quod erat demonstrandum! 


That’s it! Cool, isn’t it?


by Francesco Pochetti

What is the gambler’s fallacy?

gamblerImagine you are in a casino at a roulette table waiting to gamble. You have been following the game for a while and you’ve noticed that the last six outcomes were black. Well, it is quite remarkable, isn’t it? The probability to get six outcomes of the same color at the roulette is 1/64, approximately 1,6%. The chance to get seven consecutive blacks is the half, 1/128 or better 0,8%. Figures never lie! You must be very unlucky to obtain seven black shots. Therefore you put all your paycheck on red. But… wait a second. Is that right?

Obviously not! What you are missing is something fundamental which is the fact that consecutive outcomes at the roulette table are independent events which means that knowing the outcome of one provides no useful information about the outcome of the other. This involves that the probability to get seven consecutive blacks is truly 1/128 but the probability to obtain a seventh black after the first six ones is no more that 1/2, as there is nothing preventing the ball to stop either on a red or a black spot.

Unsuspecting gamblers may convince themselves that the odds are in their favor whilstroulette1 they are not! So, be careful!

That’s the gambler’s fallacy.  That’s it! Cool, isn’t it?


by Francesco Pochetti