prepared for the conference on
Information and Democratic Society
Representing and Conveying Quantitative Data
Columbia University
If the past is still prologue, the 21st Century will continue the leaps forward in science and technology of the century just ended. The complete mapping of the human genome, developing nanotechnology, increasing computational power, exploring the rest of the universe, and many other discoveries will all expand our ability to understand the physical and natural worlds. But what of the social, economic, and political world, what of the world of human relations? How will we learn more about living in an increasingly technology driven environment where interactions among human beings still dominate? Some of the aforementioned developments will help, but further understanding will be driven by the ability to gather and analyze more data and information, particularly longitudinal in character, and to connect that data together in multidisciplinary, cross-national, and sophisticated ways.
Writing in 1986, the late Warren Miller and his colleagues, noted that "the foreseeable future is unlikely to bring diminution in the magnitude and complexity of data needs in the social sciences; the greater likelihood is the need for more and still larger and more complex data collections."1 This occurs, according to Miller, et. al, because of the social sciences' interest in monitoring, explaining, or predicting change, the complexity of human behavior, the need for a large number of cases to support generalizations, and the continuation of existing series of data for continuity and comparability purposes, while also adding new measures. This leads to massive expenditures, in the billions of dollars worldwide, to collect social science data, both by the government and through government support for others.2
If you do some searching, you might come to the conclusion that more than enough data have been or are being collected to examine any question. With advances in information communications and technology much of these data have become easily accessible, as well. As Norman Bradburn has pointed out: it appears "we are drowning in data."3 A few years ago Richard T. Campbell developed a chart noting the myriad of longitudinal surveys covering 1900 to 1965 birth cohorts that demonstrated the significant overlapping that was occurring.4
Go to www.fedstats.gov. The entire federal statistical system is at your mouseclick. Fifteen agencies represented on the Interagency Council on Statistical Policy provide the bulk of the data, but over 50 others are involved in the planning of statistical surveys and designs, training of statisticians, collecting, processing and tabulating data for publication, dissemination, research or analysis, methodological testing or statistical research, forecasting and projecting for government-wide or public use, constructing secondary data series or developing models for generating statistical series or forecasts, and managing or coordinating statistical operations. The Current Population Survey, Education Data, Labor Market and Unemployment Data, Health Data, Income Data, are all there, albeit in the aggregate.
Government supported information collections such as the National Science Foundation big three: the Panel Study on Income Dynamics (PSID), the General Social Survey, and the National Election Studies have provided data series for over 30 years and are accessible through the Internet. The PSID has the advantage of following the same people over time. The Bureau of Labor Statistics' supported National Longitudinal Survey of Labor Market Experiences (NLS), for almost 35 years has provided a fountain of information for social scientists about working, transition from school-to-work and many other facets of growing up and in previous surveys of growing old. With the demise of the older cohorts of the NLS, the National Institute on Aging is now funding the Health and Retirement Survey and the Asset and Health Dynamics of the Oldest Old (AHEAD) to study work-to-retirement decisions and how people cope with growing old. The National Institute of Child Health and Human Development (NICHD) is providing support for the Adolescent Health Survey, an examination of the social contexts of adolescent attitudes and behaviors.
In addition, the private and non-profit sector continues to contribute to the creation of social science data through public opinion polls, election day exit polls, market research, and support to study some subjects the government sometimes has difficulty funding, such as sexual behavior.
Across the world more and more data are being collected. International governmental organizations such as the International Monetary Fund, the World Bank, the Organization for Economic Co-operation and Development (OECD), the United Nations and its components such as UNESCO and the World Health Organization, are all in the business of information collection and dissemination. These collections include economic indicators, social indicators, education data, science and technology efforts, and health and epidemiological information. Almost all countries now conduct some sort of Census.
The European Science Foundation has embarked on support for a European counterpart to the General Social Survey. The Luxembourg Income Study is a cooperative research project with membership in 25 countries trying to create a database of social and economic household microdata from different countries and promoting comparative research on the economic and social status of populations in different countries.
Data archives have been in existence for many years. These archives permit the sharing of data; now through electronic means. They also contain the documentation to help social science researchers use the data. Many granting agencies such as the National Science Foundation and the National Institute of Justice require the depositing of data in archives. The Howard Odum Center at the University of North Carolina at Chapel Hill is the oldest, begun in 1924 (www.irss.unc.edu) The Inter-University Consortium of Political and Social Research at the University of Michigan, (www.icpsr.umich.edu) around since 1972, is probably the best known. Others include the Institute for Social Science Research at UCLA (www.sscnet.ucla.edu/issr/da), the National Opinion Research Center (www.norc.uchicago.edu), the Roper Center for Public Opinion Research (www.ropercenter.uconn.edu), and the Data and Program Library Services at the University of Wisconsin, Madison (http://dpls.dacc.wisc.edu).
Outside the United States, the Council of European Social Science Data Archives (CESSDA) (http://www.nsd.uib.no/cessda) promotes the acquisition, archiving and distribution of electronic data for social science teaching and research in Europe. The International Federation of Data Organizations for the Social Sciences (IFDO) (http://www.ifdo.org) provides archive information for countries outside of Europe. The International Association for Social Science Information Services and Technology (IASSIST) is another organization dedicated to the issues and concerns of data librarians, data archivists, data producers, and data users (http:datalib.library.ualberta.ca/iassist/index.html).
Given all these collections and archives, what's the problem? Of course, we would like information on everybody and everything connected to social, economic and political behavior, individually and collectively, in all countries. We would like immediate electronic access to all these data, no comparability problems, no confidentiality problems, no missing data, somebody else to fund it all, and, in this age of interdisciplinarity, linkages to biological and physical science data. When I was in Norway this summer, I was introduced to something my hosts called the "Social Science Dream Machine." It only envisioned a fraction of what I just suggested.
The OECD is hosting a series of international workshops aimed at "reinventing the social sciences" to make them more empirical and relevant to policy-makers. Representatives meeting in Ottawa in October concluded that besides the data collection efforts, an electronically linked comprehensive Web-based database system from which researchers could extract relevant world data would greatly increase the potential of social science research.5
The introduction of advanced information and communication technologies has provided opportunities for fulfilling the "dream machine" idea. Yet barriers still remain. They can be categorized around the issues of: access, linkage, confidentiality and non-responses, missing topics, and funding.
There is a sense that despite availability of all the data sources noted earlier, they are underutilized and uncoordinated. David Gadd surveyed the use of secondary statistical data at the University of Plymouth in the UK and discovered that only one-third of those interested in the data set, had used the information from the British census, and the numbers for use of other British-based surveys were in the teens and lower.6 Why is this?
Despite electronic access to archives, many researchers still have difficulty getting the data they need. One problem is that electronic data dissemination is still in its early stage of life and there are many kinks to work out. We have all experienced frustrations with the Web and its sometimes quirkiness. Licensing costs and intellectual property considerations continue to constrain accessibility. Lag times in data depositing in archives are another problem. Data documentation is sometimes not what it should be.
Governments are also reluctant to divulge their data to researchers. Restrictions are placed on access to many countries' administrative data, particularly microdata at the individual level. Oftentimes they are restricted to officials of statistical offices, or if they are available, they must be used within a statistical office. In Europe, according to Kraus, only Great Britain and Norway have liberalized access to satisfy the needs of social science research.7 In the U.S., the Census Bureau, through its regional data centers, although these have their drawbacks due to logistical considerations, and the National Center for Education Statistics, with its researcher agreements, have experimented with ways for researchers to utilize data from their surveys. The Health and Retirement Survey researchers have negotiated agreements with the U.S. government to allow the use of administrative data to verify survey data.
Another difficulty is the unsophistication of the most common gateway search engines. There is a clear need for better ways of finding social science data-related material on the net. The NESSTAR (Networked Social Science Tools and Resources Project) is one attempt at solving this problem (http://www.nesstar.org). A joint project of the Norwegian Social Science Archive, the UK Data Archive, and the Danish Data Archive, NESSTAR allows users to locate multiple data sources across national boundaries.
The second barrier to a complete social science data system is comparability and linkage across data sets. As I said earlier, there are enormous amounts of data. The difficulty has been linking them together in a way that meaningful statistical analysis of particular phenomena can be conducted. I referred to Campbell's chart earlier. He points out that there were 10 surveys all examining birth cohorts from 1900-65 without any linkage among them.
The comparability problem is acute, especially with international data sets. Most databases are written in the language of the data owner, especially in the case of social survey outputs. Translation programs may eventually solve this problem, but reading election polls in Danish remains a barrier for most people. Data may simply be incompatible because variables are examined in slightly different ways. Different surveys ask dissimilar questions to get at the same issues. The Survey of Income and Program Participation (SIPP) asks questions somewhat slightly differently from the Current Population Survey and also somewhat differently from the Panel Study on Income Dynamics. As Lisa Dillon, Chair of the International Microaccess Data Group told SCIENCE Magazine: "If the information is not being categorized in the same way in different countries, you don't know if you are comparing apples and oranges."8
The Luxembourg Employment Survey (http://lissy.ceps.lu/LES/les.htm), a project associated with the Luxembourg Income Study, constructed a data bank of labor force surveys from countries with quite different labor market structures. The LES was able to harmonize the microdata from the labor force surveys to facilitate comparative research. This was not an easy task, according to Gaston Schaber, president of Luxembourg's International Networks for Studies in Technology, Environment, Alternatives, Development (CEPS/INSTEAD), who worked on the problem of harmonizing the data from national household surveys 9
As I noted earlier, the European Science Foundation announced last summer a "Blueprint for A European Social Survey," (http://www.esf.org) a research instrument measuring systematically, at regular intervals, citizens' attitudes, relating to a core set of political, social and economic issues. Its data findings would be available to researchers through a co-ordinated network of national data archives and other facilities.
To conquer these problems, the National Science Foundation, as part of its program to enhance infrastructure in the social and behavioral sciences, has awarded John Abowd of Cornell University over $4 million in the next five years to create three prototype data sets based on the Census Bureau's demographic and economic products. He and his colleagues will use link information that permits the data sets to be longitudinal in both the household/individual and firm/establishment dimensions. Abowd and his collaborators also hope to advance the knowledge of linkage technology and the statistical properties of linked data so that researchers in all disciplines can use these techniques. They also expect to include data from France, Sweden, and Germany.
In the same competition, Steven Ruggles of the University of Minnesota won a $3.5 million grant over 5 years, to develop an integrated international census database composed of high-precision, high-density samples of individuals and households from seven countries. He also hopes to create an innovative system for worldwide web-based access to both metadata and microdata in a consistent format.
With regard to U.S. federal statistics, legislation to share data among federal statistical agencies has made it through the House of Representatives, but not the Senate. The bill, sponsored by Representative Stephen Horn of California, designates eight agencies involved in the collection of statistics as "statistical data centers" to facilitate data sharing. The eight are: the Bureau of Economic Analysis, the Bureau of the Census, the Bureau of Labor Statistics, the National Agricultural Statistical Service, the National Center for Education Statistics, the National Center for Health Statistics, the Energy Consumption Division in the Department of Energy, and the Division of Science Resource Studies at the National Science Foundation. The centers would be allowed to share statistical data, eliminate redundant reporting requirements, and enter into joint projects to improve the quality and lower the cost of statistical programs. In addition, other federal agencies could also share data with the eight centers for "purely statistical purposes." If the bill is enacted, it would eliminate the need for both the Bureau of the Census and the Bureau of Labor Statistics to compile their own lists of business establishments, because current law prohibits these agencies from sharing their lists.
Norman Bradburn has suggested that the social sciences have a unique duality in that "people are both the subjects and objects of the inquiry."10 We often require extensive, sensitive data on individuals and organizations. As such, social scientists need to protect their subjects against privacy invasions and to inform them about what is going to happen to the information they provide. How do social scientists, as Paul Reynolds asks, "demonstrate respect for individuals and organizations through appropriate handling of private and sensitive information while simultaneously making clear the societal benefits of an enhanced understanding of basic phenomena" that we research?11 It is a difficult balancing act.
Sometimes attempts are made to answer the question for us. A few years ago, Senator Charles Grassley of Iowa, disturbed at questions being posed to children by certain researchers, sought to clarify the rules of "informed consent" to make it restrictive to the point where studies of classroom learning, not to mention the Adolescent Health Survey, would have been extremely difficult to conduct.
The recent proposed rules on medical privacy promulgated by the Department of Health and Human Services tried to deal with the problem of identifiers in medical research.12 Some argued that the rules if enacted would make epidemiological studies difficult to carry out and linking information from these studies to other data would become impossible. In response to the proposals, some suggested establishing data centers similar to the Census Bureau model to examine individualized data.
Last year's flap over Office of Management and Budget Circular A-110 and the use of the Freedom of Information Act (FOIA) created new concerns for protecting confidentiality. The Congress had voted that researchers would have to release "all data" connected to a federally funded grant to anyone who filed a request under FOIA. The OMB wrote rules to implement the congressional edict, known as the Shelby provision, after its author, Senator Richard Shelby of Alabama. Although the rules mitigated the potential damage somewhat, a recent article in THE SCIENTIST quoted NIH Deputy Director Wendy Baldwin, expressing concern about the implementation of Shelby.13 The U.S. Chamber of Commerce has already filed suit against the OMB rules, charging that they did carry out the Shelby provision's language that "all data" must be released. The SCIENTIST article raises the issue of protecting institutions in a study, such as family planning clinics, from identification in complying with a FOIA request.
Recently, a highly educated person complained to me about receiving the long form of the 2000 Census. She found some of the questions intrusive and what she considered an invasion of her privacy. When it was pointed out that all 52 questions on the long form are required by one law or another, she still expressed concern about why the government needs to know some of the answers. The Census is the largest peacetime mobilization of the government. It provides the bedrock data for much of what social scientists study. Yet, we know there is a differential undercount, and sampling, a basic tool in the social sciences, has been rejected for political reasons as a cure for this problem. The concern over low response rates has led the Bureau to spend close to $170 million on marketing.
We also know that the number of people refusing to respond to survey questionnaires is climbing.14 A good survey researcher now has to budget sufficiently for a larger number of callbacks to reach the desired response rate. Yet, at the same time, people are allowing the government to draw blood from them in government health surveys. Is it the impersonal phone call that sounds too much like another telemarketing spiel versus the friendly government worker assuring you of confidentiality that makes the difference here?
Are there substantive areas that need more data and more attention from social scientists? There is clear agreement that more longitudinal studies are the key to answering important questions for social policy. The capability to have data across a person's life cycle, as they do in Denmark for twins, would be immensely valuable. As the population ages, how people use their time is another area for deeper exploration. Of course, the new information economy challenges social scientists to measure E-commerce, produce better data about the service sector, and, in particular, the information technology sector.
New tools will help us study our subjects better. The use of techniques, such as Geographic Information Systems (GIS) and Remote Sensing, will add a spatial dimension to our usual individual and socio-economic variables. State and local governments employ GIS applications to study and develop agriculture, crime, transportation, land use, and other policies. Its use as a tool in the social sciences, has been lagging. Recognizing this problem, the NSF has awarded Mike Goodchild at the University of California, Santa Barbara $4.3 million to spread the use of GIS in the social sciences. The addition of a spatial dimension to our data will also raise more confidentiality problems as the ability to pinpoint particular places will make it more difficult to mask identification of those places.
Functional Magnetic Resonance Imaging (fMRI) allows researchers to image the brain. Another winner of the NSF SBE infrastructure competition is Michael Gazzaniga of Dartmouth College who wants to create a National fMRI center to help speed progress in understanding cognitive processes and the neural substrates that underlie them. Gazzaniga hopes to create a common data standard for brain images. The technique has the capacity to locate the neural activity associated with a particular behavior. Along with the mapping of the genome, the potential for differentiating the biological bases of behavior and the non-biological influences on behavior should become clearer.
Current plans and studies will add more improved data to examine and evaluate social phenomena. Let me mention a few. The Census Bureau is currently testing the American Community Survey (ACS). When fully implemented, the ACS will present more timely and up-to-date information on America's communities and will end the need for the decennial long form. The ACS will provide demographic, social, economic, and housing profiles updated every year. By 2003, if Congress provides the funding, a big if at the moment, the Census Bureau hopes to introduce the ACS in every county throughout the country with a sample of 250,000 households per month.
The concern with how children learn will remain on the social science agenda. One component of that agenda is the acquisition of language. The final winner in the social science infrastructure competition was Brian MacWhinney of Carnegie Mellon. He and his colleagues hope to develop computational tools to facilitate the linguistic analysis of transcript data in the Child Language Data Exchange System database.
Another component is collecting developmental data on children themselves. The National Center on Education Statistics is sponsoring the Early Childhood Longitudinal Survey (http://nces.ed.gov/ecls). The study will provide national data on: children's status at birth and various points thereafter; children's transitions to nonparental care, early education programs and school; and children's experiences and growth through the fifth grade. A birth cohort of 12,000 children born in 2000 will be studied, and a kindergarten cohort in 995 schools has already experienced the first questionnaire wave.
Along similar lines is the Human Development in Chicago Neighborhoods study sponsored by a number of government agencies and the MacArthur Foundation (http://phdcn.harvard.edu/geninfo.htm). This is a major interdisciplinary study aimed at deepening society's understanding of the causes and pathways of juvenile delinquency, adult crime, substance abuse, and violence. Besides an intensive examination of Chicago neighborhoods, the project contains a series of coordinated longitudinal studies that will follow 7,000 randomly selected children, adolescents, and young adults, looking at the changing circumstances of their lives, as well as the personal characteristics, that may lead them toward or away from a variety of antisocial behaviors. The goal is to unravel the complex influences of community, family, and individual factors on human development.
Finally, the question arises of where does the funding come from to support all these great plans for improved data collection and dissemination. Government funding for the social and behavioral sciences has remained fairly steady, improving only in the past few years, although in the age of interdisciplinarity, there is a measurement problem here. Federal Statistical budgets are pretty stagnant. As I mentioned, the fight to get the ACS funded is going to be a difficult one.
To help in this matter, the National Science Foundation's SBE directorate has announced another infrastructure competition (http://www.nsf.gov/pubs/2000/nsf0079/nsf0079.htm). Again, there is about $3 million available for between four to eight projects. This competition aims to create or extend innovative large-scale infrastructure projects that promise widely spread support to social and behavioral scientists. Infrastructure for this solicitations purposes means:
(1) Data collections: experiments, surveys, historical, objects of investigation.
(2) Web-based systems for archiving, linking or disseminating data;
(3) Collaboratories for real-time laboratory experimentation and equipment sharing;
(4) Centers: geographical or virtual; to develop a fledgling field, reinvigorate a stagnant field; or jump-start an area that is ripe for major breakthroughs.
The deadline for proposals is August 4, 2000.
More significantly, NSF Director Rita Colwell has discussed the idea of a major initiative in the social and behavioral sciences for the agency's FY 2003 budget. According to the federal budget process timeline, that means 2000 is the year such a proposal needs development and community buy-in. Any initiative will have to be multidisciplinary, multidirectorate, global in scope, have an education component, and perhaps, an industrial component. New SBE Assistant Director Norman Bradburn hopes to consult with the social science community in the coming months to put something together. The former SBE Assistant Director Bennett Bertenthal used to admonish his colleagues in the social and behavioral sciences that it was now time to think big. We hope the community will be up to it.
Acknowledgments: I would like to thank Barbara Torrey, Norman Bradburn, Bill Butz Stanley Presser, and Robert Groves for their assistance in preparing this paper. Of course, they are not responsible for any errors or omissions.
1. Warren E. Miller, et. al, "Large Scale Data Needs" in R.D. Luce, N. Smelser, and D. Gerstein, Leading Edges in Social and Behavioral Science, Russell Sage, 1989.
2. Richard Rockwell, "Data and Statistics: Empirical Bases of the Social Sciences," World Social Science Report 1999, UNESCO, 1999.
3. Norman Bradburn, "The Future of Federal Statistics in the Information Age," Morris Hansen Memorial Lecture, Washington Statistical Society, October 22, 1997.
4. Richard T. Campbell, "A Data-based Revolution in the Social Sciences." ICPSR Bulletin 14(3) pp. 1-4. 1994. Updated and expanded in Robert M. Hauser, Deborah Carr, Taissa S. Hauser, Jeffrey Hayes, Margaret Krecker, Hsiang-Hui Daphne Kuo, William Magee, John Presti, Diane Shinberg, Megan Sweeney, Theresa Thompson-Colon, S.C. Noah Uhrig, and John Robert Warren. "The Class of 1957 After 35 Years: Overview and Preliminary Findings." CDE Working Paper 93-17, Center for Demography and Ecology, The University of Wisconsin-Madison.
5. "Social Sciences Databases in OECD Countries: An Overview," prepared for the workshop on Infrastructure Needs for Social Sciences, Ottawa, CANADA, October 6-8, 1999.
6. David Gadd, "Networked Statistics Research Support Survey," Networked Statistics Bulletin, Issue 27, May 1998, University of Plymouth, cited in OECD Workshop Document.
7. Franz Kraus, "Towards A Data Infrastructure for Socio-economic Research on Europe: Improving Access to Official Microdata at the National and the European Level," EURODATA Newsletter No. 7, MZES, Mannheim, 1998, cited in OECD Workshop Document.
8. Wayne Kondro, "Making Social Science Data More Useful," SCIENCE Magazine, 29 October 1999, Volume 286.
9. Ibid.
10. Norman Bradburn, presentation to the National Science Board, March 16, 2000, Arlington, VA.
11. Paul D. Reynolds, "Privacy and Advances in Social and Policy Sciences: Balancing Present Costs and Future Gains," Journal of Official Statistics, Vol. 9, No. 2, 1993, Statistics Sweden.
12. Federal Register, November 3, 1999, pp. 59917-60065.
13. Nadia S. Helm, "A Data Access Conundrum," The Scientist, March 20, 2000.
14. Robert Groves and Mick Couper, Non-Response in Household Interview Surveys, Wiley, 1998.