Installation and General Usage

Installation and General Usage#

Installation#

Python#

To install Python, you should download the installer for the appropriate version from www.python.org/downloads/ and run the installer. Once you have run the installer, you can try entering python --version (or python3 --version if you installed a Python version greater than 3.0) in your terminal/command line to verify the installation worked.

After that, it is recommended you also install several packages to help interact with data. More information on how to use these packages can be found in the Syntax section, but for now I will just discuss how to install them.

For this online book, you are only required to install pandas, numpy, matplotlib, regex, seaborn and scikit-learn, although it is also a good idea to explore and other packages like scipy, plotly if you wish. To install any of these packages, you can just type pip install followed by the package name in the terminal. For example, to install pandas, you would type pip install pandas. If you’re interested, you can find more information about pip here.

I also recommend installing Jupyter Lab as it provides an interactive shell for you to easily conduct data analysis and look at what your data looks like. To do so, just type pip install jupyter lab like above. Once you have installed jupyter lab, you can start it by typing jupyter lab in your terminal. This should open up a Jupyter Lab environment in your default browser, allowing you to view/open various files and create new notebooks for data analysis. You can learn more about this on their website.

Stata#

To install Stata, you need to purchase a license from their website, after which they send you an activation key along with further instructions on how to install the software.

General Usage#

Python#

In Python, data analysis is typically carried out using libraries like Pandas and NumPy, and the approach to data storage and analysis differs from Stata. As Python is a general programming language, there is no requirement to have any datasets used or packages imported. However, after importing the relevant libraries, you can easily load in datasets using pandas, with the primary data structure for data manipulation and analysis being the Pandas DataFrame. With Jupyter Lab notebooks, you can easily view different datasets in different cells of the notebooks, allowing you to easily view and work on multiple datasets at once.

However, as Python is a general programming language reliant on packages for data analyses, it’s general nature means that users often have to combine multiple packages to conduct different types of analyses. For instance, while Pandas is extensively used for data manipulation and basic statistical analyses, additional libraries like NumPy are employed for numerical operations, and SciPy might be used for more advanced statistical techniques. Visualization can be handled with libraries such as Matplotlib or Seaborn, and machine learning tasks might involve libraries like Scikit-learn, TensorFlow, or PyTorch. While this approach involves more work from the user, it also allows for a greater level of customization and adaptability depending on what the user needs.

Stata#

When you’re working in Stata, you always have one dataset loaded into memory at a time. This dataset serves as the primary focus for any operations or analyses you perform. Stata’s memory structure is designed around this single-dataset paradigm, which means that you cannot directly view or manipulate multiple datasets concurrently.

However, you do not need to worry about installing additional packages, as Stata offers a comprehensive set of built-in commands that cover a wide range of statistical analyses and data manipulation techniques. These include functions for descriptive statistics, regression analysis, hypothesis testing, data cleaning, etc, and you might need to combine multiple Python packages to conduct the same analyses.